Forem: Njeri Gitome

Getting started with Sentiment Analysis and Implementation

Njeri Gitome — Sat, 25 Mar 2023 03:27:59 +0000

The advent of the internet has revolutionized how people communicate their thoughts and opinions. Today, millions of people share their daily lives and express their emotions on social media platforms like Facebook, Twitter, Google, and others.

A significant amount of sentiment-rich data is being generated via social media in the form of tweets, status updates, blog posts, comments, reviews, etc. This essential data drives analysts to discover insights and patterns through sentiment analysis.

Sentiment Analysis

Sentiment analysis is the process of extracting the emotions from the user's written text by processing unstructured information and preparing a model to extract the knowledge from it. It’s often used by businesses to detect sentiment in social data, gauge brand reputation, and understand customers.

It involves the use of data mining, machine learning (ML), artificial intelligence and computational linguistics to mine text for sentiment and subjective information such as whether it is expressing positive, negative or neutral feelings.

Different approaches for sentiment analysis

There are various approaches for sentiment analysis on linguistic data, and which approach to use depends on the nature of the data and the platform one is working on.

Most research carried out in the field of sentiment analysis employs lexicon-based analysis or machine learning techniques.

Lexicon-Based approach

Also known as Dictionary based approach, it classifies linguistic (sentiments) data using lexical databases like SentiWordNet and WordNet.

It obtains a score for each word in the sentence or document and annotates it using the feature from the lexical database present. It derives text polarity based on a set of words, each which is annotated with the weight and extracts information that contributes to conclude overall sentiments to the text.

Machine Learning approach

In this approach, the words in the sentence are considered in form of vectors, and analyzed using different machine learning algorithms like Naïve Bayes, Support Vector Machine (SVM) and Maximum Entropy.

In this article, we will dataset that is available at:
https://www.kaggle.com/datasets/kazanova/sentiment140

The data consists of sentiments expressed by users through various tweets. each comment is a record, which is classified as either positive or negative.

By filtering and analyzing the data using Natural Language Processing Techniques, and sentiment polarity is calculated based on the emotion word detected in the user tweets. This approach is implemented using Python programming language and Natural Language Toolkit(NLTK).

Text-Preprocessing

Natural Language Processing (NLP) is a branch of Data Science that deals with Text data. Text data is unstructured and therefore needs extensive text preprocessing.

Some steps of the preprocessing are:

Lower casing
Removing Hyperlinks
Removing punctuations
Removing Stop words
Tokenization
Stemming
Lemmatization

Lets start by loading the data!

Our columns of interest are those of unstructured textual tweets and sentiment. Therefore, the rest of the columns are dropped.

Lowercase all the tweets

The first step is transforming the tweets into lowercase to maintain a consistent flow during the NLP tasks and text mining.
For example 'Nation' and 'nation' will be treated as two different words in any sentence, and hence, we need to make all the words lowercase in the tweets to avoid duplication.

Remove Hyperlinks

They are very common in tweets and don't add any additional information as per our problem statement of sentiment analysis.

Remove Punctuations

For most NLP problems, punctuations do not provide additional language information and are generally removed.
Similarly, punctuation symbols are not crucial for sentiment analysis. they are redundant are the removal of punctuation before text modelling is highly recommended.

Remove Stop words

Stop words are English words that do not add much meaning to a sentence. They are removed as they do not add value to the analysis.

NLTK library consists of a list of words that are considered stop words for the English language. Some of them are : [i, me, my, myself, we, our, ours]

Tokenization

This refers to splitting up a larger body of text into smaller lines (paragraphs) and words. These broken pieces are called tokens (either word token or sentence tokens). They help in understanding the context and create a vocabulary.

Below is an example of a string of data:
"What is your favourite food joint?"

In order for this sentence to be understood by a machine, tokenization is performed on the string to break it into individual parts.

Tokens:
"What" "is" "your" "favourite" "food" "joint" "?"

Code sample:

Tokenization of the tweets:

Stemming and Lemmatization

These are text normalization techniques for Natural Language Processing. Both processes aim to reduce the word into a common base word or root word.

Stemming
This is the process of reducing a word to its stem, using a list of common prefixes such as ( -ing, -ed, -es).
Pros: Faster in executing large amounts of datasets.
Cons: This may result to meaningless words.
Lemmatization
The process of reducing a word to its stem by utilizing linguistic analysis of the word.
Pros: Preserve the meaning after extracting the root word.
Cons: Computationally expensive.

Lemmatization is almost always preferred over stemming algorithms until and unless there is need for super-fast execution on a massive corpus of text data.

Applying lemmatization to the tweets:

Text Exploratory Analysis

First, analyzing the text length for different sentiments.

Word Cloud

A word cloud is a graphical representation of word frequency. The larger the word in the visualization the more common the word was in the document(s).

Word cloud for positive tweets in our dataset:

Word cloud for negative tweets in our dataset:

Essential SQL Commands for Data Science

Njeri Gitome — Sat, 11 Mar 2023 05:11:39 +0000

Structured Query Language (SQL) is important for a data scientist because it is a powerful way to access, process, clean and analyze data stored in relational databases.

Understanding of essential SQL commands is therefore crucial to allow the data scientist perform efficiently in their role.

Categories of SQL commands

1. Data Definition Language (DDL)

These are set of SQL commands that can be used to define the database schema.

List of DDL commands:

CREATE - create the database or its objects (table, index, function, views, store procedure and triggers)
DROP - delete objects from the database.
ALTER - make changes to a table, view or entire database.
TRUNCATE - remove all records from a table without deleting the table structure.

2. Data Manipulation Language (DML)

A subset of SQL commands used to manipulate the data that is present in the database.

List of DDL commands:

INSERT - insert data into a table.
UPDATE - modify or change existing records in a table.
DELETE - delete records from a table.
SELECT - query data from a table(s)

SQL for Data Scientists

One is able to perform the following using SQL:

Basics
Joins
Aggregations
Subqueries & Temporary Tables
Data Cleaning
Window Functions

Lets delve deeper into these processes on SQL as well as the essential SQL commands that come in handy.

Basic SQL Commands

This comprises of CRUD (Create Read Update Delete) operations that is made possible by the use of the DDL and DML commands

SQL Joins

JOIN statements are used to combine two or more tables, based on a related column between them.

-JOIN - returns records that have matching values in both tables.

-LEFT JOIN - returns all records from the left table, and the matched records from the right table.

-RIGHT JOIN - returns all records from the right table, and the matched records from the left table

-FULL JOIN - returns all records when there is a match in either left or right table

SQL Aggregations

These are commands that are used to perform calculations on a set of rows in a table and return a single value as the result.

Here are the most common SQL aggregation functions:

SUM() - calculates the sum of values in a column.
AVG() -calculates the average value of a column.
MAX() - finds the maximum value in a column.
MIN() - finds the minimum value in a column.
COUNT() - counts the number of rows in a table or the number of rows that meet a certain condition.

These aggregation functions are very useful for data scientists as they enable them to perform calculations on large datasets quickly and efficiently. They can be used to generate summaries of data, identify trends, and make informed decisions based on the results.

SQL Subqueries & Temporary Tables

Data Cleaning

SQL string functions are used to clean data.

LEFT() - extracts characters from a string starting from the left.
RIGHT() - extracts characters from a string starting from the right.
SUBSTR() - extracts a substring from a string starting at any position.
CONCAT() - adds two or more strings together.
CAST() - converts a value of any type into a specific, different data type.
POSITION() - used to find the position of a substring within a string, starting from a specified position.
STRPOS() - finds the position of the first occurrence of a string inside another string.
COALESE() - returns the first non-null value in a list.

Subqueries & Temporary Tables

Window Functions

A window function allows one to compare one row to another without doing any joins. Window functions are effective while measuring trends over time or rank a specific column, and it retains the total number of records without collapsing or condensing any of the original datasets.

Aggregate Window Function
Aggregate functions such as SUM(), COUNT(), AVERAGE(), MAX(), MIN() applied over a particular window (set of rows) are called aggregate window functions.

Ranking Window Functions

RANK() - assigns ranks to all the rows within every partition. Rank is assigned such that rank 1 given to the first row and rows having same value are assigned same rank. For the next rank after two same rank values, one rank value will be skipped.
DENSE_RANK() - assigns rank to each row within partition.

The difference between RANK() and DENSE_RANK() is that in DENSE_RANK(), for the next rank after two same rank, consecutive integer is used, no rank is skipped.

ROW_NUMBER() - assigns consecutive integers to all the rows within partition. Within a partition, no two rows can have same row number.

Exploratory Data Analysis Ultimate Guide

Njeri Gitome — Thu, 02 Mar 2023 10:57:24 +0000

The phrase "Data is the new gold", emphasizes the increasing value of data in today's world. When properly analyzed, data can uncover valuable insights that inform critical decisions and shape the future. In order to extract insights from data, one must first understand it. This is where Exploratory Data Analysis (EDA) comes in.

1. What is Exploratory Data Analysis?

Exploratory data analysis is one of the first steps in the data analytics process. It entails application of various techniques in analysis of the dataset in order to understand the data.

Understanding the dataset simply means to getting to know the data and its characteristics, which can help in identifying potential issues, patterns, and relationships within the data.

2. What is the objective of Exploratory Data Analysis?

There are two main objectives of EDA:

EDA assists in identifying faulty points in the data. Once the faulty points have been identified, they can be easily removed, resulting in data cleaning.
It also helps in understanding the relationship between the variables. This gives a wider perspective on the data which helps in building models by utilizing the relationship between various features(variables).

3.Types of Exploratory Data Analysis

There are two main types of exploratory data analysis which are Univariate EDA and Multivariate EDA.

Univariate EDA

Uni means one and variate means variable, so in Univariate Analysis there is only one dependable variable. The goal of univariate analysis is to simply describe the data and find patterns with the data. Univariate EDA techniques include:

Univariate non-graphical EDA techniques:

Central Tendency (mean, mode and median)
Dispersion (range, variance)
Quartiles (interquartile range)
Standard deviation.

Univariate graphical EDA techniques:

These are graphical methods that provide a visualization of the data. Common types of univariate graphics include:

Histograms, a graph plot in which each bar represents the frequency distribution of numerical data.
Box plots, which graphically depict the five-number summary of minimum, first quartile, median, third quartile and maximum.

Multivariate EDA

This is a method of analyzing data involving more than two variables. The goal is to understand patterns, correlations and interactions between variables. Multivariate techniques include:

Multivariate non-graphical EDA techniques:

These techniques generally show the relationship between two or more variables of the data through cross-tabulation or statistics.

Multivariate graphical EDA techniques:

These are graphical methods that display the relationships between two or more sets of data. Common types of multivariate graphics include:

Scatter plot, it is used to plot two quantitative variables on a horizontal (x-axis) and vertical(y-axis) to display the relationship between the continuous variables.
Multivariate chart, is a graphical representation of the relationships between factors and a response.
Run chart, a line graph drawn over time. It visually illustrates the data values in a time sequence.
Bubble chart, scatter plots that display multiple circles (bubbles) in a two-dimensional plot.
Heatmap, a graphical representation of data in the form of a map or diagram in which data values are represented as colors.

4. Exploratory Data Analysis Tools

Python

Python is used for different tasks in EDA, such as finding missing values in data collection, data description, handling outliers, obtaining insights through charts etc.

R

R programming language is a regularly used option to make statistical observations and analyze data, i.e., perform detailed EDA by data scientists and statisticians.

MATLAB

It is common among engineers and domain experts due to its strong mathematical calculation ability.

5. Steps involved in Exploratory Data Analysis

There are three main steps involved in exploratory data analysis. They can be simplified as data collection, data cleaning and analysis of the relationship between the variables.

1. Data Collection

It is the first step in EDA, it involves gathering relevant data from various sources. Some reliable sites for data collection are Kaggle, GitHub, UCI Machine Learning Repository etc.

The data depicted in the example represents the 120 years of Olympic History dataset that is available on Kaggle.

While at the IDE of choice, start by importing the necessary libraries.

Then, load the dataset into DataFrames:

Display the content of the datasets:

Regions dataset:

Check the shape of the DataFrames:

This DataFrame shape is (271116, 15) which means that it has 271116 observations (rows) and 15 features (columns).
Checking the region's DataFrame shape:

The DataFrame shape is (230, 3) which implies that it has 230 rows and 3 columns.
Next, Merge the two DataFrames:

Check the shape of the Olympics DataFrame:

Display the content of the Olympics DataFrame:

Check the concise summary of the DataFrame using the info() function.

Check the descriptive analysis of the DataFrame using the describe() function. It provides descriptive information about the dataset.

2. Data Cleaning

This is a critical step in EDA that involves identifying and correcting errors and inconsistencies in the data to ensure its accuracy and integrity.

1. Handling the missing values.

This is a crucial step in data analysis. Missing values can be handled in various ways:

Removing missing values - this is simply removing any rows or columns that contain missing values. This is only appropriate if the amount of missing data is small relative to the size of the dataset and removing the missing data does not significantly affect the analysis.
Imputing missing values - this is imputing the missing value with an estimated value. The simplest approach is to impute the missing values with the mean, median, or mode of the non-missing values. More advanced imputation techniques involve using machine learning algorithms to predict the missing values based on other features in the dataset.
Ignoring missing values- in some cases, it may be appropriate to ignore missing values if they do not significantly affect the analysis.

Handling missing values in the Olympics dataset:

First check for missing values:

Then, the percentage of missing values:

The results above provide insights on how to handle the missing values in the Olympics dataset.

The Notes columns has 98% of the data missing and can therefore be dropped.

The Height and Weight missing values can be be imputed by the mean.

The Age column has 3% of the data missing, while Region has 0.3%, this value is relatively small and thus any modification to the column can be ignored. The missing values in the Medal column are ignored since Nan indicates that no medals were won.

2. Handling duplicate values.

This involves identifying and removing or modifying duplicates. Here are some common approaches of handling duplicate values:

Identifying and removing exact duplicates- Exact duplicates are rows that have identical values in all columns.
Identifying and removing partial duplicates- Partial duplicates are rows that have the same values in some columns but differ in others.

Here's a code example of how to handle the duplicates mentioned above:

The Olympics dataset does not require this check because duplicates are inevitable given the nature of the data.

3. Analyzing the relationship between the variables.

Univariate non-graphical EDA

Top 10 participating countries

Univariate graphical EDA

Bar plot for Top 10 participating countries

code:

Age distribution of the athletes

code:

Interpretation: Most participants were aged between 23-26 years.

Height distribution of the athletes

Code:

Interpretation: The height of the athletes ranges between 150cm to 178cm. Most of the participants had a height of 175cm

Multivariate non-graphical EDA

Number of athletes in respect to their gender

Top 15 Countries and number of Gold Medals Won in the 2016 Olympics

Multivariate graphical EDA

Pie plot for male and female distribution of athletes

Code:

Line Plot of Female Athletes over time

Code:

Bar-plot for Top 15 Countries and number of Gold Medals Won in the 2016 Olympics

Code:

Conclusion

It is crucial to keep in mind that EDA is an iterative process and that the steps used can change based on the dataset and the objectives of the analysis. In addition, domain knowledge and context are important factors in understanding and drawing meaningful insights from the data.

SQL 101:Introduction to SQL for Data Analysis

Njeri Gitome — Fri, 17 Feb 2023 09:41:07 +0000

SQL is like a paintbrush for data analysis. It is an essential tool that every data analysts analyst should have in their arsenal.

What is SQL?

SQL(Structured Query Language) is a programming language used in relational databases to manage and manipulate data.

It allows the user to interact with databases through a set of commands, or statements. It is used to create tables, insert, update and delete, as well as query the database to extract information.

SQL has also become a widely used language in data analysis and data science, because it provides a powerful set of tools for manipulating and analyzing large datasets.

How is SQL used for Data Analysis?

1. Data Extraction

One of the primary responsibilities of a Data Analyst is to extract and analyze data from various sources. SQL comes in handy to achieve this.

Using SQL, data can be retrieved from one or more databases using SQL queries. This process is simply known as data extraction.

Here are steps for data extraction in SQL:

Start by creating the database:

After that select the newly created database in order to use it:

Then, create a table within the Bookstore database:

Next, input data into the table:

Example 1: Populate one record:

Example 2: Populate multiple records:

In order to extract data the SELECT statement is used:

Output:

In addition
In this case, our scenario is a Bookstore Database. More tables namely; Books, Stock and Categories were created.

Below is an example of creating a table that has a relationship with another.

Books Table:

Stock Table:

This is a database entity relationship diagram that will be useful while analyzing the data:

2. Joins

SQL joins are a powerful tool for data analysts because they allow them to combine data from multiple tables into a single result set. This is crucial because data is often spread across multiple tables, and combining this data is necessary to answer complex business questions.

There are several joins that can be used in SQL:

INNER JOIN
LEFT JOIN
RIGHT JOIN
FULL OUTER JOIN

Below are Venn diagrams, that provide a visual illustration of SQL Joins.

As Data Analysts we will try to answer a business question. Using SQL joins we will obtain various results.

Business Question:
Which book categories are available for purchase in our database?

1. Inner Join
It is also simply indicated as JOIN. It returns only the rows that have matching values in both tables being joined.

Results:

2. Left Join
It returns all rows from the left table (table1) as well as the matched rows from the right table (table2). If no match is found, NULL values are returned.

Results:

3. Right Join
Returns all rows from the right table (table2), and the matched rows from the left table (table1). If there is no match, NULL values are returned.

Results:

4. Full Outer Join
Returns all rows from both tables, and NULL values are returned for any unmatched rows.

Results:

3. Data Filtering

This is an important technique for data analysis which allows selection of a subset of data based on a certain criteria. SQL utilizes the WHERE clause for filtering data.

Case 1: Filtering by a single condition.

Business question:
Which Nicholas Sparks novels are available for purchase?

Results:

Business question:
Which bookshops are located in Kisumu county?

Result

Case 2: Filtering by a multiple conditions.
Business question:
Is there a bookstore called Bookworms Haven in Kisumu County?

Result:

3. Data Aggregation

This is the process of summarizing and grouping data to obtain useful insights and metrics. It is a common task in data analysis and is often used to generate reports, perform statistical analysis, and identify trends.

In SQL, this is achieved via aggregate function. Here are some common functions used for aggregation:

SUM() Function
AVG() Function
MIN() Function
MAX() Function
COUNT Function

SUM() Function

It returns the total sum of a numeric column.
Business question:
How many book copies in stock?

Results:

AVG() Function

It calculates the average of a column.
Business question:
What is the average number of books in stock ?

Results:

MIN() Function

It returns the lowest(smallest) value of a numerical column.
Business question:
What is the minimum number of book copies available in a bookstore?

Results:

MAX() Function

It returns the largest value of a numerical column.
Business question:
What is the maximum number of book copies available in a bookstore?

Results:

COUNT() Function

It counts the number of rows in a table or the number of non-null values in a column.
Business question:
What is the count of the stock entries?

Results:

4. Data Transformation

This is the process of converting data from one format or structure to another to make it more suitable for analysis. In SQL, data transformation can be achieved using various techniques such as filtering, aggregating, joining, and grouping.

Business question:
How many different title book options are there in total for each book category??

Result:

5. Data Cleaning

Data cleaning is an important step in preparing data for analysis in SQL. Here are some common techniques used for data cleaning in SQL:
1.Removing duplicates

This query will remove any duplicate rows from the table, and return only unique rows based on the columns specified in the SELECT statement.
Result:

2.Handling missing values:

This query will return only the rows that do not contain null values in the specified columns. One replace null values with a default value or with values from other rows or columns.
Result:

In conclusion, SQL is a valuable tool for data analysts retrieving, transforming and analyzing large datasets stored in relational databases, making it a must-have tool for any data analyst.