Forem: Elia

Filter Swahili SMS by categories using machine learning.

Elia — Wed, 10 Aug 2022 09:22:42 +0000

When you hear "ding" you almost fall over running to your phone in the hopes of seeing the long-awaited SMS and then sadly discover it's a promotional message from an XYZ brand. This can really be annoying, many of these promotional and spam SMS continue to clog up our inboxes and get worse with time, stealing our precious time and attention.

What can we learn from Gmail?

The problem is not very new, It also exists on the side of the email and one thing that email providers like GMAIL adopted and worked so well is grouping emails into categories depending on the intentions of the emails which can either be promotional, social, primary and also being able to filter out fraudulent emails (spam).

Can we replicate the Gmail Approach to SMS? If yes How?

The meat of this article is centered around answering that question, we are going to learn how can we classify SMS messages into categories according to the intention of the messages, then now you might be asking yourself how one gets to know and classify the intention of SMS? We are going to train a machine learning model that will learn the similarities of each category and then use its generalized learned model to group new SMS into categories.

Data Collection and Annotations

The first step was sourcing, collecting, and annotating SMS data that will then be used to train our machine learning model, the data collection was done using the SMS backup application from multiple individual contributors, and the app data output was a well-organized JSON data of SMS and their details as shown in the below snippet example.

[
    {
        "_id": "7126",
        "address": "TIGO",
        "body": "Tigo inakutakia maadhimisho mema ya siku ya Muungano.",
        "date": "1619430394016",
        "errorCode": "-1",
        "locked": "0",
        "messageDirection": "INCOMING",
        "messageType": "SMS",
        "protocol": "0",
        "read": "1",
        "replyPathPresent": "0",
        "seen": "1",
        "serviceCenter": "+2557********",
        "status": "-1",
        "text": "Tigo inakutakia maadhimisho mema ya siku ya Muungano.",
        "threadId": "492",
        "type": "1"
    },
    ...
  ]

We then annotated our data into distinct categories based on the context and intention of the text messages, these were the category that we came with;

Promotional
Notification
Transaction
Sports Bettings
Michezo ya Bahati Nasibu (General gambling SMS)
Survey
Verification
Informational
Personal
SPAM

We then exported our data into CSV format ready for crunching, *Where is the Data? Well we won't be able to share for now because some of the SMS contain identifiable personal information t*herefore we are currently working on cleaning and ensuring the data is of good quality and then will share through our Github repository.

Here we go

Now that you have a bit of background about the data that we are going to use to train our model, let's now get our hands dirty, let's break down our task into three steps: data preprocessing, training machine learning model, and model evaluation.

Data Preprocessing

Data preprocessing is a way of converting raw data into a format that can be easily parsed by a machine learning model. We need to preprocess our datasets to easily train our model. But first, let's read and view the structure of our datasets with the help of Pandas library

import pandas as pd

data = pd.read_csv('./raw sms data/data.csv')
data.head()

As we can see we have got a couple of columns in our dataset, let's start by exploring messageDirection our data has;

data["messageDirection"].value_counts()

# Output: INCOMING    3384  "There are 3384 incoming messages"
#         OUTGOING      62

Now that we know the data collected consists of both OUTGOING and INCOMING SMS but from the very nature of our task, our primal interest lies in the incoming messages only, therefore we need to filter only data whose messageDirection is INCOMING.

incoming_sms = data[data["messageDirection"] == "INCOMING"]

interested_data = incoming_sms[['address', 'text', 'label']]

Examining Label distribution

Examining the distribution of labels is crucial because it can reveal information about how well your model will work with a particular label.

As you can see, most of our messages are "NOTIFICATION" labeled. "SPAM" messages are the least which means our dataset is not balanced.

Let's also remove duplicates in our datasets

texts = interested_data['text'].tolist()
ids = interested_data.index.tolist()
dirty_dict = dict(zip(ids, texts))

cleaned_dict = {}
used_texts = []
for id, text in dirty_dict.items():
    if text in used_texts:
        continue
    cleaned_dict[id] = text
    used_texts.append(text)

ids = list(cleaned_dict.keys())
print(len(ids))

# Only Filter out interested_data whose id is in ids
cleaned_incoming_sms = interested_data[interested_data.index.isin(ids)]

len(cleaned_incoming_sms)
# Output: 1920

Our data has been reduced from 3384 to 1920 which means almost 43% of our datasets were duplicates but this is an 'okay' amount of data to train our model.

Now let's get a good look at our data by visualizing it in wordcloud. But before that, we need to remove a few stopwords. Then, we can see how often some words are used in the texts according to their categories.

# Removing stop words
stopwords = ["na", "ya", "wa", "kwa", "pia", "kisha", "au"]

cleaned_incoming_sms['text'] = cleaned_incoming_sms['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in stopwords]))

cleaned_incoming_sms.loc[4:9]

The above result of our cleaned-incoming-sms is not particularly clean. We need to put in some extra effort.

Make all of them lowercase.
Remove all of the non-alphanumeric characters like ",", "+", "%", "!", ":"
Remove all numbers in the text messages.

import re
# Clean the texts
def clean_text(text):
    # remove all non-alphanumeric characters
    text = text.lower() #convert text to lower-case
    text = re.sub('[‘’“”…,]', '', text)
    text = re.sub('[()]', '', text)
    text = re.sub('[^a-zA-Z]', ' ', text)
    text = re.sub(' +', ' ', text)
    return text

cleaned_incoming_sms['text'] = cleaned_incoming_sms['text'].apply(clean_text)

All of our texts are clean now, so we can start training our model.

Training Machine Learning Model

We are going to use the Scikit-learn library to provide us all useful tools to train our model. Let's import our required tools and train our model.

from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(cleaned_incoming_sms['text'], cleaned_incoming_sms['label'], test_size=0.2, random_state=42)

pipeline = make_pipeline(
    TfidfVectorizer(lowercase=True, max_features=1000, stop_words=stopwords),
    RandomForestClassifier(n_estimators=100, random_state=42)
)

pipeline.fit(x_train, y_train)

Since our dataset is not too large, the model will finish training in a very short time. After it finishes training, then we can check its score.

pipeline.score(x_test, y_test)

# Output: 0.9380530973451328

As you can see our model has a score of 94% when evaluated with test data which is quite good.

Testing our model

Let's save our model for later use and then we will import it again into another file to test it with some other messages.

import joblib
joblib.dump(pipeline, './pipeline.pkl')

NOTE: Before we test our model with some messages, we have to remember to pass them into the clean_text function to preprocess our text(remove non-alphanumeric characters, remove numbers, etc. in the text we are going to input to our model).

pipeline = joblib.load('./pipeline.pkl')

with open('test_data.txt', "r") as f:
    test_data = f.readlines()

for text in test_data:
    print(f"Text: {text} Prediction: {pipeline.predict([clean_text(text)])[0]}")

Results

We tested our model with 14 messages that it has never seen before. As you can see from the result above, most of the messages in the test data were "SPAM" messages. But the model couldn't pick up most of them since there were few spam messages to train our model.

Also, the model didn't quite perform well in the "PROMOTIONAL" label, because after removing duplicated messages in our datasets, label distribution has changed a lot.

Label distribution after removing duplicates

Conclusion

Any model's performance is strongly influenced by the quantity and size of the datasets. We couldn't access large datasets, but by spending more time thoroughly cleaning our training data, we can attempt to improve the accuracy of our model. Furthermore, we can tweak some parameters before training our model or we can experiment with alternative Machine Learning Classifiers like Decision Tree, SVM, etc. to achieve the best results and improve the performance of our model.

Thank you.

4 Popular Natural Language Processing Techniques

Elia — Wed, 10 Aug 2022 08:55:11 +0000

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. Source: Wikipedia

It is most likely that you have used NLP in one or another way. If you have ever tried to contact a certain business through messages and got an immediate reply, probably it was NLP at work, or perhaps you have just gotten home from work, filled your cup with coffee, and asked Siri to play some relaxing seaside sounds. Without a doubt, you apply NLP.

Human language is very complex, filled with sarcasm, idioms, metaphors, and grammar to mention a few. All of these make it difficult for computers to easily grasp the intended meaning of a certain sentence.

Take an Example of a Sarcasm conversation:

John is sewing clothes while closing his eyes.
Martin: John, what are you doing, you're going to hurt yourself.
John: No I won't ,After a few moments, John accidentally injects himself with the needle.
Martin: well, what a surprise.

With Natural Language Processing(NLP) techniques we can break down human texts and sentences and process them so that can understand what's happening. In this article, we are going to learn with examples about the most common techniques and how they're applied, we will look on;

Sentiment Analysis
Text Classification
Text Summarization
And Named Entity Recognition

Sentiment Analysis

Most businesses want to know what are the customer's feedback concerning their services or products. But you might find millions of customers' feedback. Analyzing everything is very painful and boring even if you are offered a large amount of money when you accomplish that. Sentiment analysis can be useful in this situation.

Sentiment Analysis is a natural language processing technique which is used to analyse positive, negative or neutral sentiment to textual data.

Businesses use sentiment analysis to even determine whether the customer's comment indicates any interest in the product or service. Sentiment analysis can even be further developed to examine the mood of the text data (sad, furious, or excited).

Source: revechat.com

Use case

To accomplish this, let's use the hugging face transformers library. We are going to use a pre-trained model from hugging face models called "distilbert-base-uncased-finetuned-sst-2-english"

# let's first install transformers library

$ pip install transformers

Once the library is installed, completing the task is quite simple.

from transformers import pipeline
analyser = pipeline("sentiment-analysis")

The above code will import the library and use a default pre-trained model to perform sentiment analysis.

user_comment = "The product is very useful. It have helped me alot."
result = analyser(user_comment)
print(result)

# Output: [{'label': 'POSITIVE', 'score': 0.9997726082801819}]

The output shows that the sentiment of the user's comment is POSITIVE and the model is 99.9772% sure.

Text Classification

Text classification also known as text categorization is a natural language processing technique which analyses textual data and assigns them to a predefined category.

Spam emails occasionally arrive in your mailbox. When you click on one of these links, your computer may become infected with malware. Therefore, practically all email service providers employ this NLP technique to classify or categorize the email as either spam or not.

To effectively categorize your incoming emails, text classifiers are trained using a lot of spam and non-spam email data.

Use case

Let's try to create a simple text classifier to classify whether the text we input is spam. We are going to use the TextBlob library to achieve this.

Let's create some training data to train our own classifier.

train = [
    ('Congratulation you won a your prize', 'spam'),
    ('URGENT You have won a 1 week FREE membership in our 100000 Prize Jackpot', 'spam'),
    ('SIX chances to win CASH From 100 to 20000 pounds ', 'spam'),
    ('WINNER As a valued network customer you have been selected to receive 900 prize reward', 'spam'),
    ("Free entry in 2 a weekly competition to win FA Cup final tickets 21st May 2005. Text FA to 87121 to receive", 'spam'),
    ('I do not like this restaurant', 'no-spam'),
    ('I am tired of this stuff.', 'no-spam'),
    ("I can't deal with this", 'no-spam'),
    ('he is my sworn enemy!', 'no-spam'),
    ('my boss is horrible.', 'no-spam'),
    ('This job is bad', 'no-spam')
]

Now, let's import our classifier from the TextBlob library and train it with our created training data. We are going to use NaiveBayesClassifier

from textblob.classifiers import NaiveBayesClassifier

classifier = NaiveBayesClassifier(train)

After our training is complete (which might take less than two seconds according to your computer), we will input our text to see if it works.

classifier.classify("Congratulation you won a free prize of 20000 dollars and Iphone 13")

# Output: 'spam'

Our simple model correctly identified our message as "spam," which it is.

Text Summarization

Text summarization is a natural language processing technique for producing a shorter version of a long piece of text.

Let's imagine that when you are drowsily sleeping, your boss sends you a message telling you to read a specific document. The document is ten pages long when you check it. For you, text-summarization might be a ground-breaking concept.

Text summarization models often take the most crucial information out of a document and include it in the final text. However, some models go so far as to try to explain the meaning of the lengthy text in their own words.

Use case

For this, we'll also make use of the transformers library.

from transformers import pipeline

Then we are going to use the "summarization" pipeline to summarize our long text.

summarizer = pipeline("summarization")

# If the you don't have the summarization model in your machine, It will be downloaded from the internet.

The lengthy text can then be copied and pasted from anywhere for summarization.

long_text = """
    The Solar System is the gravitationally bound system of the Sun and the objects that orbit it. It formed 4.6 billion years ago from the gravitational collapse of a giant interstellar molecular cloud. The vast majority (99.86%) of the system's mass is in the Sun, with most of the remaining mass contained in the planet Jupiter. The four inner system planets—Mercury, Venus, Earth and Mars—are terrestrial planets, being composed primarily of rock and metal. The four giant planets of the outer system are substantially larger and more massive than the terrestrials. The two largest, Jupiter and Saturn, are gas giants, being composed mainly of hydrogen and helium; the next two, Uranus and Neptune, are ice giants, being composed mostly of volatile substances with relatively high melting points compared with hydrogen and helium, such as water, ammonia, and methane. All eight planets have nearly circular orbits that lie near the plane of Earth's orbit, called the ecliptic.
"""

# You can set an optional parameter of max_length to maximum number of words you want to be outputted

>>> summarizer(long_text, max_length=80)

# Output: [{'summary_text': " The Solar System formed 4.6 billion years ago from the gravitational collapse of a giant interstellar molecular cloud . The vast majority (99.86%) of the system's mass is in the Sun, with most of the remaining mass contained in the planet Jupiter . The four inner system planets are terrestrial planets, being composed primarily of rock and metal ."}]

Named Entity Recognition(NER)

Have you ever heard fanciful tales about how a particular firm listens in on all calls, chats, and online interactions to see what people are saying about it? Well, "if" this is true, one of their strategies might be named entity recognition. Because NER is a natural language processing technique that identifies and classifies named entities in text data.

Named entities are just real-world objects like a person, organization, location, product, etc. NER models identify ‘Dar-es-salaam’ as a location or ‘Michael’ as a man's name.

Source: shaip.com

Use case

We will use the SpaCy library for this task. We need to install it and download an English pre-trained model to help us to achieve our task faster.

$ pip install -U spacy

# Then downloading the model
python -m spacy download en_core_web_sm

We are going to import the spacy library and then load the model we downloaded so we can perform our task.

import spacy

nlp = spacy.load("en_core_web_sm")

After loading our model, we can simply input our text and spacy will give us named entities present in our text.

doc = nlp("The ISIS has claimed responsibility for a suicide bomb blast in the Tunisian capital earlier this week.")

for ent in doc.ents:
    print(ent.text, ent.label_)

spacy.displacy.render(doc, style="ent")

# Output: ISIS ORG
#         Tunisian NORP
#         earlier this week DATE

The output shows different entities detected by spaCy with their respective labels.

NOTE: If you didn't understand the meaning of an abbreviation in spaCy, you can use spacy.explain() to explain its meaning.

# Let's say you didn't understand the meaning of an abbreviation "ORG"

spacy.explain("ORG")

# Output: 'Companies, agencies, institutions, etc.'

The good news is that it's simple to get started with these techniques nowadays. Large language models like Google's Lamda and GPT3 are available to aid in NLP tasks. You may easily construct helpful Natural language processing projects with the use of tools like spaCy and hugging face.

Thanks.