Forem: Michelle Njuguna

Classification Metrics

Michelle Njuguna — Mon, 10 Mar 2025 20:21:52 +0000

Introduction

In machine learning, classification is the task of predicting the class to which input data belongs. One example would be to classify whether the text from an email (input data) is spam (one class) or not spam (another class).

When building a classification system, we need a way to evaluate the performance of the classifier. And we want to have evaluation metrics that are reflective of the classifier’s true performance.

Common Classification Metrics

Accuracy:
This is the simplest and most intuitive metric, measuring the proportion of correct predictions. We use it when a dataset is balanced and you want to perform simple classification.
It’s calculated as (True Positives + True Negatives) / Total Predictions.
Example: A sample class has 80% males and 20% females. Our model can easily predict 80% accuracy of the male representation by predicting total students in the sample.
Precision:
Precision, also known as positive predictive value measures the proportion of true positive predictions among all positive predictions. It helps answer the question: “Of all the positive predictions made by the model, how many were correct?” The formula for precision is True Positives / (True Positives + False Positives).
It is used where false positives are costly.
Example: Fraud detection.
Recall:
Recall, or sensitivity, gauges the proportion of true positive predictions among all actual positive instances. It answers the question: “Of all the actual positive instances, how many did the model correctly predict?” The formula for recall is True Positives / (True Positives + False Negatives).
It is used where false negatives are costly.
Example: Medical cases that are sensitive like having a patient who has cancer testing negative.
F1 Score:
The F1 score is the harmonic mean of precision and recall. It provides a balance between these two metrics, giving you a single value that considers both false positives and false negatives. The formula for the F1 score is 2 * (Precision * Recall) / (Precision + Recall).
Used when we have an imbalanced dataset.
Example: Sentiment Analysis.

5.Specificity:
Specificity, also known as the true negative rate, measures the proportion of true negative predictions among all actual negative instances. It’s calculated as True Negatives / (True Negatives + False Positives).
The focus is on correctly identifying negatives and minimizing false alarms.
Example: Patient explaining symptoms to doctor.

6.Confusion Matrix:
While not a single metric, the confusion matrix is a table that summarizes the model’s performance. It includes values for true positives, true negatives, false positives, and false negatives, providing a detailed view of classification results.

Terms to remember:

True Positives: It is the case where we predicted Yes and the real output was also yes.
True Negatives: It is the case where we predicted No and the real output was also No.
False Positives: It is the case where we predicted Yes but it was actually No.
False Negatives: It is the case where we predicted No but it was actually Yes.

7.AUC-ROC:
The Receiver Operating Characteristic (ROC) curve plots the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The Area Under the Curve (AUC) of the ROC curve quantifies the model’s ability to distinguish between positive and negative classes.
It focuses on the trade-off between precision and recall, which is particularly important when dealing with imbalanced datasets.

True positive rate:

Also called or termed sensitivity. True Positive Rate is considered as a portion of positive data points that are correctly considered as positive, with respect to all data points that are positive.

2.True Negative Rate

Also called or termed specificity. False Negative Rate is considered as a portion of negative data points that are correctly considered as negative, with respect to all data points that are negatives.

3.False-positive Rate

False Negatives rate is actually the proportion of actual positives that are incorrectly identified as negatives.

8.Matthews Correlation Coefficient (MCC):

The Matthews Correlation Coefficient is a metric that takes into account true positives, true negatives, false positives, and false negatives to provide a balanced measure of classification performance. It ranges from -1 (total disagreement) to 1 (perfect agreement), with 0 indicating no better than random chance.

9.Log Loss (Logarithmic Loss):

Log Loss, or Cross-Entropy Loss, is a metric used in the evaluation of probabilistic classifiers. It quantifies how well the predicted probabilities match the actual class labels. Lower log loss values indicate better model performance.
It usually works well with multi-class classification.
Working on Log loss, the classifier should assign a probability for each and every class of all the samples. If there are N samples belonging to the M class, then we calculate the Log loss in this way:

[Tex]LogarithmicLoss$=\frac{-1}{N} \sum_{i=1}^{N} \sum_{j=1}^{M} y_{i j} * \log \left(p_{i j}\right)$ [/Tex]

yij indicate whether sample i belongs to class j.
pij : The probability of sample i belongs to class j.
The range of log loss is [0,?). When the log loss is near 0 it indicates high accuracy and when away from zero then, it indicates lower accuracy.
Minimizing log loss gives you higher accuracy for the classifier.

Regression Evaluation Metrics
Predicts the target variable which is in the form of continuous values. To evaluate the performance of such a model the evaluation metrics listed below are used:

1.Mean Absolute Error(MAE)
It is the average distance between Predicted and original values. Basically, it gives how we have predicted from the actual output. However, there is one limitation i.e. it doesn’t give any idea about the direction of the error which is whether we are under-predicting or over-predicting our data.

2.Mean Squared Error(MSE)
It is similar to mean absolute error but the difference is it takes the square of the average of between predicted and original values. The main advantage to take this metric is here, it is easier to calculate the gradient whereas, in the case of mean absolute error, it takes complicated programming tools to calculate the gradient. By taking the square of errors it pronounces larger errors more than smaller errors, we can focus more on larger errors.

3.Root Mean Square Error(RMSE)
We can say that RMSE is a metric that can be obtained by just taking the square root of the MSE value. As we know that the MSE metrics are not robust to outliers and so are the RMSE values. This gives higher weightage to the large errors in predictions.

4.Root Mean Squared Logarithmic Error(RMSLE)
There are times when the target variable varies in a wide range of values. And hence we do not want to penalize the overestimation of the target values but penalize the underestimation of the target values. For such cases, RMSLE is used as an evaluation metric which helps us to achieve the above objective.

5.R2 – Score
The coefficient of determination also called the R2 score is used to evaluate the performance of a linear regression model. It is the amount of variation in the output-dependent attribute which is predictable from the input independent variable(s). It is used to check how well-observed results are reproduced by the model, depending on the ratio of total deviation of results described by the model.

Conclusion

Choosing the right classification metric depends on the problem you're solving. Accuracy is useful when classes are balanced, but in cases where false positives or false negatives have serious consequences, metrics like precision, recall, and F1 score are more reliable. The confusion matrix provides a detailed breakdown, while AUC-ROC helps evaluate model performance across different thresholds. Other metrics like MCC and Log Loss offer deeper insights, especially for imbalanced datasets or probabilistic models.

For regression models, MAE, MSE, RMSE, and R²-score help evaluate predictions of continuous values. Each metric has its strengths, and selecting the right one ensures your model is not just working, but working well.

Summary

Classification helps categorize data into classes.
Common evaluation metrics include Accuracy, Precision, Recall, F1 Score, Specificity, AUC-ROC, and MCC.
The choice of metric depends on the nature of the dataset and the impact of errors.
Regression models predict continuous values and use metrics like MAE, MSE, RMSE, and R²-score for evaluation.
Understanding these metrics ensures better model performance and more reliable results.

Chi-square Tests & Calculating Degrees of Freedom.

Michelle Njuguna — Mon, 10 Mar 2025 20:18:44 +0000

Introduction

Let's learn Chi square tests with a twist. Are you a fan of Formula one? If yes then this will be a perfect article for you, if not it will help you learn a lot so don't be scared.

What's a chi square test?

From Wikipedia ,Chi-Square test is a non parametric statistical procedure for determining the difference between observed and expected data.
It can also be used to decide whether the data correlates with our categorical variables. Thus helps to determine whether a difference between two categorical variables is due to chance or a relationship between them.

It is one of the most widely used techniques for hypothesis testing.

Types of Chi-square tests

The chi-square goodness of fit test is used to test whether the frequency distribution of a categorical variable is different from your expectations.
The chi-square test of independence is used to test whether two categorical variables are related to each other.

The two types of Pearson’s chi-square tests test whether the observed frequency distribution of a categorical variable is significantly different from its expected frequency distribution.
A frequency distribution describes how observations are distributed between different groups.
Frequency distributions are often displayed using frequency distribution tables. A frequency distribution table shows the number of observations in each group.

When to use a chi-square test

Testing a hypothesis about one or more categorical variables. If one or more of your variables is quantitative, you should use a different statistical test.
The sample was randomly selected from the population.
There are a minimum of five observations expected in each group or combination of groups.

How to Solve Chi-Square Problems?

State the Hypotheses
Null hypothesis (H0): There is no association between the variables
Alternative hypothesis (H1): There is an association between the variables.
Calculate the Expected Frequencies
Use the formula: E=(Row Total×Column Total)Grand TotalE = \frac{(Row \ Total \times Column \ Total)}{Grand \ Total}E=Grand Total(Row Total×Column Total)
Compute the Chi-Square Statistic
Use the formula: χ2=∑(O−E)2E\chi^2 = \sum \frac{(O - E)^2}{E}χ2=∑E(O−E)2, where O is the observed frequency and E is the expected frequency.
Determine the Degrees of Freedom (df)
Use the formula: df=(number of rows−1)×(number of columns−1)df = (number \ of \ rows - 1) \times (number \ of \ columns - 1)df=(number of rows−1)×(number of columns−1)
Find the Critical Value and Compare
Use the chi-square distribution table to find the critical value for the given df and significance level (usually 0.05).
Compare the chi-square statistic to the critical value to decide whether to reject the null hypothesis.

Example of Use with Formula One

Why We Use the Chi-Square Test

To analyze categorical data - Many factors in Formula 1, such as tire choices, pit stop strategies and driver performance are categorical variables.
To test independence - The Chi-Square test can be used to determine if two categorical factors are related, such as whether race outcomes are dependent on specific tire strategies.
To compare expected vs. observed outcomes - This helps determine if trends in Formula 1 results occur by chance or if they are statistically significant.

When We Use the Chi-Square Test

Team Performance vs. Engine Supplier: Testing if certain engine suppliers (e.g., Mercedes, Ferrari, Honda) correlate with better race performance.
Pit Stop Strategy vs. Race Results: Checking if a team's pit stop strategy significantly affects their final position.
Driver Nationality vs. Team Preference: Investigating if specific nationalities are more likely to be hired by particular teams.

Degrees of Freedom in Chi-Square Tests
Degrees of freedom play a crucial role in determining the validity of the Chi-Square test. In Formula 1, there are three main ways to calculate this:

Contingency Tables (df = (rows - 1) * (columns - 1))
- Example: If we analyze tire choices (soft, medium, hard) across three different teams, the degrees of freedom would be (3-1) * (3-1) = 4.
Goodness-of-Fit Test (df = categories - 1)
- Example: If we test whether the distribution of podium finishes among 5 teams is as expected, the degrees of freedom would be 5-1 = 4.
Homogeneity Test (df = (number of groups - 1) * (number of categories - 1))
- Example: If we test whether wet weather races impact finishing positions differently across four different seasons, we calculate df as (4-1) * (2-1) = 3.

Conclusion
The Chi-Square test is a valuable statistical tool for analyzing trends and associations in Formula 1. Whether it’s evaluating pit stop strategies or comparing team performance across different seasons, this method helps uncover patterns beyond random chance.

Hypothesis Testing.

Michelle Njuguna — Mon, 10 Mar 2025 20:15:45 +0000

Introduction

Have you ever disliked a dish yet you still have to eat it sometimes? Most of people dislike vegetables and would prefer not eating then yet they are really good for our bodies. Now imagine machine learning and AI as a body and guess what the vegetables in question are, Mathematics specifically statistics.

I know statistics might not be the most exciting part of AI for many people, but it’s the foundation of how machines learn and process data. I will try as much as I can to help you understand statistics and how major concepts work so that as we code ,we actually know what we are doing.

Hypothesis Testing

I believe hypothesis testing is the backbone of all models. It helps determine whether the problem you're solving is truly significant and proves if your solution is actually viable.

From Wikipedia, Hypothesis Testing is a type of statistical analysis in which you put your assumptions about a population parameter to the test. It is used to estimate the relationship between 2 statistical variables.

Example: In my intro, I made an assumption that most people do not like eating vegetables. I could perform tests to validate what I said by collecting and evaluating a representative sample from the data set under study.

Importance of Hypothesis Testing

Avoiding Misleading Conclusions and Making Smart Decisions.

Imagine you're an oncologist with a new method for testing cancer. Two patients, Patient A and Patient B, come in for testing. Patient A is actually healthy, but your test incorrectly diagnoses them with cancer—this is a Type I error (false positive). On the other hand, Patient B does have cancer, but your test fails to detect it, and you send them home thinking they're fine—this is a Type II error (false negative).

2.Optimizing Business Tactics.

It is invaluable for testing new ideas and strategies before fully committing to them. Example: Checking whether investing in a model that predicts early signs of cancer would bring a heath institution more patients who would want to be screened. Hence bringing new clients.

Hypothesis Testing Formula

Z = ( x̅ – μ0 ) / (σ /√n)

x̅ = sample mean,
μ0= population mean,
σ = standard deviation,
n= sample size.

1. Formulating a Hypothesis

Null Hypothesis (H0): This is the default assumption that there is no effect or difference.
Alternative Hypothesis (Ha): This is the hypothesis that there is an effect or difference.

Choose the Significance Level (α)
The significance level, often denoted by alpha (α), is the probability of rejecting the null hypothesis when it is true. Common choices for α are 0.05 (5%), 0.01 (1%), and 0.10 (10%).

The significance level (α) directly controls the probability of a Type I error. Decreasing α reduces the chance of Type I errors but increases the risk of Type II errors.

To balance these errors:

Adjust the significance level based on the consequences of each error type.
Increase sample size to improve the power of the test.
Use one-tailed tests when appropriate.

2. Select the Appropriate Test
Choose a statistical test based on the type of data and the hypothesis. Common tests include t-tests, chi-square tests, ANOVA, and regression analysis. The selection depends on data type, distribution, sample size, and whether the hypothesis is one-tailed or two-tailed.

3. Collect Data
Gather the data that will be analyzed in the test.This data should be representative of the population.

4.Calculate the Test Statistic
Based on the collected data and the chosen test, calculate a test statistic that reflects how much the observed data deviates from the null hypothesis.

5. Determine the p-value
The p-value is the probability of observing test results at least as extreme as the results observed, assuming the null hypothesis is correct.

6. Make a Decision
Compare the p-value to the chosen significance level:
If the p-value ≤ α: Reject the null hypothesis, suggesting sufficient evidence in the data supports the alternative hypothesis.
If the p-value > α: Do not reject the null hypothesis, suggesting insufficient evidence to support the alternative hypothesis.

7. Report the Results
Present the findings from the hypothesis test, including the test statistic, p-value, and the conclusion about the hypotheses.

Types of Hypothesis Testing

Z Test
It usually checks to see if two means are the same (the null hypothesis). Only when the population standard deviation is known and the sample size is 30 data points or more can a z-test be applied.
T Test
Compares the means of two groups. To determine whether two groups differ or if a procedure or treatment affects the population of interest.
Chi-Square
To determine if the expected and observed results are well-fitted, the Chi-square test analyzes the differences between categorical variables from a random sample. The test's fundamental premise is that the observed values in your data should be compared to the predicted values that would be present if the null hypothesis were true.
ANOVA
Analysis of Variance, is a statistical method used to compare the means of three or more groups. It’s particularly useful when you want to see if there are significant differences between multiple groups.

Modern Approaches to Hypothesis Testing
In addition to traditional hypothesis testing methods, there are several modern approaches:

Permutation or randomization tests
These tests involve randomly shuffling the observed data many times to create a distribution of possible outcomes under the null hypothesis. They are particularly useful when dealing with small sample sizes or when the assumptions of parametric tests are not met.
Bootstrapping
Bootstrapping is a resampling technique that involves repeatedly sampling with replacement from the original dataset. It can be used to estimate the sampling distribution of a statistic and construct confidence intervals.
Jackknife
Jackknife is a cross-validation technique and therefore, a form of resampling. It is especially useful for bias and variance estimation.
Monte Carlo simulation
Monte Carlo methods use repeated random sampling to obtain numerical results. In hypothesis testing, they can be used to estimate p-values for complex statistical models or when analytical solutions are difficult to obtain.

Conclusion
While statistical findings are important, it's also crucial to apply common sense when working with hypothesis testing. The practicality of your ideas will lead to better tests and more reliable conclusions. I do hope I have helped you like vegetables a little bit. Remember, with statistics, practice makes perfect, so keep solving as many problems as you can.

Data protection, privacy and ethics.

Michelle Njuguna — Sun, 09 Feb 2025 07:24:30 +0000

Introduction

Let’s start with a simple but unsettling thought:

By now, I think we all know I am a woman. My reproductive health is deeply personal and the data around it for me is very sensitive. To track my period and ovulation, I use Flo. It is an app that helps me understand my body better during the different hormonal stages. Now imagine I get a call from a new gynecologist from a hospital I’ve never been to talking to me about my symptoms that I logged into my app and telling me there are some irregularities and I need to do some tests at their hospital to confirm everything is ok.

How do they know that? I never shared my data with them. When I investigate further, I find Flo my tracking app sold my personal health data to the hospital without my consent. I’d feel betrayed. I might uninstall the app, warn my friends or even take legal action. This isn’t just a hypothetical scenario. It actually happened.

From Wikipedia, Flo was caught sharing women's health data tracking their reproductive changes to data with Facebook and Google without asking for permission in 2021. This was quite disturbing because a lot of sensitive data is shared by women on these apps and the fact that this data was given to third parties without consent and there was no openness in what it was being used for. Flo had to settle a case with the Federal Trade Commission but the damage was already done, thousands of women’s sensitive health information was exposed.

Why Does This Matter?

In today’s digital world, data is currency. Every time we use an app, shop online, or even just browse, we leave behind a trail of data. You know how we visit sites and you are offered cookies, well that's just how we share our data with different sites. Companies collect this data to improve services but some sell it for profit—often without us even knowing.

This is where data privacy, protection, and ethics come in.

Data Privacy: Privacy is about who gets access to your information and how it’s used.
Data Protection: Protection is about preventing unauthorized access.
Data Ethics: Ethics is about responsible data use.

Importance of Data Ethics

Builds Trust – Ethical data practices strengthen customer and stakeholder trust.
Legal Compliance – Helps organizations follow laws like GDPR and CCPA avoiding penalties.
Risk Prevention – Reduces data breaches and misuse, preventing financial and reputational harm.
_Social Responsibility _– Protects individuals from harm, bias or discrimination.

Principles of Data Ethics

_Privacy _– Collect and use personal data only with explicit consent.
Transparency – Be open about data collection, usage and policies.
Accountability – Take responsibility for data protection and any misuse.
Fairness – Ensure data practices are unbiased and do not discriminate.
Security – Protect data from unauthorized access or modifications.

Benefits of Ethical Data Practices

Customer Loyalty – People trust and stay loyal to ethical businesses.
Competitive Edge – Ethical companies stand out from competitors.
_Better Data Quality _– Ensures accuracy and accountability in data management.
Risk Reduction – Minimizes legal issues and reputational damage.
Encourages Innovation – A culture of integrity leads to long-term success.

Conclusion

In an age where data is a powerful asset, ethical data practices are not just a legal requirement but a moral obligation. Organizations that prioritize privacy, transparency, accountability, fairness, and security not only build trust with their customers but also gain a competitive advantage in the long run. I hope you've gained some knowledge on data privacy, protection and ethics. Till next time!

Feature Engineering

Michelle Njuguna — Wed, 11 Sep 2024 13:22:54 +0000

Introduction

Hey there, today we are going to demystify feature engineering. It seems like a tough topic to cover but I hope at the end of this article you will at least understand the basics of it.

From Wikipedia, Feature engineering is a machine learning method that uses data to create new variables that are not included in the training set.
It can generate new features for both supervised and unsupervised learning.
Makes data transformations easier and faster while improving model's accuracy.

Feature engineering Techniques

Data Cleaning: this is tidying up your data. You address missing information, correct errors, and remove any inconsistencies.
Data Transformation: this is data reshaping or adjusting. Example: scaling large numbers down or normalizing data so that it fits within a certain range. The important factor is to make these changes without altering the data meaning.
Feature Extraction: This is where we explore existing data and create new features that can offer new insights. This makes the model simpler and faster without losing useful details.
Feature Selection: involves picking out the pieces of data that are most closely related to your target prediction. This gets rid of unnecessary information, making the model more focused.
Feature Iteration: This is all about trial and error. The process of adding or removing certain features, test how they impact the model, and keep the ones that improve the model’s performance.

Types of Features in Machine Learning

Numerical Features: These are numbers that can be measured, are straightforward and continuous in nature. Example : age.
Categorical Features: These are categorical. Example, eye color.
Time-series Features: data that is recorded over time. Example; Stocks.
Text Features: these are features made from words or text. Example : Customer reviews

Conclusion
I hope I explained the terms well, I do believe this are the few things you need to know as a beginner theoretically. Next time we discuss feature engineering it will be on more practical terms.
Till next time!

Movie Dataset Exploration and Visualization

Michelle Njuguna — Tue, 10 Sep 2024 13:04:46 +0000

Introduction

Practice makes perfect.

Something that has a lot in common with being a data scientist. Theory is only one aspect of the equation; the most crucial aspect is putting theory into practice. I will make an effort to record today's entire process of developing my capstone project, which will involve studying a movie dataset.

These are the objectives:
Objective:

Download a movie dataset from Kaggle or retrieve it using the TMDb API.
Explore various aspects such as movie genres, ratings, director popularity, and release year trends.
Create dashboards that visualize these trends and optionally recommend movies based on user preferences.

1. Data Collection
I decided to use Kaggle to find my dataset.It is crucial to keep in mind the crucial variables you will want for the dataset you are working with. Importantly, my dataset ought to include the following: trends in release year, popularity of directors, ratings, and movie genres. As a result, I must make sure the dataset I choose has the following, at the very least.
My dataset was located on Kaggle, and I'll provide the link below. You can obtain the CSV version of the file by downloading the dataset, unzipping it, and extracting it. You can look over it to comprehend what you already have and to truly realize what kinds of insights you hope to obtain from the data you will be examining.

2. Describing the data

First, we must import the required libraries and load the necessary data. I'm using the Python programming language and Jupyter Notebooks for my project so that I can write and see my code more efficiently.
You will import the libraries that we will be using and load the data as shown below.

We will then run the following command to get more details about our dataset.

data.head() # dispalys the first rows of the dataset.
data.tail() # displays the last rows of the dataset.
data.shape # Shows the total number of rows and columns.
len(data.columns)  # Shows the total number of columns.
data.columns # Describes different column names.
data.dtypes # Describes different data types.

We now know what the dataset comprises and the insights we hope to extract after obtaining all the descriptions we require. Example: Using my dataset, I wish to investigate patterns in the popularity of directors, ratings distribution, and movie genres. I also want to suggest movies depending on user-selected preferences, such as preferred directors and genres.

3. Data Cleaning

This phase involves finding any null values and removing them. In order to move on with data visualization, we will also examine our dataset for duplicates and remove any that we find. To do this, we'll run the code that follows:

1. data['show_id'].value_counts().sum() # Checks for the total number of rows in my dataset
2. data.isna().sum() # Checks for null values(I found null values in director, cast and country columns)
3. data[['director', 'cast', 'country']] = data[['director', 'cast', 'country']].replace(np.nan, "Unknown ") # Fill null values with unknown.

We will then drop the rows with unknown values and confirm we have dropped all of them. We will also check the number of rows remaining that have cleaned data.

The code that follows looks for unique characteristics and duplicates. Although there are no duplicates in my dataset, you might still need to utilize it in case future datasets do.

data.duplicated().sum() # Checks for duplicates
data.nunique() # Checks for unique features
data.info # Confirms if nan values are present and also shows datatypes.

My date/time data type is an object and I would like for it to be in the proper date/time format so I useddata['date_added']=data['date_added'].astype('datetime64[ms]')to convert it to the proper format.

4. Data Visualization

My dataset has two types of variables namely the TV shows and Movies in the types and I used a bar graph to present the categorical data with the values that they represent.
I also used a pie chart to represent the same as above. The code used is as follows and the outcome expected shown below.

## Pie chart display
plt.figure(figsize=(8, 8))  
data['type'].value_counts().plot(
    kind='pie', 
    autopct='%1.1f%%',  
    colors=['skyblue', 'lightgreen'], 
    startangle=90, 
    explode=(0.05, 0) 
)
plt.title('Distribution of Content Types (Movies vs. TV Shows)')
plt.ylabel('')
plt.show()

I then did a tabled comparison using pd.crosstab(data.type, data.country) to create a tabled comparison of the types based on release dates, countries, and other factors (you can try changing the columns in the code independently). Below are the code to use and the expected comparison. I also checked the first 20 countries leading in the production of Tv Shows and and visualized them in a bar graph.You can copy the code in the image and ensure the outcome is almost similar to mine.

I then checked for the top 10 movie genre as shown below. You can also use the code to check for TV shows. Just substitute with proper variable names.

I extracted months and years separately from the dates provided so that I could visualize some histogram plots over the years.

Checked for the top 10 directors with the most movies and compared them using a bar graph.

Checked for the cast with the highest rating and visualized them.

5. Recommendation System

I then built a recommendation system that takes in genre or director's name as input and produces a list of movies as per the user's preference. If the input cannot be matched by the algorithm then the user is notified.

The code for the above is as follows:

def recommend_movies(genre=None, director=None):
    recommendations = data
    if genre:
        recommendations = recommendations[recommendations['listed_in'].str.contains(genre, case=False, na=False)]
    if director:
        recommendations = recommendations[recommendations['director'].str.contains(director, case=False, na=False)]
    if not recommendations.empty:
        return recommendations[['title', 'director', 'listed_in', 'release_year', 'rating']].head(10)
    else:
        return "No movies found matching your preferences."
print("Welcome to the Movie Recommendation System!")
print("You can filter movies by Genre or Director (or both).")
user_genre = input("Enter your preferred genre (or press Enter to skip): ")
user_director = input("Enter your preferred director (or press Enter to skip): ")
recommendations = recommend_movies(genre=user_genre, director=user_director)
print("\nRecommended Movies:")
print(recommendations)

Conclusion

My goals were achieved, and I had a great time taking on this challenge since it helped me realize that, even though learning is a process, there are days when I succeed and fail. This was definitely a success. Here, we celebrate victories as well as defeats since, in the end, each teach us something. Do let me know if you attempt this.
Till next time!

Note!!
The code is in my GitHub:
https://github.com/MichelleNjeri-scientist/Movie-Dataset-Exploration-and-Visualization

The Kaggle dataset is:
https://www.kaggle.com/datasets/shivamb/netflix-shows

Demystifying Data Science: The Ultimate Guide!

Michelle Njuguna — Tue, 10 Sep 2024 11:22:02 +0000

Introduction

Hey there! Today I will be sharing what I believe to be the key skills an expert in Data Science should posses. This will be a follow through to the article I did for the beginner's guide so if you missed it, I will share the link below.

[(https://dev.to/michellenjeriscientist/demystifying-data-science-a-beginners-guide-pa6)]

According to Wikipedia ,Data Science is an interdisciplinary field focused on extracting knowledge from typically large data sets and applying the knowledge and insights from that data to solve problems in a wide range of application domains.

A Data Scientist is responsible for many roles in an organization like business analytics, building data products, developing visualizations, ML algorithms and more. I do believe the first step is to actually understand the different roles of different data specialists so that you know what peaks your interest. These data specializations are; data analyst, data scientist, data administrator, data architect, business analyst, business intelligence manager, data/analytics manager.

The following are important skills you must have as a data scientist:

1. Mastering programming languages like R or Python

Python and R are both free, open-source languages. They both, run smoothly on most common operating systems i.e. Linus, macOS and Windows. Both languages have a long list of functionalities and can easily take on any data analysis task. Beginner or expert, the languages are easy to learn and execute.
Python is a general-purpose, object-oriented programming language. Its easy syntax makes it perfect for collaborations. It ensures smooth execution of tasks with flexibility, stability and code readability.
R is a popular statistical programming language that is built to facilitate computing and data visualization. R has numerous abilities, including statistical analysis, visualization of data and manipulating data.
You can compare the two and choose what best works for you.

2. Statistics and Applied Mathematics.

Statistics and data science are fields of applied mathematics designed to interpret data in all its many forms. This is done by applying mathematical models and statistical theory that relate the data at hand to the underlying questions and often hidden features of interest. A key advantage of statistical science is the ability to quantify the uncertainty in a prediction or decision and for decision making this aspect is often as important as the estimate itself.

A simpler summary of this from Wikipedia From hypothesis testing to regression analysis, statistical methods enable professionals to validate hypotheses, quantify uncertainties, and draw conclusions with confidence.

3. Working Knowledge of Hadoop and Spark.
Apache Hadoop is an open-source software utility that allows users to manage big data sets by enabling a network of computers to solve vast and intricate data problems.
It is a highly scalable, cost-effective solution that stores and processes structured, semi-structured and unstructured data.

Benefits of the Hadoop framework include the following:

Data protection amid a hardware failure.
Vast scalability from a single server to thousands of machines.
Real-time analytics for historical analyses and decision-making processes.

Apache Spark is a data processing engine for big data sets. Like Hadoop, Spark splits up large tasks across different nodes. However, it tends to perform faster than Hadoop and it uses random access memory ( to cache and process data instead of a file system. This enables Spark to handle use cases that Hadoop cannot.

Benefits of the Spark framework include the following:

A unified engine that supports SQL queries, streaming data, machine learning (ML) and graph processing.
Can be 100x faster than Hadoop for smaller workloads via in-memory processing, disk data storage, etc.
APIs designed for ease of use when manipulating semi-structured data and transforming data

These systems are two of the most prominent distributed systems for processing data on the market today. Hadoop is used mainly for disk-heavy operations with the MapReduce paradigm, and Spark is a more flexible, but more costly in-memory processing architecture.

4. Databases: SQL and NoSQL.
As a data scientist, it is generally recommended to learn both SQL and NoSQL databases, as they serve different purposes and are often used in complementary ways.

SQL (Structured Query Language) databases, such as PostgreSQL, MySQL, and Oracle, are well-suited for structured, tabular data and are widely used for data storage, management, and retrieval. They excel at performing complex queries, ensuring data integrity, and supporting transactions.
NoSQL databases, such as MongoDB, Cassandra, and Elasticsearch, are designed to handle unstructured, semi-structured, or rapidly changing data that does not fit well into the rigid structure of traditional SQL databases. NoSQL databases offer features like horizontal scalability, flexible schema, and high availability.

As a data scientist, having expertise in both SQL and NoSQL databases can be advantageous, as it allows you to choose the appropriate database technology for a given problem or dataset. By learning both SQL and NoSQL, data scientists can:

Gain a deeper understanding of data storage and management techniques.
Become more versatile in adapting to different data requirements and use cases.
Leverage the strengths of each database type to build robust and scalable data solutions.
Seamlessly integrate SQL and NoSQL databases within their data architecture.
Enhance their ability to work with diverse data sources and formats.

5. Machine Learning and Neural Networks.

Machine learning and data science are inextricably linked. Machine learning is defined as a machine's ability to extract knowledge from data. Machines can't learn much if they don't have any data. If anything, the growing use of machine learning in a variety of industries will act as a catalyst for data science to grow dramatically. Data scientists are expected to have a basic understanding of machine learning.

From Wikipedia a neural network is a method for performing machine learning tasks, training a computer with labelled training data. In other words, a computer program can learn to make decisions based on a model it builds from a training dataset._ A common goal for data scientists in artificial intelligence is to be able to classify data and associate data to different categories. Neural networks help us develop powerful algorithms that can achieve this.

6.Proficiency in Deep Learning Frameworks

Deep learning is a subset of ML that has been proven successful in helping recognize data patterns. It’s termed as a neural network-based approach that will allow computers to learn to do things independently rather than being programmed by humans.
Experts forecast deep learning to become the dominant technique for data analysis in the coming few years. Its impact on data science will be significant.
Deep Learning algorithms can learn more from data than traditional ML analytics and algorithms, because they can learn not only from data input but also from the hidden layers of data that will present higher-level concepts.
In addition, deep learning algorithms would be trained on massive datasets, which gives them an advantage over traditional ML algorithms which are struggling with big data.
Deep learning is likely to become the dominant data analysis technique across all domains in the near future.

7. Creative Thinking & Industry Knowledge.
A data scientist requires a foundation of technical skills. These include the ability to interpret, manipulate, and extract meaning from data, and then use it to build predictive models and generate business insights. Creativity in data science can be seen in anything from innovative modeling, thinking up original ways to collect data, developing new tools, and being able to visualize data process a few years down the line.

Conclusion

Learning all these skills will take time, but they are definitely essential for wholesome growth in the Data Science field. Take your time, learn your track, put in the work and watch the magic unfold.
Till next time!

Demystifying Data Science: The Ultimate Guide!

Michelle Njuguna — Sat, 24 Aug 2024 08:52:53 +0000

Understanding Your Data: The Essentials of Exploratory Data Analysis

Michelle Njuguna — Sun, 11 Aug 2024 20:17:32 +0000

Introduction

According to Wikipedia, Exploratory data analysis is an analysis approach that identifies general patterns in the data. These patterns include outliers and features of the data that might be unexpected.

EDA is an approach to data in order to:

Summarize the main characteristics of data
Gain better understanding of the dataset
Uncover relationships between different variables and extract important variables in the problems we are trying to solve.

Important EDA techniques that we will discuss are as follows:

Descriptive Statistics.
Grouping data by use of Group by to transform the dataset.
Correlation.
Advanced Correlation.

1. Descriptive Statistics
This is a branch of statistics that involves summarizing, organizing, and presenting data meaningfully and concisely.
Before understanding our dataset, we must first import the relevant libraries we are to use and load the data. Mine was an excel sheet so I loaded it as shown below.

Before building models, we need to first understand the data we have. This is why we do descriptive statistics. The following functions are important as they help with the process:
**data.describe** - helps us take a look at all the numerical functions of the dataset. As per my dataset below, you can see what output it gives. I have added include all to also check on the categorical variable summary.

**value_counts**- Checks categorical variables in our dataset. These variables are those that can be divided into different groups and have discrete values. When you run the code, it will show you the outcome of your dataset as below.

Boxplots
A great way to visualize numeric data since you can visualize various distribution of the data. A box plot shows a set of five descriptive statistics of the data: the minimum and maximum values (excluding the outliers), the median, and the first and third quartiles. Optionally, it can also show the mean value.
It is the right choice if you're interested only in these statistics, without digging into the real underlying data distribution.Here is how I generated my boxplot. You can substitute with the details in your dataset.

# Assuming 'data' is your DataFrame and 'Temp_C' is the column for temperature
plt.figure(figsize=(8, 6))
sns.boxplot(y=data['Temp_C'])
plt.title('Boxplot of Temperature (°C)')
plt.ylabel('Temperature (°C)')
plt.grid(True)
plt.show()

Scatter Plot

We use scatter plots to visualize continuous variables.Each value in the data set is represented by a dot.

Lets draw a simple scatter plot where x represents the age of the car and y the speed of the car.(Remember to import all the libraries to be used, for this, you'll need Matplotlib.)


x = [5,7,8,7,2,17,2,9,4,11,12,9,6]-independent variable
y = [99,86,87,88,111,86,103,87,94,78,77,85,86] - dependent variable
plt.scatter(x, y)
plt.show()

2. Group By

Works with categorical variables. It is used for grouping the data according to the categories and applying a function to the categories. It helps to aggregate data efficiently.
It makes the task of splitting the Dataframe over some criteria really easy and efficient.
In my dataset below, I needed the mean of the weather_conditions for all the elements in my dataset.

Heatmap Plot
A heatmap is a table-style data visualization type where each numeric data point is depicted based on a selected color scale and according to the data point's magnitude within the dataset.
These plots illustrate potential hot and cold spots of the data that may require special attention.

plt.pcolor(df_pivot, cmap="RdBu")
plt.colorbar()
plt.show()

Correlation
We use correlation to check the level of interdependence of variables.
Correlation is a statistical metric for measuring to what extent different variables are interdependent.

Advanced Correlation
We can measure the strength of the correlation between continuous numerical variables is by using Pearson Correlation.
Pearson Correlation method gives you two values;

correlation coefficient a value close to 1 shows a large positive correlation, while a value close to -1 implies a large negative correlation, and a value close to 0 implies no correlation between the variables.
p-value tells us how certain we are about the correlation that we calculated. For the p-value, a value less than 0.001 gives us a strong certainty about the correlation coefficient that we calculated, a value between 0.001 and 0.05 gives us moderate certainty, a value between 0.05 and 0.1 will give us a weak certainty, and a p-value larger than 0.1 will give us no certainty of correlation at all.

We can say that there is a strong correlation when the correlation coefficient is close to one or -1 and the p-value is less than 0.001.

Demystifying Data Science: A Beginner’s Guide!

Michelle Njuguna — Sun, 04 Aug 2024 09:32:03 +0000

Introduction

Hey there! I'm Michelle, and I like to call myself a data enthusiast! Data science might sound intimidating, but it’s not. I’ve been where you are now, staring at the screen with that “where do I even begin?” look. But don’t worry, I’m here to guide you through this journey.

What Is Data Science?
We first need to understand data. Data is a very broad term that can refer to raw facts, process data or information.
There are two types of data:

Traditional data- structured data.
Big data- unstructured data. With Big Data came the evolution of Data analysis roles like Data Science and Machine learning.

According to Wikipedia, Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processes, scientific visualization, algorithms and systems to extract or extrapolate knowledge and insights from potentially noisy, structured, or unstructured data.

In simpler terms, data science is all about discovering hidden insights in data and making predictions for the future.

Data Science

Business Intelligence (BI): studies the numbers and explains where and why some things went well and others not so well. Having the business context in mind, the business intelligence analyst will present the data in the form of reports and dashboards(translates raw data).
Traditional Methods: were designed prior to the existence of big data, where the technology simply wasn't as advanced as it is today. They involve applying statistical approaches to create predictive models. 3.Machine Learning (ML): utilizing unconventional methods or A.I., to predict behavior in unprecedented ways, using machine learning techniques and tools.

Traditional Data
A a quick run-down of how to handle traditional data:

Data Collection: Start with gathering your raw data. This could be survey responses, sales records etc.
Data Preprocessing: Now that you’ve got your data, it’s time to clean it up. This is like sorting through a pile of maize after harvesting. For example, sales numbers are numerical data, while customer feedback is categorical. 3. Data Cleansing: Sometimes, your data is messy—maybe someone wrote "two" instead of "2" or mistyped a name. Cleansing is all about fixing inconsistencies.
Balancing: If you’ve got uneven data you need to balance it out so your results aren’t skewed or bias.

Big Data
Wikipedia states that With 619 million active users, X creates around 12 TB daily. X, formerly known as Twitter, generates about 4.3 PB annually. The social media platform amasses around 500 million tweets daily, amounting to 560 GB of data.
This is just twitter so you can imagine how much data is put out there on other platforms on a daily! And it is growing everyday. The data has different forms hence it is unstructured and maybe now you can understand why Data is classified under 3 V’s namely;

Volume refers to the amount of data.
Velocity refers to the speed of data processing.
Variety refers to the number of types of data.

How to handle Big Data

Text Mining: the process of deriving valuable, unstructured data from a text.
Data Masking: maintaining a credible business or governmental activity by preserving confidential information.
Predictive Analytics Predictive analytics is looking into the future using data. You can do this with traditional statistical methods or with machine learning. Traditional Methods

Regression: is a model used for quantifying causal relationships among the different variables included in your analysis.
Clustering: Grouping similar things together.

Machine Learning
Training computers to learn from data and make predictions without being explicitly programmed.

There are three main types of ML:

Supervised Learning: Works with labeled data to predict outcomes. Examples; support vector machines, neural networks, deep learning, random forest models and Bayesian networks are all types of supervised learning.
Unsupervised Learning: works with unlabeled data to predict outcomes. Examples; There are neural networks that can be applied to an unsupervised type of machine learning, but K-means is the most common unsupervised approach.
Reinforcement Learning: Training models with rewards and punishments.

Deep learning is divided in supervised, unsupervised and reinforcement learning.

Wrapping Up
I know—it’s a lot. But remember, every expert was once a beginner. Keep experimenting, and don’t be afraid to make mistakes. That’s how you learn!

That’s it for now! Stay tuned for more articles where we’ll dive deeper into these topics.