Forem: Faith Cherotich

Supervised Learning: How Classification Model Works

Faith Cherotich — Sun, 24 Aug 2025 20:59:08 +0000

Have you ever wondered how your phone detects spam texts and automatically sends them to the junk folder? This is an example of classification in supervised learning—a branch of machine learning.
In this article, we'll delve into supervised learning, its models, and real-world applications.

What is Supervised Learning?

This branch of machine learning concentrates with learning patterns through connecting the relationship between variables and known outcomes and working with labelled datasets. The patterns from said datasets creates a model that can reproduce the same underlying rules with new data (testing data).

Under Supervised learning are two branches, classification and regression, however, our main focus is classification.

Classification

Classification algorithms can be used to create outputs that are restricted to a limited set of values. Data is trained to identify the input and assign classes accordingly.

Example:

Identifying an email as spam or not.
Transaction being specified as fraudulent or not

Types of Classification Tasks

Binary Classification- two possible classes
Multi-Class- more than two classes
Multi-Label- multiple classes at the same time. Example:

Models used for classification

Logistic Regression- typically used for binary classification to predict two discrete classes, for example pregnant or not pregnant.
K-Nearest Neighbors- a supervised learning technique used to classify new data points based on relationship to closest data points.
Decision Trees- breaks down and explains how classification is done using neat visual representation, hence the name decision trees. It begins at the root node from whence branches emerge, otherwise known as leaf nodes.
Random Forest- this is a collection of multiple decision trees.

Real-world applications of Classification

Education: Determining whether a student should be awarded a scholarship
Banking: Determining whether a customer is eligible for a loan.
Agriculture: Predicting the suitable weather for optimum production of certain crops.

Final Thoughts

Classification is more than just a theoretical concept; it's a powerful tool for making data-driven decisions. From optimizing business strategies like product recommendations to enhancing your grasp of machine learning's inner workings, mastering this technique is a fundamental inbuilding intelligent systems.

Predicting win probabilities of premier league teams based on last seasons performance using Python

Faith Cherotich — Thu, 31 Jul 2025 20:56:08 +0000

In this article, I'll delve into the world of football in an attempt to predict teams' winning chances based on last seasons performance.

First I'll retrieve last season's standings from https://www.football-data.org/ API.

import requests

# Retrieving data from the API
standings_url = "https://api.football-data.org/v4/competitions/PL/standings?season=2024" \
""
headers = {
    "X-Auth-Token": "api_key"
}

response = requests.get(standings_url, headers=headers)
standings_data = response.json()

Next is to calculate the estimated probability of winning this season using number of games won, and number of games played.

def win_probability(standings_data):
    teams_win_probabilities = []
    table = standings_data["standings"][0]["table"]  

    for row in table:
        team_name = row["team"]["name"]
        wins = row["won"]
        played = row["playedGames"]
        if played > 0:
            win_rate = (wins / played) * 100
            teams_win_probabilities.append({"team": team_name, "playedGames": played, "won": wins, "win_rate": round(win_rate, 2)})
        else:
            print(f"Skipping {row['team']['name']} because playedGames={played}")
    print(f"Total teams processed: {len(teams_win_probabilities)}")
    return teams_win_probabilities

To return output:

results = win_probability(standings_data)
for index, team in enumerate(results, start=1):
    print(f"{index}. {team['team']}: \nPlayed Games- {team['playedGames']}, Wins- {team['won']}, {team['win_rate']}% win rate")

Sample output of the first 5 teams on 2024/25 EPL table:

Total teams processed: 20
1. Liverpool FC: 
Played Games- 38, Wins- 25, 65.79% win rate
2. Arsenal FC: 
Played Games- 38, Wins- 20, 52.63% win rate
3. Manchester City FC: 
Played Games- 38, Wins- 21, 55.26% win rate
4. Chelsea FC: 
Played Games- 38, Wins- 20, 52.63% win rate
5. Newcastle United FC: 
Played Games- 38, Wins- 20, 52.63% win rate

Final Thoughts

This may not give the exact results as to what team would win this season, however, it shows how likely that outcome is.
It is helpful for football pundits and analysts to gain insights into trends in the world of football.

Measures of central tendency and their significance in the field of data science

Faith Cherotich — Tue, 22 Jul 2025 20:48:37 +0000

In this article we are going to explore methods used in measuring central tendency of data, and their importance in the field of data science.

What are measures of central tendency?

These are numerical values that represent the middle value in a dataset, also known as averages. They are important for summarizing data by finding average values.
They describe to what extent numerical variables tend to group around a specific value.

The commonly used measures are:

- The Mean
It is also known as average. This can be computed by dividing the sum of values by the number of data values that were summed.
The mean represents one way of finding the most typical value in a set of data values. it uses all values in a sample population, hence outliers can affect its accuracy.

- The Median
This is the middle value in a sorted set of values.
When the number of data values is even, there is no natural middle value, therefore to determine the mean you can compute the mean of the two middle values.
The median splits the set of ordered values into two parts that have equal number of values. It is a good alternative to use for a dataset that has outliers since it's not affected by extreme values.

- The Mode
It is the value that appears frequently in a distribution.

Example using Python

Importance of measures of central tendency in the field of data science

• Summarize large datasets making them easier to understand.
• Detect outliers/anomalies to help identify potential errors in the data for accurate assessments.
• Communicate insights from analysis of differences/similarities between different datasets or time periods for better decision making.
• Draw inductive inferences as data samples are used to make inferences about larger populations.
• Make predictions through understanding of averages of expected outcome. For example in real estate, it can be determined which region is most preferred by customers by analyzing trends in sales over a given time period.
• Draw conclusions about the corresponding statistics in the population

Applications

• A clothing store stocking the most common sizes purchased.
• Companies evaluating their average employee salaries.
• Insurance providers evaluating the median age of their customers.

How Excel is used in Real-World Data Analysis

Faith Cherotich — Wed, 11 Jun 2025 14:49:55 +0000

Overview of excel

Excel is a widely known data analysis and visualization tool–a valuable asset crucial for professional growth, for its accessibility and usability.
It is useful for data entry, data cleaning, visualization and generation of reports that can be presented in dashboards. The built in formulas aid in performing simple calculations and analysis of given datasets.

Applications of excel the in real world

Basic data analysis – inbuilt functions, e.g. MAX, SUM, AVERAGE, are essential in handling arithmetic computations. It allows users to analyze data quickly, and with ease. Examples include computing profitability and liquidity ratios.

Financial reporting – Excel allows for presentation of financial data in a clear and concise manner. This can be presented visually using features such as pivot charts to create dashboards and reports necessary for decision making. Examples include sales and expenditure reports.

Expense Tracking - Users can customize an Excel spreadsheet to track expenses, could be in groups as fuel, shopping, rent expenses. This makes it easier for budgeting depending on spending patterns.

Common excel features/ formulas

Average – Returns the average of number in a range selected. Is divides the total of a range with the number of entries in the selection.

Count - Counts how many numbers there are in a given cell. Can be used to check for duplicates.

Data validation – a feature that restricts data that users can input in a cell ensuring consistency and accuracy.

Pivot tables and charts – Pivot tables present summarized data, while pivot charts provide graphical visualizations of the said data for easier identification of trends and patterns. Both are used when creating dashboards.

Find and replace – Find helps the user locate required data in the spreadsheet. While Replace allows one to change original data with new data where necessary.

Reflection

Approximately 402 million terabytes of data is generated every day. Even though Excel may not handle such huge datasets, excel remains to be a wonderful data analysis tool; knowledge in data preparation, analysis and visualization is crucial in drawing actionable insights from the data. Using features such as pivot tables and filters, it becomes easier to sift through large datasets to make summarized reports. Additionally, creating visually appealing dashboards make it easier for comprehension and pattern recognition – all keys for storytelling with data.
Having a background in finance and accounting, Excel is a tool I’ve used before, however, during the past week of learning Excel, I’ve explored more of what it can do beyond statistical analysis I’m accustomed to–even more excited for how it will shape my data analytics journey.