Forem: Faith Cheptoo

Unsupervised Learning: Clustering

Faith Cheptoo — Sun, 14 Sep 2025 19:34:27 +0000

Discovering hidden patterns

Ever wondered how Spotify suggests playlists you didn't even know you would like? Or how online stores group similar products for you? That is where unsupervised learning comes in, and specifically, clustering.

What is Unsupervised Learning?

Unlike supervised learning, where a model is trained with labeled data, unsupervised learning works without labels. It analyses the data and tries to find patterns or structures on its own. Think of it like walking into a library for the first time. You notice some books are on the same shelf because of their topic, even if no one tells you.

How Clustering Works

Clustering is a method for grouping data points that are similar to one another.
It is like having a basket of fruits: apples, oranges, and bananas. Without being told the categories, you might sort them by color or size. Clustering algorithms perform a similar function with data, automatically identifying groups that share common characteristics.

Popular Clustering Models

Some common clustering techniques include:

K-Means: Divides data into a set number of clusters. Simple but effective.
DBSCAN (Density-based spatial clustering of applications with noise): Detects clusters of any shape and identifies outliers. Great for messy data.
Hierarchical Clustering: Builds a tree of clusters, which can be useful for understanding relationships.

I first tried clustering on a dataset of students using an AI assistant and realized the choice of clusters mattered a lot. At first, the groups didn’t make sense, but after tuning parameters and visualizing the data, I uncovered meaningful patterns. Some students clustered together because they interacted heavily with prompts, while others barely used the system. Finally, I discovered patterns that were not obvious at first glance.

Why Clustering Matters

Clustering can reveal hidden insights, guide decision-making, and even improve user experiences. Whether it’s grouping customers, students, or products, the ability to find structure in unlabeled data is incredibly powerful.

Where to Trade off Type 1 and Type 2 Errors

Faith Cheptoo — Sun, 31 Aug 2025 04:44:21 +0000

A Medical Use Case
In medicine, decisions often involve uncertainty. Whether diagnosing a disease, interpreting test results, or determining a course of treatment, healthcare professionals must balance the risks of false positives and false negatives. Understanding this trade-off is crucial because it directly affects patient health, healthcare costs, and trust in the medical system.

False Positives vs. False Negatives definitions
False Positive (Type I Error) – The test indicates a patient has the condition when they do not.
False Negative (Type II Error) – The test indicates a patient does not have the condition when they do.

Medical Scenario: Organ Compatibility Testing
Type I Error (False Positive)
Concluding that the donor and recipient are compatible when they are not.
Impact in transplantation: The organ is transplanted into a recipient who cannot properly accept it, leading to rejection, organ failure, or even death. This also wastes a precious donor organ.
Type II Error (False Negative)
Concluding that the donor and recipient are not compatible when they are.
Impact in transplantation: A perfectly good organ is rejected for use. The recipient remains on the waiting list, potentially losing their chance at life-saving surgery.

A Critical Balancing Act
When an organ becomes available, time is short. Compatibility testing must be both fast and accurate. Doctors must balance two competing risks:
• Transplant failure due to a false positive match.
• Missed opportunity due to a false negative match.
Because transplant failure is catastrophic for the patient and wastes an irreplaceable organ, minimizing Type I errors often takes priority. This means that testing criteria are strict; only highly certain matches proceed to surgery.
However, if the criteria are too strict, more potential matches are wrongly rejected, increasing Type II errors and leaving more patients without transplants.

Strategies for Managing the Trade-off
Medical teams reduce both errors through:
• Multiple layers of testing (blood typing, tissue typing, cross-matching) to confirm compatibility.
• Rapid confirmatory testing to minimize false rejections.
• Organ-sharing networks to reallocate rejected organs quickly to other patients.
• Continuous refinement of matching algorithms using genetic and immunological data.
When the Trade-off Shifts
The priority between avoiding Type I vs. Type II errors can change:
• High-demand, rare donor situations: Avoiding Type I errors is crucial; losing one organ to a false positive could be devastating.
• Abundant supply or high urgency cases: Thresholds may be slightly relaxed to reduce Type II errors and get more patients transplanted faster.

Conclusion
In organ transplantation, the decisions are truly life-or-death. It might seem like the goal should simply be to “avoid all mistakes,” but in reality, every compatibility decision comes with trade-offs. By understanding what false positives and false negatives mean, transplant teams can set their testing standards in a way that both saves the most lives and protects the precious organs available

Calculating the Win Probabilities of Premier League Teams(2024)

Faith Cheptoo — Thu, 31 Jul 2025 08:28:20 +0000

By Cheptoo Faith

This analysis estimated the win probabilities of the Premier League teams using actual match data obtained from football-data.org. The article breaks down the step-by-step process of how the probabilities were obtained.
Step 1: Gathering the Match Data
This began by collecting real-time and historical data from the Football-Data.org API (https://api.football-data.org/v4/competitions/PL/matches?season=2024), which provides detailed stats for every Premier League match. Using Python and the requests library, I sent a request to the API and retrieved data, including match status, team names, and final scores. This gave me the raw data needed for analysis.

Step 2: Structuring the Data
After retrieving the match data, I used the pandas library to organize it into a structured DataFrame. I extracted key details from each match, including status, home and away teams, and final scores. These were stored in a list of dictionaries and then converted into a DataFrame for easier analysis and manipulation.

Step 3: Calculating Win Probabilities
In this step, I calculated the win probability for each Premier League team based on the 2024 match data. For every team, I counted the number of matches they won both at home and away, then divided the total number of wins by the total number of games they played. The results were stored in a dictionary and converted into a DataFrame, sorted from the highest to the lowest win probability.
Win Probability Formula
Win Probability = (Total Wins/Total Games Played)

How Excel is used in Real-World Analysis

Faith Cheptoo — Wed, 11 Jun 2025 05:02:07 +0000

By Faith

WHAT IS EXCEL?
Excel is a powerful spreadsheet program developed by Microsoft that plays a huge role in today's data-driven world. It helps in organizing data, formatting information, performing calculations, and generating insights from the data. Excel helps us make sense of numbers and turn raw data into meaningful information that can be used in decision-making.

USE IN REAL-WORLD DATA ANALYSIS
Excel is used in various sectors in the world today, including:
1. Business Analysis
Its uses include:

Tracking of sales by comparing monthly trends and forecasting future revenue.
Management of Inventory by managing the stock levels and the product turnover rates.
Financial planning by managing the budget through expense tracking and profit/loss calculation.

2. Marketing Performance
Uses of Excel in marketing performance include:

Analysis of Campaign: Marketers track their campaigns that brought in more sales and clicks.
Social media Metrics Analysis: Excel is used to analyze the growth in followers, engagement rates, and the reach across social media platforms.

3. Health care
Excel is used in the healthcare systems in:

Patient record management: It is used to organize patient information, visit patient history, medication log, and lab results.
Data Cleaning for Research- Health researchers use Excel to clean and prepare data before statistical analysis.
Hospital Operations: Excel is used to track resource usage, such as supplies over time, and analyze patient treatment success rates.

Excel Features/Formulas

1. LOOKUP FUNCTIONS (for example, VLOOKUP AND HLOOKUP)
VLOOKUP is used to search for a specific value in a large table and return related data. For example, in a hospital dataset, we could use it to find a patient’s treatment plan by searching their ID.

2.LOGICAL FUNCTION (IF Function)
The IF formula is really helpful when analyzing performance. For instance, in marketing data, we can create a column that flags whether a product’s discount is “High” or “Low” based on a set percentage.

3.PIVOT TABLES
Pivot tables allow us to summarize large datasets quickly. We can use them to find average sales per region or average ratings by product category, without writing complex formulas.

PERSONAL REFLECTION
Before learning Excel, I used to see data as just “a lot of numbers.” Now, I see it as a place of stories and insights, and Excel is the tool that helps in finding them. Whether I’m analyzing healthcare records, marketing performance, or product reviews, Excel has made it possible to organize the mess, spot patterns, and draw real conclusions. It’s made me more confident and curious about working with data.