Forem: juved

Class imbalance issues

juved — Mon, 15 Jan 2024 03:07:24 +0000

This work is base on the article :" The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression".

Ruben van den Goorbergh1, Maarten van Smeden 1, Dirk Timmerman2,3, and
Ben Van Calster 2,4,5

Intro

Imagine, that you are teaching a smart computer to identify dogs and cats using pictures. And this process of the computer learning to perform this task is call model prediction. The way it works, depend on the type of program we are using (logistic regression) and the kind of problem we are looking to solve. In our case, we just want to teach the computer how to identify dogs or cats.
In this process of teaching the computers on how to perform this task, we figure out that we have more pictures of dogs than cats. This is what we call class imbalance. To fix this, we can decide to remove pictures of dogs or adding more pictures of cats, this process is called random undersampling or oversampling. There is also a technique, call SMOTE, that can be used to create more pictures of cats to solve this imbalance issue.

What was the original context of this paper when it was written?

What is the impact of addressing the class imbalance issue mentioning above, such as not having balanced amount of pictures of dogs and cats to train our smart computer ?

This article: is investigating the impact of applying the common methods generally used to rectify this class imbalance issue. Suggesting that these methods approaches might compromise prediction accuracy.
The researchers are assessing the models' performance in terms of how well it distinguishes between the pictures (discrimination), the accuracy of its probability (calibration) and how effectively it categorizes things (Classification).

The paper is using a real world case, predicting ovarian cancer, to illustrate these findings.

Summary of the paper findings/outcomes

The training dataset included information from 2695 women, 518 of with ovarian cancer, and the test set had data from 674 women, including 140 with ovarian cancer.
Big surprise, fixing the imbalance did not improve the capacity in doing a better job.
They have found that no matter method they used, the accuracy stayed around 79% to 80%. And even worst, trying to fix the imbalance results in making the computer identify wrongly more people with a potential event of interest than actually they were. Like using SMOTE, led to our model to overestimation of ovarian cases. The result of such outcomes can lead patients to unnecessary treatments or actions. And that can harm the patients for nothing, and create and unnecessary expenses in the medical system.
The study indicate that the common methods that we are using to fix imbalance might not be as helpful as we think. They can drive even to less accurate predictions.

How can this paper inform your work as a junior

As juniors, we are exited most of the time about the outcomes, regardless of the methodology that we are using. This article highlights how a simple and common methodology, can impact our results. It emphasizes the necessity to interpret the results in line with the context or method use in the process of addressing class imbalance. Understanding the implication of using these models predictions to make decisions, especially in scenarios like healthcare.

Why is this paper important/why does it matter to a non-technical business stakeholder?

The paper, highlight the importance of understanding the implications of using predictive modelling for decision-making. Non-technical business stakeholders should not blindly rely on imbalance correction method predictions, considering the potential negative impacts they might result. Moreover, these predictions should be assessed, taking into account the associated risks and impacts.

Linear Transformation

juved — Tue, 19 Dec 2023 13:35:14 +0000

Why using Linear Transformation when it appears to have no significant impact on the model performance Metric?
The secrets lies in their potent advantage of making the results more relatable and understandable for stakeholders. They help make the interpretations of the results more compelling, representing a strategic move to give the result a human touch.

Scenario: Predicting Energy Consumption

Imagine, that we have to predict energy consumption in buildings based on various features, including temperature, square footage, and the number of rooms and energy consumptions. Let’s build a linear regression model to understand how these features influence the energy consumption. And we will modify our models using the linear regression techniques (Scaling, Shifting, and Normalizing).

Data Understanding

We will apply the different techniques of linear regression on the energy_consumption_data_set.

We proceed by importing the necessary libraries, pandas, statsmodels and sklearn. Additionally, we are using panda to upload our dataset energy consumption.

Let's build our initial model selecting the features : temperatures, square_footage, room_count and our target energy_consumption

*Interpretation: *
For every increase in 1 Fahrenheit, the energy consumption increase in 20 energy consumption unit. For every increase of 1 square-ft, there’s an increase of 100 energy consumption units. For every additional room, there is an increase of 49 energy consumption units.
The results look normal, however, the stakeholders are more familiar with Celsius. Let’s proceed with converting the temperature in Celsius.

Scaling :

Scaling in linear regression is aiming to give our variables a tailored makeover. This results in making our model more relatable, without disrupting the overarching metrics. Think of it as having the same model but in different units, facilitating the elaboration of a story that is easier to share with stakeholders.

--We first make a copy of the subset, and then apply the appropriate formula to convert the temperature from Fahrenheit to Celsius.

-- Let's build the Celsius model using the converted feature

*Interpretation: *
For every increase in 1 Celsius, the energy consumption increase in 37 energy consumption unit.
For every increase of 1 square-ft, there’s an increase of 100 energy consumption units.
For every additional room, there is an increase of 49 energy consumption units.

Shifting:

In the context of linear regression models, shifting is a common practice to improve model interpretability, reduce multicollinearity and provide a meaningful interpretation of the intercept. It typically refers to centring or mean-entering the variables by subtracting the mean of a variable from each individual data point in that variable. The practice of shifting is recommended to be carried out before building the model.

Below, we shift our features that 0 will represent the mean.

--Let’s build our centred model around X_centred

*Interpretation: *

We would expect about 205157 energy consumption units for the average of temperature consumption, average of square_footage, average of room_count.

Standardizing:

Standardizing provide the benefic of comparing the coefficients to each other.

-- Let's perform a .describe() on the dataset to evaluate the values before standardization.

--Let’s standardize the features

--Let’s build our standardize model

*Interpretation: *
For each increase of 1 standard deviation in temperature, we see an associated increase of about 21 energy_consumption unit.
For each increase of 1 standard deviation in the square_footage, we see an associated increase of about 100 energy_consumption unit.
For each increase of 1 standard deviation in the room_count, we see an associated increase of about 49 energy_consumption unit.

Conclusion:

These transformations modify the features without compromising the linear regression process, as illustrated in the example of predicting the energy consumption. There are other linear techniques, such as Min-Max Scaling, Unit Vector Transformation and many other tools provided by Scikit-learn that can be used.

More :
https://sebastianraschka.com/Articles/2014_about_feature_scaling.html
https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing

STARTING WITH GITHUB

juved — Tue, 28 Nov 2023 03:01:38 +0000

GitHub

GitHub is a cloud based service for version control and software development, and very popular in data science and various fields. It allows real-time collaboration, it encourages teams to work together, giving them confidence in managing changes and version control.

*Git and GitHub *
Git, was created in 2005 by Linus Torvalds, is an open source version control system. Allowing each team member to work using a branch. GitHub is using Git as version control system.

Work Flow understanding

REPOSITORY
A repository, or repo in GitHub, is a storage location containing all project files including codes, images, documents, and revision history. It serves as the central location where the team can contribute to the project under their branches.
In GitHub, there are two types of repositories:

Public Repositories: They are visible and accessible to anyone online, and the content can be cloned. They are offered for free by GitHub to facilitate collaboration on open-source projects.
Private Repositories: They are visible only for the repository owner and team. Regarding access control, the owner can invite a team member to collaborate on a private repository. GitHub charges for privates repositories as part of its subscription.

BRANCHE

It is a parallel repository where everyone or team member can work isolated without affecting the main/master branches.

Here are some common git commands:
-- git branch < branch_name> # to create a new branch
--git checkout < branch_name> # to switch to a branch
--git switch < branch_name># to switch to a branch
--git checkout -b < branch_name> # to create and switch to a new branch
--git branch # will list all branches
--git merge # to merge a branch into the current branch
--git branch -d < branch_name> #delete a local branch
--git push origin --delete < branch_name> delete a remote branch
--git push origin < branch_name> #push a local branch to a remote
--git push -f origin < branch_name> # force pushing a branch to remote (with caution)
--git branch - m < branch_name > # to rename a branch
--git branch -v # to view the last commit on each branch
--git diff < branch_name1>.. < branch_name2># to compare changes between branches
--git merge --abort # to abort a merge
-- git branch - r # to list remote branches

More commands in GitHub:

Git status : with this command, we provide the state of the working directory and the staging area. It is recommended to use if often to check your directory.
--git status

Create a new repository, but avoid creating a directory inside another one.
-- git init

Clone a directory : to download an existent repository to your local location. You'll be able to work locally on the project documents
--git clone

Git add, after working, modifying files in the directory. Git add command help in staging the files.
-- git add

Commits, after staging the files with git add, git commit will save the changes. A message describing the changes is recommended.
-- git commit -m “example message”

Git push is moving your changes locally to the remote repository
--git push origin

Fetching and Merging
It is the process of retrieving all the changes from the remote repository without merging them to the local repository or your working repository.

-- git fetch origin master
--git fetch upstream

Pull

Git pull is used to basically to update your local repository with changes mad by team members on the remote repository. It combines two commands: git fetch and git merge.

Pull request

allow a team member to propose changes to a repository.It's a way to notify other teams members about changes that you have pushed to a branch in a repository.

Avoiding doing

Push sensitive information
It is recommended to avoid pushing sensitive information like passwords, privates files to a public repository. Use .gitignore file to exclude them
Force Pushing to a Shared Branch
Be careful with forcing push, particularly to branches of others. This can cause confusions and overwrite their changes.
Inconsistent Branch Naming
Maintain a consistent branch naming is a must. It will make it easier for collaborators to understand the purpose of the branch.
Ignoring Pull request and issues
Not regularly Updating Forks
In the case you fork a repository, maintaining a regular synchronization with the upstream is important to make sure you integrate the changes made by the collaborators.

Git documentation https://git-scm.com/docs

Alternatives to GitHub: GitLab, GitKraken, Beanstalk, Gitea, RhodeCode, SourceForge

What next for me ?

juved — Mon, 06 Nov 2023 02:35:14 +0000

What next...?

After years moving around, working in academia and IT industry. Like everyone at some point in their career, share this question in a perspective of growth and changes: “What next?”

Data Science?

Inevitably, Data science has always been on my radar, persistently haunting me… I have worked on data science projects in the past, wearing the hat of a project manager. I have been captivated by the discussions within the teams and fascinated by the questions we were attempting to answer. It seems like a fun and highly creative field, so why not give it a try ?

Opportunities ?

It’s universally acknowledged that we are living in the data era, with millions of terabytes of data generated daily. The data market is estimated to be worth hundreds of billions of dollars. Therefore, the opportunities are vast and are present in various domains. The creativity and innovations have surprised many in terms of what can be archived with data. The demand for Data scientist is at an all-time high.

The vision !

Becoming a Data Scientist Project Manager, contributing technically and leading a successful data related project that will shape the future and benefit everyone.