<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Damini Jain</title>
    <description>The latest articles on Forem by Damini Jain (@jaindamini1111).</description>
    <link>https://forem.com/jaindamini1111</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F787750%2F04a1b207-3e2e-4eac-9f35-1af52c25704c.png</url>
      <title>Forem: Damini Jain</title>
      <link>https://forem.com/jaindamini1111</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/jaindamini1111"/>
    <language>en</language>
    <item>
      <title>Sampling Bias and how to fix it?</title>
      <dc:creator>Damini Jain</dc:creator>
      <pubDate>Sun, 09 Jan 2022 12:09:18 +0000</pubDate>
      <link>https://forem.com/jaindamini1111/sampling-bias-and-how-to-fix-it-1k5n</link>
      <guid>https://forem.com/jaindamini1111/sampling-bias-and-how-to-fix-it-1k5n</guid>
      <description>&lt;p&gt;Pre-requisites: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Basic knowledge of Python &lt;/li&gt;
&lt;li&gt;Working Juypter notebook for testing the code&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What is sampling Bias?
&lt;/h2&gt;

&lt;p&gt;To understand sampling bias let us consider an example - &lt;/p&gt;

&lt;p&gt;Supposedly we are a Statistician conducting a study on the effects of a new drug introduced in the market for producing more serotonin. We have been asked to identify if the new drug is effective on the male or female population more. &lt;/p&gt;

&lt;p&gt;We collect data on the population who have taken the drug and create more features around it where our target &lt;code&gt;(y)&lt;/code&gt; is gender; &lt;/p&gt;

&lt;p&gt;&lt;em&gt;1 if it's a female 0 if its a male&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;As you guessed it we have a &lt;strong&gt;binary classification&lt;/strong&gt; problem at our hands.&lt;/p&gt;

&lt;p&gt;When we load the dataset we realize - &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fye9cvsonflohksxl727u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fye9cvsonflohksxl727u.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That there are 4 females and 1 male in the dataset &lt;em&gt;(Since its a hypothetical scenario and my drawing is horrible I could only make 5 stick figures)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;But imagine we get a dataset where the &lt;strong&gt;ratio of the classes in target is imbalanced&lt;/strong&gt; - for every 4 females there is 1 male in our dataset.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The analysis we do and the model we build would be incorrect since our &lt;strong&gt;target is highly imbalanced the model gets biased towards the majority class&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Feeling a bit stressed, eh?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8pkce8sqj7e1prj5xcig.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8pkce8sqj7e1prj5xcig.gif" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  To deal with imbalanced datasets we can go through 2 routes -
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Either we reduce our target where gender is female and make sure that we have an equal proportion of both the classes, resulting in a &lt;strong&gt;case of undersampling&lt;/strong&gt; of the female gender.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Or we increase our target where gender is male and make sure that we have an equal proportion of both the classes, resulting in a &lt;strong&gt;case of oversampling&lt;/strong&gt; of the male gender. &lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;There are &lt;a href="https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis" rel="noopener noreferrer"&gt;several techniques&lt;/a&gt; to deal with oversampling such as - &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Random oversampling &lt;/li&gt;
&lt;li&gt;SMOTE&lt;/li&gt;
&lt;li&gt;ADASYN &lt;/li&gt;
&lt;li&gt;Augmentation &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For undersampling we have - &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Random undersampling &lt;/li&gt;
&lt;li&gt;Cluster &lt;/li&gt;
&lt;li&gt;Tomek links &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In this article, we will learn about &lt;a href="https://pypi.org/project/imbalanced-learn/" rel="noopener noreferrer"&gt;imblearn&lt;/a&gt; and resolve the case of sampling bias. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2gl6dw2ct9xamwp9dj0.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2gl6dw2ct9xamwp9dj0.gif" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Imblearn library offers us the methods by which we can generate a data set that has an equivalent proportion of classes.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Let us build a classification model and see this working. For the  dataset and code used in the article please refer &lt;a href="https://github.com/jaindamini1111/ML-DT-Course-Files/tree/main/Article" rel="noopener noreferrer"&gt;here&lt;/a&gt;. &lt;br&gt;
&lt;em&gt;The data has been generated using Excel.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0i2o2qjwa2woxw5rdky6.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0i2o2qjwa2woxw5rdky6.jpg" alt="Target distribution"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As we can see above the distribution of female vs male in our target variable is imbalanced. Let us use the sampling technique to overcome the situation.&lt;/p&gt;
&lt;h3&gt;
  
  
  We will first perform oversampling.
&lt;/h3&gt;

&lt;p&gt;Top 2 rows of our data after preprocessing (steps available in the repo)&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fah0gkbrsb6har489e3jv.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fah0gkbrsb6har489e3jv.jpg" alt="preprocessed data"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import imblearn  #Importing all libraries. 
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from collections import Counter
from xgboost import XGBClassifier


y = data["Gender"]   #Creating target and feature set.
X = data.drop("Gender", axis=1)

#Splitting data to create train, valid &amp;amp; test sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle = True, test_size = 0.2, random_state=42, stratify = y)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, shuffle = True, test_size = 0.2, random_state=42, stratify = y_train)

print(sorted(Counter(y_train).items())) #Lets see what the group count looks like in our training set prior to sampling.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fisd9c4dy503qtdamterv.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fisd9c4dy503qtdamterv.jpg" alt="print result"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As we can see before correcting the data for sampling we have 261 instances that belong to class 0 and 635 instances which belong to class 1. If we oversample 0 class we will have a balanced dataset lets do that using SMOTE.&lt;/p&gt;

&lt;p&gt;The important thing to focus here is that &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;we fit the sampling object only on the training data and not on valid &amp;amp; test sets otherwise data leakage occurs.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ros = imblearn.over_sampling.RandomOverSampler(random_state=42)
x_ros, y_ros = ros.fit_resample(X_train, y_train) #Fitting the oversampling object to our training set.
print(sorted(Counter(y_ros).items()))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu1r54l5unpo2fqxmiy9o.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu1r54l5unpo2fqxmiy9o.jpg" alt="Oversampling the class 0"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now the classes are balanced. We will build an &lt;strong&gt;XGBoost model&lt;/strong&gt; on the new data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
model = XGBClassifier(objective='binary:logistic')
model.fit(x_ros, y_ros)
y_pred = model.predict(X_valid)
print(model.score(X_valid, y_valid))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We get a 0.59 accuracy score without any hyperparameter tuning.&lt;/p&gt;

&lt;h3&gt;
  
  
  Let us perform undersampling now.
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ros_under = imblearn.under_sampling.RandomUnderSampler(random_state=42)
x_ros_under, y_ros_under = ros_under.fit_resample(X_train, y_train) #Fitting the oversampling object to our training set.
print(sorted(Counter(y_ros_under).items()))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6jpwxy6h0sw99wyem1sm.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6jpwxy6h0sw99wyem1sm.jpg" alt="Image description"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;model = XGBClassifier(objective='binary:logistic')
model.fit(x_ros_under, y_ros_under)
y_pred = model.predict(X_valid)
print(model.score(X_valid, y_valid))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We get a 0.51 accuracy score on the validation dataset when we undersample the data without any hyperparameter tuning.&lt;/p&gt;

&lt;p&gt;One reason for the accuracy to be less in undersample is that we give less data for our model to learn. Hence oversampling in our case without hyperparameter tuning wins on the validation dataset.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;In this article, we discussed how we can pre-process the imbalanced class data set before building predictive models. &lt;/p&gt;

&lt;p&gt;We first did  oversampling and then performed undersampling.You can check the documentation &lt;a href="https://imbalanced-learn.org/stable/index.html" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Before we part away most of the ensemble algorithms today offer a parameter by which we can handle imbalanced datasets. In XGBoost we have &lt;code&gt;scale_pos_weight&lt;/code&gt; parameter, to handle cases but the &lt;br&gt;
room for configuration gets less using SMOTE we have the control over resolving biases.      &lt;/p&gt;

&lt;p&gt;Edit 01: You may notice difference in accuracy score since I didn't set the &lt;code&gt;random_state&lt;/code&gt; parameter while creating my classifier. &lt;br&gt;
Shoutout to Barbara Bredner for pointing this out :)  &lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>python</category>
      <category>beginners</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
