Forem: Sparsh Gupta

Importance of Data Visualization — Anscombe’s Quartet Way

Sparsh Gupta — Mon, 27 Jul 2020 17:48:56 +0000

Four datasets that fools the Linear Regression model if built.

Anscombe’s quartet comprises four data sets that have nearly identical simple descriptive statistics, yet have very different distributions and appear very different when graphed.

— Wikipedia

Anscombe’s Quartet can be defined as a group of four data sets which are nearly identical in simple descriptive statistics, but there are some peculiarities in the dataset that fools the regression model if built. They have very different distributions and appear differently when plotted on scatter plots.

It was constructed in 1973 by statistician Francis Anscombe to illustrate the importance of plotting the graphs before analyzing and model building, and the effect of other observations on statistical properties.There are these four data set plots which have nearly same statistical observations, which provides same statistical information that involves variance, and mean of all x,y points in all four datasets.

This tells us about the importance of visualising the data before applying various algorithms out there to build models out of them which suggests that the data features must be plotted in order to see the distribution of the samples that can help you identify the various anomalies present in the data like outliers, diversity of the data, linear separability of the data, etc. Also, the Linear Regression can be only be considered a fit for the data with linear relationships and is incapable of handling any other kind of datasets. These four plots can be defined as follows:

The statistical information for all these four datasets are approximately similar and can be computed as follows:

When these models are plotted on a scatter plot, all datasets generates a different kind of plot that is not interpretable by any regression algorithm which is fooled by these peculiarities and can be seen as follows:

The four datasets can be described as:

Dataset 1: this fits the linear regression model pretty well.
Dataset 2: this could not fit linear regression model on the data quite well as the data is non-linear.
Dataset 3: shows the outliers involved in the dataset which cannot be handled by linear regression model
Dataset 4: shows the outliers involved in the dataset which cannot be handled by linear regression model

Conclusion:

We have described the four datasets that were intentionally created to describe the importance of data visualisation and how any regression algorithm can be fooled by the same. Hence, all the important features in the dataset must be visualised before implementing any machine learning algorithm on them which will help to make a good fit model.

Thanks for reading. You can find my other Machine Learning related posts here.

I hope this post has been useful. I appreciate feedback and constructive criticism. If you want to talk about this article or other related topics, you can drop me a text here or at LinkedIn.

Sparsh Gupta

Jul 16 '20

Assumptions in Linear Regression you might not know.

#machinelearning #datascience #python

Comments

5 min read

Sparsh Gupta

Jul 9 '20

Most Common Loss Functions in Machine Learning

#machinelearning #datascience #python #computerscience

Comments

5 min read

Assumptions in Linear Regression you might not know.

Sparsh Gupta — Thu, 16 Jul 2020 17:36:15 +0000

The model should conform to these assumptions to produce a best Linear Regression fit to the data.

— All the images (plots) are generated and modified by Author.

Introduction

At first, Linear Regression is a method of modelling the best linear relationship between the independent variables and dependent variables. The simplest form of Linear Regression can be defined by the following equation with one independent and one dependent variable:

x is the independent variable,
y is the dependent variable,
β1 is the coefficient of x, i.e. slope,
β0 is the intercept (constant) which tells the distance of the line from the origin on y-axis.

Linear regression is a linear approach to modelling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables).

— Wikipedia

Linear Regression Types

1. Simple Linear Regression — The simplest form of regression which involves one independent variable and one dependent variable, which is explained as above, where we fit a line to the model.

2. Multiple Linear Regression — The complex form of regression which involves multiple independent variables and one dependent variable, which can be explained by the following equation:

x1 to xn are the independent variable,
y is the dependent variable,
β1 to βn are the coefficients of respective x features, and
β0 is the intercept (constant) which tells the distance of the line from the origin on y-axis.

Assumptions in Linear Regression

1. Linear Relationship — It is assumed and understood that the relation between the independent variables and dependent variables is linear, i.e. the coefficients must be linear, what we find out using the model building and prediction.

The predictor variables are seen as fixed values and can be any complex function like polynomial, trigonometric, etc. But the coefficients will be strictly linear with the predictor variable.

This assumption is used for implementing the Polynomial regression, which uses linear regression to fit the response variable as an arbitrary polynomial function of a predictor variable which also makes the linear relationship with the coefficients.

2. Homoscedasticity (Constant Variance) — It is assumed that the residual terms (that is, the “noise” or random disturbance in the relationship between the features and the target) must have the constant variance, i.e. the error term is same across different values of independent features, regardless of the values of the predictor variables.

There should be no clear pattern in the distribution and if there is a specific pattern, the data is heteroscedastic. The leftmost graph shows no definite pattern among the error terms i.e the distribution is varied constantly, whereas the middle graph shows a pattern where the error decreases and then increases with the estimated values violating the constant variance rule and the rightmost graph also reveals a specific pattern where the error terms decrease with the predicted values representing heteroscedasticity. Two or more normal distributions are homoscedastic if they share a common covariance (or correlation) matrix.

3. Multivariate Normality — It is assumed that the error terms are normally distributed, i.e. the mean of error terms is zero and the sum of error terms is also equal to zero. A less widely known fact is that, as the sample size goes high, the normality assumption for the residuals is not needed anymore.

The above q-q plot shows that the errors or residuals are normally distributed. The error term can be seen as the composite of some minor residuals or errors. As the number of these minor residuals increases, the distribution of the error term tends to approach the normal distribution. This tendency is called the Central Limit Theorem where the t-test and F- test are only applicable if the error term is normally distributed.

4. No Multicollinearity — Multicollinearity is defined as the degree of inter-correlations among the independent variables used in the model. It is assumed that the independent feature variables are not at all or very less correlated among each other, which makes them independent. So in practical implementation, the correlation between two independent features must not be greater than 30% as it weakens the statistical power of the model built. For identification of highly correlated features, pair plots (scatter plot) and heatmaps (correlation matrix) can be used.

Highly correlated features should not be used in the model to maintain the strong relationship between the model and all its features present as the features tend to change in unison. Hence, with the change in one feature, the change in correlated feature does not make the latter constant as the model requires it while predicting the outcome using the weighted coefficients and the expected interpretation of regression coefficient does not conform.

5. No Auto-correlation — It is assumed that there should be no auto-correlation among the features in the data. It mainly occurs when there is a dependency between residual errors, i.e. the residual error should not be correlated positively or negatively, and it should have a good spread all over. This usually occurs in time series models where the next instant is dependent on the previous instant. The presence of correlation in the residual terms also reduces the model’s predictability.

Autocorrelation can be tested with the help of the Durbin-Watson test. The Durbin-Watson test statistics is defined as:

The Durbin-Watson test statistics will always have a value between 0 and 4. An exact value of 2.0 states that there is no autocorrelation detected in the sample. Values between 0 and 2 indicate positive autocorrelation and values between 2 and 4 indicates negative autocorrelation.

6. No Extrapolation — Extrapolation is an estimation that can exist beyond the original observation range. It is assumed that the trained model will be able to predict the values for the dependent variable on independent feature values only for the data that lies within the range of the training data. Therefore, the model cannot guarantee the predicted values that are beyond the range of trained independent feature values.

Conclusion

We have explained the most important assumptions which must be focussed before implementing a Linear Regression Model to a given set of data. These assumptions are just a formal measure to ensure that the predictability of the built linear regression model is good enough to give us the best possible results for a given data set. These assumptions if not satisfied will not stop a Linear regression model to be built but will provide good confidence to the predictability of the model.

Thanks for reading. You can find my other Machine Learning related posts here.

What makes Logistic Regression a Classification Algorithm? | by Sparsh Gupta | Jul, 2020 | Towards Data Science

Sparsh Gupta ・ Jul 3, 2020 ・ 6 min read
towardsdatascience.com

I hope this post has been useful. I appreciate feedback and constructive criticism. If you want to talk about this article or other related topics, you can drop me a text here or at LinkedIn.

Insightful Loan Default Analysis

Sparsh Gupta — Fri, 10 Jul 2020 15:24:03 +0000

Visualize Insights and Discover Driving Features in Lending Credit Risk Model for Loan Defaults

Lending Club is the largest online loan marketplace, facilitating personal loans, business loans, and financing of medical procedures. Borrowers can easily access lower interest rate loans through a fast online interface.

Like most other lending companies, lending loans to ‘risky’ applicants is the largest source of financial loss (called credit loss). The credit loss is the amount of money lost by the lender when the borrower refuses to pay or runs away with the money owed. In other words, borrowers who default cause the largest amount of loss to the lenders.

Therefore, using Data Science, Exploratory Data Analysis and public data from Lending Club, we will be exploring and crunching out the driving factors that exists behind the loan default, i.e. the variables which are strong indicators of default. Further, the company can utilise this knowledge for its portfolio and risk assessment.

About Lending Club Loan Dataset

The dataset contains complete loan data for all loans issued through the 2007–2011, including the current loan status (Current, Charged-off, Fully Paid) and latest payment information. Additional features include credit scores, number of finance inquiries, and collections among others. The file is a matrix of about 39 thousand observations and 111 variables. A Data Dictionary is provided in a separate file in the dataset. The dataset can be downloaded here on Kaggle.

Questions

What set of loan data are we working with?
What types of features do we have?
Do we need to treat missing values?
What is the distribution of Loan Status?
What is the distribution of Loan Default with other features?
What all plots we can draw for inferring the relation with Loan Default?
Majorly, what are the driving features that describes the Loan Default?

Feature Distribution

Loan Characteristics such as loan amount, term, purpose which shows the information about the loan that will help us in finding loan default.
Demographic Variables such as age, employment status, relationship status which shows the information about the borrower profile which is not useful for us.
Behavioural Variables such as next payment date, EMI, delinquency which shows the information which is updated after providing the loan which in our case is not useful as we need to decide whether we should approve the loan or not by default analysis.

Here is a quick overview of things we are going to see in this article:

Dataset Overview (Distribution of Loans)
Data Cleaning (Missing Values, Standardize Data, Outlier Treatment)
Metrics Derivation (Binning)
Univariate Analysis (Categorical/Continuous Features)
Bivariate Analysis (Box Plots, Scatter Plots, Violin Plots)
Multivariate Analysis (Correlation Heatmap)

Data/Library Imports

# import required libraries
import numpy as np
print('numpy version:',np.__version__)
import pandas as pd
print('pandas version:',pd.__version__)
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(style="whitegrid")
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (12, 8)
pd.options.mode.chained_assignment = None
pd.options.display.float_format = '{:.2f}'.format
pd.set_option('display.max_columns', 200)
pd.set_option('display.width', 400)

# file path variable
case_data = "/kaggle/input/lending-club-loan-dataset-2007-2011/loan.csv"
loan = pd.read_csv(case_data, low_memory=False)

Data set has 111 columns and 39717 rows

Dataset Overview

# plotting pie chart for different types of loan_status
chargedOffLoans = loan.loc[(loan["loan_status"] == "Charged Off")]
currentLoans = loan.loc[(loan["loan_status"] == "Current")]
fullyPaidLoans = loan.loc[(loan["loan_status"]== "Fully Paid")]

data  = [{"Charged Off": chargedOffLoans["funded_amnt_inv"].sum(), "Fully Paid":fullyPaidLoans["funded_amnt_inv"].sum(), "Current":currentLoans["funded_amnt_inv"].sum()}]

investment_sum = pd.DataFrame(data) 
chargedOffTotalSum = float(investment_sum["Charged Off"])
fullyPaidTotalSum = float(investment_sum["Fully Paid"])
currentTotalSum = float(investment_sum["Current"])
loan_status = [chargedOffTotalSum,fullyPaidTotalSum,currentTotalSum]
loan_status_labels = 'Charged Off','Fully Paid','Current'
plt.pie(loan_status,labels=loan_status_labels,autopct='%1.1f%%')
plt.title('Loan Status Aggregate Information')
plt.axis('equal')
plt.legend(loan_status,title="Loan Amount",loc="center left",bbox_to_anchor=(1, 0, 0.5, 1))
plt.show()

# plotting pie chart for different types of purpose
loans_purpose = loan.groupby(['purpose'])['funded_amnt_inv'].sum().reset_index()

plt.figure(figsize=(14, 10))
plt.pie(loans_purpose["funded_amnt_inv"],labels=loans_purpose["purpose"],autopct='%1.1f%%')

plt.title('Loan purpose Aggregate Information')
plt.axis('equal')
plt.legend(loan_status,title="Loan purpose",loc="center left",bbox_to_anchor=(1, 0, 0.5, 1))
plt.show()

Data Cleaning

# in dataset, we can see around half of the columns are null
# completely, hence remove all columns having no values
loan = loan.dropna(axis=1, how="all")
print("Looking into remaining columns info:")
print(loan.info(max_cols=200))

We are left with following columns:

Looking into remaining columns info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39717 entries, 0 to 39716
Data columns (total 57 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   id                          39717 non-null  int64  
 1   member_id                   39717 non-null  int64  
 2   loan_amnt                   39717 non-null  int64  
 3   funded_amnt                 39717 non-null  int64  
 4   funded_amnt_inv             39717 non-null  float64
 5   term                        39717 non-null  object 
 6   int_rate                    39717 non-null  object 
 7   installment                 39717 non-null  float64
 8   grade                       39717 non-null  object 
 9   sub_grade                   39717 non-null  object 
 10  emp_title                   37258 non-null  object 
 11  emp_length                  38642 non-null  object 
 12  home_ownership              39717 non-null  object 
 13  annual_inc                  39717 non-null  float64
 14  verification_status         39717 non-null  object 
 15  issue_d                     39717 non-null  object 
 16  loan_status                 39717 non-null  object 
 17  pymnt_plan                  39717 non-null  object 
 18  url                         39717 non-null  object 
 19  desc                        26777 non-null  object 
 20  purpose                     39717 non-null  object 
 21  title                       39706 non-null  object 
 22  zip_code                    39717 non-null  object 
 23  addr_state                  39717 non-null  object 
 24  dti                         39717 non-null  float64
 25  delinq_2yrs                 39717 non-null  int64  
 26  earliest_cr_line            39717 non-null  object 
 27  inq_last_6mths              39717 non-null  int64  
 28  mths_since_last_delinq      14035 non-null  float64
 29  mths_since_last_record      2786 non-null   float64
 30  open_acc                    39717 non-null  int64  
 31  pub_rec                     39717 non-null  int64  
 32  revol_bal                   39717 non-null  int64  
 33  revol_util                  39667 non-null  object 
 34  total_acc                   39717 non-null  int64  
 35  initial_list_status         39717 non-null  object 
 36  out_prncp                   39717 non-null  float64
 37  out_prncp_inv               39717 non-null  float64
 38  total_pymnt                 39717 non-null  float64
 39  total_pymnt_inv             39717 non-null  float64
 40  total_rec_prncp             39717 non-null  float64
 41  total_rec_int               39717 non-null  float64
 42  total_rec_late_fee          39717 non-null  float64
 43  recoveries                  39717 non-null  float64
 44  collection_recovery_fee     39717 non-null  float64
 45  last_pymnt_d                39646 non-null  object 
 46  last_pymnt_amnt             39717 non-null  float64
 47  next_pymnt_d                1140 non-null   object 
 48  last_credit_pull_d          39715 non-null  object 
 49  collections_12_mths_ex_med  39661 non-null  float64
 50  policy_code                 39717 non-null  int64  
 51  application_type            39717 non-null  object 
 52  acc_now_delinq              39717 non-null  int64  
 53  chargeoff_within_12_mths    39661 non-null  float64
 54  delinq_amnt                 39717 non-null  int64  
 55  pub_rec_bankruptcies        39020 non-null  float64
 56  tax_liens                   39678 non-null  float64
dtypes: float64(20), int64(13), object(24)
memory usage: 17.3+ MB

Now, we will remove all the Demographic and Customer Behavioural features which is of no use for default analysis for credit approval.

# remove non-required columns
# id - not required
# member_id - not required
# acc_now_delinq - empty
# funded_amnt - not useful, funded_amnt_inv is useful which is funded to person
# emp_title - brand names not useful
# pymnt_plan - fixed value as n for all
# url - not useful
# desc - can be applied some NLP but not for EDA
# title - too many distinct values not useful
# zip_code - complete zip is not available
# delinq_2yrs - post approval feature
# mths_since_last_delinq - only half values are there, not much information
# mths_since_last_record - only 10% values are there
# revol_bal - post/behavioural feature
# initial_list_status - fixed value as f for all
# out_prncp - post approval feature
# out_prncp_inv - not useful as its for investors
# total_pymnt - post approval feature
# total_pymnt_inv - not useful as it is for investors
# total_rec_prncp - post approval feature
# total_rec_int - post approval feature
# total_rec_late_fee - post approval feature
# recoveries - post approval feature
# collection_recovery_fee - post approval feature
# last_pymnt_d - post approval feature
# last_credit_pull_d - irrelevant for approval
# last_pymnt_amnt - post feature
# next_pymnt_d - post feature
# collections_12_mths_ex_med - only 1 value 
# policy_code - only 1 value
# acc_now_delinq - single valued
# chargeoff_within_12_mths - post feature
# delinq_amnt - single valued
# tax_liens - single valued
# application_type - single
# pub_rec_bankruptcies - single valued for more than 99%
# addr_state - may not depend on location as its in financial domain

colsToDrop = ["id", "member_id", "funded_amnt", "emp_title", "pymnt_plan", "url", "desc", "title", "zip_code", "delinq_2yrs", "mths_since_last_delinq", "mths_since_last_record", "revol_bal", "initial_list_status", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp", "total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt", "next_pymnt_d", "last_credit_pull_d", "collections_12_mths_ex_med", "policy_code", "acc_now_delinq", "chargeoff_within_12_mths", "delinq_amnt", "tax_liens", "application_type", "pub_rec_bankruptcies", "addr_state"]
loan.drop(colsToDrop, axis=1, inplace=True)
print("Features we are left with",list(loan.columns))

We are left with [‘loan_amnt’, ‘funded_amnt_inv’, ‘term’, ‘int_rate’, ‘installment’, ‘grade’, ‘sub_grade’, ‘emp_length’, ‘home_ownership’, ‘annual_inc’, ‘verification_status’, ‘issue_d’, ‘loan_status’, ‘purpose’, ‘dti’, ‘earliest_cr_line’, ‘inq_last_6mths’, ‘open_acc’, ‘pub_rec’, ‘revol_util’, ‘total_acc’]

Now, dealing with missing values by removing/imputing:

# in 12 unique values we have 10+ years the most for emp_length, 
# but it is highly dependent variable so we will not impute
# but remove the rows with null values which is around 2.5%

loan.dropna(axis=0, subset=["emp_length"], inplace=True)

# remove NA rows for revol_util as its dependent and is around 0.1%

loan.dropna(axis=0, subset=["revol_util"], inplace=True)

Now, we standardize some feature columns to make data compatible for analysis:

# update int_rate, revol_util without % sign and as numeric type

loan["int_rate"] = pd.to_numeric(loan["int_rate"].apply(lambda x:x.split('%')[0]))

loan["revol_util"] = pd.to_numeric(loan["revol_util"].apply(lambda x:x.split('%')[0]))

# remove text data from term feature and store as numerical

loan["term"] = pd.to_numeric(loan["term"].apply(lambda x:x.split()[0]))

Removing records with loan status as “Current”, as the loan is currently running and we can’t infer any information regarding default from such loans.

# remove the rows with loan_status as "Current"
loan = loan[loan["loan_status"].apply(lambda x:False if x == "Current" else True)]


# update loan_status as Fully Paid to 0 and Charged Off to 1
loan["loan_status"] = loan["loan_status"].apply(lambda x: 0 if x == "Fully Paid" else 1)

# update emp_length feature with continuous values as int
# where (< 1 year) is assumed as 0 and 10+ years is assumed as 10 and rest are stored as their magnitude

loan["emp_length"] = pd.to_numeric(loan["emp_length"].apply(lambda x:0 if "<" in x else (x.split('+')[0] if "+" in x else x.split()[0])))

# look through the purpose value counts
loan_purpose_values = loan["purpose"].value_counts()*100/loan.shape[0]

# remove rows with less than 1% of value counts in paricular purpose 
loan_purpose_delete = loan_purpose_values[loan_purpose_values<1].index.values
loan = loan[[False if p in loan_purpose_delete else True for p in loan["purpose"]]]

Outlier Treatment

Looking upon the quantile values of each features, we will treat outliers for the some features.

# for annual_inc, the highest value is 6000000 where 75% quantile value is 83000, and is 100 times the mean
# we need to remomve outliers from annual_inc i.e. 99 to 100%
annual_inc_q = loan["annual_inc"].quantile(0.99)
loan = loan[loan["annual_inc"] < annual_inc_q]

# for open_acc, the highest value is 44 where 75% quantile value is 12, and is 5 times the mean
# we need to remomve outliers from open_acc i.e. 99.9 to 100%
open_acc_q = loan["open_acc"].quantile(0.999)
loan = loan[loan["open_acc"] < open_acc_q]

# for total_acc, the highest value is 90 where 75% quantile value is 29, and is 4 times the mean
# we need to remomve outliers from total_acc i.e. 98 to 100%
total_acc_q = loan["total_acc"].quantile(0.98)
loan = loan[loan["total_acc"] < total_acc_q]

# for pub_rec, the highest value is 4 where 75% quantile value is 0, and is 4 times the mean
# we need to remomve outliers from pub_rec i.e. 99.5 to 100%
pub_rec_q = loan["pub_rec"].quantile(0.995)
loan = loan[loan["pub_rec"] <= pub_rec_q]

Now this is how our data looks after cleaning and standardizing the features:

Metrics Derivation

Issue date is not in the standard format also we can split the date into two columns with month and the year which will make it easy for analysis
Year in the datetime requires year between 00 to 99 and in some cases year is single digit number i.e. 9 writing a function which will convert such dates to avoid exception in date conversion.

def standerdisedate(date):
year = date.split("-")[0]
if(len(year) == 1):
date = "0"+date
return date

from datetime import datetime
loan['issue_d'] = loan['issue_d'].apply(lambda x:standerdisedate(x))
loan['issue_d'] = loan['issue_d'].apply(lambda x: datetime.strptime(x, '%b-%y'))

extracting month and year from issue_date

loan['month'] = loan['issue_d'].apply(lambda x: x.month)
loan['year'] = loan['issue_d'].apply(lambda x: x.year)

get year from issue_d and replace the same

loan["earliest_cr_line"] = pd.to_numeric(loan["earliest_cr_line"].apply(lambda x:x.split('-')[1]))

Binning Continuous features:

# create bins for loan_amnt range
bins = [0, 5000, 10000, 15000, 20000, 25000, 36000]
bucket_l = ['0-5000', '5000-10000', '10000-15000', '15000-20000', '20000-25000','25000+']
loan['loan_amnt_range'] = pd.cut(loan['loan_amnt'], bins, labels=bucket_l)

# create bins for int_rate range
bins = [0, 7.5, 10, 12.5, 15, 100]
bucket_l = ['0-7.5', '7.5-10', '10-12.5', '12.5-15', '15+']
loan['int_rate_range'] = pd.cut(loan['int_rate'], bins, labels=bucket_l)

# create bins for annual_inc range
bins = [0, 25000, 50000, 75000, 100000, 1000000]
bucket_l = ['0-25000', '25000-50000', '50000-75000', '75000-100000', '100000+']
loan['annual_inc_range'] = pd.cut(loan['annual_inc'], bins, labels=bucket_l)

# create bins for installment range
def installment(n):
    if n <= 200:
        return 'low'
    elif n > 200 and n <=500:
        return 'medium'
    elif n > 500 and n <=800:
        return 'high'
    else:
        return 'very high'

loan['installment'] = loan['installment'].apply(lambda x: installment(x))

# create bins for dti range
bins = [-1, 5.00, 10.00, 15.00, 20.00, 25.00, 50.00]
bucket_l = ['0-5%', '5-10%', '10-15%', '15-20%', '20-25%', '25%+']
loan['dti_range'] = pd.cut(loan['dti'], bins, labels=bucket_l)

The following bins are created:

Visualising Data Insights

# check for amount of defaults in the data using countplot
plt.figure(figsize=(14,5))
sns.countplot(y="loan_status", data=loan)
plt.show()

From above plot we can see that around 16% i.e. 5062 people are defaulters in total 35152 records.

Univariate Analysis

# function for plotting the count plot features wrt default ratio
def plotUnivariateRatioBar(feature, data=loan, figsize=(10,5), rsorted=True):
    plt.figure(figsize=figsize)
    if rsorted:
        feature_dimension = sorted(data[feature].unique())
    else:
        feature_dimension = data[feature].unique()
    feature_values = []
    for fd in feature_dimension:
        feature_filter = data[data[feature]==fd]
        feature_count = len(feature_filter[feature_filter["loan_status"]==1])
        feature_values.append(feature_count*100/feature_filter["loan_status"].count())
    plt.bar(feature_dimension, feature_values, color='orange', edgecolor='white')
    plt.title("Loan Defaults wrt "+str(feature)+" feature - countplot")
    plt.xlabel(feature, fontsize=16)
    plt.ylabel("defaulter %", fontsize=16)
    plt.show()

# function to plot univariate with default status scale 0 - 1
def plotUnivariateBar(x, figsize=(10,5)):
    plt.figure(figsize=figsize)
    sns.barplot(x=x, y='loan_status', data=loan)
    plt.title("Loan Defaults wrt "+str(x)+" feature - countplot")
    plt.xlabel(x, fontsize=16)
    plt.ylabel("defaulter ratio", fontsize=16)
    plt.show()

a. Categorical Features

# check for defaulters wrt term in the data using countplot
plotUnivariateBar("term", figsize=(8,5))

From above plot for ‘term’ we can infer that the defaulters rate is increasing wrt term, hence the chances of loan getting deaulted is less for 36m than 60m.
is term benificial -> Yes

# check for defaulters wrt grade in the data using countplot
plotUnivariateRatioBar("grade")

From above plot for ‘grade’ we can infer that the defaulters rate is increasing wrt grade, hence the chances of loan getting deaulted increases with the grade from A moving towards G.
is grade benificial -> Yes

# check for defaulters wrt sub_grade in the data using countplot
plotUnivariateBar("sub_grade", figsize=(16,5))

From above plot for ‘sub_grade’ we can infer that the defaulters rate is increasing wrt sub_grade, hence the chances of loan getting deaulted increases with the sub_grade from A1 moving towards G5.
is sub_grade benificial -> Yes

# check for defaulters wrt home_ownership in the data 
plotUnivariateRatioBar("home_ownership")

From above plot for ‘home_ownership’ we can infer that the defaulters rate is constant here (it is quite more for OTHERS but we dont know what is in there, so we’ll not consider it for analysis), hence defaulter does not depends on home_ownership
is home_ownership benificial -> No

# check for defaulters wrt verification_status in the data
plotUnivariateRatioBar("verification_status")

From above plot for ‘verification_status’ we can infer that the defaulters rate is increasing and is less for Not Verified users than Verified ones, but not useful for analysis.
is verification_status benificial -> No

# check for defaulters wrt purpose in the data using countplot
plotUnivariateBar("purpose", figsize=(16,6))

From above plot for ‘purpose’ we can infer that the defaulters rate is nearly constant for all purpose type except ‘small business’, hence rate will depend on purpose of the loan
is purpose benificial -> Yes

# check for defaulters wrt open_acc in the data using countplot
plotUnivariateRatioBar("open_acc", figsize=(16,6))

From above plot for ‘open_acc’ we can infer that the defaulters rate is nearly constant for feature open_acc, hence rate will not depend on open_acc feature
is open_acc benificial -> No

# check for defaulters wrt pub_rec in the data using countplot
plotUnivariateRatioBar("pub_rec")

From above plot for ‘pub_rec’ we can infer that the defaulters rate is nearly increasing as it is less for 0 and more for pub_rec with value 1, but as other values are very less as compared to 0 we’ll not consider this
is pub_rec benificial -> No

b. Continuous Features

# check for defaulters wrt emp_length in the data using countplot
plotUnivariateBar("emp_length", figsize=(14,6))

From above plot for ‘emp_length’ we can infer that the defaulters rate is constant here, hence defaulter does not depends on emp_length
is emp_length benificial -> No

# check for defaulters wrt month in the data using countplot
plotUnivariateBar("month", figsize=(14,6))

From above plot for ‘month’ we can infer that the defaulters rate is nearly constant here, not useful
is month benificial -> No

# check for defaulters wrt year in the data using countplot
plotUnivariateBar("year")

From above plot for ‘year’ we can infer that the defaulters rate is nearly constant here, not useful
is year benificial -> No

# check for defaulters wrt earliest_cr_line in the data
plotUnivariateBar("earliest_cr_line", figsize=(16,10))

From above plot for ‘earliest_cr_line’ we can infer that the defaulters rate is nearly constant for all purpose type except year around 65, hence rate does not depends on earliest_cr_line of the person
is earliest_cr_line benificial -> No

# check for defaulters wrt inq_last_6mths in the data
plotUnivariateBar("inq_last_6mths")

From above plot for ‘inq_last_6mths’ we can infer that the defaulters rate is not consistently increasing with inq_last_6mths type, hence not useful
is inq_last_6mths benificial -> No

# check for defaulters wrt revol_util in the data using countplot
plotUnivariateRatioBar("revol_util", figsize=(16,6))

From above plot for ‘revol_util’ we can infer that the defaulters rate is fluctuating where some have complete 100% ratio for defaulter and is increasing as the magnitude increases, hence rate will depend on revol_util feature
is revol_util benificial -> Yes

# check for defaulters wrt total_acc in the data using countplot
plotUnivariateRatioBar("total_acc", figsize=(14,6))

From above plot for ‘total_acc’ we can infer that the defaulters rate is nearly constant for all total_acc values, hence rate will not depend on total_acc feature
is total_acc benificial -> No

# check for defaulters wrt loan_amnt_range in the data using countplot
plotUnivariateBar("loan_amnt_range")

From above plot for ‘loan_amnt_range’ we can infer that the defaulters rate is increasing loan_amnt_range values, hence rate will depend on loan_amnt_range feature
is loan_amnt_range benificial -> Yes

# check for defaulters wrt int_rate_range in the data
plotUnivariateBar("int_rate_range")

From above plot for ‘int_rate_range’ we can infer that the defaulters rate is decreasing with int_rate_range values, hence rate will depend on int_rate_range feature
is int_rate_range benificial -> Yes

# check for defaulters wrt annual_inc_range in the data
plotUnivariateBar("annual_inc_range")

From above plot for ‘annual_inc_range’ we can infer that the defaulters rate is decreasing as with annual_inc_range values, hence rate will depend on annual_inc_range feature
is annual_inc_range benificial -> Yes

# check for defaulters wrt dti_range in the data using countplot
plotUnivariateBar("dti_range", figsize=(16,5))

From above plot for ‘dti_range’ we can infer that the defaulters rate is increasing as with dti_range values, hence rate will depend on dti_range feature
is dti_range benificial -> Yes

# check for defaulters wrt installment range in the data
plotUnivariateBar("installment", figsize=(8,5))

From above plot for ‘installment’ we can infer that the defaulters rate is increasing as with installment values, hence rate will depend on dti_range feature
is installment benificial -> Yes

Therefore, following are the important feature we deduced from above Univariate analysis:

term, grade, purpose, pub_rec, revol_util, funded_amnt_inv, int_rate, annual_inc, dti, installment

Bivariate Analysis

# function to plot scatter plot for two features
def plotScatter(x, y):
    plt.figure(figsize=(16,6))
    sns.scatterplot(x=x, y=y, hue="loan_status", data=loan)
    plt.title("Scatter plot between "+x+" and "+y)
    plt.xlabel(x, fontsize=16)
    plt.ylabel(y, fontsize=16)
    plt.show()

def plotBivariateBar(x, hue, figsize=(16,6)):
    plt.figure(figsize=figsize)
    sns.barplot(x=x, y='loan_status', hue=hue, data=loan)
    plt.title("Loan Default ratio wrt "+x+" feature for hue "+hue+" in the data using countplot")
    plt.xlabel(x, fontsize=16)
    plt.ylabel("defaulter ratio", fontsize=16)
    plt.show()

Plotting for two different features with respect to loan default ratio on y-axis with Bar Plots and Scatter Plots.

# check for defaulters wrt annual_inc and purpose in the data using countplot
plotBivariateBar("annual_inc_range", "purpose")

From above plot, we can infer it doesn’t shows any correlation
related - N

# check for defaulters wrt term and purpose in the data 
plotBivariateBar("term", "purpose")

As we can see straight lines on the plot, default ratio increases for every purpose wrt term
related - Y

# check for defaulters wrt grade and purpose in the data 
plotBivariateBar("grade", "purpose")

As we can see straight lines on the plot, default ratio increases for every purpose wrt grade
related - Y

# check for defaulters wrt loan_amnt_range and purpose in the data
plotBivariateBar("loan_amnt_range", "purpose")

As we can see straight lines on the plot, default ratio increases for every purpose wrt loan_amnt_range
related - Y

# check for defaulters wrt loan_amnt_range and term in the data
plotBivariateBar("loan_amnt_range", "term")

As we can see straight lines on the plot, default ratio increases for every term wrt loan_amnt_range
related - Y

# check for defaulters wrt annual_inc_range and purpose in the data
plotBivariateBar("annual_inc_range", "purpose")

As we can see straight lines on the plot, default ratio increases for every purpose wrt annual_inc_range
related - Y

# check for defaulters wrt annual_inc_range and purpose in the data
plotBivariateBar("installment", "purpose")

As we can see straight lines on the plot, default ratio increases for every purpose wrt installment except for small_business
related - Y

# check for defaulters wrt loan_amnt_range in the data
plotScatter("int_rate", "annual_inc")

As we can see straight lines on the plot, there is no relation between above mentioned features
related - N

# plot scatter for funded_amnt_inv with dti
plotScatter("funded_amnt_inv", "dti")

As we can see straight lines on the plot, there is no relation between above mentioned features
related - N

# plot scatter for funded_amnt_inv with annual_inc
plotScatter("annual_inc", "funded_amnt_inv")

As we can see slope pattern on the plot, there is positive relation between above mentioned features
related - Y

# plot scatter for loan_amnt with int_rate
plotScatter("loan_amnt", "int_rate")

As we can see straight line patterns on the plot, there is no relation between above mentioned features
related - N

# plot scatter for int_rate with annual_inc
plotScatter("int_rate", "annual_inc")

As we can see negative correlation pattern with reduced density on the plot, there is some relation between above mentioned features
related - Y

# plot scatter for earliest_cr_line with int_rate
plotScatter("earliest_cr_line", "int_rate")

As we can see positive correlation pattern with increasing density on the plot, there is co-relation between above mentioned features
related - Y

# plot scatter for annual_inc with emp_length
plotScatter("annual_inc", "emp_length")

As we can see straight line patterns on the plot, there is no relation between above mentioned features
related - N

# plot scatter for earliest_cr_line with dti
plotScatter("earliest_cr_line", "dti")

Plotting for two different features with respect to loan default ratio on y-axis with Box Plots and Violin Plots.

# function to plot boxplot for comparing two features
def plotBox(x, y, hue="loan_status"):
    plt.figure(figsize=(16,6))
    sns.boxplot(x=x, y=y, data=loan, hue=hue, order=sorted(loan[x].unique()))
    plt.title("Box plot between "+x+" and "+y+" for each "+hue)
    plt.xlabel(x, fontsize=16)
    plt.ylabel(y, fontsize=16)
    plt.show()
    plt.figure(figsize=(16,8))
    sns.violinplot(x=x, y=y, data=loan, hue=hue, order=sorted(loan[x].unique()))
    plt.title("Violin plot between "+x+" and "+y+" for each "+hue)
    plt.xlabel(x, fontsize=16)
    plt.ylabel(y, fontsize=16)
    plt.show()

# plot box for term vs int_rate for each loan_status
plotBox("term", "int_rate")

int_rate increases with term on loan and the chances of default also increases

# plot box for loan_status vs int_rate for each purpose
plotBox("loan_status", "int_rate", hue="purpose")

int_rate is quite high where the loan is defaulted for every purpose value

# plot box for purpose vs revo_util for each status
plotBox("purpose", "revol_util")

revol_util is more for every purpose value where the loan is defaulted and quite high for credit_card

# plot box for grade vs int_rate for each loan_status
plotBox("grade", "int_rate", "loan_status")

int_rate is increasing with every grade and also the defaulters for every grade are having their median near the non-defaulter 75% quantile of int_rate

# plot box for issue_d vs int_rate for each loan_status
plotBox("month", "int_rate", "loan_status")

int_rate for defaulter is increasing with every month where the defaulters for every month are having their median near the non-defaulter’s 75% quantile of int_rate, but is almost constant for each month, not useful

Therefore, following are the important feature we deduced from above Bivariate analysis:

term, grade, purpose, pub_rec, revol_util, funded_amnt_inv, int_rate, annual_inc, installment

Multivariate Analysis (Correlation)

# plot heat map to see correlation between features
continuous_f = ["funded_amnt_inv", "annual_inc", "term", "int_rate", "loan_status", "revol_util", "pub_rec", "earliest_cr_line"]
loan_corr = loan[continuous_f].corr()
sns.heatmap(loan_corr,vmin=-1.0,vmax=1.0,annot=True, cmap="YlGnBu")
plt.title("Correlation Heatmap")
plt.show()

Hence, important related feature from above Multivariate analysis are:

term, grade, purpose, revol_util, int_rate, installment, annual_inc, funded_amnt_inv

Final Findings

After analysing all the related features available in the dataset, we have come to an end, deducing the main driving features for the Lending Club Loan Default analysis:

The best driving features for the Loan default analysis are: term, grade, purpose, revol_util, int_rate, installment, annual_inc, funded_amnt_inv

Classify any Object using pre-trained CNN Model

Sparsh Gupta — Fri, 10 Jul 2020 14:55:47 +0000

Large Scale Image Classification using pre-trained Inception v3 Convolution Neural Network Model

Today we have the super-effective technique as Transfer Learning where we can use a pre-trained model by Google AI to classify any image of classified visual objects in the world of computer vision.

Transfer learning is a machine learning method which utilizes a pre-trained neural network. Here, the image recognition model called Inception-v3 consists of two parts:

Feature extraction part with a convolutional neural network.
Classification part with fully-connected and softmax layers.

Inception-v3 is a pre-trained convolutional neural network model that is 48 layers deep.

It is a version of the network already trained on more than a million images from the ImageNet database. It is the third edition of Inception CNN model by Google, originally instigated during the ImageNet Recognition Challenge.

This pre-trained network can classify images into 1000 object categories, such as keyboard, mouse, pencil, and many animals. As a result, the network has learned rich feature representations for a wide range of images. The network has an image input size of 299-by-299. The model extracts general features from input images in the first part and classifies them based on those features in the second part.

Schematic diagram of Inception v3 — By Google AI

Inception v3 is a widely-used image recognition model that has been shown to attain greater than 78.1% accuracy on the ImageNet dataset and around 93.9% accuracy in top 5 results. The model is the culmination of many ideas introduced by multiple researchers over the past years. It is based on the original paper: “Rethinking the Inception Architecture for Computer Vision” by Szegedy, et. al.

More information about the Inception architecture can be found here.

In Transfer Learning, when you build a new model to classify your original dataset, you reuse the feature extraction part and re-train the classification part with your dataset. Since you don’t have to train the feature extraction part (which is the most complex part of the model), you can train the model with less computational resources and training time.

In this article, we will just use the Inception v3 model to predict some images and fetch the top 5 predicted classes for the same. Let’s begin.

We are using Tensorflow v2.x

Import Data

import os
import numpy as np
from PIL import Image
from imageio import imread
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
import tf_slim as slim
from tf_slim.nets import inception
import tf_slim as slim
import cv2
import matplotlib.pyplot as plt

Data Loading

Setup all initial variables with default file locations and respective values.

ckpt_path = "/kaggle/input/inception_v3.ckpt"
images_path = "/kaggle/input/animals/*"
img_width = 299
img_height = 299
batch_size = 16
batch_shape = [batch_size, img_height, img_width, 3]
num_classes = 1001
predict_output = []
class_names_path = "/kaggle/input/imagenet_class_names.txt"
with open(class_names_path) as f:
    class_names = f.readlines()

Create Inception v3 model

X = tf.placeholder(tf.float32, shape=batch_shape)

with slim.arg_scope(inception.inception_v3_arg_scope()):
    logits, end_points = inception.inception_v3(
        X, num_classes=num_classes, is_training=False
    )

predictions = end_points["Predictions"]
saver = tf.train.Saver(slim.get_model_variables())

Define function for loading images and resizing for sending to model for evaluation in RGB mode.

def load_images(input_dir):
    global batch_shape
    images = np.zeros(batch_shape)
    filenames = []
    idx = 0
    batch_size = batch_shape[0]
    files = tf.gfile.Glob(input_dir)[:20]
    files.sort()
    for filepath in files:
        with tf.gfile.Open(filepath, "rb") as f:
            imgRaw = np.array(Image.fromarray(imread(f, as_gray=False, pilmode="RGB")).resize((299, 299))).astype(np.float) / 255.0
        images[idx, :, :, :] = imgRaw * 2.0 - 1.0
        filenames.append(os.path.basename(filepath))
        idx += 1
        if idx == batch_size:
            yield filenames, images
            filenames = []
            images = np.zeros(batch_shape)
            idx = 0
    if idx > 0:
        yield filenames, images

Load Pre-Trained Model

session_creator = tf.train.ChiefSessionCreator(
        scaffold=tf.train.Scaffold(saver=saver),
        checkpoint_filename_with_path=ckpt_path,
        master='')

Classify Images using Model

with tf.train.MonitoredSession(session_creator=session_creator) as sess:
    for filenames, images in load_images(images_path):
        labels = sess.run(predictions, feed_dict={X: images})
        for filename, label, image in zip(filenames, labels, images):
            predict_output.append([filename, label, image])

Predictions

We will use some images from the Animals-10 dataset from kaggle to declare the model predictions.

for x in predict_output:
    out_list = list(x[1])
    topPredict = sorted(range(len(out_list)), key=lambda i: out_list[i], reverse=True)[:5]
    plt.imshow((((x[2]+1)/2)*255).astype(int))
    plt.show()
    print("Filename:",x[0])
    print("Displaying the top 5 Predictions for above image:")
    for p in topPredict:
        print(class_names[p-1].strip())

At length, all of the classes are classified spot on, and we can also see that the top 5 similar classes as predicted by the model are pretty good and precise.

Most Common Loss Functions in Machine Learning

Sparsh Gupta — Thu, 09 Jul 2020 06:11:13 +0000

Every Machine Learning Engineer should know about these common Loss functions in Machine Learning and when to use them.

In mathematical optimization and decision theory, a loss function or cost function is a function that maps an event or values of one or more variables onto a real number intuitively representing some “cost” associated with the event.

— Wikipedia

As a core element, the Loss function is a method of evaluating your Machine Learning algorithm that how well it models your featured dataset. It is defined as a measurement of how good your model is in terms of predicting the expected outcome.

The Cost function and Loss function refer to the same context. The cost function is a function that is calculated as the average of all loss function values. Whereas, the loss function is calculated for each sample output compared to its actual value.

The Loss function is directly related to the predictions of your model that you have built. So if your loss function value is less, your model will be providing good results. Loss function or I should rather say, the Cost function that is used to evaluate the model performance, needs to be minimized in order to improve its performance.

Lets now dive into the Loss functions.

Widely speaking, the Loss functions can be grouped into two major categories concerning the types of problems that we come across in the real world — Classification and Regression. In Classification, the task is to predict the respective probabilities of all classes that the problem is dealing with. In Regression, oppositely, the task is to predict the continuous value concerning a given set of independent features to the learning algorithm.

Assumptions:

    n/m — Number of training samples.

    i — ith training sample in a dataset.

    y(i) — Actual value for the ith training sample.

    y_hat(i) — Predicted value for the ith training sample.

Classification Losses

1. Binary Cross-Entropy Loss / Log Loss

This is the most common Loss function used in Classification problems. The cross-entropy loss decreases as the predicted probability converges to the actual label. It measures the performance of a classification model whose predicted output is a probability value between 0 and 1.

When the number of classes is 2, Binary Classification

When the number of classes is more than 2, Multi-class Classification

The Cross-Entropy Loss formula is derived from the regular likelihood function, but with logarithms added in.

2. Hinge Loss

The second most common loss function used for Classification problems and an alternative to Cross-Entropy loss function is Hinge Loss, primarily developed for Support Vector Machine (SVM) model evaluation.

Hinge Loss not only penalizes the wrong predictions but also the right predictions that are not confident. It is primarily used with SVM Classifiers with class labels as -1 and 1. Make sure you change your malignant class labels from 0 to -1.

Regression Losses

1. Mean Square Error / Quadratic Loss / L2 Loss

MSE loss function is defined as the average of squared differences between the actual and the predicted value. It is the most commonly used Regression loss function.

The corresponding cost function is the Mean of these Squared Errors (MSE). The MSE Loss function penalizes the model for making large errors by squaring them and this property makes the MSE cost function less robust to outliers. Therefore, it should not be used if the data is prone to many outliers.

2. Mean Absolute Error / L1 Loss

MSE loss function is defined as the average of absolute differences between the actual and the predicted value. It is the second most commonly used Regression loss function. It measures the average magnitude of errors in a set of predictions, without considering their directions.

The corresponding cost function is the Mean of these Absolute Errors (MAE). The MAE Loss function is more robust to outliers compared to MSE Loss function. Therefore, it should be used if the data is prone to many outliers.

3. Huber Loss / Smooth Mean Absolute Error

Huber loss function is defined as the combination of MSE and MAE Loss function as it approaches MSE when 𝛿 ~ 0 and MAE when 𝛿 ~ ∞ (large numbers). It’s Mean Absolute Error, that becomes quadratic when the error is small. And to make the error quadratic depends on how small that error could be which is controlled by a hyperparameter, 𝛿 (delta), which can be tuned.

The choice of the delta value is critical because it determines what you’re willing to consider as an outlier. Hence, the Huber Loss function could be less sensitive to outliers compared to MSE Loss function depending upon the hyperparameter value. Therefore, it can be used if the data is prone to outliers and we might need to train hyperparameter delta which is an iterative process.

4. Log-Cosh Loss

The Log-Cosh loss function is defined as the logarithm of the hyperbolic cosine of the prediction error. It is another function used in regression tasks which is much smoother than MSE Loss. It has all the advantages of Huber loss, and it’s twice differentiable everywhere, unlike Huber loss as some Learning algorithms like XGBoost use Newton’s method to find the optimum, and hence the second derivative (Hessian) is needed.

log(cosh(x)) is approximately equal to (x * 2) / 2 for small x and to abs(x) - log(2) for large x. This means that ‘logcosh’ works mostly like the mean squared error, but will not be so strongly affected by the occasional wildly incorrect prediction.
— Tensorflow Docs

5. Quantile Loss

A quantile is a value below which a fraction of samples in a group falls. Machine learning models work by minimizing (or maximizing) an objective function. As the name suggests, the quantile regression loss function is applied to predict quantiles. For a set of predictions, the loss will be its average.

Quantile loss function turns out to be useful when we are interested in predicting an interval instead of only point predictions.

Thank you for reading! I hope this post has been useful. I appreciate feedback and constructive criticism. If you want to talk about this article or other related topics, you can drop me a text here or on my LinkedIn account.

What makes Logistic Regression a Classification Algorithm?

Sparsh Gupta — Mon, 06 Jul 2020 15:36:08 +0000

Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist.

— Wikipedia.

— All the images (plots) are generated and modified by Author.

Probably, for every Data Practitioner, the Linear Regression happens to be the starting point when implementing Machine Learning, where you learn about foretelling a continuous value for the given independent set of rules.

Why Logistic, not Linear?

Let us start with the most basic one, in Binary Classification, the model should be able to predict the dependent variable as one of the two probable class which could be 0 or 1. If we consider using Linear Regression, we can predict the value for the given set of rules as input to the model but it will forecast continuous values like 0.03, +1.2, -0.9, etc. which is not suitable for categorizing it in one of the two classes neither identifying it as a probability value to predict a class.

E.g. When we have to predict if a website is malicious when the length of the URL is given as a feature, the response variable has two values, benign and malicious.

If we try to fit a Linear Regression model to a binary classification problem, the model fit will be a straight line and can be seen why it is not suitable for using the same.

To overcome this problem, we use a sigmoid function, which tries to fit an exponential curve to the data to build a good model.

Logistic/Sigmoid Function

The Logistic Regression can be explained with Logistic function, also known as Sigmoid function that takes any real input x, and outputs a probability value between 0 and 1 which is defined as,

The model fit using the above Logistic function can be seen as below:

Further, for any given independent variable t, let us consider it as a linear function in a univariate regression model, where β0 is the intercept and β1 is the slope and is given by,

The general Logistic function p which outputs a value between 0 and 1 will become,

We can see that the data separable into two classes can be modelled using a Logistic function for the given variable in a linear function. But the relation between the input variable x and output probability cannot be interpreted easily which is given by the sigmoid function, we introduce the Logit (log-odds) function now that makes this model interpretable in a linear fashion.

Logit (Log-Odds) Function

The Log-odds function, a.k.a natural logarithm of the odds, is an inverse of the standard Logistic function which can be defined and further simplified as,

In the above equation, the terms are as follows:

g is the logit function. The equation for g(p(x)) shows that the logit is equivalent to linear regression expression
ln denotes the natural logarithm
p(x) is the probability of the dependent variable that falls in one of the two classes 0 or 1, given some linear combination of the predictors
β0 is the intercept from the linear regression equation
β1 is the regression coefficient multiplied by some value of the predictor

On further simplifying the above equation and exponentiating both sides, we can deduce the relationship between the probability and the linear model as,

The left term is called odds, which is defined as equivalent to the exponential function of the linear regression expression. With ln (log base e) on both sides, we can interpret the relation as linear between the log-odds and the independent variable x.

Why Regression?

The change in probability p(x) with change in variable x can not be directly understood as it is defined by the sigmoid function. But with the above expression, we can interpret that the change in log-odds of variable x is linear concerning a change in variable x itself. The plot of log-odds with linear equation can be seen as,

The probability outcome of the dependent variable shows that the value of the linear regression expression can vary from negative to positive infinity and yet, after transformation with sigmoid function, the resulting expression for the probability p(x) ranges between 0 and 1, i.e. 0<p<1. Therefore, this is what makes Logistic Regression a Classification algorithm being regression, that classifies the value of linear regression to a particular class depending upon the decision boundary.

Decision Boundary

The decision boundary is defined as a threshold value that helps us to classify the predicted probability value given by sigmoid function into a particular class, positive or negative.

Linear Decision Boundary

When two or more classes can be linearly separable,

Non-Linear Boundary

When two or more classes are not linearly separable,

Multi-Class Classification

The basic intuition behind Multi-Class and Binary Logistic Regression is the same. However, for a multi-class classification problem, we follow a ***one v/s all classification***. If there are multiple independent variables for the model, the traditional equation is modified as,

Here, the Log-Odds can be defined as linearly related to multiple independent variables present when the linear regression becomes multiple regression with m explanators.

Eg. If we have to predict whether the weather is sunny, rainy, or windy, we are dealing with a Multi-class problem. We turn this problem into three binary classification problem i.e whether it is sunny or not, whether it is rainy or not and whether it is windy or not. We run all three classifications independently on input features and the classification for which the value of probability is the maximum relative to others, becomes the solution.

Conclusion

Logistic regression is one of the most simple Machine Learning models. They are easy to understand, interpretable, and can give pretty good results. Every practitioner using Logistic Regression out there must know about the Log-Odds which is the main concept behind this learning algorithm. The Logistic regression is very much interpretable considering the business needs and explanation regarding how the model works concerning different independent variables used in the model. This post aimed to provide an easy way to understand the idea behind regression and transparency provided by Logistic Regression.

Thanks for reading. You can find my other Machine Learning related posts here.

I hope this post has been useful. I appreciate feedback and constructive criticism. If you want to talk about this article or other related topics, you can drop me a text here or at LinkedIn.

Forem: Sparsh Gupta

Importance of Data Visualization — Anscombe’s Quartet Way

Four datasets that fools the Linear Regression model if built.

Anscombe’s quartet comprises four data sets that have nearly identical simple descriptive statistics, yet have very different distributions and appear very different when graphed.

— Wikipedia

Conclusion:

Assumptions in Linear Regression you might not know.

Most Common Loss Functions in Machine Learning

Assumptions in Linear Regression you might not know.

The model should conform to these assumptions to produce a best Linear Regression fit to the data.

Introduction

Linear regression is a linear approach to modelling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables).

— Wikipedia

Linear Regression Types

Assumptions in Linear Regression

Conclusion

What makes Logistic Regression a Classification Algorithm? | by Sparsh Gupta | Jul, 2020 | Towards Data Science

Sparsh Gupta ・ Jul 3, 2020 ・ 6 min read towardsdatascience.com

Insightful Loan Default Analysis

Visualize Insights and Discover Driving Features in Lending Credit Risk Model for Loan Defaults

About Lending Club Loan Dataset

Questions

Feature Distribution

Data/Library Imports

Dataset Overview

Data Cleaning

Outlier Treatment

Metrics Derivation

extracting month and year from issue_date

get year from issue_d and replace the same

Visualising Data Insights

Univariate Analysis

Bivariate Analysis

Multivariate Analysis (Correlation)

Final Findings

Classify any Object using pre-trained CNN Model

Large Scale Image Classification using pre-trained Inception v3 Convolution Neural Network Model

Import Data

Data Loading

Create Inception v3 model

Load Pre-Trained Model

Classify Images using Model

Predictions

Most Common Loss Functions in Machine Learning

Every Machine Learning Engineer should know about these common Loss functions in Machine Learning and when to use them.

In mathematical optimization and decision theory, a loss function or cost function is a function that maps an event or values of one or more variables onto a real number intuitively representing some “cost” associated with the event.

— Wikipedia

Classification Losses

1. Binary Cross-Entropy Loss / Log Loss

2. Hinge Loss

Regression Losses

1. Mean Square Error / Quadratic Loss / L2 Loss

2. Mean Absolute Error / L1 Loss

3. Huber Loss / Smooth Mean Absolute Error

4. Log-Cosh Loss

5. Quantile Loss

What makes Logistic Regression a Classification Algorithm?

Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist.

— Wikipedia.

Why Logistic, not Linear?

Logistic/Sigmoid Function

Logit (Log-Odds) Function

Why Regression?

Decision Boundary

Linear Decision Boundary

Non-Linear Boundary

Multi-Class Classification

Conclusion

Sparsh Gupta ・ Jul 3, 2020 ・ 6 min read
towardsdatascience.com