<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Sparsh Gupta</title>
    <description>The latest articles on Forem by Sparsh Gupta (@imsparsh).</description>
    <link>https://forem.com/imsparsh</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F416289%2Fa8e8e463-f1c8-4998-993f-cbf951b28bee.JPG</url>
      <title>Forem: Sparsh Gupta</title>
      <link>https://forem.com/imsparsh</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/imsparsh"/>
    <language>en</language>
    <item>
      <title>Importance of Data Visualization — Anscombe’s Quartet Way</title>
      <dc:creator>Sparsh Gupta</dc:creator>
      <pubDate>Mon, 27 Jul 2020 17:48:56 +0000</pubDate>
      <link>https://forem.com/imsparsh/importance-of-data-visualization-anscombe-s-quartet-way-5693</link>
      <guid>https://forem.com/imsparsh/importance-of-data-visualization-anscombe-s-quartet-way-5693</guid>
      <description>&lt;h4&gt;
  
  
  Four datasets that fools the Linear Regression model if built.
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2326%2F1%2AteCUzrolOckJEyHsNhi_Ng.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2326%2F1%2AteCUzrolOckJEyHsNhi_Ng.png" alt="Image by Author" width="800" height="442"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;h1&gt;
  
  
  &lt;strong&gt;Anscombe’s quartet&lt;/strong&gt; comprises four &lt;a href="https://en.wikipedia.org/wiki/Data_set" rel="noopener noreferrer"&gt;data sets&lt;/a&gt; that have nearly identical simple &lt;a href="https://en.wikipedia.org/wiki/Descriptive_statistics" rel="noopener noreferrer"&gt;descriptive statistics&lt;/a&gt;, yet have very different distributions and appear very different when graphed.
&lt;/h1&gt;
&lt;h1&gt;
  
  
  — Wikipedia
&lt;/h1&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Anscombe’s Quartet&lt;/strong&gt; can be defined as a group of four data sets which are &lt;strong&gt;nearly identical in simple descriptive statistics&lt;/strong&gt;, but there are some peculiarities in the dataset that &lt;strong&gt;fools the regression model&lt;/strong&gt; if built. They have very different distributions and &lt;strong&gt;appear differently&lt;/strong&gt; when plotted on scatter plots.&lt;/p&gt;

&lt;p&gt;It was constructed in 1973 by statistician &lt;strong&gt;Francis Anscombe&lt;/strong&gt; to illustrate the &lt;strong&gt;importance&lt;/strong&gt; of &lt;strong&gt;plotting the graphs&lt;/strong&gt; before analyzing and model building, and the effect of other &lt;strong&gt;observations on statistical properties&lt;/strong&gt;.There are these four data set plots which have nearly &lt;strong&gt;same statistical observations&lt;/strong&gt;, which provides same statistical information that involves &lt;strong&gt;variance&lt;/strong&gt;, and &lt;strong&gt;mean&lt;/strong&gt; of all x,y points in all four datasets.&lt;/p&gt;

&lt;p&gt;This tells us about the importance of visualising the data before applying various algorithms out there to build models out of them which suggests that the data features must be plotted in order to see the distribution of the samples that can help you identify the various anomalies present in the data like outliers, diversity of the data, linear separability of the data, etc. Also, the Linear Regression can be only be considered a fit for the &lt;strong&gt;data with linear relationships&lt;/strong&gt; and is incapable of handling any other kind of datasets. These four plots can be defined as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2268%2F1%2AwMuoOLohuNbTWbbu_rpujg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2268%2F1%2AwMuoOLohuNbTWbbu_rpujg.png" alt="Image by Author" width="800" height="300"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The statistical information for all these four datasets are approximately similar and can be computed as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2268%2F1%2AUrXAppaF09s88C_rG0KRjA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2268%2F1%2AUrXAppaF09s88C_rG0KRjA.png" alt="Image by Author" width="800" height="416"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When these models are plotted on a scatter plot, all datasets generates a different kind of plot that is not interpretable by any regression algorithm which is fooled by these peculiarities and can be seen as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2314%2F1%2A4H7ByZaIXvke8NVAOZ8E2g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2314%2F1%2A4H7ByZaIXvke8NVAOZ8E2g.png" alt="Image by Author" width="800" height="425"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The four datasets can be described as:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Dataset 1:&lt;/strong&gt; this &lt;strong&gt;fits&lt;/strong&gt; the linear regression model pretty well.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Dataset 2:&lt;/strong&gt; this &lt;strong&gt;could not fit&lt;/strong&gt; linear regression model on the data quite well as the data is non-linear.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Dataset 3:&lt;/strong&gt; shows the &lt;strong&gt;outliers&lt;/strong&gt; involved in the dataset which &lt;strong&gt;cannot be handled&lt;/strong&gt; by linear regression model&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Dataset 4:&lt;/strong&gt; shows the &lt;strong&gt;outliers&lt;/strong&gt; involved in the dataset which &lt;strong&gt;cannot be handled&lt;/strong&gt; by linear regression model&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Conclusion:
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;We have described the four datasets that were intentionally created to describe the importance of data visualisation and how any regression algorithm can be fooled by the same. Hence, all the important features in the dataset must be visualised before implementing any machine learning algorithm on them which will help to make a good fit model.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Thanks for reading. You can find my other &lt;a href="https://towardsdatascience.com/@imsparsh" rel="noopener noreferrer"&gt;Machine Learning related posts here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I hope this post has been useful. I appreciate feedback and constructive criticism. If you want to talk about this article or other related topics, you can drop me a text here or at &lt;a href="https://www.linkedin.com/in/imsparsh/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/p&gt;


&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/imsparsh/assumptions-in-linear-regression-you-might-not-know-58c6" class="crayons-story__hidden-navigation-link"&gt;Assumptions in Linear Regression you might not know.&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/imsparsh" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmedia.dev.to%2Fdynamic%2Fimage%2Fwidth%3D90%2Cheight%3D90%2Cfit%3Dcover%2Cgravity%3Dauto%2Cformat%3Dauto%2Fhttps%253A%252F%252Fdev-to-uploads.s3.amazonaws.com%252Fuploads%252Fuser%252Fprofile_image%252F416289%252F18dab641-23d9-4b5a-b8f2-64eaa6b8deb1.jpeg" alt="imsparsh profile" class="crayons-avatar__image" width="612" height="612"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/imsparsh" class="crayons-story__secondary fw-medium m:hidden"&gt;
              Sparsh Gupta
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                Sparsh Gupta
                
              
              &lt;div id="story-author-preview-content-400657" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/imsparsh" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmedia.dev.to%2Fdynamic%2Fimage%2Fwidth%3D90%2Cheight%3D90%2Cfit%3Dcover%2Cgravity%3Dauto%2Cformat%3Dauto%2Fhttps%253A%252F%252Fdev-to-uploads.s3.amazonaws.com%252Fuploads%252Fuser%252Fprofile_image%252F416289%252F18dab641-23d9-4b5a-b8f2-64eaa6b8deb1.jpeg" class="crayons-avatar__image" alt="" width="612" height="612"&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;Sparsh Gupta&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/imsparsh/assumptions-in-linear-regression-you-might-not-know-58c6" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Jul 16 '20&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/imsparsh/assumptions-in-linear-regression-you-might-not-know-58c6" id="article-link-400657"&gt;
          Assumptions in Linear Regression you might not know.
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/machinelearning"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;machinelearning&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/datascience"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;datascience&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/python"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;python&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
          &lt;a href="https://dev.to/imsparsh/assumptions-in-linear-regression-you-might-not-know-58c6" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left"&gt;
            &lt;div class="multiple_reactions_aggregate"&gt;
              &lt;span class="multiple_reactions_icons_container"&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/sparkle-heart-5f9bee3767e18deb1bb725290cb151c25234768a0e9a2bd39370c382d02920cf.svg" width="24" height="24"&gt;
                  &lt;/span&gt;
              &lt;/span&gt;
              &lt;span class="aggregate_reactions_counter"&gt;5&lt;span class="hidden s:inline"&gt; reactions&lt;/span&gt;&lt;/span&gt;
            &lt;/div&gt;
          &lt;/a&gt;
            &lt;a href="https://dev.to/imsparsh/assumptions-in-linear-regression-you-might-not-know-58c6#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              &lt;span class="hidden s:inline"&gt;Add Comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            5 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;



&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/imsparsh/most-common-loss-functions-in-machine-learning-57p7" class="crayons-story__hidden-navigation-link"&gt;Most Common Loss Functions in Machine Learning&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/imsparsh" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F416289%2Fa8e8e463-f1c8-4998-993f-cbf951b28bee.JPG" alt="imsparsh profile" class="crayons-avatar__image" width="800" height="1066"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/imsparsh" class="crayons-story__secondary fw-medium m:hidden"&gt;
              Sparsh Gupta
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                Sparsh Gupta
                
              
              &lt;div id="story-author-preview-content-387064" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/imsparsh" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F416289%2Fa8e8e463-f1c8-4998-993f-cbf951b28bee.JPG" class="crayons-avatar__image" alt="" width="800" height="1066"&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;Sparsh Gupta&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/imsparsh/most-common-loss-functions-in-machine-learning-57p7" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Jul 9 '20&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/imsparsh/most-common-loss-functions-in-machine-learning-57p7" id="article-link-387064"&gt;
          Most Common Loss Functions in Machine Learning
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/machinelearning"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;machinelearning&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/datascience"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;datascience&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/python"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;python&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/computerscience"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;computerscience&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
          &lt;a href="https://dev.to/imsparsh/most-common-loss-functions-in-machine-learning-57p7" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left"&gt;
            &lt;div class="multiple_reactions_aggregate"&gt;
              &lt;span class="multiple_reactions_icons_container"&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/multi-unicorn-b44d6f8c23cdd00964192bedc38af3e82463978aa611b4365bd33a0f1f4f3e97.svg" width="24" height="24"&gt;
                  &lt;/span&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/sparkle-heart-5f9bee3767e18deb1bb725290cb151c25234768a0e9a2bd39370c382d02920cf.svg" width="24" height="24"&gt;
                  &lt;/span&gt;
              &lt;/span&gt;
              &lt;span class="aggregate_reactions_counter"&gt;31&lt;span class="hidden s:inline"&gt; reactions&lt;/span&gt;&lt;/span&gt;
            &lt;/div&gt;
          &lt;/a&gt;
            &lt;a href="https://dev.to/imsparsh/most-common-loss-functions-in-machine-learning-57p7#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              &lt;span class="hidden s:inline"&gt;Add Comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            5 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;


</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>computerscience</category>
      <category>python</category>
    </item>
    <item>
      <title>Assumptions in Linear Regression you might not know.</title>
      <dc:creator>Sparsh Gupta</dc:creator>
      <pubDate>Thu, 16 Jul 2020 17:36:15 +0000</pubDate>
      <link>https://forem.com/imsparsh/assumptions-in-linear-regression-you-might-not-know-58c6</link>
      <guid>https://forem.com/imsparsh/assumptions-in-linear-regression-you-might-not-know-58c6</guid>
      <description>&lt;h4&gt;
  
  
  The model should conform to these assumptions to produce a best Linear Regression fit to the data.
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--pGQpYK9h--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/11150/0%2A0O7LtlDWczZU05fX" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--pGQpYK9h--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/11150/0%2A0O7LtlDWczZU05fX" alt="Photo by [Joseph Barrientos](https://unsplash.com/@jbcreate_?utm_source=medium&amp;amp;utm_medium=referral) on [Unsplash](https://unsplash.com?utm_source=medium&amp;amp;utm_medium=referral)"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;— All the images (plots) are generated and modified by Author.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;At first, Linear Regression is a method of modelling the best &lt;strong&gt;linear relationship&lt;/strong&gt; between the &lt;strong&gt;independent&lt;/strong&gt; variables and &lt;strong&gt;dependent&lt;/strong&gt; variables. The simplest form of Linear Regression can be defined by the following equation with one independent and one dependent variable:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--cNat4d17--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2000/1%2At2YgYdjJQJcyypqlj2T_4w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--cNat4d17--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2000/1%2At2YgYdjJQJcyypqlj2T_4w.png" alt="Simple Linear Regression"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;x&lt;/strong&gt; is the independent variable, &lt;br&gt;
&lt;strong&gt;y&lt;/strong&gt; is the dependent variable,&lt;br&gt;
&lt;strong&gt;β1&lt;/strong&gt; is the coefficient of x, i.e. slope, &lt;br&gt;
&lt;strong&gt;β0&lt;/strong&gt; is the intercept (constant) which tells the distance of the line from the origin on y-axis.&lt;/p&gt;

&lt;blockquote&gt;
&lt;h1&gt;
  
  
  &lt;strong&gt;Linear regression&lt;/strong&gt; is a &lt;a href="https://en.wikipedia.org/wiki/Linearity"&gt;linear&lt;/a&gt; approach to modelling the relationship between a scalar response (or &lt;a href="https://en.wikipedia.org/wiki/Dependent_variable"&gt;dependent variable&lt;/a&gt;) and one or more &lt;a href="https://en.wikipedia.org/wiki/Explanatory_variable"&gt;explanatory variables&lt;/a&gt; (or &lt;a href="https://en.wikipedia.org/wiki/Independent_variable"&gt;independent variables&lt;/a&gt;).
&lt;/h1&gt;
&lt;h1&gt;
  
  
  — &lt;a href="https://en.wikipedia.org/wiki/Linear_regression"&gt;Wikipedia&lt;/a&gt;
&lt;/h1&gt;
&lt;/blockquote&gt;
&lt;h3&gt;
  
  
  Linear Regression Types
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Simple Linear Regression&lt;/strong&gt; — The simplest form of regression which involves one independent variable and one dependent variable, which is explained as above, where we fit a line to the model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Multiple Linear Regression&lt;/strong&gt; — The complex form of regression which involves multiple independent variables and one dependent variable, which can be explained by the following equation:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Qkx3Km-2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2000/1%2ATvjV8mzecaWtYjpTznZgVA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Qkx3Km-2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2000/1%2ATvjV8mzecaWtYjpTznZgVA.png" alt="Multiple Linear Regression"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;x1&lt;/strong&gt; to &lt;strong&gt;xn&lt;/strong&gt; are the independent variable, &lt;br&gt;
&lt;strong&gt;y&lt;/strong&gt; is the dependent variable,&lt;br&gt;
&lt;strong&gt;β1&lt;/strong&gt; to &lt;strong&gt;βn&lt;/strong&gt; are the coefficients of respective x features, and&lt;br&gt;
&lt;strong&gt;β0&lt;/strong&gt; is the intercept (constant) which tells the distance of the line from the origin on y-axis.&lt;/p&gt;
&lt;h2&gt;
  
  
  Assumptions in Linear Regression
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--y79rFBoN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/12000/0%2A8W0qNKPE5SdjMVLt" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--y79rFBoN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/12000/0%2A8W0qNKPE5SdjMVLt" alt="Photo by [Tom Roberts](https://unsplash.com/@tomrdesigns?utm_source=medium&amp;amp;utm_medium=referral) on [Unsplash](https://unsplash.com?utm_source=medium&amp;amp;utm_medium=referral)"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Linear Relationship&lt;/strong&gt; — It is assumed and understood that the relation between the independent variables and dependent variables is linear, i.e. the coefficients must be linear, what we find out using the model building and prediction.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--c7GqPA5w--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2000/1%2AfLObzuuBDCG369iGH0h3_A.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--c7GqPA5w--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2000/1%2AfLObzuuBDCG369iGH0h3_A.png" alt="Image by Author"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The predictor variables are seen as fixed values and can be any complex function like polynomial, trigonometric, etc. But the coefficients will be strictly linear with the predictor variable.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--bsFW_sjI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2000/1%2Apm-InClFV1k2q3WNzC_EVw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--bsFW_sjI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2000/1%2Apm-InClFV1k2q3WNzC_EVw.png" alt="Polynomial Regression"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This assumption is used for implementing the &lt;strong&gt;Polynomial regression&lt;/strong&gt;, which uses linear regression to fit the response variable as an arbitrary polynomial function of a predictor variable which also makes the linear relationship with the coefficients.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Homoscedasticity (Constant Variance)&lt;/strong&gt; — It is assumed that the residual terms (that is, the “noise” or random disturbance in the relationship between the features and the target) must have the constant variance, i.e. the error term is same across different values of independent features, regardless of the values of the predictor variables.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--wEEbXIvQ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/3326/1%2AJan9oVOzNqQyhA4bSg_zwA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--wEEbXIvQ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/3326/1%2AJan9oVOzNqQyhA4bSg_zwA.png" alt="Image by Author — Modified"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There should be no clear pattern in the distribution and if there is a specific pattern, the data is heteroscedastic. The leftmost graph shows no definite pattern among the error terms i.e the distribution is varied constantly, whereas the middle graph shows a pattern where the error decreases and then increases with the estimated values violating the constant variance rule and the rightmost graph also reveals a specific pattern where the error terms decrease with the predicted values representing heteroscedasticity. Two or more normal distributions are homoscedastic if they share a common covariance (or correlation) matrix.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Multivariate Normality&lt;/strong&gt; — It is assumed that the error terms are normally distributed, i.e. the mean of error terms is zero and the sum of error terms is also equal to zero. A less widely known fact is that, as the sample size goes high, the normality assumption for the residuals is not needed anymore.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--HcNz05MC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2000/1%2Am5u5g0Gs7r0L8fS0mOkIDA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--HcNz05MC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2000/1%2Am5u5g0Gs7r0L8fS0mOkIDA.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The above q-q plot shows that the errors or residuals are normally distributed. The error term can be seen as the composite of some minor residuals or errors. As the number of these minor residuals increases, the distribution of the error term tends to approach the normal distribution. This tendency is called the Central Limit Theorem where the t-test and F- test are only applicable if the error term is normally distributed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. No Multicollinearity&lt;/strong&gt; — Multicollinearity is defined as the degree of inter-correlations among the independent variables used in the model. It is assumed that the independent feature variables are not at all or very less correlated among each other, which makes them independent. So in practical implementation, the correlation between two independent features must not be greater than 30% as it weakens the statistical power of the model built. For identification of highly correlated features, pair plots (scatter plot) and heatmaps (correlation matrix) can be used.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--tpVjKCRr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2000/1%2AVYPi2Tqx02Lw1VYs28rtDg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--tpVjKCRr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2000/1%2AVYPi2Tqx02Lw1VYs28rtDg.png" alt="Correlation Heatmap — Image by Author"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Highly correlated features should not be used in the model to maintain the strong relationship between the model and all its features present as the features tend to change in unison. Hence, with the change in one feature, the change in correlated feature does not make the latter constant as the model requires it while predicting the outcome using the weighted coefficients and the expected interpretation of regression coefficient does not conform.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. No Auto-correlation&lt;/strong&gt; — It is assumed that there should be no auto-correlation among the features in the data. It mainly occurs when there is a dependency between residual errors, i.e. the residual error should not be correlated positively or negatively, and it should have a good spread all over. This usually occurs in time series models where the next instant is dependent on the previous instant. The presence of correlation in the residual terms also reduces the model’s predictability.&lt;/p&gt;

&lt;p&gt;Autocorrelation can be tested with the help of the Durbin-Watson test. The Durbin-Watson test statistics is defined as:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--pgwuGyCP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2000/1%2AlP_1ng3tQZ3UAidLDoGVXQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--pgwuGyCP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2000/1%2AlP_1ng3tQZ3UAidLDoGVXQ.png" alt="Durbin-Watson Equation"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Durbin-Watson test statistics will always have a value between 0 and 4. An exact value of 2.0 states that there is no autocorrelation detected in the sample. Values between 0 and 2 indicate positive autocorrelation and values between 2 and 4 indicates negative autocorrelation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. No Extrapolation&lt;/strong&gt; — Extrapolation is an estimation that can exist beyond the original observation range. It is assumed that the trained model will be able to predict the values for the dependent variable on independent feature values only for the data that lies within the range of the training data. Therefore, the model cannot guarantee the predicted values that are beyond the range of trained independent feature values.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--SAYGLR8u--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2118/1%2Aennj4kl3b724C5w1to42rQ.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--SAYGLR8u--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2118/1%2Aennj4kl3b724C5w1to42rQ.jpeg" alt="Image by Author — Modified"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;We have explained the most important assumptions which must be focussed before implementing a Linear Regression Model to a given set of data. These assumptions are just a formal measure to ensure that the predictability of the built linear regression model is good enough to give us the best possible results for a given data set. These assumptions if not satisfied will not stop a Linear regression model to be built but will provide good confidence to the predictability of the model.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Thanks for reading. You can find my other &lt;a href="https://towardsdatascience.com/@imsparsh"&gt;Machine Learning related posts here&lt;/a&gt;.&lt;/p&gt;


&lt;div class="ltag__link"&gt;
  &lt;a href="https://towardsdatascience.com/what-makes-logistic-regression-a-classification-algorithm-35018497b63f" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__pic"&gt;
      &lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--XMQln4Ie--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://miro.medium.com/fit/c/96/96/1%2AOou0KMNLO-BurnkyhAa1yA.jpeg" alt="Sparsh Gupta"&gt;
    &lt;/div&gt;
  &lt;/a&gt;
  &lt;a href="https://towardsdatascience.com/what-makes-logistic-regression-a-classification-algorithm-35018497b63f" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__content"&gt;
      &lt;h2&gt;What makes Logistic Regression a Classification Algorithm? | by Sparsh Gupta | Jul, 2020 | Towards Data Science&lt;/h2&gt;
      &lt;h3&gt;Sparsh Gupta ・ &lt;time&gt;Jul 3, 2020&lt;/time&gt; ・ 6 min read
      &lt;div class="ltag__link__servicename"&gt;
        &lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--KBvj_QRD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://practicaldev-herokuapp-com.freetls.fastly.net/assets/medium_icon-90d5232a5da2369849f285fa499c8005e750a788fdbf34f5844d5f2201aae736.svg" alt="Medium Logo"&gt;
        towardsdatascience.com
      &lt;/div&gt;
    &lt;/h3&gt;
&lt;/div&gt;
  &lt;/a&gt;
&lt;/div&gt;


&lt;p&gt;I hope this post has been useful. I appreciate feedback and constructive criticism. If you want to talk about this article or other related topics, you can drop me a text here or at &lt;a href="https://www.linkedin.com/in/imsparsh/"&gt;LinkedIn&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>python</category>
    </item>
    <item>
      <title>Insightful Loan Default Analysis</title>
      <dc:creator>Sparsh Gupta</dc:creator>
      <pubDate>Fri, 10 Jul 2020 15:24:03 +0000</pubDate>
      <link>https://forem.com/imsparsh/insightful-loan-default-analysis-3ojg</link>
      <guid>https://forem.com/imsparsh/insightful-loan-default-analysis-3ojg</guid>
      <description>&lt;h4&gt;
  
  
  Visualize Insights and Discover Driving Features in Lending Credit Risk Model for Loan Defaults
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftwtqouj2p5a5yjc75qbr.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftwtqouj2p5a5yjc75qbr.jpeg" alt="(Image by Author)" width="800" height="454"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.lendingclub.com/" rel="noopener noreferrer"&gt;Lending Club&lt;/a&gt; is the largest online loan marketplace, facilitating personal loans, business loans, and financing of medical procedures. Borrowers can easily access lower interest rate loans through a fast online interface.&lt;/p&gt;

&lt;p&gt;Like most other lending companies, lending loans to ‘risky’ applicants is the largest source of financial loss (called credit loss). The credit loss is the amount of money lost by the lender when the borrower refuses to pay or runs away with the money owed. In other words, borrowers who default cause the largest amount of loss to the lenders.&lt;/p&gt;

&lt;p&gt;Therefore, using &lt;strong&gt;Data Science&lt;/strong&gt;, &lt;strong&gt;Exploratory Data Analysis&lt;/strong&gt; and public data from &lt;strong&gt;Lending Club&lt;/strong&gt;, we will be exploring and crunching out the driving factors that exists behind the loan default, i.e. the variables which are strong indicators of default. Further, the company can utilise this knowledge for its portfolio and risk assessment.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F11232%2F0%2Ar0HYsDqb1VdNPV0A" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F11232%2F0%2Ar0HYsDqb1VdNPV0A" alt="Photo by [Shane](https://unsplash.com/@theyshane?utm_source=medium&amp;amp;utm_medium=referral) on [Unsplash](https://unsplash.com?utm_source=medium&amp;amp;utm_medium=referral)" width="760" height="506"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;About Lending Club Loan Dataset&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The dataset contains complete loan data for all loans issued through the &lt;strong&gt;2007–2011&lt;/strong&gt;, including the current loan status (&lt;strong&gt;Current, Charged-off, Fully Paid&lt;/strong&gt;) and latest payment information. Additional features include credit scores, number of finance inquiries, and collections among others. The file is a matrix of about 39 thousand observations and 111 variables. A &lt;strong&gt;Data Dictionary&lt;/strong&gt; is provided in a separate file in the dataset. The dataset can be downloaded here on &lt;a href="https://www.kaggle.com/imsparsh/lending-club-loan-dataset-2007-2011" rel="noopener noreferrer"&gt;Kaggle&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Questions
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;What set of loan data are we working with?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;What types of &lt;strong&gt;features&lt;/strong&gt; do we have?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Do we need to treat &lt;strong&gt;missing values&lt;/strong&gt;?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;What is the distribution of Loan Status?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;What is the distribution of Loan Default with other features?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;What all plots we can draw for &lt;strong&gt;inferring the relation&lt;/strong&gt; with Loan Default?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Majorly, what are the &lt;strong&gt;driving features&lt;/strong&gt; that describes the Loan Default?&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Feature Distribution
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Loan Characteristics&lt;/strong&gt; such as &lt;strong&gt;loan amount, term, purpose&lt;/strong&gt; which shows the information about the loan that will help us in finding loan default.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Demographic Variables&lt;/strong&gt; such as &lt;strong&gt;age, employment status, relationship status&lt;/strong&gt; which shows the information about the borrower profile which is not useful for us.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Behavioural Variables&lt;/strong&gt; such as &lt;strong&gt;next payment date, EMI, delinquency&lt;/strong&gt; which shows the information which is updated after providing the loan which in our case is not useful as we need to decide whether we should approve the loan or not by default analysis.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Here is a quick overview of things we are going to see in this article:&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Dataset Overview&lt;/strong&gt; (Distribution of Loans)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Data Cleaning&lt;/strong&gt; (Missing Values, Standardize Data, Outlier Treatment)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Metrics Derivation&lt;/strong&gt; (Binning)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Univariate Analysis&lt;/strong&gt; (Categorical/Continuous Features)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Bivariate Analysis&lt;/strong&gt; (Box Plots, Scatter Plots, Violin Plots)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Multivariate Analysis&lt;/strong&gt; (Correlation Heatmap)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Data/Library Imports
&lt;/h2&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# import required libraries
import numpy as np
print('numpy version:',np.__version__)
import pandas as pd
print('pandas version:',pd.__version__)
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(style="whitegrid")
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (12, 8)
pd.options.mode.chained_assignment = None
pd.options.display.float_format = '{:.2f}'.format
pd.set_option('display.max_columns', 200)
pd.set_option('display.width', 400)

# file path variable
case_data = "/kaggle/input/lending-club-loan-dataset-2007-2011/loan.csv"
loan = pd.read_csv(case_data, low_memory=False)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;Data set has 111 columns and 39717 rows&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Dataset Overview
&lt;/h2&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# plotting pie chart for different types of loan_status
chargedOffLoans = loan.loc[(loan["loan_status"] == "Charged Off")]
currentLoans = loan.loc[(loan["loan_status"] == "Current")]
fullyPaidLoans = loan.loc[(loan["loan_status"]== "Fully Paid")]

data  = [{"Charged Off": chargedOffLoans["funded_amnt_inv"].sum(), "Fully Paid":fullyPaidLoans["funded_amnt_inv"].sum(), "Current":currentLoans["funded_amnt_inv"].sum()}]

investment_sum = pd.DataFrame(data) 
chargedOffTotalSum = float(investment_sum["Charged Off"])
fullyPaidTotalSum = float(investment_sum["Fully Paid"])
currentTotalSum = float(investment_sum["Current"])
loan_status = [chargedOffTotalSum,fullyPaidTotalSum,currentTotalSum]
loan_status_labels = 'Charged Off','Fully Paid','Current'
plt.pie(loan_status,labels=loan_status_labels,autopct='%1.1f%%')
plt.title('Loan Status Aggregate Information')
plt.axis('equal')
plt.legend(loan_status,title="Loan Amount",loc="center left",bbox_to_anchor=(1, 0, 0.5, 1))
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AoDr0uBiNWcNahkZIM5_o2g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AoDr0uBiNWcNahkZIM5_o2g.png" alt="(Image by Author)" width="800" height="438"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# plotting pie chart for different types of purpose
loans_purpose = loan.groupby(['purpose'])['funded_amnt_inv'].sum().reset_index()

plt.figure(figsize=(14, 10))
plt.pie(loans_purpose["funded_amnt_inv"],labels=loans_purpose["purpose"],autopct='%1.1f%%')

plt.title('Loan purpose Aggregate Information')
plt.axis('equal')
plt.legend(loan_status,title="Loan purpose",loc="center left",bbox_to_anchor=(1, 0, 0.5, 1))
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AlLugFGIaa4vdy4Hj-xqTlQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AlLugFGIaa4vdy4Hj-xqTlQ.png" alt="(Image by Author)" width="800" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Data Cleaning
&lt;/h2&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# in dataset, we can see around half of the columns are null
# completely, hence remove all columns having no values
loan = loan.dropna(axis=1, how="all")
print("Looking into remaining columns info:")
print(loan.info(max_cols=200))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;We are left with following columns:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Looking into remaining columns info:
&amp;lt;class 'pandas.core.frame.DataFrame'&amp;gt;
RangeIndex: 39717 entries, 0 to 39716
Data columns (total 57 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   id                          39717 non-null  int64  
 1   member_id                   39717 non-null  int64  
 2   loan_amnt                   39717 non-null  int64  
 3   funded_amnt                 39717 non-null  int64  
 4   funded_amnt_inv             39717 non-null  float64
 5   term                        39717 non-null  object 
 6   int_rate                    39717 non-null  object 
 7   installment                 39717 non-null  float64
 8   grade                       39717 non-null  object 
 9   sub_grade                   39717 non-null  object 
 10  emp_title                   37258 non-null  object 
 11  emp_length                  38642 non-null  object 
 12  home_ownership              39717 non-null  object 
 13  annual_inc                  39717 non-null  float64
 14  verification_status         39717 non-null  object 
 15  issue_d                     39717 non-null  object 
 16  loan_status                 39717 non-null  object 
 17  pymnt_plan                  39717 non-null  object 
 18  url                         39717 non-null  object 
 19  desc                        26777 non-null  object 
 20  purpose                     39717 non-null  object 
 21  title                       39706 non-null  object 
 22  zip_code                    39717 non-null  object 
 23  addr_state                  39717 non-null  object 
 24  dti                         39717 non-null  float64
 25  delinq_2yrs                 39717 non-null  int64  
 26  earliest_cr_line            39717 non-null  object 
 27  inq_last_6mths              39717 non-null  int64  
 28  mths_since_last_delinq      14035 non-null  float64
 29  mths_since_last_record      2786 non-null   float64
 30  open_acc                    39717 non-null  int64  
 31  pub_rec                     39717 non-null  int64  
 32  revol_bal                   39717 non-null  int64  
 33  revol_util                  39667 non-null  object 
 34  total_acc                   39717 non-null  int64  
 35  initial_list_status         39717 non-null  object 
 36  out_prncp                   39717 non-null  float64
 37  out_prncp_inv               39717 non-null  float64
 38  total_pymnt                 39717 non-null  float64
 39  total_pymnt_inv             39717 non-null  float64
 40  total_rec_prncp             39717 non-null  float64
 41  total_rec_int               39717 non-null  float64
 42  total_rec_late_fee          39717 non-null  float64
 43  recoveries                  39717 non-null  float64
 44  collection_recovery_fee     39717 non-null  float64
 45  last_pymnt_d                39646 non-null  object 
 46  last_pymnt_amnt             39717 non-null  float64
 47  next_pymnt_d                1140 non-null   object 
 48  last_credit_pull_d          39715 non-null  object 
 49  collections_12_mths_ex_med  39661 non-null  float64
 50  policy_code                 39717 non-null  int64  
 51  application_type            39717 non-null  object 
 52  acc_now_delinq              39717 non-null  int64  
 53  chargeoff_within_12_mths    39661 non-null  float64
 54  delinq_amnt                 39717 non-null  int64  
 55  pub_rec_bankruptcies        39020 non-null  float64
 56  tax_liens                   39678 non-null  float64
dtypes: float64(20), int64(13), object(24)
memory usage: 17.3+ MB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Now, we will remove all the &lt;strong&gt;Demographic and Customer Behavioural&lt;/strong&gt; features which is of no use for default analysis for credit approval.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# remove non-required columns
# id - not required
# member_id - not required
# acc_now_delinq - empty
# funded_amnt - not useful, funded_amnt_inv is useful which is funded to person
# emp_title - brand names not useful
# pymnt_plan - fixed value as n for all
# url - not useful
# desc - can be applied some NLP but not for EDA
# title - too many distinct values not useful
# zip_code - complete zip is not available
# delinq_2yrs - post approval feature
# mths_since_last_delinq - only half values are there, not much information
# mths_since_last_record - only 10% values are there
# revol_bal - post/behavioural feature
# initial_list_status - fixed value as f for all
# out_prncp - post approval feature
# out_prncp_inv - not useful as its for investors
# total_pymnt - post approval feature
# total_pymnt_inv - not useful as it is for investors
# total_rec_prncp - post approval feature
# total_rec_int - post approval feature
# total_rec_late_fee - post approval feature
# recoveries - post approval feature
# collection_recovery_fee - post approval feature
# last_pymnt_d - post approval feature
# last_credit_pull_d - irrelevant for approval
# last_pymnt_amnt - post feature
# next_pymnt_d - post feature
# collections_12_mths_ex_med - only 1 value 
# policy_code - only 1 value
# acc_now_delinq - single valued
# chargeoff_within_12_mths - post feature
# delinq_amnt - single valued
# tax_liens - single valued
# application_type - single
# pub_rec_bankruptcies - single valued for more than 99%
# addr_state - may not depend on location as its in financial domain

colsToDrop = ["id", "member_id", "funded_amnt", "emp_title", "pymnt_plan", "url", "desc", "title", "zip_code", "delinq_2yrs", "mths_since_last_delinq", "mths_since_last_record", "revol_bal", "initial_list_status", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp", "total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt", "next_pymnt_d", "last_credit_pull_d", "collections_12_mths_ex_med", "policy_code", "acc_now_delinq", "chargeoff_within_12_mths", "delinq_amnt", "tax_liens", "application_type", "pub_rec_bankruptcies", "addr_state"]
loan.drop(colsToDrop, axis=1, inplace=True)
print("Features we are left with",list(loan.columns))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;We are left with &lt;em&gt;[‘loan_amnt’, ‘funded_amnt_inv’, ‘term’, ‘int_rate’, ‘installment’, ‘grade’, ‘sub_grade’, ‘emp_length’, ‘home_ownership’, ‘annual_inc’, ‘verification_status’, ‘issue_d’, ‘loan_status’, ‘purpose’, ‘dti’, ‘earliest_cr_line’, ‘inq_last_6mths’, ‘open_acc’, ‘pub_rec’, ‘revol_util’, ‘total_acc’]&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Now, dealing with &lt;strong&gt;missing values&lt;/strong&gt; by removing/imputing:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# in 12 unique values we have 10+ years the most for emp_length, 
# but it is highly dependent variable so we will not impute
# but remove the rows with null values which is around 2.5%

loan.dropna(axis=0, subset=["emp_length"], inplace=True)

# remove NA rows for revol_util as its dependent and is around 0.1%

loan.dropna(axis=0, subset=["revol_util"], inplace=True)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Now, we standardize some feature columns to make data compatible for analysis:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# update int_rate, revol_util without % sign and as numeric type

loan["int_rate"] = pd.to_numeric(loan["int_rate"].apply(lambda x:x.split('%')[0]))

loan["revol_util"] = pd.to_numeric(loan["revol_util"].apply(lambda x:x.split('%')[0]))

# remove text data from term feature and store as numerical

loan["term"] = pd.to_numeric(loan["term"].apply(lambda x:x.split()[0]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Removing records with loan status as “Current”, as the loan is currently running and we can’t infer any information regarding default from such loans.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# remove the rows with loan_status as "Current"
loan = loan[loan["loan_status"].apply(lambda x:False if x == "Current" else True)]


# update loan_status as Fully Paid to 0 and Charged Off to 1
loan["loan_status"] = loan["loan_status"].apply(lambda x: 0 if x == "Fully Paid" else 1)

# update emp_length feature with continuous values as int
# where (&amp;lt; 1 year) is assumed as 0 and 10+ years is assumed as 10 and rest are stored as their magnitude

loan["emp_length"] = pd.to_numeric(loan["emp_length"].apply(lambda x:0 if "&amp;lt;" in x else (x.split('+')[0] if "+" in x else x.split()[0])))

# look through the purpose value counts
loan_purpose_values = loan["purpose"].value_counts()*100/loan.shape[0]

# remove rows with less than 1% of value counts in paricular purpose 
loan_purpose_delete = loan_purpose_values[loan_purpose_values&amp;lt;1].index.values
loan = loan[[False if p in loan_purpose_delete else True for p in loan["purpose"]]]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Outlier Treatment
&lt;/h2&gt;

&lt;p&gt;Looking upon the quantile values of each features, we will treat outliers for the some features.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# for annual_inc, the highest value is 6000000 where 75% quantile value is 83000, and is 100 times the mean
# we need to remomve outliers from annual_inc i.e. 99 to 100%
annual_inc_q = loan["annual_inc"].quantile(0.99)
loan = loan[loan["annual_inc"] &amp;lt; annual_inc_q]

# for open_acc, the highest value is 44 where 75% quantile value is 12, and is 5 times the mean
# we need to remomve outliers from open_acc i.e. 99.9 to 100%
open_acc_q = loan["open_acc"].quantile(0.999)
loan = loan[loan["open_acc"] &amp;lt; open_acc_q]

# for total_acc, the highest value is 90 where 75% quantile value is 29, and is 4 times the mean
# we need to remomve outliers from total_acc i.e. 98 to 100%
total_acc_q = loan["total_acc"].quantile(0.98)
loan = loan[loan["total_acc"] &amp;lt; total_acc_q]

# for pub_rec, the highest value is 4 where 75% quantile value is 0, and is 4 times the mean
# we need to remomve outliers from pub_rec i.e. 99.5 to 100%
pub_rec_q = loan["pub_rec"].quantile(0.995)
loan = loan[loan["pub_rec"] &amp;lt;= pub_rec_q]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Now this is how our data looks after cleaning and standardizing the features:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2414%2F1%2AM9yad6u8f03TYCpKwSvf-Q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2414%2F1%2AM9yad6u8f03TYCpKwSvf-Q.png" alt="(Image by Author)" width="800" height="238"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Metrics Derivation
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Issue date is not in the standard format also we can split the date into two columns with month and the year which will make it easy for analysis&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Year in the datetime requires year between 00 to 99 and in some cases year is single digit number i.e. 9 writing a function which will convert such dates to avoid exception in date conversion.&lt;/p&gt;

&lt;p&gt;def standerdisedate(date):&lt;br&gt;
    year = date.split("-")[0]&lt;br&gt;
    if(len(year) == 1):&lt;br&gt;
        date = "0"+date&lt;br&gt;
    return date&lt;/p&gt;

&lt;p&gt;from datetime import datetime&lt;br&gt;
loan['issue_d'] = loan['issue_d'].apply(lambda x:standerdisedate(x))&lt;br&gt;
loan['issue_d'] = loan['issue_d'].apply(lambda x: datetime.strptime(x, '%b-%y'))&lt;/p&gt;
&lt;h1&gt;
  
  
  extracting month and year from issue_date
&lt;/h1&gt;

&lt;p&gt;loan['month'] = loan['issue_d'].apply(lambda x: x.month)&lt;br&gt;
loan['year'] = loan['issue_d'].apply(lambda x: x.year)&lt;/p&gt;
&lt;h1&gt;
  
  
  get year from issue_d and replace the same
&lt;/h1&gt;

&lt;p&gt;loan["earliest_cr_line"] = pd.to_numeric(loan["earliest_cr_line"].apply(lambda x:x.split('-')[1]))&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Binning Continuous features:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# create bins for loan_amnt range
bins = [0, 5000, 10000, 15000, 20000, 25000, 36000]
bucket_l = ['0-5000', '5000-10000', '10000-15000', '15000-20000', '20000-25000','25000+']
loan['loan_amnt_range'] = pd.cut(loan['loan_amnt'], bins, labels=bucket_l)

# create bins for int_rate range
bins = [0, 7.5, 10, 12.5, 15, 100]
bucket_l = ['0-7.5', '7.5-10', '10-12.5', '12.5-15', '15+']
loan['int_rate_range'] = pd.cut(loan['int_rate'], bins, labels=bucket_l)

# create bins for annual_inc range
bins = [0, 25000, 50000, 75000, 100000, 1000000]
bucket_l = ['0-25000', '25000-50000', '50000-75000', '75000-100000', '100000+']
loan['annual_inc_range'] = pd.cut(loan['annual_inc'], bins, labels=bucket_l)

# create bins for installment range
def installment(n):
    if n &amp;lt;= 200:
        return 'low'
    elif n &amp;gt; 200 and n &amp;lt;=500:
        return 'medium'
    elif n &amp;gt; 500 and n &amp;lt;=800:
        return 'high'
    else:
        return 'very high'

loan['installment'] = loan['installment'].apply(lambda x: installment(x))

# create bins for dti range
bins = [-1, 5.00, 10.00, 15.00, 20.00, 25.00, 50.00]
bucket_l = ['0-5%', '5-10%', '10-15%', '15-20%', '20-25%', '25%+']
loan['dti_range'] = pd.cut(loan['dti'], bins, labels=bucket_l)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The following bins are created:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2ALYqZZCW6yCmaKMkRE5EbfA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2ALYqZZCW6yCmaKMkRE5EbfA.png" alt="(Image by Author)" width="797" height="270"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Visualising Data Insights
&lt;/h2&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# check for amount of defaults in the data using countplot
plt.figure(figsize=(14,5))
sns.countplot(y="loan_status", data=loan)
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2ATW68dOdVWpzIHRPGkdOX5Q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2ATW68dOdVWpzIHRPGkdOX5Q.png" alt="(Image by Author)" width="800" height="310"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;From above plot we can see that around 16% i.e. 5062 people are defaulters in total 35152 records.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Univariate Analysis
&lt;/h3&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# function for plotting the count plot features wrt default ratio
def plotUnivariateRatioBar(feature, data=loan, figsize=(10,5), rsorted=True):
    plt.figure(figsize=figsize)
    if rsorted:
        feature_dimension = sorted(data[feature].unique())
    else:
        feature_dimension = data[feature].unique()
    feature_values = []
    for fd in feature_dimension:
        feature_filter = data[data[feature]==fd]
        feature_count = len(feature_filter[feature_filter["loan_status"]==1])
        feature_values.append(feature_count*100/feature_filter["loan_status"].count())
    plt.bar(feature_dimension, feature_values, color='orange', edgecolor='white')
    plt.title("Loan Defaults wrt "+str(feature)+" feature - countplot")
    plt.xlabel(feature, fontsize=16)
    plt.ylabel("defaulter %", fontsize=16)
    plt.show()

# function to plot univariate with default status scale 0 - 1
def plotUnivariateBar(x, figsize=(10,5)):
    plt.figure(figsize=figsize)
    sns.barplot(x=x, y='loan_status', data=loan)
    plt.title("Loan Defaults wrt "+str(x)+" feature - countplot")
    plt.xlabel(x, fontsize=16)
    plt.ylabel("defaulter ratio", fontsize=16)
    plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;a. Categorical Features&lt;/strong&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# check for defaulters wrt term in the data using countplot
plotUnivariateBar("term", figsize=(8,5))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AuK1lpSTHtBZxd6GrBtHuNg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AuK1lpSTHtBZxd6GrBtHuNg.png" alt="(Image by Author)" width="513" height="341"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;From above plot for ‘term’ we can infer that the defaulters rate is increasing wrt term, hence the chances of loan getting deaulted is less for 36m than 60m.&lt;br&gt;
&lt;strong&gt;is term benificial -&amp;gt; Yes&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# check for defaulters wrt grade in the data using countplot
plotUnivariateRatioBar("grade")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2A9z50C6UrHj94vlit9MkQbQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2A9z50C6UrHj94vlit9MkQbQ.png" alt="(Image by Author)" width="614" height="341"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;From above plot for ‘grade’ we can infer that the defaulters rate is increasing wrt grade, hence the chances of loan getting deaulted increases with the grade from A moving towards G.&lt;br&gt;
&lt;strong&gt;is grade benificial -&amp;gt; Yes&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# check for defaulters wrt sub_grade in the data using countplot
plotUnivariateBar("sub_grade", figsize=(16,5))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AmJU8rDj899qdgLhYs4QciQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AmJU8rDj899qdgLhYs4QciQ.png" alt="(Image by Author)" width="800" height="287"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;From above plot for ‘sub_grade’ we can infer that the defaulters rate is increasing wrt sub_grade, hence the chances of loan getting deaulted increases with the sub_grade from A1 moving towards G5.&lt;br&gt;
&lt;strong&gt;is sub_grade benificial -&amp;gt; Yes&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# check for defaulters wrt home_ownership in the data 
plotUnivariateRatioBar("home_ownership")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AoPEtddeaAlMCePsuUOie-Q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AoPEtddeaAlMCePsuUOie-Q.png" alt="(Image by Author)" width="625" height="342"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;From above plot for ‘home_ownership’ we can infer that the defaulters rate is constant here (it is quite more for OTHERS but we dont know what is in there, so we’ll not consider it for analysis), hence defaulter does not depends on home_ownership&lt;br&gt;
&lt;strong&gt;is home_ownership benificial -&amp;gt; No&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# check for defaulters wrt verification_status in the data
plotUnivariateRatioBar("verification_status")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AbN8Np1G9mY5GhICjZeYAnw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AbN8Np1G9mY5GhICjZeYAnw.png" alt="(Image by Author)" width="614" height="342"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;From above plot for ‘verification_status’ we can infer that the defaulters rate is increasing and is less for Not Verified users than Verified ones, but not useful for analysis.&lt;br&gt;
&lt;strong&gt;is verification_status benificial -&amp;gt; No&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# check for defaulters wrt purpose in the data using countplot
plotUnivariateBar("purpose", figsize=(16,6))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2A0CNZVWG6Y6eeP0zvEb9E6A.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2A0CNZVWG6Y6eeP0zvEb9E6A.png" alt="(Image by Author)" width="800" height="330"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;From above plot for ‘purpose’ we can infer that the defaulters rate is nearly constant for all purpose type except ‘small business’, hence rate will depend on purpose of the loan&lt;br&gt;
&lt;strong&gt;is purpose benificial -&amp;gt; Yes&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# check for defaulters wrt open_acc in the data using countplot
plotUnivariateRatioBar("open_acc", figsize=(16,6))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2A-p8jXygS8l5TIJ6r_VfD9Q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2A-p8jXygS8l5TIJ6r_VfD9Q.png" alt="(Image by Author)" width="800" height="333"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;From above plot for ‘open_acc’ we can infer that the defaulters rate is nearly constant for feature open_acc, hence rate will not depend on open_acc feature&lt;br&gt;
&lt;strong&gt;is open_acc benificial -&amp;gt; No&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# check for defaulters wrt pub_rec in the data using countplot
plotUnivariateRatioBar("pub_rec")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2A4CzY6Ijk7ZppfbQp36_2Zg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2A4CzY6Ijk7ZppfbQp36_2Zg.png" alt="(Image by Author)" width="614" height="342"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;From above plot for ‘pub_rec’ we can infer that the defaulters rate is nearly increasing as it is less for 0 and more for pub_rec with value 1, but as other values are very less as compared to 0 we’ll not consider this&lt;br&gt;
&lt;strong&gt;is pub_rec benificial -&amp;gt; No&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;b. Continuous Features&lt;/strong&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# check for defaulters wrt emp_length in the data using countplot
plotUnivariateBar("emp_length", figsize=(14,6))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2A5G4B5xJEs5Ex_eXgqQXXpA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2A5G4B5xJEs5Ex_eXgqQXXpA.png" alt="(Image by Author)" width="800" height="374"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;From above plot for ‘emp_length’ we can infer that the defaulters rate is constant here, hence defaulter does not depends on emp_length&lt;br&gt;
&lt;strong&gt;is emp_length benificial -&amp;gt; No&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# check for defaulters wrt month in the data using countplot
plotUnivariateBar("month", figsize=(14,6))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2ABUSR3e7wwe08zBUy_zCdOw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2ABUSR3e7wwe08zBUy_zCdOw.png" alt="(Image by Author)" width="800" height="370"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;From above plot for ‘month’ we can infer that the defaulters rate is nearly constant here, not useful&lt;br&gt;
&lt;strong&gt;is month benificial -&amp;gt; No&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# check for defaulters wrt year in the data using countplot
plotUnivariateBar("year")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2ArR-YGz2i2DlJ9ohEIaIlCQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2ArR-YGz2i2DlJ9ohEIaIlCQ.png" alt="(Image by Author)" width="625" height="341"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;From above plot for ‘year’ we can infer that the defaulters rate is nearly constant here, not useful&lt;br&gt;
&lt;strong&gt;is year benificial -&amp;gt; No&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# check for defaulters wrt earliest_cr_line in the data
plotUnivariateBar("earliest_cr_line", figsize=(16,10))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2ApoJ_DLqP2J54jNEKdt_pPw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2ApoJ_DLqP2J54jNEKdt_pPw.png" alt="(Image by Author)" width="800" height="515"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;From above plot for ‘earliest_cr_line’ we can infer that the defaulters rate is nearly constant for all purpose type except year around 65, hence rate does not depends on earliest_cr_line of the person&lt;br&gt;
&lt;strong&gt;is earliest_cr_line benificial -&amp;gt; No&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# check for defaulters wrt inq_last_6mths in the data
plotUnivariateBar("inq_last_6mths")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2ApcjCBtvhxDY8usHTjcKmhg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2ApcjCBtvhxDY8usHTjcKmhg.png" alt="(Image by Author)" width="618" height="342"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;From above plot for ‘inq_last_6mths’ we can infer that the defaulters rate is not consistently increasing with inq_last_6mths type, hence not useful&lt;br&gt;
&lt;strong&gt;is inq_last_6mths benificial -&amp;gt; No&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# check for defaulters wrt revol_util in the data using countplot
plotUnivariateRatioBar("revol_util", figsize=(16,6))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AQC-L8uGtP82P4JkiJpIWgg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AQC-L8uGtP82P4JkiJpIWgg.png" alt="(Image by Author)" width="800" height="332"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;From above plot for ‘revol_util’ we can infer that the defaulters rate is fluctuating where some have complete 100% ratio for defaulter and is increasing as the magnitude increases, hence rate will depend on revol_util feature&lt;br&gt;
&lt;strong&gt;is revol_util benificial -&amp;gt; Yes&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# check for defaulters wrt total_acc in the data using countplot
plotUnivariateRatioBar("total_acc", figsize=(14,6))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AFUKyC8bBeYDKbLywvcmmiA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AFUKyC8bBeYDKbLywvcmmiA.png" alt="(Image by Author)" width="800" height="378"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;From above plot for ‘total_acc’ we can infer that the defaulters rate is nearly constant for all total_acc values, hence rate will not depend on total_acc feature&lt;br&gt;
&lt;strong&gt;is total_acc benificial -&amp;gt; No&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# check for defaulters wrt loan_amnt_range in the data using countplot
plotUnivariateBar("loan_amnt_range")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AMuWOXJ17ZIZqYFIb0FoT4w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AMuWOXJ17ZIZqYFIb0FoT4w.png" alt="(Image by Author)" width="625" height="342"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;From above plot for ‘loan_amnt_range’ we can infer that the defaulters rate is increasing loan_amnt_range values, hence rate will depend on loan_amnt_range feature&lt;br&gt;
&lt;strong&gt;is loan_amnt_range benificial -&amp;gt; Yes&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# check for defaulters wrt int_rate_range in the data
plotUnivariateBar("int_rate_range")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AIYPAh_lOZbh2iouXfcJ8FA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AIYPAh_lOZbh2iouXfcJ8FA.png" alt="(Image by Author)" width="625" height="342"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;From above plot for ‘int_rate_range’ we can infer that the defaulters rate is decreasing with int_rate_range values, hence rate will depend on int_rate_range feature&lt;br&gt;
&lt;strong&gt;is int_rate_range benificial -&amp;gt; Yes&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# check for defaulters wrt annual_inc_range in the data
plotUnivariateBar("annual_inc_range")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AUaNIPD-CvoG5KKqOp7ywyA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AUaNIPD-CvoG5KKqOp7ywyA.png" alt="(Image by Author)" width="632" height="342"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;From above plot for ‘annual_inc_range’ we can infer that the defaulters rate is decreasing as with annual_inc_range values, hence rate will depend on annual_inc_range feature&lt;br&gt;
&lt;strong&gt;is annual_inc_range benificial -&amp;gt; Yes&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# check for defaulters wrt dti_range in the data using countplot
plotUnivariateBar("dti_range", figsize=(16,5))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AqB8Tk84liQ4x_YsscpiBhw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AqB8Tk84liQ4x_YsscpiBhw.png" alt="(Image by Author)" width="800" height="282"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;From above plot for ‘dti_range’ we can infer that the defaulters rate is increasing as with dti_range values, hence rate will depend on dti_range feature&lt;br&gt;
&lt;strong&gt;is dti_range benificial -&amp;gt; Yes&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# check for defaulters wrt installment range in the data
plotUnivariateBar("installment", figsize=(8,5))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AUrIIOXa7ImOQqvGFFQNzBw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AUrIIOXa7ImOQqvGFFQNzBw.png" alt="(Image by Author)" width="520" height="341"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;From above plot for ‘installment’ we can infer that the defaulters rate is increasing as with installment values, hence rate will depend on dti_range feature&lt;br&gt;
&lt;strong&gt;is installment benificial -&amp;gt; Yes&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;Therefore, following are the important feature we deduced from above Univariate analysis:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;term, grade, purpose, pub_rec, revol_util, funded_amnt_inv, int_rate, annual_inc, dti, installment&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Bivariate Analysis
&lt;/h2&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# function to plot scatter plot for two features
def plotScatter(x, y):
    plt.figure(figsize=(16,6))
    sns.scatterplot(x=x, y=y, hue="loan_status", data=loan)
    plt.title("Scatter plot between "+x+" and "+y)
    plt.xlabel(x, fontsize=16)
    plt.ylabel(y, fontsize=16)
    plt.show()

def plotBivariateBar(x, hue, figsize=(16,6)):
    plt.figure(figsize=figsize)
    sns.barplot(x=x, y='loan_status', hue=hue, data=loan)
    plt.title("Loan Default ratio wrt "+x+" feature for hue "+hue+" in the data using countplot")
    plt.xlabel(x, fontsize=16)
    plt.ylabel("defaulter ratio", fontsize=16)
    plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Plotting for two different features with respect to loan default ratio on y-axis with Bar Plots and Scatter Plots.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# check for defaulters wrt annual_inc and purpose in the data using countplot
plotBivariateBar("annual_inc_range", "purpose")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AGLZMhgWep9chvY7TvF787A.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AGLZMhgWep9chvY7TvF787A.png" alt="(Image by Author)" width="800" height="330"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;From above plot, we can infer it doesn’t shows any correlation&lt;br&gt;
&lt;strong&gt;related - N&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# check for defaulters wrt term and purpose in the data 
plotBivariateBar("term", "purpose")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AqAms9crOazFvEQTP7tYYAA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AqAms9crOazFvEQTP7tYYAA.png" alt="(Image by Author)" width="800" height="332"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;As we can see straight lines on the plot, default ratio increases for every purpose wrt term&lt;br&gt;
&lt;strong&gt;related - Y&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# check for defaulters wrt grade and purpose in the data 
plotBivariateBar("grade", "purpose")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2A739tg_5vL-ESU2XHDFxL7w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2A739tg_5vL-ESU2XHDFxL7w.png" alt="(Image by Author)" width="800" height="332"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;As we can see straight lines on the plot, default ratio increases for every purpose wrt grade&lt;br&gt;
&lt;strong&gt;related - Y&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# check for defaulters wrt loan_amnt_range and purpose in the data
plotBivariateBar("loan_amnt_range", "purpose")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AapiMX_RuUx0Rhw6egOBNVg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AapiMX_RuUx0Rhw6egOBNVg.png" alt="(Image by Author)" width="800" height="333"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;As we can see straight lines on the plot, default ratio increases for every purpose wrt loan_amnt_range&lt;br&gt;
&lt;strong&gt;related - Y&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# check for defaulters wrt loan_amnt_range and term in the data
plotBivariateBar("loan_amnt_range", "term")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2ACpAeRCM7ioXtbrozUqgW9g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2ACpAeRCM7ioXtbrozUqgW9g.png" alt="(Image by Author)" width="800" height="330"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;As we can see straight lines on the plot, default ratio increases for every term wrt loan_amnt_range&lt;br&gt;
&lt;strong&gt;related - Y&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# check for defaulters wrt annual_inc_range and purpose in the data
plotBivariateBar("annual_inc_range", "purpose")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AYBPNPOuP8Sob9pfT-E5Qeg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AYBPNPOuP8Sob9pfT-E5Qeg.png" alt="(Image by Author)" width="800" height="330"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;As we can see straight lines on the plot, default ratio increases for every purpose wrt annual_inc_range&lt;br&gt;
&lt;strong&gt;related - Y&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# check for defaulters wrt annual_inc_range and purpose in the data
plotBivariateBar("installment", "purpose")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AKzawDl8vXodiSYj1_xmROQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AKzawDl8vXodiSYj1_xmROQ.png" alt="(Image by Author)" width="800" height="332"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;As we can see straight lines on the plot, default ratio increases for every purpose wrt installment except for small_business&lt;br&gt;
&lt;strong&gt;related - Y&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# check for defaulters wrt loan_amnt_range in the data
plotScatter("int_rate", "annual_inc")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AlVoAve3aZZCjQ0pJS4JR8g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AlVoAve3aZZCjQ0pJS4JR8g.png" alt="(Image by Author)" width="800" height="321"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;As we can see straight lines on the plot, there is no relation between above mentioned features&lt;br&gt;
&lt;strong&gt;related - N&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# plot scatter for funded_amnt_inv with dti
plotScatter("funded_amnt_inv", "dti")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2Aqj9I9HUyqTbMSMkgP9nDTQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2Aqj9I9HUyqTbMSMkgP9nDTQ.png" alt="(Image by Author)" width="800" height="334"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;As we can see straight lines on the plot, there is no relation between above mentioned features&lt;br&gt;
&lt;strong&gt;related - N&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# plot scatter for funded_amnt_inv with annual_inc
plotScatter("annual_inc", "funded_amnt_inv")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2Agkqu1SdnS70kyurViDVCzg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2Agkqu1SdnS70kyurViDVCzg.png" alt="(Image by Author)" width="800" height="327"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;As we can see slope pattern on the plot, there is positive relation between above mentioned features&lt;br&gt;
&lt;strong&gt;related - Y&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# plot scatter for loan_amnt with int_rate
plotScatter("loan_amnt", "int_rate")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2ACDGz5t6IKJs3PN3jhAp4ig.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2ACDGz5t6IKJs3PN3jhAp4ig.png" alt="(Image by Author)" width="800" height="330"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;As we can see straight line patterns on the plot, there is no relation between above mentioned features&lt;br&gt;
&lt;strong&gt;related - N&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# plot scatter for int_rate with annual_inc
plotScatter("int_rate", "annual_inc")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AlVoAve3aZZCjQ0pJS4JR8g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AlVoAve3aZZCjQ0pJS4JR8g.png" alt="(Image by Author)" width="800" height="321"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;As we can see negative correlation pattern with reduced density on the plot, there is some relation between above mentioned features&lt;br&gt;
&lt;strong&gt;related - Y&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# plot scatter for earliest_cr_line with int_rate
plotScatter("earliest_cr_line", "int_rate")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AKQ-MFXf0NnrNXHg2RJRAyQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AKQ-MFXf0NnrNXHg2RJRAyQ.png" alt="(Image by Author)" width="800" height="330"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;As we can see positive correlation pattern with increasing density on the plot, there is co-relation between above mentioned features&lt;br&gt;
&lt;strong&gt;related - Y&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# plot scatter for annual_inc with emp_length
plotScatter("annual_inc", "emp_length")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AzVCmy-ZlKvcLaqzIq5WT3Q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AzVCmy-ZlKvcLaqzIq5WT3Q.png" alt="(Image by Author)" width="800" height="334"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;As we can see straight line patterns on the plot, there is no relation between above mentioned features&lt;br&gt;
&lt;strong&gt;related - N&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# plot scatter for earliest_cr_line with dti
plotScatter("earliest_cr_line", "dti")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AH48tltYADFRY-Vesan0ahw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AH48tltYADFRY-Vesan0ahw.png" alt="(Image by Author)" width="800" height="334"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Plotting for two different features with respect to loan default ratio on y-axis with Box Plots and Violin Plots.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# function to plot boxplot for comparing two features
def plotBox(x, y, hue="loan_status"):
    plt.figure(figsize=(16,6))
    sns.boxplot(x=x, y=y, data=loan, hue=hue, order=sorted(loan[x].unique()))
    plt.title("Box plot between "+x+" and "+y+" for each "+hue)
    plt.xlabel(x, fontsize=16)
    plt.ylabel(y, fontsize=16)
    plt.show()
    plt.figure(figsize=(16,8))
    sns.violinplot(x=x, y=y, data=loan, hue=hue, order=sorted(loan[x].unique()))
    plt.title("Violin plot between "+x+" and "+y+" for each "+hue)
    plt.xlabel(x, fontsize=16)
    plt.ylabel(y, fontsize=16)
    plt.show()

# plot box for term vs int_rate for each loan_status
plotBox("term", "int_rate")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2ADLE3T9btaqwI6S1_KxzMkw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2ADLE3T9btaqwI6S1_KxzMkw.png" alt="(Image by Author)" width="800" height="329"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AAUD_Vy97bK8rXTn50XczPA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AAUD_Vy97bK8rXTn50XczPA.png" alt="(Image by Author)" width="800" height="424"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;int_rate increases with term on loan and the chances of default also increases&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# plot box for loan_status vs int_rate for each purpose
plotBox("loan_status", "int_rate", hue="purpose")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AFuG__UlU9QGgUvDfgwRhYQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AFuG__UlU9QGgUvDfgwRhYQ.png" alt="(Image by Author)" width="800" height="330"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AZ2kMugNfcHx1cl3R9E-pqw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AZ2kMugNfcHx1cl3R9E-pqw.png" alt="(Image by Author)" width="800" height="425"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;int_rate is quite high where the loan is defaulted for every purpose value&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# plot box for purpose vs revo_util for each status
plotBox("purpose", "revol_util")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2Anp0nayXZmReLF5xAXWFonw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2Anp0nayXZmReLF5xAXWFonw.png" alt="(Image by Author)" width="800" height="331"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2ACHgaGWi5iEp_ibrcc6ynkQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2ACHgaGWi5iEp_ibrcc6ynkQ.png" alt="(Image by Author)" width="800" height="420"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;revol_util is more for every purpose value where the loan is defaulted and quite high for credit_card&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# plot box for grade vs int_rate for each loan_status
plotBox("grade", "int_rate", "loan_status")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AP60vD-fSPsXU2jBbg5IdpA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AP60vD-fSPsXU2jBbg5IdpA.png" alt="(Image by Author)" width="800" height="329"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AtxlvMupNHkXH2FA-upxoQw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AtxlvMupNHkXH2FA-upxoQw.png" alt="(Image by Author)" width="800" height="419"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;int_rate is increasing with every grade and also the defaulters for every grade are having their median near the non-defaulter 75% quantile of int_rate&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# plot box for issue_d vs int_rate for each loan_status
plotBox("month", "int_rate", "loan_status")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2A6WSUa5MRmG40EJbfFYogHg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2A6WSUa5MRmG40EJbfFYogHg.png" alt="(Image by Author)" width="800" height="329"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2Au5iqMyGYQndTSQxEcF_Ztg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2Au5iqMyGYQndTSQxEcF_Ztg.png" alt="(Image by Author)" width="800" height="424"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;int_rate for defaulter is increasing with every month where the defaulters for every month are having their median near the non-defaulter’s 75% quantile of int_rate, but is almost constant for each month, not useful&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;Therefore, following are the important feature we deduced from above Bivariate analysis:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;term, grade, purpose, pub_rec, revol_util, funded_amnt_inv, int_rate, annual_inc, installment&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Multivariate Analysis (Correlation)
&lt;/h2&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# plot heat map to see correlation between features
continuous_f = ["funded_amnt_inv", "annual_inc", "term", "int_rate", "loan_status", "revol_util", "pub_rec", "earliest_cr_line"]
loan_corr = loan[continuous_f].corr()
sns.heatmap(loan_corr,vmin=-1.0,vmax=1.0,annot=True, cmap="YlGnBu")
plt.title("Correlation Heatmap")
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AdN-lNZl4NBKIcx-67PDt1Q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AdN-lNZl4NBKIcx-67PDt1Q.png" alt="(Image by Author)" width="753" height="571"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Hence, important related feature from above &lt;strong&gt;Multivariate analysis&lt;/strong&gt; are:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;term, grade, purpose, revol_util, int_rate, installment, annual_inc, funded_amnt_inv&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Findings
&lt;/h2&gt;

&lt;p&gt;After analysing all the related features available in the dataset, we have come to an end, deducing the main &lt;em&gt;driving features&lt;/em&gt; for the &lt;strong&gt;Lending Club Loan Default&lt;/strong&gt; analysis:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The best driving features for the Loan default analysis are:&lt;/em&gt; &lt;strong&gt;term, grade, purpose, revol_util, int_rate, installment, annual_inc, funded_amnt_inv&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>python</category>
    </item>
    <item>
      <title>Classify any Object using pre-trained CNN Model</title>
      <dc:creator>Sparsh Gupta</dc:creator>
      <pubDate>Fri, 10 Jul 2020 14:55:47 +0000</pubDate>
      <link>https://forem.com/imsparsh/classify-any-object-using-pre-trained-cnn-model-1pbm</link>
      <guid>https://forem.com/imsparsh/classify-any-object-using-pre-trained-cnn-model-1pbm</guid>
      <description>&lt;h4&gt;
  
  
  Large Scale Image Classification using pre-trained Inception v3 Convolution Neural Network Model
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F10000%2F0%2AhPVHCGVgepguJt8I" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F10000%2F0%2AhPVHCGVgepguJt8I" alt="Photo by [Lenin Estrada](https://unsplash.com/@lenin33?utm_source=medium&amp;amp;utm_medium=referral) on [Unsplash](https://unsplash.com?utm_source=medium&amp;amp;utm_medium=referral)" width="760" height="506"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Today we have the super-effective technique as &lt;strong&gt;Transfer Learning&lt;/strong&gt; where we can use a pre-trained model by &lt;strong&gt;Google AI&lt;/strong&gt; to classify any image of classified visual objects in the world of computer vision.&lt;/p&gt;

&lt;p&gt;Transfer learning is a machine learning method which utilizes a pre-trained neural network. Here, the &lt;em&gt;image recognition&lt;/em&gt; model called Inception-v3 consists of two parts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Feature extraction&lt;/strong&gt; part with a convolutional neural network.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Classification&lt;/strong&gt; part with fully-connected and softmax layers.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Inception-v3 is a pre-trained convolutional neural network model that is 48 layers deep.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It is a version of the network already trained on more than a million images from the &lt;a href="http://www.image-net.org" rel="noopener noreferrer"&gt;&lt;strong&gt;ImageNet&lt;/strong&gt;&lt;/a&gt; database. It is the third edition of Inception CNN model by Google, originally instigated during the &lt;strong&gt;ImageNet Recognition Challenge&lt;/strong&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This pre-trained network can classify images into 1000 object categories, such as keyboard, mouse, pencil, and many animals. As a result, the network has learned rich feature representations for a wide range of images. The network has an image input size of 299-by-299. The model extracts general features from input images in the first part and classifies them based on those features in the second part.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6ut9m6knpm7jrvkk01rq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6ut9m6knpm7jrvkk01rq.png" alt="Schematic diagram of Inception v3 — By Google AI" width="800" height="311"&gt;&lt;/a&gt;&lt;em&gt;Schematic diagram of Inception v3 — By Google AI&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Inception v3&lt;/em&gt;&lt;/strong&gt; is a widely-used image recognition model that has been shown to attain greater than 78.1% accuracy on the ImageNet dataset and around 93.9% accuracy in top 5 results. The model is the culmination of many ideas introduced by multiple researchers over the past years. It is based on the original paper: “&lt;a href="https://arxiv.org/abs/1512.00567" rel="noopener noreferrer"&gt;Rethinking the Inception Architecture for Computer Vision&lt;/a&gt;” by Szegedy, et. al.&lt;/p&gt;

&lt;p&gt;More information about the Inception architecture can be found &lt;a href="https://github.com/tensorflow/models/tree/master/research/inception" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;h1&gt;
  
  
  In Transfer Learning, when you build a new model to classify your original dataset, you reuse the feature extraction part and re-train the classification part with your dataset. Since you don’t have to train the feature extraction part (which is the most complex part of the model), you can train the model with less computational resources and training time.
&lt;/h1&gt;
&lt;/blockquote&gt;

&lt;p&gt;In this article, we will just use the Inception v3 model to predict some images and fetch the top 5 predicted classes for the same. Let’s begin.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;We are using Tensorflow v2.x&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Import Data
&lt;/h2&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import os
import numpy as np
from PIL import Image
from imageio import imread
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
import tf_slim as slim
from tf_slim.nets import inception
import tf_slim as slim
import cv2
import matplotlib.pyplot as plt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Data Loading
&lt;/h2&gt;

&lt;p&gt;Setup all initial variables with default file locations and respective values.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ckpt_path = "/kaggle/input/inception_v3.ckpt"
images_path = "/kaggle/input/animals/*"
img_width = 299
img_height = 299
batch_size = 16
batch_shape = [batch_size, img_height, img_width, 3]
num_classes = 1001
predict_output = []
class_names_path = "/kaggle/input/imagenet_class_names.txt"
with open(class_names_path) as f:
    class_names = f.readlines()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Create Inception v3 model
&lt;/h2&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;X = tf.placeholder(tf.float32, shape=batch_shape)

with slim.arg_scope(inception.inception_v3_arg_scope()):
    logits, end_points = inception.inception_v3(
        X, num_classes=num_classes, is_training=False
    )

predictions = end_points["Predictions"]
saver = tf.train.Saver(slim.get_model_variables())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Define function for loading images and resizing for sending to model for evaluation in RGB mode.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def load_images(input_dir):
    global batch_shape
    images = np.zeros(batch_shape)
    filenames = []
    idx = 0
    batch_size = batch_shape[0]
    files = tf.gfile.Glob(input_dir)[:20]
    files.sort()
    for filepath in files:
        with tf.gfile.Open(filepath, "rb") as f:
            imgRaw = np.array(Image.fromarray(imread(f, as_gray=False, pilmode="RGB")).resize((299, 299))).astype(np.float) / 255.0
        images[idx, :, :, :] = imgRaw * 2.0 - 1.0
        filenames.append(os.path.basename(filepath))
        idx += 1
        if idx == batch_size:
            yield filenames, images
            filenames = []
            images = np.zeros(batch_shape)
            idx = 0
    if idx &amp;gt; 0:
        yield filenames, images
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Load Pre-Trained Model
&lt;/h2&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;session_creator = tf.train.ChiefSessionCreator(
        scaffold=tf.train.Scaffold(saver=saver),
        checkpoint_filename_with_path=ckpt_path,
        master='')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Classify Images using Model
&lt;/h2&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;with tf.train.MonitoredSession(session_creator=session_creator) as sess:
    for filenames, images in load_images(images_path):
        labels = sess.run(predictions, feed_dict={X: images})
        for filename, label, image in zip(filenames, labels, images):
            predict_output.append([filename, label, image])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Predictions
&lt;/h2&gt;

&lt;p&gt;We will use some images from the &lt;a href="https://www.kaggle.com/alessiocorrado99/animals10" rel="noopener noreferrer"&gt;Animals-10&lt;/a&gt; dataset from kaggle to declare the model predictions.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for x in predict_output:
    out_list = list(x[1])
    topPredict = sorted(range(len(out_list)), key=lambda i: out_list[i], reverse=True)[:5]
    plt.imshow((((x[2]+1)/2)*255).astype(int))
    plt.show()
    print("Filename:",x[0])
    print("Displaying the top 5 Predictions for above image:")
    for p in topPredict:
        print(class_names[p-1].strip())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmpgtybdr2f0a4vhmjtfb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmpgtybdr2f0a4vhmjtfb.png" width="800" height="501"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvjv1n4ncuknm2fpgnq9d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvjv1n4ncuknm2fpgnq9d.png" width="800" height="500"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2A_nw0jzhFN7sjB8BbFzpmRg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2A_nw0jzhFN7sjB8BbFzpmRg.png" width="800" height="504"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AFKO8wrDcUE_H2hlMzq2CPA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AFKO8wrDcUE_H2hlMzq2CPA.png" width="800" height="497"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2Afik-gIv8cMwpyCXww3mPhw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2Afik-gIv8cMwpyCXww3mPhw.png" width="800" height="497"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AjvaX4wEUAQvUE8O5iMzPzQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AjvaX4wEUAQvUE8O5iMzPzQ.png" width="800" height="502"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At length, all of the classes are classified spot on, and we can also see that the top 5 similar classes as predicted by the model are pretty good and precise.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>python</category>
    </item>
    <item>
      <title>Most Common Loss Functions in Machine Learning</title>
      <dc:creator>Sparsh Gupta</dc:creator>
      <pubDate>Thu, 09 Jul 2020 06:11:13 +0000</pubDate>
      <link>https://forem.com/imsparsh/most-common-loss-functions-in-machine-learning-57p7</link>
      <guid>https://forem.com/imsparsh/most-common-loss-functions-in-machine-learning-57p7</guid>
      <description>&lt;h4&gt;
  
  
  Every Machine Learning Engineer should know about these common Loss functions in Machine Learning and when to use them.
&lt;/h4&gt;

&lt;blockquote&gt;
&lt;h1&gt;
  
  
  In &lt;a href="https://en.wikipedia.org/wiki/Mathematical_optimization" rel="noopener noreferrer"&gt;mathematical optimization&lt;/a&gt; and &lt;a href="https://en.wikipedia.org/wiki/Decision_theory" rel="noopener noreferrer"&gt;decision theory&lt;/a&gt;, a loss function or cost function is a function that maps an &lt;a href="https://en.wikipedia.org/wiki/Event_(probability_theory)" rel="noopener noreferrer"&gt;event&lt;/a&gt; or values of one or more variables onto a &lt;a href="https://en.wikipedia.org/wiki/Real_number" rel="noopener noreferrer"&gt;real number&lt;/a&gt; intuitively representing some “cost” associated with the event.
&lt;/h1&gt;
&lt;h1&gt;
  
  
  — &lt;a href="https://en.wikipedia.org/wiki/Loss_function" rel="noopener noreferrer"&gt;Wikipedia&lt;/a&gt;
&lt;/h1&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F7132%2F1%2Al3kNYfW54bLC2b1VW4whmA.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F7132%2F1%2Al3kNYfW54bLC2b1VW4whmA.jpeg" alt="Photo by [Josh Rose](https://unsplash.com/@joshsrose?utm_source=medium&amp;amp;utm_medium=referral) on [Unsplash](https://unsplash.com?utm_source=medium&amp;amp;utm_medium=referral)" width="800" height="880"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As a core element, the Loss function is a method of evaluating your Machine Learning algorithm that how well it models your featured dataset. It is defined as &lt;strong&gt;a measurement of how good your model is in terms of predicting the expected outcome.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The &lt;em&gt;Cost function&lt;/em&gt; and &lt;em&gt;Loss function&lt;/em&gt; refer to the same context. The cost function is a function that is calculated as the average of all loss function values. Whereas, the loss function is calculated for each sample output compared to its actual value.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;The Loss function is directly related to the predictions of your model that you have built. So if your loss function value is less, your model will be providing good results. Loss function or I should rather say, the Cost function that is used to evaluate the model performance, needs to be minimized in order to improve its performance.&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Lets now dive into the Loss functions.&lt;/p&gt;

&lt;p&gt;Widely speaking, the Loss functions can be grouped into two major categories concerning the types of problems that we come across in the real world — &lt;a href="https://en.wikipedia.org/wiki/Loss_functions_for_classification" rel="noopener noreferrer"&gt;&lt;strong&gt;Classification&lt;/strong&gt;&lt;/a&gt; and &lt;strong&gt;Regression&lt;/strong&gt;. In Classification, the task is to predict the respective probabilities of all classes that the problem is dealing with. In Regression, oppositely, the task is to predict the continuous value concerning a given set of independent features to the learning algorithm.&lt;/p&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&lt;strong&gt;Assumptions:&lt;/strong&gt;&lt;br&gt;
    n/m — Number of training samples.&lt;br&gt;
    i — ith training sample in a dataset.&lt;br&gt;
    y(i) — Actual value for the ith training sample.&lt;br&gt;
    y_hat(i) — Predicted value for the ith training sample.&lt;br&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
&lt;br&gt;
  &lt;br&gt;
  &lt;br&gt;
  Classification Losses&lt;br&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Binary Cross-Entropy Loss / Log Loss
&lt;/h3&gt;

&lt;p&gt;This is the most common Loss function used in Classification problems. The cross-entropy loss decreases as the predicted probability converges to the actual label. It measures the performance of a classification model whose predicted output is a probability value between 0 and 1.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When the number of classes is 2, &lt;em&gt;Binary Classification&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fswxv6xyr4evx6svdezt6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fswxv6xyr4evx6svdezt6.png" width="800" height="113"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When the number of classes is more than 2, &lt;em&gt;Multi-class Classification&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgorsl0meqc743rkdj1jl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgorsl0meqc743rkdj1jl.png" width="561" height="161"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnlyi7bv988qxi8k8p56l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnlyi7bv988qxi8k8p56l.png" width="556" height="397"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Cross-Entropy Loss formula is derived from the regular likelihood function, but with logarithms added in.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Hinge Loss
&lt;/h3&gt;

&lt;p&gt;The second most common loss function used for Classification problems and an alternative to Cross-Entropy loss function is Hinge Loss, primarily developed for Support Vector Machine (SVM) model evaluation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw3ck5uqwtr4wl5lb81jl.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw3ck5uqwtr4wl5lb81jl.jpeg" width="421" height="101"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AaxK_lLrWa20u8_tM0V7uBQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AaxK_lLrWa20u8_tM0V7uBQ.png" width="424" height="280"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Hinge Loss not only penalizes the wrong predictions but also the right predictions that are not confident. It is primarily used with SVM Classifiers with class labels as -1 and 1. Make sure you change your malignant class labels from 0 to -1.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F10368%2F0%2A4ZxpnNHNK9ehInmM" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F10368%2F0%2A4ZxpnNHNK9ehInmM" alt="Photo by [Jen Theodore](https://unsplash.com/@jentheodore?utm_source=medium&amp;amp;utm_medium=referral) on [Unsplash](https://unsplash.com?utm_source=medium&amp;amp;utm_medium=referral)" width="720" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Regression Losses
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Mean Square Error / Quadratic Loss / L2 Loss
&lt;/h3&gt;

&lt;p&gt;MSE loss function is defined as the average of squared differences between the actual and the predicted value. It is the most commonly used Regression loss function.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2ASRKwqe7YM2-cV3PMMN7VTg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2ASRKwqe7YM2-cV3PMMN7VTg.png" width="581" height="161"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AH1mkqlq7buDZ7rDWD_4qIw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AH1mkqlq7buDZ7rDWD_4qIw.png" width="576" height="360"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The corresponding cost function is the &lt;strong&gt;Mean&lt;/strong&gt; of these &lt;strong&gt;Squared Errors (MSE)&lt;/strong&gt;. The MSE Loss function penalizes the model for making large errors by squaring them and this property makes the MSE cost function less robust to outliers. Therefore, &lt;em&gt;it should not be used if the data is prone to many outliers.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Mean Absolute Error / L1 Loss
&lt;/h3&gt;

&lt;p&gt;MSE loss function is defined as the average of absolute differences between the actual and the predicted value. It is the second most commonly used Regression loss function. It measures the average magnitude of errors in a set of predictions, without considering their directions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2A5A-4HIx11hquyDaOImEFkA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2A5A-4HIx11hquyDaOImEFkA.png" width="644" height="165"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AID63kMgh8F0fcgmsrmjS5A.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AID63kMgh8F0fcgmsrmjS5A.png" width="576" height="360"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The corresponding cost function is the &lt;strong&gt;Mean&lt;/strong&gt; of these &lt;strong&gt;Absolute Errors (MAE)&lt;/strong&gt;. The MAE Loss function is more robust to outliers compared to MSE Loss function. Therefore, &lt;em&gt;it should be used if the data is prone to many outliers.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Huber Loss / Smooth Mean Absolute Error
&lt;/h3&gt;

&lt;p&gt;Huber loss function is defined as the combination of MSE and MAE Loss function as it approaches &lt;strong&gt;MSE when 𝛿 ~ 0 and MAE when 𝛿 ~ ∞ (large numbers)&lt;/strong&gt;. It’s Mean Absolute Error, that becomes quadratic when the error is small. And to make the error quadratic depends on how small that error could be which is controlled by a hyperparameter, 𝛿 (delta), which can be tuned.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AVx7otH8Vzkkw_Gxzt9xpgg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AVx7otH8Vzkkw_Gxzt9xpgg.png" width="525" height="83"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AkBXeqZvMpfVMsYry0Chs5g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AkBXeqZvMpfVMsYry0Chs5g.png" width="504" height="360"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The choice of the delta value is critical because it determines what you’re willing to consider as an outlier. Hence, the Huber Loss function could be less sensitive to outliers compared to MSE Loss function depending upon the hyperparameter value. Therefore, &lt;strong&gt;&lt;em&gt;it can be used if the data is prone to outliers and&lt;/em&gt; we might need to train hyperparameter delta which is an iterative process.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Log-Cosh Loss
&lt;/h3&gt;

&lt;p&gt;The Log-Cosh loss function is defined as the logarithm of the hyperbolic cosine of the prediction error. It is another function used in regression tasks which is much smoother than MSE Loss. It has all the advantages of Huber loss, and it’s twice differentiable everywhere, unlike Huber loss as some Learning algorithms like XGBoost use Newton’s method to find the optimum, and hence the second derivative (&lt;em&gt;Hessian&lt;/em&gt;) is needed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AGrNtSStzBEwM343vuZoIIQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AGrNtSStzBEwM343vuZoIIQ.png" width="436" height="90"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AVQf8ToK0Td-XQjfuWrD-ew.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AVQf8ToK0Td-XQjfuWrD-ew.png" width="504" height="360"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;log(cosh(x)) is approximately equal to (x *&lt;/em&gt; 2) / 2 for small x and to abs(x) - log(2) for large x. This means that ‘logcosh’ works mostly like the mean squared error, but will not be so strongly affected by the occasional wildly incorrect prediction.&lt;br&gt;
 — &lt;a href="https://www.tensorflow.org/api_docs/python/tf/keras/losses/logcosh" rel="noopener noreferrer"&gt;Tensorflow Docs&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;5. Quantile Loss&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;A quantile is a value below which a fraction of samples in a group falls. Machine learning models work by minimizing (or maximizing) an objective function. As the name suggests, the quantile regression loss function is applied to predict quantiles. For a set of predictions, the loss will be its average.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AS4LuBEEuOrPHv55YBisPNQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AS4LuBEEuOrPHv55YBisPNQ.png" width="717" height="99"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AG9ei3IOtTOz2N8QCTj-37Q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AG9ei3IOtTOz2N8QCTj-37Q.png" width="504" height="360"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://towardsdatascience.com/deep-quantile-regression-c85481548b5a" rel="noopener noreferrer"&gt;Quantile loss function&lt;/a&gt; turns out to be useful when we are interested in predicting an interval instead of only point predictions.&lt;/p&gt;

&lt;p&gt;Thank you for reading! I hope this post has been useful. I appreciate feedback and constructive criticism. If you want to talk about this article or other related topics, you can drop me a text here or on my &lt;a href="https://www.linkedin.com/in/imsparsh/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; account.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F6030%2F1%2ANgtElaMk5jG3trBElXPG8A.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F6030%2F1%2ANgtElaMk5jG3trBElXPG8A.jpeg" alt="Photo by [Crawford Jolly](https://unsplash.com/@crawford?utm_source=medium&amp;amp;utm_medium=referral) on [Unsplash](https://unsplash.com?utm_source=medium&amp;amp;utm_medium=referral)" width="800" height="464"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>python</category>
      <category>computerscience</category>
    </item>
    <item>
      <title>What makes Logistic Regression a Classification Algorithm?</title>
      <dc:creator>Sparsh Gupta</dc:creator>
      <pubDate>Mon, 06 Jul 2020 15:36:08 +0000</pubDate>
      <link>https://forem.com/imsparsh/what-makes-logistic-regression-a-classification-algorithm-199l</link>
      <guid>https://forem.com/imsparsh/what-makes-logistic-regression-a-classification-algorithm-199l</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F5786%2F1%2AKE_Ccr6hyC_ZK0Or1kfhiQ.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F5786%2F1%2AKE_Ccr6hyC_ZK0Or1kfhiQ.jpeg" alt="Photo by [Caleb Jones](https://unsplash.com/@gcalebjones?utm_source=medium&amp;amp;utm_medium=referral) on [Unsplash](https://unsplash.com?utm_source=medium&amp;amp;utm_medium=referral) — Edited" width="800" height="432"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;h1&gt;
  
  
  Logistic regression is a &lt;a href="https://en.wikipedia.org/wiki/Statistical_model" rel="noopener noreferrer"&gt;statistical model&lt;/a&gt; that in its basic form uses a &lt;a href="https://en.wikipedia.org/wiki/Logistic_function" rel="noopener noreferrer"&gt;logistic function&lt;/a&gt; to model a &lt;a href="https://en.wikipedia.org/wiki/Binary_variable" rel="noopener noreferrer"&gt;binary&lt;/a&gt; &lt;a href="https://en.wikipedia.org/wiki/Dependent_variable" rel="noopener noreferrer"&gt;dependent variable&lt;/a&gt;, although many more complex &lt;a href="https://en.wikipedia.org/wiki/Logistic_regression#Extensions" rel="noopener noreferrer"&gt;extensions&lt;/a&gt; exist.
&lt;/h1&gt;
&lt;h1&gt;
  
  
  — Wikipedia.
&lt;/h1&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;— All the images (plots) are generated and modified by Author.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Probably, for every Data Practitioner, the &lt;strong&gt;&lt;em&gt;Linear Regression&lt;/em&gt;&lt;/strong&gt; happens to be the starting point when implementing Machine Learning, where you learn about &lt;em&gt;foretelling a continuous value for the given independent set of rules&lt;/em&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Logistic, not Linear?
&lt;/h3&gt;

&lt;p&gt;Let us start with the most basic one, in &lt;strong&gt;&lt;em&gt;Binary Classification&lt;/em&gt;&lt;/strong&gt;, the model should be able to predict the dependent variable as one of the two probable class which could be &lt;em&gt;0 or 1&lt;/em&gt;. If we consider using &lt;em&gt;Linear Regression&lt;/em&gt;, we can predict the value for the given set of rules as input to the model but it will forecast continuous values like 0.03, +1.2, -0.9, etc. which is not suitable for categorizing it in one of the two classes neither identifying it as a probability value to predict a class.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;E.g.&lt;/em&gt;&lt;/strong&gt; When we have to predict if a website is malicious when the length of the URL is given as a feature, the response variable has two values, benign and malicious.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxt0jmtaa2sim7o09kcxx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxt0jmtaa2sim7o09kcxx.png" alt="Linear Regression on categorical data — By Author" width="363" height="323"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If we try to fit a Linear Regression model to a binary classification problem, the model fit will be a straight line and can be seen why it is not suitable for using the same.&lt;/p&gt;

&lt;p&gt;To overcome this problem, we use a &lt;strong&gt;&lt;em&gt;sigmoid function&lt;/em&gt;&lt;/strong&gt;, which tries to fit an exponential curve to the data to build a good model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Logistic/Sigmoid Function
&lt;/h2&gt;

&lt;p&gt;The Logistic Regression can be explained with &lt;em&gt;Logistic function&lt;/em&gt;, also known as &lt;em&gt;Sigmoid function&lt;/em&gt; that takes any real input &lt;em&gt;x&lt;/em&gt;, and outputs a probability value between 0 and 1 which is defined as,&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx2of9dfr64io9ksbgfb7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx2of9dfr64io9ksbgfb7.png" width="598" height="155"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The model fit using the above Logistic function can be seen as below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxz9uevet9pj7yu3orvfv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxz9uevet9pj7yu3orvfv.png" alt="Logistic Regression on categorical data — By Author" width="800" height="645"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Further, for any given independent variable t, let us consider it as a linear function in a univariate regression model, where &lt;em&gt;β0&lt;/em&gt; is the intercept and &lt;em&gt;β1&lt;/em&gt; is the slope and is given by,&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd30dpwi7vjaqk7o7u5a7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd30dpwi7vjaqk7o7u5a7.png" width="344" height="85"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The general Logistic function &lt;em&gt;p&lt;/em&gt; which outputs a value between 0 and 1 will become,&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2Ac5vPslbYETp28VpmYRvPVA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2Ac5vPslbYETp28VpmYRvPVA.png" width="691" height="147"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can see that the data separable into two classes can be modelled using a Logistic function for the given variable in a linear function. But the relation between the input variable x and output probability cannot be interpreted easily which is given by the sigmoid function, we introduce the &lt;strong&gt;&lt;em&gt;Logit&lt;/em&gt;&lt;/strong&gt; (log-odds) function now that makes this model interpretable in a linear fashion.&lt;/p&gt;

&lt;h2&gt;
  
  
  Logit (Log-Odds) Function
&lt;/h2&gt;

&lt;p&gt;The Log-odds function, &lt;em&gt;a.k.a natural logarithm of the odds&lt;/em&gt;, is an inverse of the standard Logistic function which can be defined and further simplified as,&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2742%2F1%2Akv33IT2dtfjRPG1xcSAAvA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2742%2F1%2Akv33IT2dtfjRPG1xcSAAvA.png" width="800" height="101"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the above equation, the terms are as follows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;g&lt;/em&gt; is the &lt;a href="https://en.wikipedia.org/wiki/Logit" rel="noopener noreferrer"&gt;logit&lt;/a&gt; function. The equation for &lt;em&gt;g(p(x))&lt;/em&gt; shows that the logit is equivalent to linear regression expression&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;ln&lt;/em&gt; denotes the &lt;a href="https://en.wikipedia.org/wiki/Natural_logarithm" rel="noopener noreferrer"&gt;natural logarithm&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;p(x)&lt;/em&gt; is the probability of the dependent variable that falls in one of the two classes 0 or 1, given some linear combination of the predictors&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;β0&lt;/em&gt; is the &lt;a href="https://en.wikipedia.org/wiki/Y-intercept" rel="noopener noreferrer"&gt;intercept&lt;/a&gt; from the linear regression equation&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;β1&lt;/em&gt; is the regression coefficient multiplied by some value of the predictor&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On further simplifying the above equation and exponentiating both sides, we can deduce the relationship between the probability and the linear model as,&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2Aa1wdaYLavV_K3VSqQPMIKg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2Aa1wdaYLavV_K3VSqQPMIKg.png" width="456" height="163"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The left term is called &lt;strong&gt;&lt;em&gt;odds&lt;/em&gt;&lt;/strong&gt;, which is defined as equivalent to the exponential function of the linear regression expression. With &lt;em&gt;ln&lt;/em&gt; (log base e) on both sides, we can interpret the relation as linear between the log-odds and the independent variable &lt;em&gt;x&lt;/em&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Regression?
&lt;/h3&gt;

&lt;p&gt;The change in probability &lt;em&gt;p(x)&lt;/em&gt; with change in variable x can not be directly understood as it is defined by the sigmoid function. But with the above expression, we can interpret that the change in log-odds of variable x is linear concerning a change in variable &lt;em&gt;x&lt;/em&gt; itself. The plot of log-odds with linear equation can be seen as,&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2A7ETm9YJTSixm_Y7ovZrD7w.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2A7ETm9YJTSixm_Y7ovZrD7w.jpeg" alt="Log-odds vs independent variable x — By Author" width="578" height="356"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The probability outcome of the dependent variable shows that the value of the linear regression expression can vary from negative to positive infinity and yet, after transformation with sigmoid function, the resulting expression for the probability &lt;em&gt;p(x)&lt;/em&gt; ranges between 0 and 1, i.e. 0&amp;lt;p&amp;lt;1. Therefore, this is &lt;strong&gt;what makes Logistic Regression a Classification algorithm being regression&lt;/strong&gt;, that classifies the value of linear regression to a particular class depending upon the decision boundary.&lt;/p&gt;

&lt;h2&gt;
  
  
  Decision Boundary
&lt;/h2&gt;

&lt;p&gt;The decision boundary is defined as a &lt;em&gt;threshold&lt;/em&gt; value that helps us to classify the predicted probability value given by sigmoid function into a particular class, positive or negative.&lt;/p&gt;

&lt;h3&gt;
  
  
  Linear Decision Boundary
&lt;/h3&gt;

&lt;p&gt;When two or more classes can be linearly separable,&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AUEMtd_lo2ve5M1jTYVg4DQ.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AUEMtd_lo2ve5M1jTYVg4DQ.jpeg" alt="Linear Decision Boundary — By Author" width="800" height="599"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Non-Linear Boundary
&lt;/h3&gt;

&lt;p&gt;When two or more classes are not linearly separable,&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2400%2F1%2ARggWqrx86u_4JhUY1zHznw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2400%2F1%2ARggWqrx86u_4JhUY1zHznw.png" alt="Non-Linear Decision Boundary — By Author" width="800" height="471"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-Class Classification
&lt;/h2&gt;

&lt;p&gt;The basic intuition behind &lt;a href="https://en.wikipedia.org/wiki/Multiclass_classification" rel="noopener noreferrer"&gt;Multi-Class&lt;/a&gt; and Binary Logistic Regression is the same. However, for a multi-class classification problem, we follow a &lt;a href="https://houxianxu.github.io/implementation/One-vs-All-LogisticRegression.html" rel="noopener noreferrer"&gt;***one v/s all classification&lt;/a&gt;***. If there are multiple independent variables for the model, the traditional equation is modified as,&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2050%2F1%2A7uXOwzakASQcTZ3SLMer9Q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2050%2F1%2A7uXOwzakASQcTZ3SLMer9Q.png" width="800" height="111"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here, the Log-Odds can be defined as linearly related to multiple independent variables present when the linear regression becomes multiple regression with &lt;em&gt;m&lt;/em&gt; explanators.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Eg.&lt;/em&gt;&lt;/strong&gt; If we have to predict whether the weather is sunny, rainy, or windy, we are dealing with a Multi-class problem. We turn this problem into three binary classification problem i.e whether it is sunny or not, whether it is rainy or not and whether it is windy or not. We run all three classifications &lt;em&gt;independently&lt;/em&gt; on input features and the classification for which the value of probability is the maximum relative to others, becomes the solution.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Logistic regression is one of the most simple Machine Learning models. They are easy to understand, interpretable, and can give pretty good results. Every practitioner using Logistic Regression out there must know about the Log-Odds which is the main concept behind this learning algorithm. The Logistic regression is very much interpretable considering the business needs and explanation regarding how the model works concerning different independent variables used in the model. This post aimed to provide an easy way to understand the idea behind regression and transparency provided by Logistic Regression.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Thanks for reading. You can find my other &lt;a href="https://towardsdatascience.com/@imsparsh" rel="noopener noreferrer"&gt;Machine Learning related posts here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I hope this post has been useful. I appreciate feedback and constructive criticism. If you want to talk about this article or other related topics, you can drop me a text here or at &lt;a href="https://www.linkedin.com/in/imsparsh/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>computerscience</category>
      <category>python</category>
    </item>
  </channel>
</rss>
