<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Milan Mitrovic</title>
    <description>The latest articles on Forem by Milan Mitrovic (@milanzmitrovic).</description>
    <link>https://forem.com/milanzmitrovic</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F607979%2Fee6e26e8-9a7c-45dd-859d-a3d993d81974.jpeg</url>
      <title>Forem: Milan Mitrovic</title>
      <link>https://forem.com/milanzmitrovic</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/milanzmitrovic"/>
    <language>en</language>
    <item>
      <title>Plotly Histogram</title>
      <dc:creator>Milan Mitrovic</dc:creator>
      <pubDate>Sun, 26 Dec 2021 02:30:50 +0000</pubDate>
      <link>https://forem.com/milanzmitrovic/plotly-histogram-a68</link>
      <guid>https://forem.com/milanzmitrovic/plotly-histogram-a68</guid>
      <description>&lt;p&gt;Recently I started playing with one interesting dataset. I had dataset of 150k rows that comes from one big European bank. I was interested to figure out if there is any correlation between monthly salary and default probability of borrower.&lt;/p&gt;

&lt;p&gt;First thing that I tried to chart was histogram of monthly salary. Next thing I tried to do was to show histogram with default/non-default borrowers separated with different colour.&lt;/p&gt;

&lt;p&gt;Since dataset is strictly confidential, let's artificially create data set for testing purpose.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd
import numpy as np
import plotly.express as px

import plotly.io as pio
pio.renderers.default = 'browser'

x = np.random.exponential(size=100000, scale=20) + 50000

df = pd.DataFrame({
            'monthly_salary': x,
            'default': np.random.choice([1, 0], size=len(x))
            })

# Purpose of this column is to help us count number of clients
# that belong to each bin group.
df['help_column'] = 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is how generated table looks like.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--m3zeFCLd--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ididbxvz0bjxosr57ob8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--m3zeFCLd--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ididbxvz0bjxosr57ob8.png" alt="Image description" width="828" height="390"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Idea here is to create histogram of monthly salaries. There are two ways, one quicker, and another where we have more control over what is going under the hood.&lt;/p&gt;

&lt;p&gt;Let's start with easier approach.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;fig = px.histogram(
    data_frame=df,
    x='monthly_salary',
    nbins=200
)
fig.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is how chart looks like.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--xkiR6xs0--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/7aszdczqulwqqmyyraw6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--xkiR6xs0--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/7aszdczqulwqqmyyraw6.png" alt="Image description" width="880" height="646"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this case, bins are automatically created in Plotly Express function. We do not have control about size of bin, it is created automatically. We just supplied number of bins.&lt;/p&gt;

&lt;p&gt;Another approach, which is a bit more complicated, is to use pandas functions to create bins of arbitrary size. After that, we will classify clients into corresponding bin groups.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;bins_ = pd.interval_range(start=50000, end=50100, freq=1)
df['monthly_salary_BINS'] = pd.cut(x=x, bins=bins_)

# Idea is to have lower left boundary instead of upper-lower bound
# It is easier for plotting
df['monthly_salary_BINS_left'] = df['monthly_salary_BINS'].apply(func=lambda x: x.left)

xx = df[['help_column', 'monthly_salary_BINS_left']].groupby(by='monthly_salary_BINS_left').sum().reset_index()

fig = px.bar(
    x=xx['monthly_salary_BINS_left'],
    y=xx['help_column']
)

fig.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's take a look at this chart.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--hiV3AOJW--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/n30wf9xegwczevwxaaoc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--hiV3AOJW--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/n30wf9xegwczevwxaaoc.png" alt="Image description" width="880" height="646"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;...&lt;/p&gt;

&lt;p&gt;What if we want to have one histogram for default and one for non-default borrowers? &lt;br&gt;
Again, there are two approaches.&lt;/p&gt;

&lt;p&gt;Let's start with easier again.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;fig = px.histogram(
    data_frame=df_filtered,
    x='monthly_salary',
    color='default',
    nbins=200,
    barmode='group'
)
fig.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is newly added chart.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--AgE02n6Q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/2goec1yuiz2aeyk9iee3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--AgE02n6Q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/2goec1yuiz2aeyk9iee3.png" alt="Image description" width="880" height="646"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Second way of plotting histogram requires pivoting data. Idea is to create cross tabulation first, and then to plot data.&lt;/p&gt;

&lt;p&gt;Personally, I prefer this way. It gives me more control, I can clearly see table that is underlying chart and consequently I can do quality assurance timely.&lt;/p&gt;

&lt;p&gt;Also, this way is more efficient. Not all data is going to be sent to browser. Only aggregated data will be stored on front end side, which is significantly lower amount.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df_pivoted = pd.pivot_table(data=df,
               values='help_column',
               index='monthly_salary_BINS_left',
               columns='default',
               aggfunc='sum')
fig = px.bar(
    df_pivoted,
    barmode='group'
)
fig.show()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pivot table that is underlying chart.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--pOQfQ0o---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/movluuz0dd2j1jtdxj3b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--pOQfQ0o---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/movluuz0dd2j1jtdxj3b.png" alt="Image description" width="838" height="652"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here is amazing chart.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--0Kj-LoFh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4e8fllx55ovhdjad3g53.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--0Kj-LoFh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4e8fllx55ovhdjad3g53.png" alt="Image description" width="880" height="646"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I hope you have enjoyed this tutorial. Happy cooking and see you in next plotting endeavour :) &lt;/p&gt;

</description>
      <category>python</category>
      <category>pandas</category>
      <category>plotly</category>
    </item>
  </channel>
</rss>
