Forem: Piyush Raj

Pandas - EDA Case Study - 7 Days of Pandas

Piyush Raj — Tue, 27 Dec 2022 10:25:26 +0000

Welcome to the seventh (and final) article in the "7 Days of Pandas" series where we cover the pandas library in Python which is used for data manipulation.

In the first article of the series, we looked at how to read and write CSV files with Pandas. In this tutorial, we will look at some of the most common operations that we perform on a dataframe in Pandas.

In the second article, we looked at how to perform basic data manipulation.

In the third article, we looked at how to perform EDA (exploratory data analysis) with Pandas.

In the fourth article, we looked at how to handle missing values in a dataframe.

In the fifth article we looked at how to aggregate and group data in Pandas.

In the sixth article we looked at how to visualize the data in a pandas dataframe.

In this tutorial, we will look apply the methods learned so far in a case-study. We'll be working with a demo assignment on performing EDA from the open source mlcourse.ai project.

The Task

In this task you should use Pandas to answer a few questions about the Adult dataset.

Unique values of all features (for more information, please see the links above):

age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
salary: >50K,<=50K

Let's now read the data as a dataframe.

import pandas as pd

# read data from csv file
df = pd.read_csv("adult.data.csv")
# display the first five rows
df.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	salary
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K

1. How many men and women (sex feature) are represented in this dataset?

# we need to get the value counts in the "sex" column
df["sex"].value_counts()

Male      21790
Female    10771
Name: sex, dtype: int64

2. What is the average age (age feature) of women?

# filter for women and then get their average age
df[df["sex"]=="Female"]["age"].mean()

36.85823043357163

3. What is the percentage of German citizens (native-country feature)?

# find the number of German citizens and divide that by the total population
len(df[df["native-country"]=="Germany"])/len(df) * 100

0.42074874850281013

Only 0.42%

4-5. What are the mean and standard deviation of age for those who earn more than 50K per year (salary feature) and those who earn less than 50K per year?

# group on salary and then calculate the mean and std for the age
df.groupby(by="salary")["age"].agg(['mean', 'std'])

	mean	std
salary
<=50K	36.783738	14.020088
>50K	44.249841	10.519028

6. Is it true that people who earn more than 50K have at least high school education? (education – Bachelors, Prof-school, Assoc-acdm, Assoc-voc, Masters or Doctorate feature)

# filter the dataframe for >50k and see the distribution of education
df[df["salary"] == ">50K"]["education"].value_counts()

Bachelors       2221
HS-grad         1675
Some-college    1387
Masters          959
Prof-school      423
Assoc-voc        361
Doctorate        306
Assoc-acdm       265
10th              62
11th              60
7th-8th           40
12th              33
9th               27
5th-6th           16
1st-4th            6
Name: education, dtype: int64

No, we can see that there are individuals with less than high-school education in the >50K bucket.

7. Display age statistics for each race (race feature) and each gender (sex feature). Use groupby() and describe(). Find the maximum age of men of Amer-Indian-Eskimo race.

# for each race
df.groupby(by="race")["age"].describe()

	count	mean	std	min	25%	50%	75%	max
race
Amer-Indian-Eskimo	311.0	37.173633	12.447130	17.0	28.0	35.0	45.5	82.0
Asian-Pac-Islander	1039.0	37.746872	12.825133	17.0	28.0	36.0	45.0	90.0
Black	3124.0	37.767926	12.759290	17.0	28.0	36.0	46.0	90.0
Other	271.0	33.457565	11.538865	17.0	25.0	31.0	41.0	77.0
White	27816.0	38.769881	13.782306	17.0	28.0	37.0	48.0	90.0

# for each gender
df.groupby(by="sex")["age"].describe()

	count	mean	std	min	25%	50%	75%	max
sex
Female	10771.0	36.858230	14.013697	17.0	25.0	35.0	46.0	90.0
Male	21790.0	39.433547	13.370630	17.0	29.0	38.0	48.0	90.0

# for each race and gender
df.groupby(by=["race", "sex"])["age"].describe()

		count	mean	std	min	25%	50%	75%	max
race	sex
Amer-Indian-Eskimo	Female	119.0	37.117647	13.114991	17.0	27.0	36.0	46.00	80.0
Amer-Indian-Eskimo	Male	192.0	37.208333	12.049563	17.0	28.0	35.0	45.00	82.0
Asian-Pac-Islander	Female	346.0	35.089595	12.300845	17.0	25.0	33.0	43.75	75.0
Asian-Pac-Islander	Male	693.0	39.073593	12.883944	18.0	29.0	37.0	46.00	90.0
Black	Female	1555.0	37.854019	12.637197	17.0	28.0	37.0	46.00	90.0
Black	Male	1569.0	37.682600	12.882612	17.0	27.0	36.0	46.00	90.0
Other	Female	109.0	31.678899	11.631599	17.0	23.0	29.0	39.00	74.0
Other	Male	162.0	34.654321	11.355531	17.0	26.0	32.0	42.00	77.0
White	Female	8642.0	36.811618	14.329093	17.0	25.0	35.0	46.00	90.0
White	Male	19174.0	39.652498	13.436029	17.0	29.0	38.0	49.00	90.0

8. Among whom is the proportion of those who earn a lot (>50K) greater: married or single men (marital-status feature)? Consider as married those who have a marital-status starting with Married (Married-civ-spouse, Married-spouse-absent or Married-AF-spouse), the rest are considered bachelors.

# add a new column "is-married"
df["is-married"] = df["marital-status"].str.startswith("Married")

# display the dataframe
df.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	salary	is-married
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K	False
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K	True
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K	False
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K	True
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K	True

df.groupby(by=["is-married"])["salary"].value_counts(normalize=True)

is-married  salary
False       <=50K     0.935546
            >50K      0.064454
True        <=50K     0.563080
            >50K      0.436920
Name: salary, dtype: float64

We can see that amongst Married people, we have a higher proportion of people with salary >50K

9. What is the maximum number of hours a person works per week (hours-per-week feature)? How many people work such a number of hours, and what is the percentage of those who earn a lot (>50K) among them?

# max number of hourse a person works per week
df["hours-per-week"].max()

# how many people work the above maximum number of hourse
len(df[df["hours-per-week"] == df["hours-per-week"].max()])

# percentage of people in the above population that earn more than 50K
df[df["hours-per-week"] == df["hours-per-week"].max()]['salary'].value_counts(normalize=True)

<=50K    0.705882
>50K     0.294118
Name: salary, dtype: float64

Only 29%

10. Count the average time of work (hours-per-week) for those who earn a little and a lot (salary) for each country (native-country). What will these be for Japan?

# group the data on native country and salary and find the average work time for each group
with pd.option_context('display.max_rows', None):
    print(df.groupby(by=["native-country", "salary"])["hours-per-week"].mean())

native-country              salary
?                           <=50K     40.164760
                            >50K      45.547945
Cambodia                    <=50K     41.416667
                            >50K      40.000000
Canada                      <=50K     37.914634
                            >50K      45.641026
China                       <=50K     37.381818
                            >50K      38.900000
Columbia                    <=50K     38.684211
                            >50K      50.000000
Cuba                        <=50K     37.985714
                            >50K      42.440000
Dominican-Republic          <=50K     42.338235
                            >50K      47.000000
Ecuador                     <=50K     38.041667
                            >50K      48.750000
El-Salvador                 <=50K     36.030928
                            >50K      45.000000
England                     <=50K     40.483333
                            >50K      44.533333
France                      <=50K     41.058824
                            >50K      50.750000
Germany                     <=50K     39.139785
                            >50K      44.977273
Greece                      <=50K     41.809524
                            >50K      50.625000
Guatemala                   <=50K     39.360656
                            >50K      36.666667
Haiti                       <=50K     36.325000
                            >50K      42.750000
Holand-Netherlands          <=50K     40.000000
Honduras                    <=50K     34.333333
                            >50K      60.000000
Hong                        <=50K     39.142857
                            >50K      45.000000
Hungary                     <=50K     31.300000
                            >50K      50.000000
India                       <=50K     38.233333
                            >50K      46.475000
Iran                        <=50K     41.440000
                            >50K      47.500000
Ireland                     <=50K     40.947368
                            >50K      48.000000
Italy                       <=50K     39.625000
                            >50K      45.400000
Jamaica                     <=50K     38.239437
                            >50K      41.100000
Japan                       <=50K     41.000000
                            >50K      47.958333
Laos                        <=50K     40.375000
                            >50K      40.000000
Mexico                      <=50K     40.003279
                            >50K      46.575758
Nicaragua                   <=50K     36.093750
                            >50K      37.500000
Outlying-US(Guam-USVI-etc)  <=50K     41.857143
Peru                        <=50K     35.068966
                            >50K      40.000000
Philippines                 <=50K     38.065693
                            >50K      43.032787
Poland                      <=50K     38.166667
                            >50K      39.000000
Portugal                    <=50K     41.939394
                            >50K      41.500000
Puerto-Rico                 <=50K     38.470588
                            >50K      39.416667
Scotland                    <=50K     39.444444
                            >50K      46.666667
South                       <=50K     40.156250
                            >50K      51.437500
Taiwan                      <=50K     33.774194
                            >50K      46.800000
Thailand                    <=50K     42.866667
                            >50K      58.333333
Trinadad&Tobago             <=50K     37.058824
                            >50K      40.000000
United-States               <=50K     38.799127
                            >50K      45.505369
Vietnam                     <=50K     37.193548
                            >50K      39.200000
Yugoslavia                  <=50K     41.600000
                            >50K      49.500000
Name: hours-per-week, dtype: float64

# for japan
df[df["native-country"]=="Japan"].groupby(by=["salary"])["hours-per-week"].mean()

salary
<=50K    41.000000
>50K     47.958333
Name: hours-per-week, dtype: float64

Pandas - Visualizing Dataframe Data - 7 Days of Pandas

Piyush Raj — Mon, 26 Dec 2022 10:09:11 +0000

Welcome to the sixth article in the "7 Days of Pandas" series where we cover the pandas library in Python which is used for data manipulation.

In the second article, we looked at how to perform basic data manipulation.

In the third article, we looked at how to perform EDA (exploratory data analysis) with Pandas.

In the fourth article, we looked at how to handle missing values in a dataframe.

In the fifth article we looked at how to aggregate and group data in Pandas

In this tutorial, we will look at how to plot data in a pandas dataframe with the help of some examples.

Data visualizations are a great way to present data and can help us find insights that may not have been obvious with the data in just tabular form. For example, if you have the data of salaries of employees in an office, a bar chart would give you a much more intuitive feel for comparing them.

How to visualize data in pandas dataframes?

You can use the pandas dataframe plot() function to create a plot from the dataframe values. It creates a matplotlib plot. You can specify the x and y values of the plot with x and y parameters respectively and the type of plot you want to create with the kind parameter.

Let's look at some common types of plots that you can create from pandas dataframe data.

Before we begin, let's first import pandas and create a sample dataframe that we will be using throughout this tutorial.

import pandas as pd

# employee data
data = {
    "Name": ["Tim", "Shaym", "Noor", "Esha", "Sam", "James", "Lily"],
    "Gender": ["M", "M", "F", "F", "M", "M", "F"],
    "Age": [26, 28, 27, 32, 24, 31, 33],
    "Department": ["Marketing", "Product", "Product", "HR", "Product", "HR", "Marketing"],
    "Salary": [60000, 70000, 82000, 55000, 58000, 55000, 65000]
}

# create pandas dataframe
df = pd.DataFrame(data)

# display the dataframe
df

	Name	Gender	Age	Department	Salary
0	Tim	M	26	Marketing	60000
1	Shaym	M	28	Product	70000
2	Noor	F	27	Product	82000
3	Esha	F	32	HR	55000
4	Sam	M	24	Product	58000
5	James	M	31	HR	55000
6	Lily	F	33	Marketing	65000

Scatter Plot

To create a scatter plot with dataframe data, pass "scatter" to the kind parameter of the plot() function. For example, let's create a scatter plot of the "Age" vs "Salary" data in the above dataframe.

df.plot(x="Age", y="Salary", kind="scatter")

You can also customize the plot with additional parameters to the plot() function. For example, let's add a title to the plot and change the color of the points.

df.plot(x="Age", y="Salary", kind="scatter", title="Salary v/s Age", color='red')

Bar Plot

To create a bar plot, pass "bar" as an argument to the kind parameter. Let's create a bar plot of the "Salary" column in the above dataframe.

df.plot(y="Salary", x="Name", kind="bar")

You can also customize the plot with additional parameters to the plot() function. For example, let's rotate the xtick labels slightly and change the color of the bars.

df.plot(y="Salary", x="Name", kind="bar", rot=30, color="teal")

Histogram

A histogram is used to look at the distribution of a continuous variable. To plot a histogram on pandas dataframe data, pass "hist" to the kind parameter.

For example, let's plot a histogram of the values in the "Age" column.

df.plot(y="Age", kind="hist", bins=3)

You can also directly apply the plot() function to a pandas series.

df['Age'].plot(kind="hist", bins=3)

We get the same result.

You can similarly plot other types of plots (for example, line plot, pie chart, etc.) with the plot() function using the appropriate parameters.

Pandas - Aggregating and Grouping Data - 7 Days of Pandas

Piyush Raj — Sun, 25 Dec 2022 10:08:00 +0000

Welcome to the fifth article in the "7 Days of Pandas" series where we cover the pandas library in Python which is used for data manipulation.

In the second article, we looked at how to perform basic data manipulation.

In the third article, we looked at how to perform EDA (exploratory data analysis) with Pandas.

In the fourth article, we looked at how to handle missing values in a dataframe.

In this tutorial, we will look aggregate and group data in Pandas.

Aggregating and grouping data is a common task when working with datasets, and pandas provides a range of functions and methods to help you do this efficiently.

In this tutorial, we will cover the following topics:

Applying aggregate functions to pandas dataframe.
Grouping data in pandas dataframe (and applying aggregate functions to the grouped data).

Before we begin, let's first import pandas and create a sample dataframe that we will be using throughout this tutorial.

import pandas as pd

# employee data
data = {
    "Name": ["Tim", "Shaym", "Noor", "Esha", "Sam", "James", "Lily"],
    "Gender": ["M", "M", "F", "F", "M", "M", "F"],
    "Age": [26, 28, 27, 32, 24, 31, 33],
    "Department": ["Marketing", "Product", "Product", "HR", "Product", "HR", "Marketing"],
    "Salary": [60000, 70000, 82000, 55000, 58000, 55000, 65000]
}

# create pandas dataframe
df = pd.DataFrame(data)

# display the dataframe
df

	Name	Gender	Age	Department	Salary
0	Tim	M	26	Marketing	60000
1	Shaym	M	28	Product	70000
2	Noor	F	27	Product	82000
3	Esha	F	32	HR	55000
4	Sam	M	24	Product	58000
5	James	M	31	HR	55000
6	Lily	F	33	Marketing	65000

Applying aggregate functions

Pandas comes with a number of aggregate functions that you can apply to the entire dataframe or one or more columns in the dataframe.

For example, you can apply the sum() funciton to get the sum of values in each column.

df.sum()

Name                         TimShaymNoorEshaSamJamesLily
Gender                                            MMFFMMF
Age                                                   201
Department    MarketingProductProductHRProductHRMarketing
Salary                                             445000
dtype: object

Note that for object type sum() resulted in a concatenated string.

You can select which columns to apply the aggregate functions to.

For example, let's the mean value of the "Age" and the "Salary" columns.

df[["Age", "Salary"]].mean()

Age          28.714286
Salary    63571.428571
dtype: float64

Grouping data

To group a pandas DataFrame by one or more columns, you can use the pandas dataframe groupby() method. This method takes one or more column names as arguments and returns a groupby object that can be used to apply various operations to the grouped data.

Let's group the above data on the "Gender" column.

df.groupby("Gender")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x10684b970>

We get a groupby object. You can now use this object to apply aggregations to the grouped data. For example, let's get the average "Age" value for each group.

df.groupby("Gender")["Age"].mean()

Gender
F    30.666667
M    27.250000
Name: Age, dtype: float64

We get the mean value of the "Age" column for each group (here, "Gender") in the data.

You can group the data of more than one columns as well. For example, let's group the data on "Gender" and "Department" and get the average "Age" in each group.

df.groupby(["Gender", "Department"])["Age"].mean()

Gender  Department
F       HR            32.0
        Marketing     33.0
        Product       27.0
M       HR            31.0
        Marketing     26.0
        Product       26.0
Name: Age, dtype: float64

You can also apply multiple aggregate functions to the grouped data using the .agg() function.

df.groupby(["Gender", "Department"]).agg(['mean', 'count'])

		Age		Salary
		mean	count	mean	count
Gender	Department
F	HR	32.0	1	55000.0	1
	Marketing	33.0	1	65000.0	1
	Product	27.0	1	82000.0	1
M	HR	31.0	1	55000.0	1
	Marketing	26.0	1	60000.0	1
	Product	26.0	2	64000.0	2

Pandas - Handling Missing Values - 7 Days of Pandas

Piyush Raj — Sat, 24 Dec 2022 15:54:37 +0000

Welcome to the fourth article in the "7 Days of Pandas" series where we cover the pandas library in Python which is used for data manipulation.

In the first article of the series, we looked at how to read and write CSV files with Pandas.

In the second article, we looked at how to perform basic data manipulation.

In the third article, we looked at how to perform EDA (exploratory data analysis) with Pandas.

In this tutorial, we will look at how to handle missing values in data.

When working on some data, it's not uncommon to find missing values in the data. Missing values can occur in data for a variety of reasons, for example, error in data capture, encoding issues, etc. It's important to deal with missing values before you proceed on further analyzing data and it's a major step in data preprocessing.

In this tutorial, we will cover the following topics:

Identifying missing values.
Handling missing values.
- Filling missing values.
- Removing missing values.

Before we begin, let's first import pandas and create a sample dataframe that we will be using throughout this tutorial.

import pandas as pd
import numpy as np

# employee data
data = {
    "Name": ["Tim", "Shaym", "Noor", "Esha", "Sam", "James", "Lily"],
    "Age": [26, 28, 27, 32, 24, None, 33],
    "Department": ["Marketing", "Product", "Product", "HR", "Product", np.nan, "Marketing"],
    "Salary": [60000, np.nan, 82000, np.nan, 58000, 55000, 65000]
}

# create pandas dataframe
df = pd.DataFrame(data)

# display the dataframe
df

	Name	Age	Department	Salary
0	Tim	26.0	Marketing	60000.0
1	Shaym	28.0	Product	NaN
2	Noor	27.0	Product	82000.0
3	Esha	32.0	HR	NaN
4	Sam	24.0	Product	58000.0
5	James	NaN	NaN	55000.0
6	Lily	33.0	Marketing	65000.0

Identifying missing values

To identify missing values in a pandas DataFrame, you can use the pandas isna() method, which returns a boolean mask indicating the presence of missing values. You can then use this mask to select the rows or columns with missing values.

For example, let's check which values in the above dataframe are missing using the isna() function.

df.isnull()

	Name	Age	Department	Salary
0	False	False	False	False
1	False	False	False	True
2	False	False	False	False
3	False	False	False	True
4	False	False	False	False
5	False	True	True	False
6	False	False	False	False

You can see the resulting boolean mask.

To check which columns in the dataframe have missing value, apply the any() function on the resulting boolean dataframe with axis=0.

df.isna().any(axis=0)

Name          False
Age            True
Department     True
Salary         True
dtype: bool

We see that only the "Name" column doesn't have any missing values.

Handling missing values

Handling missing values is an important step in the data preparation pipeline. Generally, there are two approaches to handle missing values -

Fill the missing value with some appropriate value (for example, a constant or mean, median, etc. for continuous variables, and mode for categorical fields).
Remove the missing values (remove the records with missing values).

Let's now look at how to do both of them in pandas.

Filling missing values

To fill missing values in a pandas DataFrame, you can use the pandas fillna() method. This method allows you to specify a value to fill the missing values with, or a method for imputing the missing values.

For example, let's see what we get if we fill the missing values with 0.

df.fillna(0)

	Name	Age	Department	Salary
0	Tim	26.0	Marketing	60000.0
1	Shaym	28.0	Product	0.0
2	Noor	27.0	Product	82000.0
3	Esha	32.0	HR	0.0
4	Sam	24.0	Product	58000.0
5	James	0.0	0	55000.0
6	Lily	33.0	Marketing	65000.0

It returns a dataframe with the missing values filled with the constant value. Note that the fillna() function didn't modify the original dataframe in-place. It returned the resulting dataframe after filling the missing values.

You can also specify different values for different columns when filling missing values.
For exmaple, let's fill missing values in "Age" and "Salary" columns with their respective means and the missing value in the "Department" column with its mode (the most frequent value).

df.fillna({'Age': df['Age'].mean(), 'Salary': df['Salary'].mean(), 'Department': df['Department'].mode()[0]})

	Name	Age	Department	Salary
0	Tim	26.000000	Marketing	60000.0
1	Shaym	28.000000	Product	64000.0
2	Noor	27.000000	Product	82000.0
3	Esha	32.000000	HR	64000.0
4	Sam	24.000000	Product	58000.0
5	James	28.333333	Product	55000.0
6	Lily	33.000000	Marketing	65000.0

Dropping rows with missing values

Another strategy of handling missing values is to remove the rows that contain missing values. This is used when the proportion of missing values is comparitively less and we can afford to discard that data.

Use the pandas dropna() function to remove rows with missing values.

df.dropna()

	Name	Age	Department	Salary
0	Tim	26.0	Marketing	60000.0
2	Noor	27.0	Product	82000.0
4	Sam	24.0	Product	58000.0
6	Lily	33.0	Marketing	65000.0

This is how the dataframe looks after removing rows with any missing values.

Pandas - Basic Exploratory Data Analysis - 7 Days of Pandas

Piyush Raj — Fri, 23 Dec 2022 07:29:36 +0000

Welcome to the third article in the "7 Days of Pandas" series where we cover the pandas library in Python which is used for data manipulation.

In the first article of the series, we looked at how to read and write CSV files with Pandas.
In the second article, we looked at how to perform basic data manipulation.
In this tutorial, we will look at some of the common operations that we perform on a dataframe during the exploratory data analysis (EDA phase).

Exploratory Data Analysis (EDA) helps us better understand the data at hand and can give us valuable insights. In this phase, we look at the data for insights and use descriptive statistics and visualizations to derive insights from the data.

The pandas library comes with a number of useful functions that help us explore the data. In this tutorial, we will cover the following topics:

Get the first and the last N rows of a dataframe.
Using the info() function.
Get descriptive statistics with the describe() function.

Before we begin, let's first import pandas and create a sample dataframe that we will be using throughout this tutorial.

import pandas as pd

# employee data
data = {
    "Name": ["Tim", "Shaym", "Noor", "Esha", "Sam", "James", "Lily"],
    "Age": [26, 28, 27, 32, 24, 31, 33],
    "Department": ["Marketing", "Product", "Product", "HR", "Product", "HR", "Marketing"],
    "Salary": [60000, 70000, 82000, 55000, 58000, 55000, 65000]
}

# create pandas dataframe
df = pd.DataFrame(data)

# display the dataframe
df

	Name	Age	Department	Salary
0	Tim	26	Marketing	60000
1	Shaym	28	Product	70000
2	Noor	27	Product	82000
3	Esha	32	HR	55000
4	Sam	24	Product	58000
5	James	31	HR	55000
6	Lily	33	Marketing	65000

We have a dataframe with information of some employee in an office.

Get the first and the last N rows of a dataframe

After loading or creating a dataframe, a good first step is to look at the first few rows to see if the data is as expected or not. Or, if there are any obvious issues with the data (for example, missing fields, etc.).

You can use the pandas dataframe head() function to get the first n rows of the dataframe. Pass the number of rows you want from the top as an argument. By default, n is 5.

# get the first five rows
df.head(5)

	Name	Age	Department	Salary
0	Tim	26	Marketing	60000
1	Shaym	28	Product	70000
2	Noor	27	Product	82000
3	Esha	32	HR	55000
4	Sam	24	Product	58000

You can similarly get the last n rows of the dataframe, using the pandas dataframe tail() function. Pass the number of rows you want from the bottom as an argument. By default, n is 5.

# get the last five rows
df.tail(5)

	Name	Age	Department	Salary
2	Noor	27	Product	82000
3	Esha	32	HR	55000
4	Sam	24	Product	58000
5	James	31	HR	55000
6	Lily	33	Marketing	65000

Use the `info()` function

You can use the pandas dataframe info() function to get a concise summary of the dataframe. It gives information such as the column dtypes, count of non-null values in each column, the memory usage of the dataframe, etc.

# summary of the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Name        7 non-null      object
 1   Age         7 non-null      int64 
 2   Department  7 non-null      object
 3   Salary      7 non-null      int64 
dtypes: int64(2), object(2)
memory usage: 352.0+ bytes

Get descriptive statistics with the `describe()` function

The pandas dataframe describe() function returns some descriptive statistics for a dataframe. For example, for numerical columns, it returns the count, mean, standard deviation, min, max, percentile values, etc.

# get dataframe's descriptive statistics
df.describe()

	Age	Salary
count	7.000000	7.000000
mean	28.714286	63571.428571
std	3.352327	9778.499252
min	24.000000	55000.000000
25%	26.500000	56500.000000
50%	28.000000	60000.000000
75%	31.500000	67500.000000
max	33.000000	82000.000000

Note that the pandas dataframe describe() function, by default includes only the numeric columns when generating the dataframe’s description.

You can, however, specify other columns types (or all the columns) to include the statistics for using the include parameter.

# get descriptive statistics for object type the columns
df.describe(include='object')

	Name	Department
count	7	7
unique	7	3
top	Tim	Product
freq	1	3

For object type columns, we get the information about the count, number of unique values, top (the most frequent value), and freq (the count of the most frequent value in the column).

These descriptive statistics give us valuable insights into the distribution of the data in different columns.

Pandas - Basic Data Manipulation - 7 Days of Pandas

Piyush Raj — Thu, 22 Dec 2022 06:10:10 +0000

Welcome to the second article in the "7 Days of Pandas" series where we cover the pandas library in Python which is used for data manipulation.

Pandas is a powerful Python library that is widely used for data manipulation and analysis. It provides a range of functions and methods that allow you to easily manipulate and transform data in a variety of formats. In this tutorial, we will cover the following topics:

Selecting rows and columns
Filtering data
Sorting data
Adding and deleting columns

Before we begin, let's first import pandas and read in a sample data file. We will use the pandas.read_csv() function to read in a CSV file and store it in a DataFrame object.

We'll assume that a CSV file "sample_data.csv" exists in the current working directory that we read into a dataframe.

import pandas as pd

df = pd.read_csv("sample_data.csv")

Now that we have a DataFrame, let's dive into the first topic: selecting rows and columns.

Selecting Rows and Columns

There are several ways to select specific rows and columns from a pandas DataFrame. One way is to use the loc attribute, which allows you to select rows and columns based on their labels. For example, to select the first row of the DataFrame, you can use the following code:

# select the first row
df.loc[0]

To select a specific column, you can pass the column name as a string:

# select column by its name
df.loc[:, "column_name"]

You can also use the iloc attribute to select rows and columns based on their integer indices. For example, to select the first row using iloc, you can use the following code:

# select the first row
df.iloc[0]

To select a specific column, you can pass the column index as an integer:

# select column by column index
df.iloc[:, 0]

Filtering Data

In addition to selecting rows and columns, you can also use pandas to filter your data based on specific conditions.

You can use boolean indexing to filter the data in a dataframe. Boolean indexing allows you to filter a DataFrame based on the values in one or more columns. The idea is the to use a boolean expression that results in a boolean index which we use to filter the original data.

To do this, you pass a boolean expression to the DataFrame's indexing operator, []. For example, to filter the DataFrame to only include rows where the value in the "column_name" column is greater than 5, you can use the following code:

# filter dataframe
df[df["column_name"] > 5]

You can also filter the dataframe on multiple conditions by using the logical operators & (and) and | (or). For example, to filter the DataFrame to only include rows where the value in the "column_name" column is greater than 5 and the value in the "other_column" column is less than 10, you can use the following code:

# filter dataframe on mulitple conditions
df[(df["column_name"] > 5) & (df["other_column"] < 10)]

Alternatively, you can also use the query() function in pandas to filter a dataframe.

Sorting Data

To sort a pandas DataFrame, you can use the pandas dataframe sort_values() method. This method allows you to specify one or multiple columns to sort by, as well as the sort order (ascending or descending).

For example, to sort the DataFrame by the "column_name" column in ascending order, you can use the following code:

# sort dataframe by "column_name" in ascending order
df.sort_values("column_name")

To sort in descending order, you can set the ascending parameter to False:

# sort dataframe by "column_name" in descending order
df.sort_values("column_name", ascending=False)

You can also sort by multiple columns by passing a list of column names:

# sort dataframe by multiple columns
df.sort_values(["column_name_1", "column_name_2"])

Adding and Deleting Columns

To add a new column to a pandas DataFrame, you can simply assign a new value to a column that doesn't exist. For example, to add a new column called "new_column" with a default value of 0 for all rows, you can use the following code:

# create a new column with all values as 0
df["new_column"] = 0

You can also assign different values to each row using a list or another Series object.

There are other methods to add a column as well.

To delete a column from a DataFrame, you can use the drop() method and specify the column name and the axis parameter set to 1 (columns). For example, to delete the "new_column" from the DataFrame, you can use the following code:

# remove the column "new_column" from the dataframe
df = df.drop("new_column", axis=1)

That concludes this tutorial on basic data manipulation with pandas. We hope that you found it useful.

In the coming articles, we'll look at other useful operations in Pandas.

Pandas - Read and Write Data From CSV files - 7 Days of Pandas

Piyush Raj — Wed, 21 Dec 2022 12:11:41 +0000

Welcome to the 7 days of Pandas challenge!

In this series, we'll cover the basics and the commonly used operations in pandas library in Python which is primarily used for data manipulation.

In pandas, the main object we use is a DataFrame which is an object that stores the data into a tabular form and lets us perform operations on it.

Day 1 - Read and Write Data from CSV files using `pandas`

In this article, we will cover how to load data from a CSV file into a dataframe and then write a dataframe to a CSV file using the pandas module.

Read data from CSV file in `pandas`

You can use the pandas.read_csv() function to read data from a CSV file into a dataframe. The following is the syntax -

import pandas as pd
# read data from csv file
df = pd.read_csv(PATH_TO_FILE)

Pass the path to the CSV file as an argument to the pandas.read_csv() function. It reads the data from the CSV file and returns the resulting dataframe with that data.

Let's look at an example.

We'll read the data from a file called "Pokemon.csv" saved in the current working directory as a dataframe.

# import the pandas module
import pandas as pd

# read data from csv file
df = pd.read_csv("Pokemon.csv")

# display the first five rows of the dataframe
df.head()

Output:

You can see that the data from the CSV file was loaded in the dataframe. Now, you can go ahead and analyze/manipulate the data as per your requirements.

Write data to a CSV file using `pandas`

You can also use the pandas module to save a dataframe as a CSV file. For example, after working with and changing the data in a dataframe, you may want to save it for later use.

Use the pandas.DataFrame.to_csv() function to save a pandas dataframe as a CSV file. The following is the syntax -

# save dataframe to a csv file
df.to_csv(PATH_TO_NEW_FILE)

Pass the path (or just the file name in case you want to save the dataframe as csv in the current working directory) as an argument to the pandas.DataFrame.to_csv() function.

Note that, if you do not want the dataframe index as an additional column in the resulting CSV file, pass index=False as an argument.

Let's look at an example.

Let's write the above dataframe df to a new CSV file called "Pokemon2.csv".

# write dataframe to a csv file
df.to_csv("Pokemon2.csv", index=False)

If you open the CSV file, it looks something like this -

You can see that the data was successfully written to the CSV file.

That'll be it for this article. In the coming articles, we will dive deep into using pandas and some of its most powerful and useful functionalities.

References

Python - If Else in List Comprehension

Piyush Raj — Mon, 12 Sep 2022 12:27:44 +0000

In this tutorial, we will look at how to create a list using a list comprehension that uses an if else logic to decide the final list values.

List comprehensions offer a concise way to create lists in Python and are particularly useful when creating lists using other iterables, filtering lists, etc.

Let's say you have a list, say ls of integers 1 to 10 (both inclusive) and you want to create a new list, say new_ls that contains the string "odd" or "even" depending on whether the corresponding value in ls is odd or even.

Using a list comprehension can be a valid approach for such cases.

How to use `if else` logic inside a list comprehension in Python?

Use the following syntax to incorporate an if else logic inside a list comprehension.

new_ls = [expression if condition else other_expression for member in iterable]

Here, for each member in the iterable we are checking for our condition which if it evaluates to True, we use the value resulting from expression otherwise we use the value resulting from other_expression as our resulting value in the list comprehension.

Let's now look at an example.

Let's create a list with string values "odd" and "even" for corresponding value in the ls list.

# list of integers from 1 to 10
ls = range(1, 11)
# create list using list comprehension
new_ls = ["odd" if num % 2 != 0 else "even" for num in ls]
print(new_ls)

Output:

['odd', 'even', 'odd', 'even', 'odd', 'even', 'odd', 'even', 'odd', 'even']

We get a list with "odd" and "even" values. For example, for the value 4 in the list ls, its corresponding value in the list new_ls is "even"

Filter values with `if` in list comprehension

Note that you can also just use the if statement (without the else part) inside the list comprehension. This is commonly used when filtering lists.

Use the following syntax to use just the if construct inside a list comprehension.

new_ls = [expression for member in iterable if condition]

Here, for each member in the iterable we are checking for our condition which if it evaluates to True, we use the value resulting from expression as our resulting value in the list comprehension otherwise we don't do anything (skip that member).

For example, let's filter the above list of integers ls to create a new list with only odd integers.

# list of integers from 1 to 10
ls = range(1, 11)
# create list of odd values in ls using list comprehension
odd_ls = [num for num in ls if num % 2 != 0]
print(odd_ls)

Output:

[1, 3, 5, 7, 9]

We get a list of only the odd numbers in ls.

How to Sort a List in Python?

Piyush Raj — Wed, 17 Aug 2022 14:12:50 +0000

Lists are a common data structure used to store sequences and/or collection of data in Python. Since lists are an ordered collection, it can be handy to know how to sort a list in ascending or descending order.

Why sort lists or any data for that matter?
The list sort() function
The built-in sorted() function
Conclusion

Why sort lists or any data for that matter?

In general, data presented in a sorted order is more intuitive to look at and infer from. Additionally, some problems become easier to solve when the data is already sorted. For example - Binary search - Searching for an element in a list would take linear time if the list is not sorted but it takes only logarithmic time if the list is already sorted.

The list `sort()` function

To sort the list in-place, use the list sort() function. It has the following syntax -

ls.sort(reverse=False, key=None)

It sorts the list in-place and does not return any value.

The optional parameter reverse specifies whether to sort the list in descending order or not (and is False by default) whereas the key option parameter allows you to pass a custom function to determine the sorting order.

Here's an example -

# create a list
ls = [3, 5, 1, 2, 4, 7, 6]
# sort the list
ls.sort()
# display the list
print(ls)

Output:

[1, 2, 3, 4, 5, 6, 7]

The list got sorted in-place.

Let's now sort the above list in descending order. For this, pass reverse=True to the sort() function.

# create a list
ls = [3, 5, 1, 2, 4, 7, 6]
# sort the list in descending order
ls.sort(reverse=True)
# display the list
print(ls)

Output:

[7, 6, 5, 4, 3, 2, 1]

The list is now sorted in descending order

The built-in `sorted()` function

If you do not want to modify the original list, you can use the Python built-in sorted() function. Its syntax is very similar to the list sort() function -

sorted(iterable, reverse=False, key=None)

It returns a sorted copy of the original list.

Let's look at this method in action.

# create a list
ls = [3, 5, 1, 2, 4, 7, 6]
# sort the list
res_ls = sorted(ls)
# display the original list
print("Original list - ", ls)
# display the resulting list
print("Resulting list - ", res_ls)

Output:

Original list -  [3, 5, 1, 2, 4, 7, 6]
Resulting list -  [1, 2, 3, 4, 5, 6, 7]

The original list is unaffected and the returned list is sorted.

Conclusion

We looked at two methods to sort a list in Python (both with similar parameters). The key takeaways from this tutorial are-

To sort the list in-place use the list sort() function.
To keep the original list unaltered use the Python built-in sorted() function which returns a sorted copy of the original list.

Both the sort() and the sorted() functions take the optional parameters reverse and key. The key parameter can be very useful if you're looking to apply for custom sorting logic on a list. Refer to the tutorial - Python List Sort - With Examples for examples on how to perform such sort operations using the key parameter.

Forem: Piyush Raj

Pandas - EDA Case Study - 7 Days of Pandas

The Task

Pandas - Visualizing Dataframe Data - 7 Days of Pandas

How to visualize data in pandas dataframes?

Scatter Plot

Bar Plot

Histogram

Pandas - Aggregating and Grouping Data - 7 Days of Pandas

Applying aggregate functions

Grouping data

Pandas - Handling Missing Values - 7 Days of Pandas

Identifying missing values

Handling missing values

Filling missing values

Dropping rows with missing values

Pandas - Basic Exploratory Data Analysis - 7 Days of Pandas

Get the first and the last N rows of a dataframe

Use the info() function

Get descriptive statistics with the describe() function

Pandas - Basic Data Manipulation - 7 Days of Pandas

Selecting Rows and Columns

Filtering Data

Sorting Data

Adding and Deleting Columns

Pandas - Read and Write Data From CSV files - 7 Days of Pandas

Day 1 - Read and Write Data from CSV files using pandas

Read data from CSV file in pandas

Write data to a CSV file using pandas

References

Python - If Else in List Comprehension

How to use if else logic inside a list comprehension in Python?

Filter values with if in list comprehension

How to Sort a List in Python?

Table of Contents

Why sort lists or any data for that matter?

The list sort() function

The built-in sorted() function

Conclusion

Use the `info()` function

Get descriptive statistics with the `describe()` function

Day 1 - Read and Write Data from CSV files using `pandas`

Read data from CSV file in `pandas`

Write data to a CSV file using `pandas`

How to use `if else` logic inside a list comprehension in Python?

Filter values with `if` in list comprehension

The list `sort()` function

The built-in `sorted()` function