Forem: pexpeter

Ultimate Guide to Exploratory Data Analysis

pexpeter — Tue, 28 Feb 2023 20:24:56 +0000

Definition

Exploratory Data Analysis is an approach to analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods, according to Wikipedia.

It assists us to identify potential issues with our dataset i.e. missing data, outliers understanding the nature and type of variables, understanding the relationship between the variables and effectively communicating our findings. This helps in advising the company during decision-making as it will be a data-driven approach.

Exploratory Data Analysis Process

The EDA process is similar in all the data science programming languages i.e. R, and Python. The process involves three major steps:

Data Input/ Reading

This involves assigning your data to a programming language object so as to store the data in memory. Each data source has its own way of data input/handling. Example in R and Python:

  import pandas as pd
  df = pd.read_csv('filepath/data.csv')

  df = read.csv('filepath/data.csv')

The method to use is dictated by the data size and format in which it's stored.

Data Cleaning and Analysis

This involves getting a deeper understanding of the data you imported. It will involve:

Identifying missing data points/ data

During data collection, some respondents tend to skip questions or sometimes pick an option and do not give the required input. This tends to bring missing data which would have given better insight. EDA assists in identifying the missing and the size of missing data. The most common name is null or NaN variables.

Identifying outliers

Outliers are data points that are significantly different from other data points in the dataset. We tend to identify them through quantiles or percentiles. Most outliers are considered to be not between the (0.1, 0.9) quantile. . The best action always involves dropping the outliers. Example in Python

 #subseting to get the borders
 low, high = df['column'].quantile([0.1,0.9])
 #assigning to series
 col = df['column'].between(low,high)

Distribution of the data

We query the spread, shape, and central tendency (mean, standard deviation, variance e.t.c) of the data to get insights into the data distribution. This helps us in making decisions on what statistical test or analysis to use, and also helps in checking data skewness (unbalanced data).

Statistical Analysis

A method to perform statistical analysis depends on the data distribution and data types. It involves performing correlations between variables and this aid in understanding concepts such as multicollinearity between the variables.

We also perform statistical tests on the data to assist us to answer the hypothesis or objectives we had created.

We can also perform regression analysis on the data set to get more insights into our response variable and independent variables.

Data Visualization

This is considered the last process of Exploratory Data Analysis.

We visualize our data to identify patterns and trends(time series) which would be difficult with raw data.

It also involves communicating our results in a simple language understandable by the policymakers. We compile our results in pictorial format to highlight major insights and show the data-driven recommendations.

The most common data visualization packages are ggplot2 (R), matplotlib(Python), Seaborn(Python), and plotly(R and Python).

Importance

EDA significance can be majorly classified as:

Data Quality and Better Understanding

It helps in having a better understanding of the data you are working on to identify trends, patterns, or outliers, which are considered anomalies. This aids in planning how to analyze the data and interpret it.

Identifying missing values helps in ensuring we use reliable data.

Communication

The use of visualization and summaries aids in presenting our results. This makes the results more understandable to most people.

Decision Making

EDA can aid in making decisions backed up by data. This gives policymakers a chance to make informed decisions that tend to be more effective and achievable.

Conclusion

EDA is a critical process in data processing in order to get insights about data sets. The insights aid in making data-driven decisions which tends to be effective.

It's essential for a data analyst to master EDA in order to assist policymakers to understand the past, current and best practices for future events.

I hope this ultimate guide serves as a valuable resource for anyone looking to improve their EDA skills.

You can check a sample EDA procedure in this GitHub repository and never feel shy about asking for guidance.

Python 101: Introduction to Python for Data Science

pexpeter — Sat, 18 Feb 2023 15:14:19 +0000

Python Definition

Python is an interpreted, high-level, general-purpose programming language. It was first released in 1991 by Guido van Rossum and has since become one of the most popular programming languages in the world. Its syntax has made it popular as its easy to learn and use.

Advantages of Python for Data Analysis

There are several data analysis preferred programming language including R, Stata and SAS.

Python is better compared to others due to:

It's ease of use. It's syntax makes it easy to learn, write, and maintain code, even for beginners.
Range of libraries: Python has a large number of libraries that provide a range of functionalities for data analysis, such as NumPy, Pandas, and Matplotlib.
Open-source: Python is open-source, which means that it is freely available and can be used and modified by anyone.

Installing Python

Python application file can be accessed and downloaded from its main website for different operating systems. I will mainly use Windows for this article.

Welcome to Python.org

The official home of the Python Programming Language

python.org

After installing python you have to choose an IDE (Integrated Development Environment) which is a software application that provides a comprehensive environment for software development.

The common IDE for data science are:

Pycharm
Spyder
Visual Studio Code
Jupyter Notebooks
IDLE( comes with python installation).

First Code

A successful setup of your code writing environment means you are ready to code. Python for data science requires several basic libraries to simplify your coding processes. They can be installed using a package manager used in python in your command prompt (cmd)by running the following:

NumPy

pip install numpy

Pandas

pip install pandas

Matplotlib

pip install matplotlib

Scikit-Learn

pip install scikit-learn

The libraries are installed for ease of use when dealing with tabular data (Pandas), arrays(NumPy), visualizations(Matplotlib) and Machine Learning (Scikit Learn). As you advance you come to know more libraries that come handy in your data science projects.

Using Python Libraries for Data Science

The installed libraries are not usable until they are called or modules in them are called. This is done easily through using the import and from library import module. Example:

#we use `as` as an alias so as to simplify our code
#pandas library
import pandas as pd
#numpy library
import numpy as np
#matplotlib library
import matplotlib.pyplot as plt
#scikit learn library
from sklearn.pipeline import make_pipeline

As you may have noted, from is used to import a certain method or module from a library depending on the project you are working on.

Python Syntax

Operators in Python for Data Science

Arithmetic operators: used for performing arithmetic operations such as addition(+), subtraction(-), multiplication(*), division(/), and modulus(%).
Comparison operators: used for comparing two values and returning a Boolean value (True or False). They include equal to(==), not equal to(!=), greater than(>), less than(<), greater than or equal to(>=) and less than or equal to(<=).
Logical operators: used for combining Boolean values and returning a Boolean result. These include logical AND, logical OR and logical NOT.
Assignment operators: used for assigning a value to a variable and performing an operation on the variable at the same time. These include:

a = 5
a += 3    # equivalent to a = a + 3
a -= 2    # equivalent to a = a - 2
a *= 4    # equivalent to a = a * 4
a /= 2    # equivalent to a = a / 2

Python Data Structures

Python comes with inbuilt data structures that enables data scientist store and manipulate data sets. They are the foundations that makes easy to integrate with the data science libraries.
The most common data structures are:

Lists:

A list is a collection of ordered elements, which can be of any data type. Example:

mylist = [1,2,3,4]

Tuples:

A tuple is a collection of ordered elements, similar to a list. However, tuples are immutable, which means that once a tuple is created, its elements cannot be modified. Example:

mytuple = (1, 2, 3, 4, 5)

-Dictionaries:

A dictionary is a collection of key-value pairs, where each key is associated with a value. Dictionaries are unordered and mutable, which means that you can add, remove, or modify key-value pairs in a dictionary. Example:

mydict= {"a": 2, "b": 3, "c": 4}

These are the most commonly used data structures but others include sets and arrays.

Conclusion

Data science involves lot of projects from data collection to machine learning. The kind of project will dictate the kind of library and code to write. The most common data sources are apis, excel(flat databases), structured databases(SQL), unstructured databases(mongo) and mixed sometimes.

Python offers easy integration of data sources e.g pymongo library for mongodb databases, sqlite3 for sql databases and pandas for flat databases(excel, csv etc).

Examples:
-Pymongo

from pymongo import MongoClient
client = MongoClient(host="local host", port=27017)

-sqlite3

import sqlite3
%load_ext.sql
%sql sqlite://path

-pandas

df=pd.read_csv(filepath)

The kind of data also will determine the type of code and libraries to install and use.

Its always advisable to come up with a clear plan of how to handle your project to avoid wrong method or libraries.

Data science with python is fun and easy to learn with dedication.

Thank you and for any clarification feel free to reach out.