Understanding your data: The Essentials of Exploratory Data Analysis (EDA).

Stephen Ndichu — Fri, 16 Aug 2024 12:02:59 +0000

Once data has been collected and stored, there's need for its analysis to derive meaningful understanding of it. It is for this reason that exploratory data analysis (EDA) comes into play. As the name suggests, we are 'exploring' the data i.e. getting a general overview of it.

The data collected may either be text, videos or images and will usually be stored in an unstructured manner. Rarely will you find data that is 100% clean i.e. without any anomalies. Additionally, data may be in various formats like Excel, CSV (comma separated values), Json, Parquet etc.

In the world of data, EDA may also be referred to as data manipulation or data cleaning. Practitioners in the industry emphasize the importance of cleaning data to remove 'junk' as this may negatively impact the results as well as predictions. Structured data, usually in tabular format, can be analysed using several techniques and tools (like Excel, Power BI, SQL) but we will focus on Python for this illustration.

EDA using Python
Python programming language is one of the most widely tools in EDA owing to its versatility which allows for its use across multiple industries, be it finance, education, healthcare, mining, hospitality among others.
Inbuilt libraries, namely Pandas and NumPy are highly effective in this regard and work across board (whether using Anaconda/Jupyter Notebook, Google Collab, or an IDE like Visual Studio)

Below are the common steps and code lines executable when performing EDA:

First, you'll import the python libraries necessary for manipulation/analysis:

import pandas as pd
import numpy as np

Secondly, load the dataset
df = pd.read_excel('File path')

Note: df is the standard function for converting tabular data into a data Frame.

Once loaded, you can preview the data using the code:
df.head()

This will show the first 5 rows of the dataset
Alternatively, you can simply run df which will show a select few rows (both top and bottom) of the entire dataset as well as all the columns therein.

Thirdly, understand all the datatypes using:
df.info()

Note: Datatypes include integers (whole numbers), floats (decimals) or objects (qualitative data/descriptive words).

At this step, it's advisable to get summary statistics of the data using:
df.describe()

This will give you stats like Mean, Mode, Standard Deviation, Maximum/Minimum values and the Quartiles.

Fourthly, identify whether null values exist in the dataset using:
df.isnull()

This can then be followed by checking for duplicates (repetitive entries)
df.duplicated()

Other key aspects of EDA are checking how the various variables in a dataset relate with each other (Correlation) and their distribution.
Correlation can be positive or negative and ranges from -1 to 1. Its code is:

df.corr()

Note: A correlation figure close to 1 indicates a strong positive correlation, while a figure close to -1 indicates a strong negative correlation.

Distribution checks on how symmetrical or asymmetrical data is, as well as the skewness of the data and it can either be normal, binomial, Bernoulli or Poisson.

In summary, exploratory data analysis is an important process in gaining a better understanding of the data. It allows for better visualizations and model building.

The Ultimate Guide to Data Analytics: Techniques and Tools.

Stephen Ndichu — Fri, 16 Aug 2024 09:15:43 +0000

With the ever-increasing advancements in technology in our day to day lives, there's an increase in the amount of data being generated. This has necessitated the need for data analysts and data scientists to analyse the raw data so as to make sense of it.
There are a variety of specializations in the data industry that beginners need to familiarize themselves with before deciding which one suits them best. They include data architects, data analysts, data engineers, machine learning engineers, business analysts, analytics engineers among others.
Globally, more and more people are seeking to join the field of data science and analytics. There is an abundance of resources that beginners can tap into as they start their journey into data. Below is a guide for those looking to venture into data science and analytics.

Data collection/gathering
Usually, most organizational data will be stored in databases in structured formats. However, data can be unstructured or semi structured depending on how its collected and stored. Therefore, it becomes useful for data analysts to be knowledgeable of data structures and data warehousing.
Additionally, data analysts may be required to extract data from websites and structure it in a manner that allows for meaningful analysis. Some tools used in web scrapping include Beautiful Soup and Selenium.

Data Manipulation
This is usually a vital step in data analytics and can also be referred to as exploratory data analysis (EDA). while manipulating data, it's important to note data exists in various forms (text, videos or images). Here, data is cleaned, processed and transformed with the aim to identify trends and getting a general outlook of the data. With this, data can be grouped, missing values identified and unwanted values removed.
Some key tools for this are MS Excel and Google Sheets, which are universally used in the business sector for computation and analysis. They offer a wide range of inbuilt functions necessary for manipulating data.

Maths and Statistics.
Data is either quantitative or qualitative i.e., numeric or non-numeric. Data specialists should be able to perform both simple and complex mathematical and statistical operations on data sets to identify historical trends and predict future outcomes.
For these, beginners should know basics of descriptive statistics, inferential statistics and probability.

Programming
Whereas there are many programming languages that can be used in data analytics, beginners should at least know the fundamentals of R and Python. These are easy to grasp and use in addition to being open source.
Python, in particular, is widely used across different sectors/industries and comes with inbuilt libraries for data analysis and visualizations. They include Pandas, NumPy, Matplotlib, Seaborn among others.
Additionally, people starting out in data analytics should learn SQL (Structured Query Language) which manages and manipulates relational databases.

Visualizations
In addition to Matplotlib and Seaborn libraries for visualizations, other programs may be used to create graphs and tables that are easily understandable to all users.
The most common tools are Tableau, Power BI and Looker, which are useful for developing visualizations that can be used for deriving insights and enabling better decision making for the stakeholders.
Another emerging aspect of visualizations is data physicalisation. This involves use of tangible interfaces and shape changing displays that would be of use to visually impaired stakeholders.

Machine Learning
The patterns and trends identified during data manipulation and analysis can be used to make future predictions. This is what machine learning (ML) entails. Machine learning engineers use data and algorithms to train models in a variety of ways which can either be supervised, unsupervised, semi supervised or reinforcement learning.
Python has inbuilt frameworks for ML like Scikit Learn and Tensor Flow.

In a nutshell, data analytics encompasses many tools and techniques which complement each other. Ability to work with these tools enhances both individual and organizational skills when handling and interpreting data.
It is worth remembering that data analytics also requires skills in problem solving, communication and domain knowledge.

Forem: Stephen Ndichu

Understanding your data: The Essentials of Exploratory Data Analysis (EDA).

The Ultimate Guide to Data Analytics: Techniques and Tools.