Forem: Qasim H. (aiwithqasim 🚀)

Batch Processing using PySpark on AWS EMR

Qasim H. (aiwithqasim 🚀) — Sat, 11 Nov 2023 04:01:22 +0000

Are you a Data Engineer and want to do hands-on on AWS services? This blog is about batch data processing using AWS services, you will learn to do batch processing using AWS services: S3 as storage, EMR as processing cluster, and Athena for querying the processed results.

About Batch Data Pipeline:

The Wikipedia Activity data will be put into a folder in the S3 bucket. We will have PySpark code that will run on the EMR cluster. This code will fetch the data from the S3 bucket, perform filtering and aggregation on this data, and push the processed data back into S3 in another folder. We will then use Athena to query this processed data present in S3. We will create a table on top of the processed data by providing the relevant schema and then use ANSI SQL to query the data.

Architecture Diagram:

Languages - Python
Package - PySpark
Services - AWS EMR, AWS S3, AWS Athena.

Dataset:

We'll be using the Wikipedia activity logs JSON dataset that has a huge payload comprising 15+ fields

NOTE: In our Script created we'll take two conditions into consideration that we want only those payloads where isRobot is False & user country is from United Estate

Steps of Working:

1- Create an S3 bucket with a suitable name i:e., emr-batchprocessing-raw-useast1-Account ID-dev & inside the bucket create folders i:e.,

input-source (upload your dataset here it will be your source folder),
output-destination (According to scenarios processed from AWS EMR data will be dumped here for further processing) &
logs (AWS EMR logs will be saved here. We'll specify this directory during creation of EMR)

2- Goto EC2-keypair & create Key pair for using that into EMR cluster creation

3- Now, We have to create an AWS EMR cluster for that go to AWS EMR from AWS Console & choose EMR on EC2: Clusters

4- During the Cluster creation provide some suitable name in my case I have provided "emr-batch-processing" & selected spark as the processing engine.

5- As EMR stands for Elastic Map Reduce & it works under the rule of distributed processing we need master node and woker/core nodes for processing. Note: I removed the Task nodes here during creation

6- For cluster scaling and provisioning let's go with 2 woker/core nodes since our working in minimal and realistic

NOTE: Keep default setting for _Networking _& Cluster termination

7- Select EC2 key pair that we created in Step 2 so that we can do SSH using the terminal

8- For cluster logs Let's choose the Log folder that we created during the Bucket creation step.

9- Lastly we Need to create the Amazon EMR service role for Identity and Access Management (IAM) & Similarly Instance Profile for EC2 instance profile for Amazon EMR. After that review steps & Click on Create Create EMR Cluster.

10- Let's create a script that we want to run into our EMR cluster. NOTE: The code given below is only one filtering script for full scripts please refer to GitHub.

Github link: https://github.com/aiwithqasim/emr-batch-processing



from pyspark.sql import SparkSession

S3_DATA_INPUT_PATH="<<bucket link to source dataset>>"
S3_DATA_OUTPUT_PATH_FILTERED="<<bucket link to output folder>>/filtered"

def main():
    spark = SparkSession.builder.appName('EMRBathcProcessing').getOrCreate()
    df = spark.read.json(S3_DATA_INPUT_PATH)
    print(f'The total number of records in the source data set is {df.count()}')
    filtered_df = df.filter((df.isRobot == False) & (df.countryName == 'United States'))
    print(f'The total number of records in the filtered data set is {filtered_df.count()}')
    filtered_df.show(10)
    filtered_df.printSchema()
    filtered_df.write.mode('overwrite').parquet(S3_DATA_OUTPUT_PATH_FILTERED)
    print('The filtered output is uploaded successfully')

if __name__ == '__main__':
    main()

11- Make sure the EMR cluster you created has SSH port 22 open for cluster connection from local _Terminal _ or _Putty _

12- Connect to your AWS EMR EC2 instance using the connection command as shown below:

13- Create the main script (include the script that we created above) using the Linux command and submit the code using spark-submit main.py

14- After completion validate that the code ran successfully & & the terminal has print schema as shown below

and also & the S3 bucket has processed data.

15- Goto AWS Athena query editor

Create a table using data in the S3 output folder (processed data)
Make sure the table& databaseare created properly
Created a new queryto select data from the table created
Make sure the query is returning the result properly.

Conclusion

In a professional data engineering career, you have various scenarios where data gets collected every day. The data can be processed once a day, i.e., batch processed, and the processed results are stored in a location to derive insights and take appropriate action based on the insights. In this blog, we have implemented a batch-processing pipeline using AWS services. We have taken a day’s worth of data related to Wikipedia, and performed batch processing on it.

For more such content please follow:

LinkedIn: https://www.linkedin.com/in/qasimhassan/
GitHub: https://github.com/aiwithqasim
Join our AWS Data Engineering WhastApp Group

What is Databricks Lakehouse ?

Qasim H. (aiwithqasim 🚀) — Tue, 04 Oct 2022 08:09:52 +0000

In this blog, I’ll introduce you to the Databricks Lakehouse platform & discuss some of the problems that Lakehouse addresses. Databricks is a data and AI company. They provide the first Lakehouse which offers one simple platform to unify all your data analytics and AI workloads.

Challenges that most organizations have with data. Generally, start with the architecture. Note here as in below Fig2 have different isolated stocks for Data Warehousing, Data Engineering & streaming, and Data Science and Machine Learning. This needs to develop and maintain very different stocks of technologies can result in a lot of complications and confusion often times the underlying technologies don't work very well together and it's difficult to maintain this overall ecosystem.
Secondly, we have a lot of different tools that power each of these architectures and it ends up being just a huge slew of different open-source tools that you have to connect. In the Data Warehousing stack, you're often dealing with proprietary data formats and if you want to enable Advanced use cases, you have to move the data across the stacks. It ends up being expensive and resource-intensive to manage. All of this complexity ends up being felt by your data teams. Both people trying to query and analyze your data. As well as those responsible for maintaining these systems. Because the systems are siloed, the teams become siloed too, so communication slows down hindering Innovation and speed, and different teams often end up with different versions of the truth. So, we end up with many copies of data. No, consistent security or governance model closed systems and disconnected less productive teams.

The core problem is the technologies in these stacks are built upon. The solution for this is a Data Lakehouse. Data Lake and Data Warehouses have complementary but different benefits that have acquired both to exist in most enterprise environments.

Data Lakes, do a great job supporting machine learning, they have open formats and a big ecosystem, but they have poor support for business intelligence and suffer from complex data Quality problems.
On the other hand, Data Warehouses are great for business intelligence applications, but they have limited support for machine learning workloads, and they are proprietary systems with only a SQL interface.
Unifying, these systems can be transformational and how we think about data. Therefore, Databrick’s culture of innovation and dedication to open source, bring Lakehouse to provide one platform to unify all of your data analytics and AI.
While it's an oversimplification, a good starting point to talking about the Lakehouse is to describe it as enabling design patterns and use cases associated with Data Warehouses to be stored in an open format and economical cloud object storage, as known as Data Lake
With proper planning and implementation, Lakehouse can provide organizations of any size with robust, secure & scalable systems that drastically lower total operating costs while increasing the simplicity of system maintenance and reducing the time to reach actionable insights.

At the heart of the Databricks Lakehouse is a Delta Lake format, Delta Lake provides the ability to build curated Data Lakes that add reliability and performance, and the governance you expect from Data Warehouses directly to your existing Data Lake. You gain reliability with acid transactions now; you can be sure that all operations in the data Lake either fully succeed or fail with the ability to easily time travel backward to understand every change made to your data. Additionally, the lake is underpinned by Apache spark and utilizes Advanced caching and indexing methods. This allows you to process and query data on your data Lake extremely quickly at scale. And finally, Databricks provide support for a fine-grained, access control list to give you much more control over who can access what data in your data Lake.

Now with this Foundation, let's look at the Lakehouse that's built on top of it. The Databricks Lakehouse. The platform is unique in three ways.

It's simple data only needs to exist, once to support all of your data workloads on one common platform.
It's based on open source and Open Standards to make it easy to work with existing tools and avoid proprietary formats and
It's collaborative your Data Engineers, Analysts, and Data Scientists can work together much more easily.

Let's explore this a little more,

With a Lakehouse much more of your data can remain in your Data Lake rather than having to be copied into other systems. You no longer need separate data architectures, to support your workloads, across Analytics, Data Engineering & streaming, and Data Science, and Databricks provides the capabilities and workspaces to handle all of these workloads. This gives you one common way to get things done, across all of the major Cloud providers. The unification of the Lakehouse extends beyond the architecture and workloads as stated above.

Databricks invented some of the most successful open-source projects in the world. These Innovations underpin, everything we do, and they are born from our expertise in space. A commitment to open source is why we believe your data should always be in your control. Without the lake, there's no vendor lock-in on your data because it's built on open-source technology, the Databricks Lakehouse.

With Databricks and all your Data teams from Data Engineers to Data Analysts and Scientists can collaborate and share across all of your workloads. You can easily share data products, like models, Dashboards, Notebooks, and Datasets, and get more initiatives to production, so that you can be a data native organization.

I had read many of blogs related to Lakehouse but most of them include a lot of marketing stuff added in the blog. I tried to write too the point while explaining the core concept & feature that Lakehouse offers. Hope you like it. If you're curious how to pass the Databricks Lakehouse fundamentals Accreditation? Kindly follow the link ->
Databricks Lakehouse Exam Guide

How to pass the Databricks Lakehouse Accreditation?

Qasim H. (aiwithqasim 🚀) — Tue, 04 Oct 2022 08:05:48 +0000

ABOUT ACCREDIATION

Number of questions : 20
Type of questions: Multiple choice questions
Duration: 20min
Cost : Free
Retry: Unlimited
Passing score: 80%
Expiration: 1 year
Where to register for the certification : Databricks Academy
- If you are a partner : Link
- If you are a customer : Link

TOPICS COVERED OVERVIEW

Databricks SQL
Databricks Lakehouse platform
Databricks Data Engineering and Data Science Workspace
Databricks Machine Learning
How to prepare for the certification:

PREPARATION MATERIAL

Some of the material Recommended by Databricks to clear the accreditation are:

What's Databricks Machine Learning? : Link
What's Data Science and Engineering Workspace? : Link
What's SQL Databricks? : Link
What's the Databricks Lakehouse Platform? : Link

Apart from those resources if you're still curious supported material is attached below:

Managed MLflow : Link
Delta Lake: Link
Workspace assets: Link
Navigate the workspace: Link
Databricks Machine Learning: Link
Databricks runtime: Link
Introduction to Databricks AutoML : Link
Introduction to Feature Store: Link
Introduction to SQL Databricks: Link
Introduction to the medallion architecture : Link
Add Users and groups to workspace : Link
Databricks Compliance: Link

ABOUT ME

I make #dataEngineering & #machinelearing complex problem easy & make a way for new learners to learn fast & smart. Follow me for more content & don't forget to share if you think material is worthy.

Databricks Assets ?

Qasim H. (aiwithqasim 🚀) — Sun, 25 Sep 2022 12:13:01 +0000

Talking about Data Science & Engineering workspace, the classic Databricks environment for collaboration among Data Scientists, Data Engineers, and Data Analysts. Just mentioning it is the backbone of Databrick’s machine learning environment and provides an environment for accessing and managing all your Databricks assets, you might be asking yourself what sorts of assets I'm talking about? The below figure shows the some of the Data assets & all three Personas assets in Databricks.

Sounds Interesting right? Let's address all of them one by one in this blog with some short description to get to know all of them.

Data assets, while accessible from the space fall under the category of unity catalog i: e., Databricks data governance solution,

These assets include Databases, Tables and Views also included under this umbrella or catalogs, which is a top-level container that contains the database
Storage credentials and external locations which managed credentials and paths for accessing files and external cloud storage containers
Share recipients, which are special assets in principles for use with Delta. Sharing the open-source high-performance secured data sharing protocol developed by Databricks.

These assets like Queries, Dashboards, Alerts, and Endpoints also, do not fall under the category of Data Science and Engineering workspace, but rather manage and access within Databricks SQL, likewise Experiments and Models are managed within Databricks Machine Learning.
Data Science & Engineering workspace, the classic Databricks environment for collaboration among Data Scientists, Data Engineers, and Data Analysts. Just mention it is the backbone of Databrick’s machine learning environment and provides an environment for accessing and managing all your Databricks assets These assets are:

Notebooks, arguably, the centerpiece of the data science and engineering workspace cost to feed a web-based interface to documents that contain a series of cells. These cells can contain commands that operate on data in any of the languages supported by Databricks. Visualizations that provide a visual interpretation of an output narrative text can be employed to document the notebook Magic commands that allow you to perform higher-level operations, such as running, other notebooks invoking, a database utility, and more notebooks that are collaborative. They have a built-in history and users can exchange notes. They can be run interactively or in batch mode as a job through Databricks workflows.

Foldersprovide a file system, like construct within the workspace, to organize workspace assets, such as notebooks.

While notebooks provide a built-in history that can be very handy in many circumstances. The functionality is limited and does not go far to fulfill the requirements of a fully-fledged CI/CD environment. Reposeprovides the ability to sync notebooks and files to remote git. Repose provides functionality for Pushing and Pulling, Managing Branches, Viewing Differences & reverting changes. Repos also provide an API for integration with other CI/CD tools.

Databrick Secret fulfills security best practices by providing secure key-value storage, for sensitive information allowing you to easily decouple such material from your code.

Jobs have the capability of automatically running tasks. Jobs can be simple with a single task or can be large multitask workflows with complex dependencies. Job tasks can be implemented with notebooks, jobs pipelines, or applications written in Python Scala, and Java, or using the spark summit style.

Delta live table is a framework that delivers, a simplified approach for building reliable, maintainable, and testable data processing pipelines. The main unit of execution in Delta live tables is a pipeline which is a directed acyclic graph, linking data sources to target datasets. pipelines are implemented using notebooks.

A cluster is a set of computation resources and configurations on which you run Data, Engineering, Data Science, and Data Analytics workloads such as production ETL pipeline, streaming analytics, Ad-hoc analytics, and Machine Learning. The workloads can consist of a set of commands in a notebook or a workflow that runs jobs or pipelines.

Pools reduce cluster start, and auto-scaling times by maintaining a set of idols ready-to-use virtual machine instances when a cluster is attached to a pool, cluster nodes are created using the pool idle instances, creating new ones as needed.

NOTE: Reference & Images in the blogs are taken from Databricks site & courses.

This is it for this blog, hope you like the blog. Kindly do Follow.

Using AWS for Text Classification Part-1

Qasim H. (aiwithqasim 🚀) — Sat, 27 Aug 2022 13:42:23 +0000

Online conversations are ubiquitous in modern life, spanning industries from video games to telecommunications. This has led to an exponential growth in the amount of online conversation data, which has helped develop state-of-the-art natural language processing (NLP) systems like chatbots and natural language generation (NLG) models. Over time, various NLP techniques for text analysis have also evolved. This necessitates a fully managed service that can be integrated into applications using API calls without the need for extensive machine learning (ML) expertise. AWS offers pre-trained AWS AI services like Amazon Comprehend, which can effectively handle NLP use cases involving classification, text summarization, entity recognition, and more to gather insights from text.

Additionally, online conversations have led to a widespread phenomenon of non-traditional usage of language. Traditional NLP techniques often perform poorly on this text data due to the constantly evolving and domain-specific vocabularies within different platforms, as well as the significant lexical deviations of words from proper English, either by accident or intentionally as a form of adversarial attack.

In this post, we describe multiple ML approaches for text classification of online conversations with tools and services available on AWS.

Prerequisites

Before diving deep into this use case, please complete the following prerequisites:

Set up an AWS account and create an IAM user.
Set up the AWS CLI and AWS SDKs.
(Optional) Set up your Cloud9 IDE environment.

Dataset

For this post, we use the Jigsaw Unintended Bias in Toxicity Classification dataset, a benchmark for the specific problem of classification of toxicity in online conversations. The dataset provides toxicity labels as well as several subgroup attributes such as obscene, identity attack, insult, threat, and sexually explicit. Labels are provided as fractional values, which represent the proportion of human annotators who believed the attribute applied to a given piece of text, which are rarely unanimous. To generate binary labels (for example, toxic or non-toxic), a threshold of 0.5 is applied to the fractional values and comments with values greater than the threshold are treated as the positive class for that label.

Subword embedding and RNNs

For our first modeling approach, we use a combination of subword embedding and recurrent neural networks (RNNs) to train text classification models. Subword embeddings were introduced by Bojanowski et al. in 2017 as an improvement upon previous word-level embedding methods. Traditional Word2Vec skip-gram models are trained to learn a static vector representation of a target word that optimally predicts that word’s context. Subword models, on the other hand, represent each target word as a bag of the character n-grams that make up the word, where an n-gram is composed of a set of n consecutive characters. This method allows for the embedding model to better represent the underlying morphology of related words in the corpus as well as the computation of embeddings for novel, out-of-vocabulary (OOV) words. This is particularly important in the context of online conversations, a problem space in which users often misspell words (sometimes intentionally to evade detection) and also use a unique, constantly evolving vocabulary that might not be captured by a general training corpus.

Amazon SageMaker makes it easy to train and optimize an unsupervised subword embedding model on your own corpus of domain-specific text data with the built-in BlazingText algorithm. We can also download existing general-purpose models trained on large datasets of online text, such as the following English language models available directly from fastText. From your SageMaker notebook instance, simply run the following to download a pre-trained fastText model:

!wget -O vectors.zip [https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip](https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip)

Whether you’ve trained your own embeddings with BlazingText or downloaded a pre-trained model, the result is a zipped model binary that you can use with the gensim library to embed a given target word as a vector-based on its constituent subwords:

After we preprocess a given segment of text, we can use this approach to generate a vector representation for each of the constituent words (as separated by spaces). We then use SageMaker and a deep learning framework such as PyTorch to train a customized RNN with a binary or multilabel classification objective to predict whether the text is toxic or not and the specific sub-type of toxicity based on labeled training examples.

To upload your preprocessed text to Amazon Simple Storage Service (Amazon S3), use the following code:

To initiate scalable, multi-GPU model training with SageMaker, enter the following code:

Within , we define a PyTorch Dataset that is used by train.py to prepare the text data for training and evaluation of the model:

Note that this code anticipates that the vectors.zip file containing your fastText or BlazingText embeddings will be stored in .

Additionally, you can easily deploy pretrained fastText models on their own to live SageMaker endpoints to compute embedding vectors on the fly for use in relevant word-level tasks. See the following GitHub example for more details.

In the Next Part, I’ll explain how to use a Transformer with a Hugging Face for text classification in AWS. The link to Part 2 will be given soon.

Data Science Process

Qasim H. (aiwithqasim 🚀) — Sat, 27 Aug 2022 08:13:48 +0000

Data Science Process

Why is it called Data Science?

So, what exactly does it mean to do data science?

Professionals who do data science are driven by a desire to ask questions. Questions like, how many customers prefer website A to website B? Are wildfires in California getting bigger? Is Fenty Makeup more inclusive than other foundations?

To answer these questions, data scientists use a process of experimentation and exploration to find answers. Sound familiar? Like other scientific fields, data science follows the scientific method. But the data science process also includes steps particular to working with large, digital datasets.

The process generally goes as follows:

Ask a question
Determine the necessary data
Get the data
Clean and organize the data
Explore the data
Model the data
Communicate your findings

In this lesson, we’re going to dig into the data science process. While we present these steps as a linear process, it’s important to keep in mind that you’ll often repeat steps or do things out of order. Consider this to be a set of guidelines to get you started.

Step 1-Determine the Necessary Data

After you have a question, you have to make an educated guess about what you think the answer might be. This educated guess is known as a hypothesis and helps to determine the data you need to collect.

It’s important to remember that the data you decide to collect can’t just be the data that you think is going to answer your question. We can’t carefully select for what we know will prove the hypothesis is correct — actually, the data needs to disprove the hypothesis.

What do we mean by that?

In science, it’s actually impossible to prove that something is true. Rather, we try and show that we’re really, really confident that it’s not false. That’s because the only way we can say we’re 100% positive our hypothesis is correct is to collect all the data for an entire population — and that’s pretty much impossible!

So, how do we determine what data is necessary to collect?

First, we need to determine what data could disprove our hypothesis. For example, if our hypothesis is that Sriracha is the most popular hot sauce in the entire world, we can’t just survey a group of 100 people and ask if they prefer Sriracha over Frank’s Red Hot. We would need to look at how Sriracha consumption compares to the consumption of multiple other brands of hot sauce.

Next, we need to figure out how much data to collect. While it’s preferable to get information for an entire population, that process is often difficult or impractical. So instead, we collect a sample set of data, a smaller amount of data that are representative of the entire population.

How do we ensure that our sample is representative? We figure out the necessary number of samples that have similar descriptive statistics to the entire population. For example, say that we want to know the length of oysters in Long Island. Our hypothesis is that they’re about three inches. If we collect five oyster shells, they may only measure, on average, two inches. However, if we collect 145 more, we’ll find that the average is closer to three inches.

Rule of thumb: the larger the sample size and the more diverse your dataset is, the more confident you’ll be in your results. We don’t want to go through the trouble of designing and running an experiment only to discover that we don’t have enough information to make a good decision!

Step 2-Getting Data

Once you’ve determined which data you need, it’s time to collect it!

Data collection can be as simple as locating a file or as complicated as designing and running an experiment.

Here are a couple of different ways to get data:

Active data collection — you’re setting up specific conditions in which to get data. You’re on the hunt. Examples include running experiments and surveys.
Passive data collection — you’re looking for data that already exists. You’re foraging for data. Examples include locating datasets and web scraping.

There are a couple of things to keep in mind when we collect data. An important one is the size of our dataset. Remember that we usually can’t get data from an entire population, so we need to have an appropriate sample that is representative of the larger population. If you’re ever unsure about the size of your dataset, use a sample size calculator.

Another thing to keep in mind is that errors and bias can occur in data collection. For example, early facial recognition software was trained on datasets that disproportionately contained portraits of white males. Other developers continued to use these datasets and for many years, facial recognition software was not adequate at recognizing other faces. Again, having a larger sample size is one way to mitigate errors and bias.

Step 3-Cleaning Data

As soon as you get your data, you’ll be excited to get started. But wait just one moment — we might be ready to go, but how can we be sure that our data is?

Data is typically organized in columns and rows as you’d see in a spreadsheet. But raw data can actually come in a variety of file types and formats. This is especially true if you’re getting your data from elsewhere, like public datasets.

We as humans may be able to understand the organizing logic of a dataset, but computers are very literal. A missing value or unlabeled column can completely throw them off their game. Even worse — your program could still run, but your outcomes would be incorrect. Ouch!

An important part of the data science process is to clean and organize our datasets, sometimes referred to as data wrangling. Processing a dataset could mean a few different things. For example, it may mean getting rid of invalid data or correctly labeling columns.

The Python library Pandas is a great tool for importing and organizing datasets. You can use Pandas to convert a spreadsheet document, like a CSV, into easily readable tables and charts known as DataFrames. We can also use libraries like Pandas to transform our datasets by adding columns and rows to an existing table, or by merging multiple tables together!

Step 4-Explore the Data

Now that our data is organized, we can begin looking for insights. But just because our data is all cleaned up, we still can’t learn a lot by staring at tables. In order to truly explore our data, we’ll need to go a few steps further.

Exploring a dataset is a key step because it will help us quickly understand the data we’re working with and allow us to determine if we need to make any changes before moving forward. Changes could include some additional dataset cleaning, collecting more data, or even modifying the initial question we’re trying to answer. Remember: data science is not necessarily a linear process.

There are two strategies for exploring our data:

Statistical calculations
Data visualizations

STATISTICAL CALCULATIONS

When we first get a dataset, we can use descriptive statistics to get a sense of what it contains. Descriptive statistics summarize a given dataset using statistical calculations, such as the average (also known as mean), median, and standard deviation. We can immediately learn what are common values in our dataset and how to spread out the dataset is (are most of the values the same, or are they wildly different?).

We can use a Python module known as NumPy to calculate descriptive statistics values. NumPy (short for Numerical Python) supplies short commands to easily perform statistical calculations, like np.mean(), which calculates the mean of a dataset.

DATA VISUALIZATIONS

Another way we can quickly get a sense of our data is by visualizing it. The practice of data visualization enables us to see patterns, relationships, and outliers, and how they relate to the entire dataset. Visualizations are particularly useful when working with large amounts of data.

Python data visualization libraries like Matplotlib and Seaborn can display distributions and statistical summaries for easy comparison. The JavaScript library D3 enables the creation of interactive data visualizations, which are useful for modeling different scenarios.

Step 5 -Modeling and Analysis

Data in hand, we can begin to dig in and analyze what we have. To analyze our data, we’ll want to create a model. Models are abstractions of reality, informed by real data, that allow us to understand situations and make guesses about how things might change given different variables.

A model gives us the relationship between two or more variables. For example, you could build a model that relates the number of grade school children in a neighborhood with sales of toys, or a model that connects the number of trucks that travel certain roads with the amount of a city’s budget assigned to road maintenance.

Models allow us to analyze our data because once we begin to understand the relationships between different variables, we can make inferences about certain circumstances. Why is it that the sales of toys increases as the number of grade school children grows? Well, maybe it’s because parents are buying their children more toys.

Models are also useful for informing decisions, since they can be used to predict unknowns. Once we understand a relationship between variables, we can introduce unknown variables and calculate different possibilities. If the number of children in the neighborhood increases, what do we predict will happen to the sales of toys?

As we collect more data, our model can change, allowing us to draw new insights and get a better idea of what might happen in different circumstances. Our models can tell us whether or not an observed variance is within reason, is due to an error, or possibly carries significance. How would our understanding of our model change if in 2016 we discovered that the number of toys did not increase, but instead, decreased? We’d want to look for an explanation as to why this year did not fit the trend.

Models can be expressed as mathematical equations, such as the equation for a line. You can use data visualization libraries like Matplotlib and Seaborn to visualize relationships. If you pursue machine learning, you can use the Python package scikit-learn to build predictive models, such as linear regressions.

Step 6-Communicating Findings

After you’ve done your analyses, built your models, and gotten some results, it’s time to communicate your findings.

Two important parts of communicating data are visualizing and storytelling.

VISUALIZING

As we saw earlier, visualizations are helpful for exploring and understanding our data; however, they are also very useful for communicating results. So, can you just reuse the ones you already have? Not necessarily. The answer depends on your audience. If you’re giving a presentation at a conference for data scientists who also work at dating companies, then sure, they’ll probably be able to interpret your graphs. But if you’re writing an article for Medium, you’ll probably want to simplify your visualizations and style them so they’re easy to read and highly engaging.

STORYTELLING

It’s also important to remember that visualizations can’t always stand on their own — they need to tell a story. Storytelling is an important part of data science because it gives meaning to the data. In fact, data storytelling is an important practice in its own right. Contextualizing your findings and giving them a narrative draws people into your work and enables you to convince them as to why these results are important and why they should make a certain decision.
To do practice of the above steps follow the link below:
GitHub: medium-code/data science process at main · qasim1020/medium-code (github.com)

Reference: https://www.codecademy.com/