Forem: Brendah Achieng

Introduction to Data Version Control

Brendah Achieng — Sat, 01 Apr 2023 20:22:28 +0000

Data Version Control is a free open-source system that ensures management for data,machine learning experiments and machine learning automations.By ensuring that scientist do not have to worry about which data model uses which dataset and the actions carried out to achieve the result, work has been made easier.Data scientists are able to manage large datasets with ease making collaboration better.

Data Version Control was first released in 2017 as a simple command line tool.It is based on existing version control tools like Git and CI.It tracks the changing versions of data and every commit changes done to any file.Therefore DVC is like git for machine learning projects.

The .dvc file is lightweight hence stored with code in github.The .dvc files is downloaded together with code from github. the large datasets used and the model ****files are always stored on the DVC remote storage while the .dvc files that points to the data files are stored on github.

DVC design principles

1. Codification: Definition of the project aspects like data and model versions or machine learning experiments in metafiles that are readable by humans.

2. Versioning: Commit DVC metafiles to git which enables the versioning and sharing of the entire project(that is datasets, source code and configuration, parameters and metrics) using git.

3. Secure Collaboration: Control the access and permissions to the project.

Characteristics of DVC

Data Version Control takes advantage of existing technologies with the aim of bringing the best software engineering practices to the field of data science.Some of the characteristics of DVC include:

Easy to use and install:
DVC doesnt require special infrastructure and knowledge.Furthermore, it does not depend on any external services.DVC can be easily intergrated with existing tools like Git.
Can work on top of Git Repo:
DVC sticks to the git workflow like commit,branching requests,pull,push,clone etc.It can also work on its own without the versioning capabilities.
DVC doesn't depend on the platform:
It can run and work on all major operating systems.It is independent of the programming languages and the machine learning libraries.

How to install Data Version Control on windows
DVC can be installed on both Linux and macOS.However we will look into the windows installation in this article.
To use DVC as a Python library, you can install it with conda or with pip.

Installation with choco
To install from command line use Chocolatey by using the choco command:
$ choco install dvc

Installation with conda:
Requires minioconda or anaconda distribution.Use conda from anaconda prompt.

$ conda install -c conda-forge mamba

$ mamba install -c conda-forge dvc

Installation using pip:
Virtual environment creation is recommended or using pipx to encapsulate your local environment.Python 3.8+ is needed to get the latest version of DVC

** $ pip install dvc**

Windows Installer:
Go to the https://dvc.org/ homepage and get the self-contained, executable installer, which is available from the Download button .You can also get it from the release page on GitHub.
To update the DVC download and run the installer again.Use Windows Uninstaller incase you want to uninstall the program from your machine.

Advantages of Data Version Control

Organized Machine learning data-
Data pipeline concept is used by DVC to version data using Git. The pipelines being lightweight allow organization and reproduciblity of workflows.
Share Models via Cloud Storage-
Using a centralized data storage scientists find it easy to perform experiments on a single shared machine which leads to better resource utilization.
Reproducibility-
DVC repositories store the history and details such what changes were made and when.It can also use no-code pulls to update requests with just one commit.The easy to use command line interface allow scientists to reproduce and organize feature stores with dvc get and dvc import commits.
Track & Visualize ML Models-
Versioning is achieved using Git workflows such as pull and push requests.DVC built in cache is used to store all the machine learning information which are further synchronized with remote cloud storage. DVC therefore, allows for the tracking of data and models for further versioning.

Disadvantages of DVC

a)Poor Performance in Sloppy Architecture
Data version control works alongside Git hence the team members are not able to enjoy the full benefits of this version control system if some information about the datasets for a given project is mising.Teams may have to manually develop extra features in DVC to meet certain demands of ML.
b)Redundancy
DVC uses pipeline management hence any use of a separate pipeline tool leads to redundancy.
c) Incorrect Configuration Risk
Should the working team forget to add the output file there is always a risk of incorrect confirguration of the pipeline.Furthermore, a DVC-produced version of project from last year may not work the same in today's circumstance.

Getting Started With Sentiment Analysis

Brendah Achieng — Tue, 28 Mar 2023 19:38:55 +0000

Sentiment analysis(opinion mining) is a natural language processing (NLP) technique that focuses on analyzing and finding the intent/emotion behind a given text or speech.
There is always a sentiment behind any written or spoken speech.It could be negative,positive or neutral.

Sentimental analysis helps automate the processing of large amount of data in real time.It can be used to analyze customer feedback, survey responses,social media monitoring, reputation management, customer experience and product reviews.Business decisions can be made after analyzing and understanding people's reaction towards a given comodity.

Sentimental analysis is fast becoming an essential tool in understanding the sentiment behind all types of data.Being able to understand the responses from over 5000 customers from a given survey automatically is a great gain for a business.

Importance of sentimental analysis

Sorting large amount of data: Manually sorting through thousands of tweets or customer survey responses is very tidious.Sentimental analysis helps analyse large amounts of unstructured data within a short period of time.
Real time analysis:Through Sentimental analysis models urgent or critical issues can be detected in real time .For example an angry customer who needs immediate attention can be identified immediately and the situation delt with.
Consistent criteria:Using a centralize sentimental analysis model can help with the consistency and maintenance of the standard when interpreting data.Manually done interpretations can be bias as sometimes people get influenced with their experience,beliefs and thoughts.

How Does Sentiment Analysis Work?

With the use of machine learning and natural language processing sentimental analysis can determine whether a text is neutral,positive or negative.

Main approaches of sentimental analysis are:

1.Rule-based sentiment analysis.

A set of manually created rules is used for the analysis.NLP techniques like Lexicons (lists of words), Stemming, Tokenization, Parsing are used.

Lexicons-A list of both negative and positive words are created and later used to describe the sentiment.
Tokenization- Breaking a text or a sentence into smaller pieces called tokens.

Basic example of how a rule-based system works:

Defines two lists of polarized words that is negative words such as bad, ugly and positive words such as best, beautiful.

The text is then prepared,processed and formated to make analyzation by the machine possible and easy.Tokenizationm and Lemmatization occurs here.

The computer then counts the number of words classified as negative and the positive words in the text.

The overall sentiment score of the text is then calculated based on a given scale like -100 to 100.If the number of positive words are higher than the negative word the system returns positive sentiment and vice versa.Should the score be even the system returns neutral sentiment.

Disadvantages of Rule-based sentiment analysis

It is limited because it doesnt consider the whole sentences but parts of it.Human language is complicated and sometimes the real emotion can be missed.

2. Automated or Machine Learning Sentiment Analysis

Machine learning techniques are used.A model is trained with a given data set to classify the sentiment based on the words and their order in a given text.The quality of this approach depends on the quality of the training dataset used.

Step 1: Feature Extraction

Data(text) preparation is done here.Techniques such as tokenization,lemmatization,vectorization and stopword removal are used to make the text ready for classification by the model.Deep learning is used to achieve vectorization of the text.

Step 2: Training & Prediction
A sentiment-labelled training dataset is used to train the algorithm.The dataset is created manually or generated from reviews.

Step 3: Predictions

New text is fed into the model. The model then predicts labels for this new data using the model trained using the training dataset. The text is then classified as positive, negative or neutral in sentiment. This eliminate the need for a pre-defined lexicon used in rule-based sentiment analysis.

N/B-A hybrid of both rule-based and automated can be used sometimes.Although they are very complex, they provide the best result.

Building Sentiment Analysis Model

Pre-trained models are publicly available on the Hub hence they are the best place to get started.The available models use deep learning designs like transformers.For better results it is advisable to fine tune the chosen model with your own data to better fit the case at hand and for accurate results

Essential SQL Commands For Data Science

Brendah Achieng — Wed, 15 Mar 2023 16:24:25 +0000

Structured Query Language is a simple ,easy to write language used around the world to manipulate databases.Without data there is no Data Science hence SQL is very important.
In this post we will talk about some of the important sql commands used in Data Science.

Data Retrieval

Select Command
Together with other retrieve commands it is used to retrieve specific data from the database.The Select Clause can be used to specify a column or columns from the database.To retieve more than one column a comma and a space between the column names is used.And to get all the columns in a given table use an asterik (*).

Syntax of Select Command:

Select * from table_name
Select column_name from table_name
Select column_name1, column_name2 from table_name
Select * from table_name where condition

Distinct Command
It is used with the select command to display only the different,unique or distinct data from a table that has some similar data.

Syntax of the distinct command:

Select distinct column_name from table_name
Select distinct column_name1, column_name2…….. from table_name

Data Retrieval With Simple Conditions
where
This is used to display specific data that meets the given condition.

Syntax of where statement:****

Select * from table_name where condition
Select column_name1, column_name2…. where condition

order by
Used to retrieve data from the database in a specific order.It could be ascending and descending order.

Syntax of order by statement:

Select * from table_name where condition order by column_name
Select * from table_name where condition order by column_name DESC

limit
Used to get a limited number of entries i.e the top 10 records.

Syntax:
SELECT * FROM table_name where condition order by column_name desc limit 10.

Aggregations
An aggregate function calculates multiple values and returns a single value.Aggregate functions in SQL includes group by, avg, count, sum, min, max and many others.NULL values are ignored during the calculations except for the count function.

GROUP BY
The group by clause is used to display the result in the group with the aggregate functions.

Syntax of group by:
Select column_list from table_name where condition group by expression.

COUNT()
Used to count the number of all or distinct values in an expression.

Syntax of count () function:
SELECT * from count (column_name) from table_name

SUM()
sum() function is used to add and get the total sum of values of a numeric column.

Syntax of sum() function:
SELECT sum (column_name) from table_name

JOINS

SQL JOINS is used to combine data or rows from two or more tables based on a common field between them.
There are 4 different types of SQL joins:

SQL INNER JOIN (SIMPLE JOIN OR JOIN)
Returns rows from multiple tables where the join condition is met,returns only the common data between the two tables that is where they intersect.

Syntax:
SELECT table1.column1,table1.column2,table2.column1,....
FROM table1
INNER JOIN table2
ON table1.matching_column = table2.matching_column;

SQL LEFT OUTER JOIN (LEFT JOIN)
Returns all rows from the LEFT-hand table specified in the ON condition and only those rows from the other table where the join condition is met or where they intersect.

Suntax:
SELECT table1.column1,table1.column2,table2.column1,....
FROM table1
LEFT JOIN table2
ON table1.matching_column = table2.matching_column;

SQL RIGHT OUTER JOIN (RIGHT JOIN)
It returns all the rows of the table on the right side of the join and matching rows for the table on the left side of the join.

Syntax:
SELECT table1.column1,table1.column2,table2.column1,....
FROM table1
RIGHT JOIN table2
ON table1.matching_column = table2.matching_column;

SQL FULL OUTER JOIN (FULL JOIN)
Returns all the rows from both tables. For the rows for which there is no matching,it returns NULL values.

Syntax:
SELECT table1.column1,table1.column2,table2.column1,....
FROM table1
FULL JOIN table2
ON table1.matching_column = table2.matching_column;

UNION
Returns two query results together.

Syntax:
SELECT column_name AS Name FROM table_name
UNION
SELECT column_name FROM table_name

Complex Conditions

CASE Statement
This is the way SQL handles if/then logic.The statements are often followed by WHEN and THEN statements.The case statements ends with END statement.ELSE statements are optional .

Syntax:
SELECT CASE Expression
When expression1 Then Result1
When expression2 Then Result2
...
ELSE Result
END

Window Functions
Uses agreggate functions and other functions over a particular set of rows.OVER clause is used in the definition of the window.

Syntax:
SELECT coulmn_name1,
window_function(cloumn_name2)
OVER([PARTITION BY column_name1] [ORDER BY column_name3]) AS new_column
FROM table_name;

Introduction to SQl for Data Analysis

Brendah Achieng — Fri, 17 Feb 2023 08:19:22 +0000

Structured query language is a standard programming language designed in 1970s for accessing, manipulating and storing data in a relational database.As the name suggests, a relational database is a database composed of data organized in tables that relate to each other. The table rows and columns represent data characteristics and how the data values relate to each other.

*Why is SQl so important?
*
SQL is very easy to learn since it uses common English keywords like "where" in it's statements.
SQL is the most universal language in the world.It is used in almost all types of applications because it integrates so well with many programming languages.
It is the standard language for database management systems used in both extremely big and small businesses.
SQL is a powerful,fast, efficient,secure, inexpensive open source software that can be used to do anything related to a database.

How SQL works

When a query is run it is processed by a query optimizer.Upon reaching the SQL server,the query is compiled in three stages:
a) parsing-syntax checking
b) binding-semantics checking
c) optimization-query execution plan creation

SQL Commands

Data Definition Language:the creation,design and modification of the database structure and objects i.e the CREATE command

Data Query Language: retrieval of data from the database for example the SELECT command

Data Manipulation Language: insertion of new records and modification of existing ones i.e the INSERT command

Data Control Language:access authorization of the database for example the GRANT command to allow a given user to access a particular section of the database.

Transaction Control Language: automatic database changes i.e ROLLBACK command

SQL For Data Analysis

Data Analysis,Data Science, Business Intelligence,Big Data etc all manipulate and process big amounts of data using different methods to gain
useful insights.
As mentioned earlier SQL can be implemented in all database management systems like desktop (Access),open source (MySQL) and commercial (oracle).
Data Analysts use SQL to process, manipulate and generally interact with data stored in relational databases.
Businesses and Organizations need Data analysts to discover useful patterns and trends from their data.
Data Analysis therefore involves collecting and organizing data to extract and retrieve useful information that can be used to make critical decisions.
SQL offer great ability to data manipulation of big amounts of data.It can efficiently build complex models and analysis in a very short time.

*How to use SQL for Data Analysis *

Due to the SQL ability to communicate complex instructions to the database and manipulate data in the shortest time possible, SQL can be used to create useful dashboards with reporting tools that can display data in many ways.
Furthermore,SQL can be used to design and build useful warehouses.
SQL can be intergrated with different data analytics frameworks and Languages like python,R, Scala etc.

*Learning SQL for data analysis *

SQL is easy to learn ad use.Sometimes having just an SQL cheat sheet can get a data analyst get going.However, to be a better Data Analysts ,one need to exhaust SQL and master all the skills.

Finally, data analysts do analyze data but before that they need to retrieve it from the database and that's when SQL come in.Therefore, SQL is a critical language in data analysis.