Forem: Riri

Unfolding a Machine Learning Classification Problem: A Step by Step Guide.

Riri — Sun, 30 Apr 2023 16:47:42 +0000

A classification problem in machine learning involves forecasting the class or category of an input sample based on its features or attributes. For instance, you want to build a machine learning model that can be able to distinguish a cat from a dog, the goal of this model will be to accurately predict the classes of new and unseen images of cats and dogs and assign each image its respective class. As such, this problem is classified as a classification problem.

Similar to other machine learning tasks, building a classification model involves a number of steps to achieve its primary objective. These steps include collecting and preprocessing the data, dividing it into training and testing sets, selecting a suitable model, training the model on the data, evaluating the performance of the model, optimizing the model, and finally deploying it in real-world applications such as web services, mobile apps, or APIs.

Some common applications of machine learning classification that you've probably interacted with include spam detection which I've just mentioned above, sentiment analysis, fraud detection, image classification, and medical diagnosis. This shows the classification technique have a wide pool of practical applications across various domains.

Types of Classification Approaches.

There exists three MAIN types of classification approaches, they include:

Binary Classification:

In binary classification, the goal is to classify instances into one of two classes or categories.These classes are assigned class labels - either 0 or 1. Class label 0 is mostly associated with the "normal state" of the category and vice versa.

Examples include a classifier determining if an email is spam or not, or if a person has a disease or not.

Some popular algorithms used in this type of classification problem include SGD classifier (very powerful esp when handling large datasets), K-Nearest Neighbors, and SVM.

Multi-class Classification:

In multiclass classification (also called multinomial classification), the goal is to classify instances into one of several possible classes or categories.

Examples include classifying images of animals into different types of species, or classifying news articles into different topics.

Random Forest and Naive Bayes classifiers usually perfom exceptionally well in this scenario. However, some binary classifiers such as SVM and Linear classifiers can also be used but in such a case, two strategies are used to train the classifier; either One-Versus-All or One-Versus-One.

Multi-label Classification:

On the other hand we have multi-label classification, whereby the goal is to assign one or more labels to each instance. This is different from multiclass classification, where each instance is assigned to only one class.

Examples include tagging documents with relevant keywords, or classifying images with multiple labels such as "sunset," "beach," and "ocean".

A common application of this is the face-recognition classifier which attaches a single tag per person in an image containing a number of people.

End-End Classification Task.

In this section, we will focus on the core of machine learning and build classification models using the Titanic dataset which can be found on this Kaggle competition. The optimal goal is to predict whether or not a passenger survived the infamous 1912 ship disaster based on various features such as age, sex, class, and so on.

Let's roll.

Loading dependencies and downloading the dataset.

For the specific purpose of this article, we'll be using the sklearn library so let's go ahead and import all the necessary tools that we will be using, but before that make sure you've downloaded and unzipped the dataset.

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

All that is left is to load the dataset, let's go ahead and do that.

train=pd.read_csv("titanic/train.csv")
test=pd.read_csv("titanic/test.csv")


print(f"Train Dataset shape: {train.shape}\n\nTest Dataset shape: {test.shape}")

Now that we've stored both train and and test .csv files into a pandas dataframe we can print the shape of both dataframes. The train dataframe have 891 samples and 12 features, on the other hand the test dataframe have 418 samples and 11 features. The train dataframe have an additional feature which we will use it as our target class.

Let's display the first five rows both for trainand test .

train.head()

test.head()

Data Analysis and EDA.

To make things easier, let's join both the train and test dataframes to create a single dataset.

dataset=pd.concat(objs=[train, test], axis=0)
dataset=dataset.set_index("PassengerId")

The set_index method is called on dataset to set the index column as "PassengerId".

It is important to note that test dataset should not be used in this way during training, as it is meant to be used as a separate dataset to evaluate the performance of the trained model on unseen data. In practice, we should not include the test data in the training dataset, and instead only use it for testing and evaluation purposes.

Let's display the first and the last few rows of the concatenated dataset.

dataset.head()

dataset.tail()

Let's familiarize ourselves with the kind of dataset we're working with.

dataset.info()

From the output, the dataset has 1309 entries and 11 columns - 3 columns are of type float, 3 columns are of type integer, and 5 columns are categorical columns of type object. Some of the columns have missing values. The Survived column has only 891 non-null values, which means that there are missing values in this column for the test set but keep in mind that this column was not present in the test dataset, and so we won't fill these missing values.

About the features.

The column attributes have the following meaning:

PassengerId: a unique identifier for each passenger.
Survived: that's the target, 0 means the passenger did not survive, while 1 means he/she survived.
Pclass: passenger class.
Name, Sex, Age: Self-explanatory.
SibSp: how many siblings and spouses of the passenger aboard the Titanic.
Parch: how many children & parents of the passenger aboard the Titanic.
Ticket: ticket id.
Fare: price paid (in pounds).
Cabin: passenger's cabin number.
Embarked: where the passenger embarked the Titanic.

Next Up: statistical summary.

dataset.describe().T

This will provide the statistical summary of the numerical columns in the dataset. The .T function transposes the table to make it more readable.

The mean age was 29.88, and the oldest person was 80 yrs.
The mean fare was 33.30 pounds.
The survival rate was only 38%, quite sad 😢 .

Visualizations.

Let's now visualize the distribution of some key features.

#Target class
sns.set(style="white", color_codes=True)
ax=sns.catplot("Survived", data=dataset, kind="count", hue="Sex", height=5)
plt.title("Survival Distribution", weight="bold")
plt.xlabel("Survival", weight="bold", size=14)
plt.ylabel("Head Count", size=14)
ax.set_xticklabels(["Survived", "Didn't Survive"], rotation=0)
plt.show();

From the generated categorical plot, we can see most men survived as compared to those who didn't. On the other hand, most women/female didn't survive as compared to those who survived. This is quite an interesting visual.

Let's continue uncovering these relationships:

sns.set(style="white", color_codes=True)
ax=sns.catplot("Survived", data=dataset, kind="count", hue="Embarked", height=5)
plt.title("Survival Distribution by boarding Place", weight="bold")
plt.xlabel("Survival", weight="bold", size=14)
plt.ylabel("Head Count", size=14)
ax.set_xticklabels(["Survived", "Didn't Survive"], rotation=0)
plt.show();

The Embarked column in this dataset indicates the port of embarkation of each passenger. The values represents:

C: Cherbourg.
Q: Queenstown.
S: Southampton.

Interestingly, the survival rate of those who embarked the ship at Southampton was way more higher as compared to the rest.

Let's look at a few more:

sns.set(style="white", color_codes=True)
ax=sns.catplot("Survived", data=dataset, kind="count", hue="Pclass", height=5)
plt.title("Survival Distribution by Passenger Class", weight="bold")
plt.xlabel("Survival", weight="bold", size=14)
plt.ylabel("Head Count", size=14)
ax.set_xticklabels(["Survived", "Didn't Survive"], rotation=0)
plt.show();

If a passenger happened to be in the 3rd class, the chances were higher that he/she would survive the incident. Paradoxically, most people who were in the first class didn't survive compared to those who survived in both 1st and 2nd classes.

To further understand the distribution of the survival outcomes, let's create a new colum age_dec that bins the Age column into decades.

dataset['age_dec']=dataset.Age.map(lambda Age:10*(Age//10))

sns.set(style="white", color_codes=True)
ax=sns.catplot("Survived", data=dataset, kind="count", hue="age_dec", height=5)
plt.title("Survival Distribution by Age Decade", weight="bold")
plt.xlabel("Survival", weight="bold", size=14)
plt.ylabel("Head Count", size=14)
ax.set_xticklabels(["Survived", "Didn't Survive"], rotation=0)
plt.show();

The plot shows that passengers in their 20s and 30s had the highest survival rate, while those in their 1s and 60s had the lowest.

Finally, let's create a violin plot that shows the distribution of passengers by age decade and passenger class, with the hue representing the passenger's sex.

sns.violinplot(x="age_dec", y="Pclass", hue="Sex",data=dataset,
               split=True, inner='quartile',
               palette=['lightpink', 'lightblue']);

The plot shows the distribution of passenger ages by decade and class, with the violins representing the density of passengers at different ages. The split violins allow for easy comparison of the distributions for male and female passengers within each age and class group.

Preprocessing.

Time to preprocess the data before feeding it into our models. But first, how many missing values do we have? and while we're at it, let's create a pandas dataframe containing the percentages of the missing values of each column.

dataset.isnull().sum()

#percentage missing
p_missing= dataset.isnull().sum()*100/len(dataset)
missing=pd.DataFrame({"columns ":dataset.columns,
                     "Missing Percentage": p_missing})
missing.reset_index(drop=True, inplace=True)
missing

About 20% is missing both in Age and age_dec columns, the Cabin column have 77% of its values missing.

Next we need to identify which columns we will be dropping, and then split the columns into numerical and categorical columns.

col_drop=['Name','Ticket', 'Cabin']
print(f"Dataset shape before dropping: {dataset.shape}")
dataset=dataset.drop(columns=col_drop)
print(f"Dataset shape after dropping: {dataset.shape}")

#Numerical and Categorical columns
cat=[col for col in dataset.select_dtypes('object').columns]
num=[col for col in dataset.select_dtypes('int', 'float').columns if col not in ['Survived']]
print(cat)
print(num)

The code is dropping the Name, Ticket, and Cabin columns from the dataset DataFrame, and then creating two lists cat and num.
cat list contains the names of categorical columns in the dataset which are of object datatype while num list contains the names of numerical columns which are of int and float datatype except Survived column which is our target feature.

Split the dataset.

train=dataset[:891]
test=dataset[891:].drop("Survived", axis=1)

The train dataframe contains the first 891 rows of the original dataset, which corresponds to the training data for the Titanic survival prediction problem. The test dataframe contains the remaining rows of the original dataset, which corresponds to the test data for the problem. The Survived column is dropped from the test dataframe because this is the target variable that we are trying to predict, and it is not present in the test data.

Transformation Pipelines.

To prepare the data for machine learning algorithms, we need to convert the data into a format that can be processed by these algorithms. One important aspect of this is to transform categorical data into numerical data, and to fill in any missing values in the data. To automate these tasks, we can create data pipelines that perform these operations for us. These pipelines will help to transform our data into a format that can be used by machine learning algorithms.

num_pipeline=Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

cat_pipeline=Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder())
])

full_pipeline=ColumnTransformer([
    ("cat", cat_pipeline, cat),
    ("num", num_pipeline, num)
])

The num_pipeline first applies a SimpleImputer to fill in missing values with the median value of the column, and then standardizes the data using StandardScaler(). The cat_pipeline applies a SimpleImputer to fill in missing values with the most frequent value of the column, and then encodes the categorical variables using OneHotEncoder().
The ColumnTransformer is then used to apply these pipelines to the respective columns in the dataset. The cat and num lists created earlier are used to specify which columns belong to each pipeline.

# Transform the train dataset
X_train = full_pipeline.fit_transform(train[cat+num])
X_train

The code above applies the full_pipeline ColumnTransformer to the concatenated train dataset ( train[cat+num] ) that contains both categorical ( cat ) and numerical ( num ) columns. The output is a Numpy array that contains the transformed features.

Let's do the same for the test dataset.

X_test=full_pipeline.fit_transform(test[cat+num])
X_test

Identify the target class.

y_train=train["Survived"]

Classification Models.

Random Forest Classifier.

rf=RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train )
y_pred=rf.predict(X_test)


scores=cross_val_score(rf, X_train, y_train, cv=20)
scores.mean()

The above code fits the random forest classifier to the training dataset and predicts the target variable for the test dataset, later it performs 20-fold cross-validation on the training dataset by taking the random forest classifier, the predictor variables X_train, the target variable y_train, and the cv=20parameter. It returns the accuracy score for each fold. scores.mean() returns the mean accuracy score across all the folds.

Wow, the model performed amazingly - 80.01% is quite a satisfying performance considering this is our first run.

XGBoost Classifier.

Let's try out a gradient boosting classifier and see how well it performs.

model = XGBClassifier(
    n_estimators=100,
    max_depth=8,
    n_jobs=-1,
    random_state=42
)

model.fit(X_train, y_train)
y_pred=model.predict(X_test)


scores=cross_val_score(model, X_train, y_train, cv=20)
scores.mean()

Not bad, 79% is somehow better - remember this is our first time executing these models and we're doing so without any feature engineering or hypeparameter tuning techniques.

Let's try two more extra models; a support vector one and a neural-network one to see if our performance improves.

Support Vector Machine.

svm=SVC(gamma="auto")
svm.fit(X_train, y_train)
y_pred=svm.predict(X_test)

scores=cross_val_score(svm, X_train,y_train, cv=20)
scores.mean()

Hooray!!! our perfomace just impoved by 1%, now we're at 81% - it might seem subtle but it's quite a massive win considering the kind of task we're handling.
This model seems promising and would be highly recommend for further development.

Multi-Layer Perceptron Classifier.

Lastly, let's create a neural network classifier to tackle the problem. MLPC classifier is a neural network algorithm mostly used for classification tasks.

nn=MLPClassifier(hidden_layer_sizes=(20,15,25))
nn.fit(X_train, y_train)
nn.predict(X_test)

scores=cross_val_score(nn, X_train, y_train, cv=15)
scores.mean()

Not disappointing at all, the model gave us an 80.15% perfomance.

Overall, our models performed better with an average score of 80%.

PS: Your scores might be different from those herein but that's not something to worry about.

Fun Fact: Looking at the leaderboard for the Titanic competition on Kaggle, our scores would've been among the top 3%, ain't that not mind-blowing?.🤩😎.

But wueh, some kagglers achieved 100% score, yeah you heard me right, 100%…that's nuts, special respect to all of them. 👏

Tips to Improve performance:

Hyperparameters tuning using GridSearch and Cross validation.
Feature Engineering, eg, try to add some attributes such as SibSP and Parch .
Identify parts of names that correlate well with the Survived attribute.

And with that, we're done with our classification model development. Let's connect on LinkedIn and Twitter.

Exploratory Data Analysis: The Ultimate Guide

Riri — Mon, 27 Feb 2023 21:11:42 +0000

Definition :

Exploratory Data Analysis(EDA), also referred to as Data Exploration, is the process of analyzing,investigating and summarizing datasets to gain insights into the underlying patterns and relationships within the dataset, this is done by employing data visualization techniques and graphical statistical methods like histograms,heatmaps, violin plots, joint plots etc. Technically, Eda is all about 'understanding the dataset'.

'Understanding' in this context might refer to quite a number of things: -

Extracting important variables which is normally referred to as Feature engineering.
Identifying and dealing with outliers and missing values.
Understanding the relationships between variables; linear or non-linear variables.

By employing EDA techniques, you can change a very grumpy dataset into a very clean dataset. Overall, EDA is a very crucial and critical part of any data analysis project, it is often used to guide the data analyst while doing further analysis and data modelling.

In this article we will dive deeper into EDA, we will discuss several topics: -

Data cleaning and preparation
Univariate Analysis
Bivariate Analysis
Multivariate Analysis
Visualization Techniques
Descriptive Statistics.

Data Cleaning and Preparation

The first step in data analysis is to clean and prepare the data. This might involve employing techniques such as identifying and correcting missing values, removing outliers, and transforming variables as necessary. This step ensures that data is cleaned and prepared for the other processes ahead. For a successful data analysis process, data is required to be as accurate and reliable as possible.

Let's have a look on how you do this:

#Import the necessary libraries 
import pandas as pd

# Read in data
df = pd.read_csv('data.csv')

# Check for missing values
print(df.isnull().sum())

# Remove outliers
df = df[df['column_name'] < 100]

# Transform variables
df['new_column'] = df['column_name'] * 2

Univariate Analysis

Univariate analysis involves analyzing each variable in the dataset individually. Let's say you have variable named age, by using univariate analysis, you can calculate the summary statistics of this variable eg., mean, median,mode, standard deviation, and variance. This step also involves visualizing the distribution of each variable using histograms, box plots, density plots etc.

We can use the Seaborn library to perform a univariate analysis on some set of data: -

import seaborn as sns

# Load data
tips = sns.load_dataset('tips')

# Calculate summary statistics
print(tips.describe())

# Visualize distribution with histogram
sns.histplot(tips['total_bill'], kde=False)

Bivariate Analysis

On the other hand of Univariate Analysis we have Bivariate Analysis, your guess is good as mine, Bivariate Analysis involves analyzing the relationships between two variables in a dataset.Again, let's look this from a practical point of view, you have two variables height and weight and you need to understand the relationship between the two, Bivariate analysis let's you use graphical methods such as scatter plots, bar charts, line plots etc. to visualize this relationships.It also includes calculation of correlation coefficients, cross-tabulations and contigency tables.

Let's use the matplotlib library to illustrate bivariate analysis: -

#Import the necessary libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Load data
iris = sns.load_dataset('iris')

# Calculate correlation coefficients
print(iris.corr())

# Visualize relationship with scatter plot
plt.scatter(iris['sepal_length'], iris['sepal_width'])
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.show()

Multivariate Analysis

This is the statistical procedure that involves analyzing the relationships between more than two variables. Alternatively, multivariate analysis can be used in analyzing the relationship between dependent and independent variables. It's major applications is relevant in:- clustering, feature selection, dimensionality reduction, hypothesis testing etc.

Visualization Techniques

Another very important component of EDA is data visualization, it gives the data analyst a chance to explore and understand the data visually. This step is very crucial for any organization as it is easily understandable by the non-technical people in the organization. Non-technical people sometimes have hard time trying
to understand the 'under-the-hood' variable relationships but with data visualization, they can easily understand the relationshhips between different variables in the dataset.There are several techniques and tools used in this process. Some tools that are normally used by data analysts to visualize data include: -

MS Power Bi
MS Excel
Tableu
Google Data Studio

For the techniques, there are numerous techniques that can be employed to visualize your data. We will discuss some of these techniques using the pandas library: -

Histograms

Histograms are used to visualize the distribution of a continuous variable like height. The hist() method is used to generate a histogram.

# Generate a histogram
df['column_name'].hist()

Boxplots

Boxplots are used to visualize the distribution of continuous variables to detect outliers. The boxplot() method is used.

# Generate a boxplot
df['column_name'].boxplot()

Scatterplots

This technique is used to visualize the relationship between two continuous variables, e.g, height and weight. The plot.scatter() method is used to generate a scatterplot.

# Generate a scatterplot
df.plot.scatter(x='column1', y='column2')

Bar Charts

Bar charts are used in visualizing the distribution of categorical variables in a dataset. An example of a categorical variable is gender, race, type of job etc. The plot.bar() method is used to generate a bar chart.

# Generate a bar chart
df['column_name'].value_counts().plot.bar()

Hands-on EDA

We've talked a lot about the theoretical side of EDA, but now let's get to the fun part where we will be applying these techniques on a real-world dataset. Working with real-world data can at times be quite hard and frustrating, as it involves paying careful attention to data cleaning, exploration, handling outliers, dealing with missing data, and finally understanding the data. Also, it's good to keep in mind that- the ultimate goal for any data scientist is to achieve an accurate, meaningful, and relevant analysis to the problem at hand. To get your hands on a real-word data, there are various open source and free data websites that provides a wide pool of datasets, they iclude: Data.gov, Kaggle, World Bank Open Data, Open ML, Datahub, etc.

Enough have been said, let's now dive right into the nitty-gritty part. For this particular article, i'll be using an East African dataset that can be found on: https://www.kaggle.com/datasets/enockmokua/financial-dataset

Importing Libraries

#Import the necessary libraries 
%matplotlib inline #for displaying plots directly below the code cell
import matplotlib.pyplot as plt #for creating plots
import pandas as pd #for data manipulation and analysis
import numpy as np # for working with arrays and matrices
import seaborn as sns #for complex visualizations that can't be achieved by plt
sns.set(); #sets the default parameters for seaborn

Loading and Exploring the Data

#load the data
df=pd.read_csv("Datasets/finance.csv")

#check the dataset size and shape 
df.shape, df.size

#Display the first 5 rows
df.head()

#Display the last 5 rows 
df.tail()

#View the column names
df.columns

#view the column data types
df.dtypes

#view the summary statistics of the numerical columns
df.describe()

#Display the summary of the DataFrame
df.info

#Display the total number of missing values in each column
df.isnull().sum()

#Create a mini dataframe to see the % of the missing values
missing=(df.isnull().sum()*100/len(df))
missing_df=pd.DataFrame({'Percentage missing': missing})
missing_df

Data Cleaning

#Drop irrelevant columns, or columns with too many missing values
df.drop(['Unnamed: 0','year'], axis=1, inplace=True)

#Fill the missing values with the mean or median
df['Respondent Age'].fillna(df['Respondent Age'].mean(), inplace=True)

#Check for any duplicates and drop them if exists
df.drop_duplicates(inplace=True)

Creating visualizations

We will create histogram visualizations using the Matplotlib library and seaborn libraries.

#Using matplotlib
x=(df["Respondent Age"])
plt.hist(x,100,density=True, facecolor="green")
plt.show()

#Using Seaborn
sns.histplot(df['Respondent Age']);

From the two visalizations, we can see the differences between the two libraries. Seaborn library tends to have more clearer visualizations than the matplotlib.pyplot library.

#Scatter plots for two numerical columns
sns.scatterplot(data=df, x='Respondent Age',y='household_size',hue='Has a Bank account');

#Boxplot of a numerical column by a categorical column
sns.boxplot(x='Respondent Age',y='country', data=df);

Boxplots are mostly used to check outliers in a dataset

#A heatmap of the correlation between columns
sns.heatmap(df.corr());

#A bar chart of the headcount per country
df.country.value_counts().plot(kind='bar')
plt.xlabel("Country")
plt.ylabel("Count");

#A pie chart for the number of respondents per country
counts=df.country.value_counts()
plt.pie(counts, labels=counts.index, autopct='%1.1f%%')
plt.show();

Chi-Squared Test

So far we've only talked and visualized mostly numerical variables, and you might be wondering, "What about the Categorical variables?". That's where Chi-Squared test comes in, the chi-squared test is basically a statistical test used to determine if there is any significant association between two categorical variables. It is used to compare observed data with expected data, and to determine if the differences between them are significant enough to reject the null hypothesis that there is no association between the variables.

In Python, you can use the scipy.stats module, which provides a function called chi2_contingency() that calculates the chi-squared statistic, degrees of freedom, p-value, and expected frequencies for a contingency table. Let's try it out.

import numpy as np
from scipy.stats import chi2_contingency

# create a contingency table
table = np.array([[10, 20, 30], [15, 25, 35]])

# perform the chi-squared test
chi2, p, dof, expected = chi2_contingency(table)

# print the results
print('Chi-squared statistic:', chi2)
print('Degrees of freedom:', dof)
print('P-value:', p)
print('Expected frequencies:\n', expected)

The contingency table we created with two rows and three columns, represents the frequency of two categorical variables.

The expected output will be:

Chi-squared statistic: 0.27692307692307694
Degrees of freedom: 2
P-value: 0.870696738961232
Expected frequencies:
 [[11.11111111 20.         28.88888889]
 [13.88888889 25.         36.11111111]]

-The p-value is greater than the significance level of 0.05, which means we fail to reject the null hypothesis that there is no association between the variables. Therefore, we conclude that there is no significant association between the variables.

-Check out this article for further understanding of Chi-Square Test.

SQL 101: Introduction to SQL

Riri — Sat, 18 Feb 2023 20:09:40 +0000

What is SQL?

Many are the times you've heard the acronym SQL maybe from your friends, colleagues or from your teachers, but what really is SQL? SQL stands for Structured Query Language and it's the lingua franca that is used to manage, create, manipulate, and retrieve information/data from databases. It was developed in the 1970s by the IBM computer scientists.

now that you know what is SQL, what does it really do? SQL can perform various tasks in a database which include:-

execute queries against databases.
create new tables in databases.
insert records in databases.
update records in a database.
Create and maintaindatabase users.
delete records.
retrieve data from databases e.t.c.

SQL is a very effective, easy to learn and use language. It is also very funtionaly complete due to it's ability to let users define, retrieve and manipulate data in tables.

SQL Statements

Statements in SQL are a set of instructions that consists of identifiers, parameters, variables, data types and SQL reserved key words. An SQL statement must compile successfully. e.g DELETE TABLE Users

1. Data Manipulation Language(DML)
Includes:-
SELECT- Retrieves certain records from one or more tables.
INSERT- Creates a new record.
UPDATE- Modifies an existing record.
DELETE- Deletes particular record(s).
MERGE - combines the separate INSERT, UPDATE, and DELETE statements into a single SQL query.

2.Data Definition Language(DDL)
Includes:-
CREATE- Creates a new table, a view of a table, or other object in the database.
ALTER - Modifies an existing database object, e.g a table.
DROP - Deletes an entire table, a view of a table or other objects in the database.
RENAME- Used together with ALTER to modify objects in a database.
TRUNCATE- Deletes all data from the table.
COMMENT - Starts with /* and ends with */, this part of the code is not executed.

3.Data Control Language(DCL)
Includes:-
GRANT - Gives a privilege to user(s).
REVOKE- Takes backs privileges that were previously granted to the user.

4.Transaction Control Language(TCL)
Includes:-
COMMIT - Stores changes invoked by a transaction to the database.
ROLLBACK- Reverts all the changes that were made since the last COMMIT.
SAVEPOINT- Used together with Rollback to get a certain transaction at a particular point.

Writing SQL statements.

While writing SQL Statements it's good to note:-
i. SQL statements are not case sensitive.
ii. SQL can be entered on many lines.
iii. Keywords cannot be split across lines.
iv. Clauses are usually placed on separate lines for readability and ease of editing.
v. Indents make it more readable.
vi. Keywords may be entered in caps and all others in lowercase.

Why Learn SQL?

If you're a professional in the software development domain, or a student who wants to become a Software Engineer, SQL is a very essential query language that you should equip yourself with. In most application softwares, developers tend to use SQL to store and manipulate data. Also, in Most Relational Database Management Systems(RDBMS) like MYSQL,Oracle, Postgres, Sybase, MS Access use SQL as their standard database language.

How SQL works.

Let's say you're executing SQL commands for any given SQL task, the system on which you're running the code determines which is the best way to carry out your request while the SQL engine interprets the code.

Sounds like a complicated process but it's not, now let's see the steps that are involved in Query Processing(this process involves translating the High Level SQL Queries into low level expressions that are used in the physical level of the file system, as well as in the query optimization and of course in the actual execution of the query).

Step-1:

Parser-

During this stage, the database performs the following checks- Syntax check, and Semantic check, this is after converting the query into relational algebra.

Syntax Check:

involves checking whether the rules of writing an SQL command have been satisfied/followed ( This rules are what we call Syntax)
e.g SELECT * FORM students
The above command can't be executed and thus will result into an error, this is due to the mispelling of the keyword FROM.

Semantic Check:

During this check, the parser determines whether a statement is meaningful or not. Example: Let's say you requested for a table named Students from the database but you haven't created it yet which technically means it doesn't exist, this check is performed by Semantic Check

Step-2

Optimizer:

During the optimization stage, database must perform a hard parse(is when your SQL commands are re-loaded into the shared pool)for atleast one unique DML(Data Manipulation Language) statement and of course perform optimization during the parse.

Step-3

Execution Engine:

The query is finally executed and the output is displayed.

Hands-on SQL Practicals.

Now let's get down to the nitty-gritty aspect of SQL.

Creating Tables

In SQL you create tables using the CREATE TABLEstatement. When creating tables, you must provide three basic essentials:

Table name.
Column names.
Data types for each column.

Guidelines for creating tables

Table and column naming rules

Must start with a letter, which is followed by a sequence of letters, numbers,_,#,0r $.
Must be 1 to 30 characters long.
Must not be an SQL reserved word.

data types

VARCHAR2(n): Variable length charater string up to n characters.
CHAR(n): Fixed length charater string of n characters.
NUMBER(n): Integer number of up to n digits.
NUMBER(precision, scale): Fixed-point decimal number. “precision” is the total number of digits; “scale” is the number of digits to the right of the decimal point. The decimal point is not counted.
NUMBER: Floating-point decimal number.
DATE: DD-MON-YY (or YYYY) HH:MM:SS A.M. (or P.M.) form date-time.
LONG: Variable-length character string up to 2 GB.
NCHAR: LONG for international character sets (2-byte per character).
CLOB: Single-byte ASCII character data up to 4 GB.
BLOB: Binary data (e.g., program, image, or sound) of up to 4 GB.
BFILE: Reference to a binary file that is external to the database (OS file).
RAW(size) or LONG_RAW: raw binary data.
ROWID: Unique row address in hexadecimal format.

Example:
CREATE TABLE employees ( employee_id number(7) not null, first_name varchar2(20), last_name varchar2(20), cellphone varchar2(12), email varchar2(20), hire_date date, job_id varchar2(5), salary number(12,2), manager_id number(6), department_id number(4) );

Adding data into tables

You use the key word INSERT to add data into any table in SQL.
Example:
INSERT INTO employees VALUES(1000,'Simon','Otieno','0722456789','otieno@yahoo.com','01-jan-90','5500',32000,5000,10);

INSERT INTO employees VALUES(1001,'Alice','Mwangi','0720766659','alice@yahoo.com','02-feb-80,'5600',42000, 5000,10);

Retrieving data from database objects using `SELECT` Statement

As mentioned earlier, SELECT is used to retrieve data from the database. With the SELECT statement you can have the following capabilities.

Projection- choose columns/fields from a table through a query.
Selection- choose rows in a table.
Joining – bring together data that is stored in different tables by specifying the link between them.

Examples:
SELECT * FROM employees;

SELECT employee_id,first_name,last_name, email , job_id, salary FROM employees;

Arithmetic Expressions in SQL

Arithmetic expressions in SQL perfom arithmetic operations on the numeric operads/values stored in the database tables.

Expressions used include:-
+ - for Addition.
- - for Subtraction operations.
* - for multiplication.
/ - for division.
% - for modulus.

Example:
SELECT employee_id, first_name,last_name, salary, salary + 3000 FROM employees;

SELECT employee_id, first_name,last_name, salary, 12* salary + 700 FROM employees;

SELECT employee_id, first_name,last_name, salary, 12* (salary + 700) FROM employees;

SELECT employee_id, first_name,last_name, salary, 12* (salary - 2000) FROM employees;

Restricting and sorting data using the `SELECT` statement.

use of the `WHERE` clause

The WHERE clause is used to filter records. It only extracts those records that fulfill a specified condition.

Comparison conditions

= -Equal to.
> -Greater than.
>=-Greater than or equal to.
< -Less than.
<=-Less than or equal to.
<>-Not equal to.
IS Null is a null value.
IN (set) Match any of the list.

examples:
SELECT employee_id “Employee ID”,first_name “First Name”,last_name “Last Name”, email “Email” , job_id “Job ID”,salary “ Monthly Pay “ FROM employees WHERE employee_id=1000;

SELECT employee_id “Employee ID”,first_name “First Name” ,last_name “Last Name”, email “Email” , job_id “Job ID”,salary “ Monthly Pay “ FROM employees WHERE salary > 10000;

SELECT employee_id “Employee ID”,first_name “First Name” ,last_name “Last Name”, email “Email” , job_id “Job ID”,salary “ Monthly Pay “ FROM employees WHERE salary IN(10000,20000,30000);

Logical Conditions

AND- Returns true if both are true
OR- Returns true if one is true
NOT- Return true if condition is false

Examples:
SELECT employee_id,last_name FROM employees WHERE salary >=10000 AND manager_id=5000;

SELECT employee_id,last_name FROM employees WHERE department_id NOT IN(90,60,30);

Using the `ORDER BY` Clause.

This keyword, sorts out the recordes in a particular order. By default, it sorts records in ascending order.

Example:
SELECT last_name,job_id, department_id FROM employees ORDER BY hire_date DESC;

SELECT last_name,job_id, department_id FROM employees ORDER BY hire_date ASC;

NB : asc=ascending, desc=descending.

UPDATE COMMAND
As mentioned earlier in the article, this command is used to update data in given tables.

examples:
UPDATE employees SET salary= 50000 WHERE employee_id=1001;

UPDATE employees SET last_name='Opiyo'; WHERE employee_id=1000;

DELETE COMMAND
It is Used to delete or remove records from tables in a database.

example:
DELETE from employees where employee_id=1000;

ROLLBACK COMMAND.
To undo some transactions in the database,ROLLBACK Command is used. It is used together with data manipulation language(DML) commands.

example:
ROLLBACK;

COMMIT COMMAND.
Ensures that records are permanently saved. used with data manipulation language commands.

example:
COMMIT;

SQL Constrains

Constraints are used to limit the type of data that can go into a table. This ensures the accuracy and reliability of the data in the table. If there is any violation between the constraint and the data action, the action is aborted.
They iclude:-

NOT NULL: Ensures that the column contains no null or empty values.

Example:
CREATE TABLE Sales ( Sale_Id int NOT NULL, Sale_Amount int NOT NULL, Vendor_Name varchar(255) NOT NULL, Sale_Date date, Profit int );

UNIQUE: Requires that every value must be unique.

Example:
CREATE TABLE employees2 ( employee_id number(6), last_name varchar2(20) not null, email varchar2(20), salary number(10,2), hire_date date not null, constraint emp_email_uk unique(email) );

PRIMARY KEY: creates a primary key for the table. Only one primary key can be created for each table.

example:
CREATE TABLE Sales ( Sale_Id int NOT NULL, Sale_Amount int NOT NULL, Vendor_Name varchar(255), Sale_Date date, Profit int, PRIMARY KEY (Sale_Id) );

FOREIGN KEY: designates a column or combination of columns as a foreign key and establishes a relationship between a primary key or a unique key in the same table or different table.

example:
CREATE TABLE employees3 ( employee_id number(6) constraint emp_id_pk primary key, last_name varchar2(20) not null, first_name varchar2(20), salary number(10,2), hire_date date not null, department_id number(4) constraint emp_dept_fk foreign key(department_id) references department (department_id);

From the above SQL commands, the table department must exist with primary key on department_id.

CHECK:The CHECK constraint is used to ensure that all the records in a certain column follow a specific rule.

Example:
CREATE TABLE Sales ( Sale_Id int NOT NULL UNIQUE, Sale_Amount int NOT NULL, Vendor_Name varchar(255) CHECK (Vendor_Name<> ’ABC’), Sale_Date date, Profit int );

When you have constrains in place on columns, an error is returned if you try to violate the constrant rule.

`ALTER TABLE` Statements.

This command is used to:-
i. Add a new column to a table.
ii. Modify an existing column.
iii. Define default value for a new column.
iv. Drop a column from a table.

Example:
ALTER TABLE employees ADD constraint emp_id_pk primary key;

ALTER TABLE employees2 DROP constraint emp_dept_fk;

Notice the ADD and DROP constraint commands, the former is used to create a UNIQUE, PRIMARY KEY, FOREIGN KEY, or CHECK constraint while the latter is used to delete the same. This is done only after a table is already created.

DATA OBJECTS

Objects are used in databases to store or reference data. An object can only be accessed by using its identifier. There are various data objects in SQL, they include:-

Table-Basic unit of storage.
View-Logically represents subsets of data from one or more tables.
Sequence-GenerateS numeric values.
Index-Improves the performance of some queries.
Synonyms-Gives alternative names to objects.

Let's look at VIEW.

`VIEW`.

This is a logical table based on a table or another view.

Why use `VIEW`?

To restrict data access.
To make complex queries easy.
To provide data independence.
To present different views of the same data.

creating a view:
CREATE VIEW emp1 AS SELECT * FROM employees;

To remove a view use the DROP command.
e.g DROP VIEW emp1