<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Riri</title>
    <description>The latest articles on Forem by Riri (@njarambariri).</description>
    <link>https://forem.com/njarambariri</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1007842%2F1e7d2083-4215-46c9-821b-7c249634d7b6.JPG</url>
      <title>Forem: Riri</title>
      <link>https://forem.com/njarambariri</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/njarambariri"/>
    <language>en</language>
    <item>
      <title>Unfolding a Machine Learning Classification Problem: A Step by Step Guide.</title>
      <dc:creator>Riri</dc:creator>
      <pubDate>Sun, 30 Apr 2023 16:47:42 +0000</pubDate>
      <link>https://forem.com/njarambariri/unfolding-a-machine-learning-classification-problem-a-step-by-step-guide-cac</link>
      <guid>https://forem.com/njarambariri/unfolding-a-machine-learning-classification-problem-a-step-by-step-guide-cac</guid>
      <description>&lt;p&gt;A classification problem in machine learning involves forecasting the class or category of an input sample based on its features or attributes. For instance, you want to build a machine learning model that can be able to distinguish a cat from a dog, the goal of this model will be to accurately predict the classes of new and unseen images of cats and dogs and assign each image its respective class. As such, this problem is classified as a classification problem.&lt;/p&gt;

&lt;p&gt;Similar to other machine learning tasks, building a classification model involves a number of steps to achieve its primary objective. These steps include collecting and preprocessing the data, dividing it into training and testing sets, selecting a suitable model, training the model on the data, evaluating the performance of the model, optimizing the model, and finally deploying it in real-world applications such as web services, mobile apps, or APIs.&lt;/p&gt;

&lt;p&gt;Some common applications of machine learning classification that you've probably interacted with include spam detection which I've just mentioned above, sentiment analysis, fraud detection, image classification, and medical diagnosis. This shows the classification technique have a wide pool of practical applications across various domains.&lt;/p&gt;




&lt;h2&gt;
  
  
  Types of Classification Approaches.
&lt;/h2&gt;

&lt;p&gt;There exists three MAIN types of classification approaches, they include:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Binary Classification:&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In binary classification, the goal is to classify instances into one of two classes or categories.These classes are assigned class labels - either 0 or 1. Class label 0 is mostly associated with the "normal state" of the category and vice versa.&lt;/p&gt;

&lt;p&gt;Examples include a classifier determining if an email is spam or not, or if a person has a disease or not.&lt;/p&gt;

&lt;p&gt;Some popular algorithms used in this type of classification problem include SGD classifier (very powerful esp when handling large datasets), K-Nearest Neighbors, and SVM.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Multi-class Classification:&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In multiclass classification (also called multinomial classification), the goal is to classify instances into one of several possible classes or categories.&lt;/p&gt;

&lt;p&gt;Examples include classifying images of animals into different types of species, or classifying news articles into different topics.&lt;/p&gt;

&lt;p&gt;Random Forest and Naive Bayes classifiers usually perfom exceptionally well in this scenario. However, some binary classifiers such as SVM and Linear classifiers can also be used but in such a case, two strategies are used to train the classifier; either One-Versus-All or One-Versus-One.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;Multi-label Classification:&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;On the other hand we have multi-label classification, whereby the goal is to assign one or more labels to each instance. This is different from multiclass classification, where each instance is assigned to only one class.&lt;/p&gt;

&lt;p&gt;Examples include tagging documents with relevant keywords, or classifying images with multiple labels such as "sunset," "beach," and "ocean".&lt;/p&gt;

&lt;p&gt;A common application of this is the face-recognition classifier which attaches a single tag per person in an image containing a number of people.&lt;/p&gt;




&lt;h2&gt;
  
  
  End-End Classification Task.
&lt;/h2&gt;

&lt;p&gt;In this section, we will focus on the core of machine learning and build classification models using the Titanic dataset which can be found on this &lt;a href="https://www.kaggle.com/c/titanic"&gt;Kaggle competition&lt;/a&gt;. The optimal goal is to predict whether or not a passenger survived the infamous 1912 ship disaster based on various features such as age, sex, class, and so on.&lt;/p&gt;

&lt;p&gt;Let's roll.&lt;/p&gt;

&lt;h3&gt;
  
  
  Loading dependencies and downloading the dataset.
&lt;/h3&gt;

&lt;p&gt;For the specific purpose of this article, we'll be using the sklearn library so let's go ahead and import all the necessary tools that we will be using, but before that make sure you've downloaded and unzipped the dataset.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All that is left is to load the dataset, let's go ahead and do that.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;train=pd.read_csv("titanic/train.csv")
test=pd.read_csv("titanic/test.csv")


print(f"Train Dataset shape: {train.shape}\n\nTest Dataset shape: {test.shape}")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Now that we've stored both train and and test .csv files into a pandas dataframe we can print the shape of both dataframes.
The train dataframe have 891 samples and 12 features, on the other hand the test dataframe have 418 samples and 11 features. The train dataframe have an additional feature which we will use it as our target class.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's display the first five rows both for trainand test .&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;train.head()

test.head()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--iyEANzUf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4lqvbdq7hgzht4uy12lr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--iyEANzUf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4lqvbdq7hgzht4uy12lr.png" alt="Image description" width="800" height="343"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Data Analysis and EDA.
&lt;/h3&gt;

&lt;p&gt;To make things easier, let's join both the train and test dataframes to create a single dataset.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dataset=pd.concat(objs=[train, test], axis=0)
dataset=dataset.set_index("PassengerId")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;set_index&lt;/code&gt; method is called on dataset to set the index column as "PassengerId".&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;It is important to note that test dataset should not be used in this way during training, as it is meant to be used as a separate dataset to evaluate the performance of the trained model on unseen data. In practice, we should not include the test data in the training dataset, and instead only use it for testing and evaluation purposes.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Let's display the first and the last few rows of the concatenated dataset.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dataset.head()

dataset.tail()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--fKkcvZb3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/81btrdiyf0ro2k3ephl4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--fKkcvZb3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/81btrdiyf0ro2k3ephl4.png" alt="Image description" width="800" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's familiarize ourselves with the kind of dataset we're working with.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dataset.info()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;From the output, the dataset has 1309 entries and 11 columns - 3 columns are of type float, 3 columns are of type integer, and 5 columns are categorical columns of type object. Some of the columns have missing values. The Survived column has only 891 non-null values, which means that there are missing values in this column for the test set but keep in mind that this column was not present in the test dataset, and so we won't fill these missing values.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  About the features.
&lt;/h4&gt;

&lt;p&gt;The column attributes have the following meaning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PassengerId: a unique identifier for each passenger.&lt;/li&gt;
&lt;li&gt;Survived: that's the target, 0 means the passenger did not survive, while 1 means he/she survived.&lt;/li&gt;
&lt;li&gt;Pclass: passenger class.&lt;/li&gt;
&lt;li&gt;Name, Sex, Age: Self-explanatory.&lt;/li&gt;
&lt;li&gt;SibSp: how many siblings and spouses of the passenger aboard the Titanic.&lt;/li&gt;
&lt;li&gt;Parch: how many children &amp;amp; parents of the passenger aboard the Titanic.&lt;/li&gt;
&lt;li&gt;Ticket: ticket id.&lt;/li&gt;
&lt;li&gt;Fare: price paid (in pounds).&lt;/li&gt;
&lt;li&gt;Cabin: passenger's cabin number.&lt;/li&gt;
&lt;li&gt;Embarked: where the passenger embarked the Titanic.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Next Up: statistical summary.&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dataset.describe().T
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will provide the statistical summary of the numerical columns in the dataset. The .T function transposes the table to make it more readable.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--YUqfJcDt--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/s0jcp1btf7zc0426cgq7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--YUqfJcDt--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/s0jcp1btf7zc0426cgq7.png" alt="Image description" width="800" height="179"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The mean age was 29.88, and the oldest person was 80 yrs.&lt;/li&gt;
&lt;li&gt;The mean fare was 33.30 pounds.&lt;/li&gt;
&lt;li&gt;The survival rate was only 38%, quite sad 😢 .&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Visualizations.
&lt;/h3&gt;

&lt;p&gt;Let's now visualize the distribution of some key features.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#Target class
sns.set(style="white", color_codes=True)
ax=sns.catplot("Survived", data=dataset, kind="count", hue="Sex", height=5)
plt.title("Survival Distribution", weight="bold")
plt.xlabel("Survival", weight="bold", size=14)
plt.ylabel("Head Count", size=14)
ax.set_xticklabels(["Survived", "Didn't Survive"], rotation=0)
plt.show(); 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--501Ey1Ym--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rtr5jr1qxuitafwzfkg5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--501Ey1Ym--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rtr5jr1qxuitafwzfkg5.png" alt="Image description" width="565" height="441"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;From the generated categorical plot, we can see most men survived as compared to those who didn't. On the other hand, most women/female didn't survive as compared to those who survived. This is quite an interesting visual.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's continue uncovering these relationships:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sns.set(style="white", color_codes=True)
ax=sns.catplot("Survived", data=dataset, kind="count", hue="Embarked", height=5)
plt.title("Survival Distribution by boarding Place", weight="bold")
plt.xlabel("Survival", weight="bold", size=14)
plt.ylabel("Head Count", size=14)
ax.set_xticklabels(["Survived", "Didn't Survive"], rotation=0)
plt.show();
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--KqcDnzMi--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/71eht3bwk6h9bienakrw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--KqcDnzMi--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/71eht3bwk6h9bienakrw.png" alt="Image description" width="565" height="441"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Embarked column in this dataset indicates the port of embarkation of each passenger. The values represents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;C&lt;/code&gt;: Cherbourg.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Q&lt;/code&gt;: Queenstown.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;S&lt;/code&gt;: Southampton.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Interestingly, the survival rate of those who embarked the ship at Southampton was way more higher as compared to the rest.&lt;/p&gt;

&lt;p&gt;Let's look at a few more:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sns.set(style="white", color_codes=True)
ax=sns.catplot("Survived", data=dataset, kind="count", hue="Pclass", height=5)
plt.title("Survival Distribution by Passenger Class", weight="bold")
plt.xlabel("Survival", weight="bold", size=14)
plt.ylabel("Head Count", size=14)
ax.set_xticklabels(["Survived", "Didn't Survive"], rotation=0)
plt.show();
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--jO3Ps7__--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ptl08uom4c6yb3xl94gq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--jO3Ps7__--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ptl08uom4c6yb3xl94gq.png" alt="Image description" width="565" height="441"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If a passenger happened to be in the 3rd class, the chances were higher that he/she would survive the incident. Paradoxically, most people who were in the first class didn't survive compared to those who survived in both 1st and 2nd classes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To further understand the distribution of the survival outcomes, let's create a new colum age_dec that bins the Age column into decades.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dataset['age_dec']=dataset.Age.map(lambda Age:10*(Age//10))

sns.set(style="white", color_codes=True)
ax=sns.catplot("Survived", data=dataset, kind="count", hue="age_dec", height=5)
plt.title("Survival Distribution by Age Decade", weight="bold")
plt.xlabel("Survival", weight="bold", size=14)
plt.ylabel("Head Count", size=14)
ax.set_xticklabels(["Survived", "Didn't Survive"], rotation=0)
plt.show();
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--kB-4m_KE--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/745qpdu4b1z3mvy4lmvz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--kB-4m_KE--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/745qpdu4b1z3mvy4lmvz.png" alt="Image description" width="565" height="441"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The plot shows that passengers in their 20s and 30s had the highest survival rate, while those in their 1s and 60s had the lowest.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Finally, let's create a violin plot that shows the distribution of passengers by age decade and passenger class, with the hue representing the passenger's sex.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sns.violinplot(x="age_dec", y="Pclass", hue="Sex",data=dataset,
               split=True, inner='quartile',
               palette=['lightpink', 'lightblue']);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;The plot shows the distribution of passenger ages by decade and class, with the violins representing the density of passengers at different ages. The split violins allow for easy comparison of the distributions for male and female passengers within each age and class group.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Preprocessing.
&lt;/h3&gt;

&lt;p&gt;Time to preprocess the data before feeding it into our models. But first, how many missing values do we have? and while we're at it, let's create a pandas dataframe containing the percentages of the missing values of each column.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dataset.isnull().sum()

#percentage missing
p_missing= dataset.isnull().sum()*100/len(dataset)
missing=pd.DataFrame({"columns ":dataset.columns,
                     "Missing Percentage": p_missing})
missing.reset_index(drop=True, inplace=True)
missing
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--9ySSofEe--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ui9n0vo1r2u0hsqfpkpv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--9ySSofEe--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ui9n0vo1r2u0hsqfpkpv.png" alt="Image description" width="448" height="319"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;About 20% is missing both in Age and age_dec columns, the Cabin column have 77% of its values missing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Next we need to identify which columns we will be dropping, and then split the columns into numerical and categorical columns.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;col_drop=['Name','Ticket', 'Cabin']
print(f"Dataset shape before dropping: {dataset.shape}")
dataset=dataset.drop(columns=col_drop)
print(f"Dataset shape after dropping: {dataset.shape}")

#Numerical and Categorical columns
cat=[col for col in dataset.select_dtypes('object').columns]
num=[col for col in dataset.select_dtypes('int', 'float').columns if col not in ['Survived']]
print(cat)
print(num)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;The code is dropping the &lt;code&gt;Name&lt;/code&gt;, &lt;code&gt;Ticket&lt;/code&gt;, and &lt;code&gt;Cabin&lt;/code&gt; columns from the dataset DataFrame, and then creating two lists cat and num.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;cat&lt;/code&gt; list contains the names of categorical columns in the &lt;code&gt;dataset&lt;/code&gt; which are of &lt;code&gt;object&lt;/code&gt; datatype while &lt;code&gt;num&lt;/code&gt; list contains the names of numerical columns which are of &lt;code&gt;int&lt;/code&gt; and &lt;code&gt;float&lt;/code&gt; datatype except &lt;code&gt;Survived&lt;/code&gt; column which is our target feature.&lt;/li&gt;
&lt;/ul&gt;




&lt;h4&gt;
  
  
  Split the dataset.
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;train=dataset[:891]
test=dataset[891:].drop("Survived", axis=1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;train&lt;/code&gt; dataframe contains the first 891 rows of the original &lt;code&gt;dataset&lt;/code&gt;, which corresponds to the training data for the Titanic survival prediction problem. The &lt;code&gt;test&lt;/code&gt; dataframe contains the remaining rows of the original dataset, which corresponds to the test data for the problem. The &lt;code&gt;Survived&lt;/code&gt; column is dropped from the &lt;code&gt;test&lt;/code&gt; dataframe because this is the target variable that we are trying to predict, and it is not present in the test data.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Transformation Pipelines.
&lt;/h3&gt;

&lt;p&gt;To prepare the data for machine learning algorithms, we need to convert the data into a format that can be processed by these algorithms. One important aspect of this is to transform categorical data into numerical data, and to fill in any missing values in the data. To automate these tasks, we can create data pipelines that perform these operations for us. These pipelines will help to transform our data into a format that can be used by machine learning algorithms.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;num_pipeline=Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

cat_pipeline=Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder())
])

full_pipeline=ColumnTransformer([
    ("cat", cat_pipeline, cat),
    ("num", num_pipeline, num)
])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The &lt;code&gt;num_pipeline&lt;/code&gt; first applies a &lt;code&gt;SimpleImputer&lt;/code&gt; to fill in missing values with the median value of the column, and then standardizes the data using &lt;code&gt;StandardScaler()&lt;/code&gt;. The &lt;code&gt;cat_pipeline&lt;/code&gt; applies a &lt;code&gt;SimpleImputer&lt;/code&gt; to fill in missing values with the most frequent value of the column, and then encodes the categorical variables using &lt;code&gt;OneHotEncoder()&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The &lt;code&gt;ColumnTransformer&lt;/code&gt; is then used to apply these pipelines to the respective columns in the dataset. The &lt;code&gt;cat&lt;/code&gt; and &lt;code&gt;num&lt;/code&gt; lists created earlier are used to specify which columns belong to each pipeline.&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Transform the train dataset
X_train = full_pipeline.fit_transform(train[cat+num])
X_train
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;The code above applies the &lt;code&gt;full_pipeline&lt;/code&gt; ColumnTransformer to the concatenated train dataset ( &lt;code&gt;train[cat+num]&lt;/code&gt; ) that contains both categorical ( &lt;code&gt;cat&lt;/code&gt; ) and numerical ( &lt;code&gt;num&lt;/code&gt; ) columns. The output is a Numpy array that contains the transformed features.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's do the same for the test dataset.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;X_test=full_pipeline.fit_transform(test[cat+num])
X_test
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Identify the target class.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;y_train=train["Survived"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Classification Models.
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Random Forest Classifier.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rf=RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train )
y_pred=rf.predict(X_test)


scores=cross_val_score(rf, X_train, y_train, cv=20)
scores.mean()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;The above code fits the random forest classifier to the training dataset and predicts the target variable for the test dataset, later it performs 20-fold cross-validation on the training dataset by taking the random forest classifier, the predictor variables &lt;code&gt;X_train&lt;/code&gt;, the target variable &lt;code&gt;y_train&lt;/code&gt;, and the &lt;code&gt;cv=20&lt;/code&gt;parameter. It returns the accuracy score for each fold. &lt;code&gt;scores.mean()&lt;/code&gt; returns the mean accuracy score across all the folds.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Wow, the model performed amazingly - 80.01% is quite a satisfying performance considering this is our first run.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;XGBoost Classifier.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let's try out a gradient boosting classifier and see how well it performs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;model = XGBClassifier(
    n_estimators=100,
    max_depth=8,
    n_jobs=-1,
    random_state=42
)

model.fit(X_train, y_train)
y_pred=model.predict(X_test)


scores=cross_val_score(model, X_train, y_train, cv=20)
scores.mean()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Not bad, 79% is somehow better - remember this is our first time executing these models and we're doing so without any feature engineering or hypeparameter tuning techniques.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's try two more extra models; a support vector one and a neural-network one to see if our performance improves.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Support Vector Machine.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;svm=SVC(gamma="auto")
svm.fit(X_train, y_train)
y_pred=svm.predict(X_test)

scores=cross_val_score(svm, X_train,y_train, cv=20)
scores.mean()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Hooray!!! our perfomace just impoved by 1%, now we're at 81% - it might seem subtle but it's quite a massive win considering the kind of task we're handling.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;This model seems promising and would be highly recommend for further development.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Multi-Layer Perceptron Classifier.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Lastly, let's create a neural network classifier to tackle the problem. MLPC classifier is a neural network algorithm mostly used for classification tasks.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nn=MLPClassifier(hidden_layer_sizes=(20,15,25))
nn.fit(X_train, y_train)
nn.predict(X_test)

scores=cross_val_score(nn, X_train, y_train, cv=15)
scores.mean()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Not disappointing at all, the model gave us an 80.15% perfomance.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Overall, our models performed better with an average score of 80%.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;PS: Your scores might be different from those herein but that's not something to worry about.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;blockquote&gt;
&lt;p&gt;Fun Fact: Looking at the leaderboard for the Titanic competition on Kaggle, our scores would've been among the top 3%, ain't that not mind-blowing?.🤩😎.&lt;/p&gt;

&lt;p&gt;But wueh, some kagglers achieved 100% score, yeah you heard me right, 100%…that's nuts, special respect to all of them. 👏&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Tips to Improve performance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hyperparameters tuning using GridSearch and Cross validation.&lt;/li&gt;
&lt;li&gt;Feature Engineering, eg, try to add some attributes such as &lt;code&gt;SibSP&lt;/code&gt; and &lt;code&gt;Parch&lt;/code&gt; .&lt;/li&gt;
&lt;li&gt;Identify parts of names that correlate well with the Survived attribute.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;And with that, we're done with our classification model development. Let's connect on &lt;a href="https://www.linkedin.com/in/riri-njaramba/"&gt;LinkedIn&lt;/a&gt; and &lt;a href="https://twitter.com/icy_riri"&gt;Twitter&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>ai</category>
    </item>
    <item>
      <title>Exploratory Data Analysis: The Ultimate Guide</title>
      <dc:creator>Riri</dc:creator>
      <pubDate>Mon, 27 Feb 2023 21:11:42 +0000</pubDate>
      <link>https://forem.com/njarambariri/exploratory-data-analysis-the-ultimate-guide-40ah</link>
      <guid>https://forem.com/njarambariri/exploratory-data-analysis-the-ultimate-guide-40ah</guid>
      <description>&lt;h3&gt;
  
  
  Definition :
&lt;/h3&gt;

&lt;p&gt;Exploratory Data Analysis(EDA), also referred to as Data Exploration, is the process of analyzing,investigating and summarizing datasets to gain insights into the underlying patterns and relationships within the dataset, this is done by employing data visualization techniques and graphical statistical methods like histograms,heatmaps, violin plots, joint plots etc. Technically, Eda is all about 'understanding the dataset'.&lt;/p&gt;

&lt;p&gt;'Understanding' in this context might refer to quite a number of things: -&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Extracting important variables which is normally referred to as Feature engineering.&lt;/li&gt;
&lt;li&gt;Identifying and dealing with outliers and missing values.&lt;/li&gt;
&lt;li&gt;Understanding the relationships between variables; linear or non-linear variables.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By employing EDA techniques, you can change a very grumpy dataset into a very clean dataset. Overall, EDA is a very crucial and critical part of any data analysis project, it is often used to guide the data analyst while doing further analysis and data modelling.&lt;/p&gt;

&lt;p&gt;In this article we will dive deeper into EDA, we will discuss several topics: -&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Data cleaning and preparation &lt;/li&gt;
&lt;li&gt;Univariate Analysis&lt;/li&gt;
&lt;li&gt;Bivariate Analysis&lt;/li&gt;
&lt;li&gt;Multivariate Analysis&lt;/li&gt;
&lt;li&gt;Visualization Techniques&lt;/li&gt;
&lt;li&gt;Descriptive Statistics.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Data Cleaning and Preparation
&lt;/h3&gt;

&lt;p&gt;The first step in data analysis is to clean and prepare the data. This might involve employing techniques such as identifying and correcting missing values, removing outliers, and transforming variables as necessary. This step ensures that data is cleaned and prepared for the other processes ahead. For a successful data analysis process, data is required to be as accurate and reliable as possible. &lt;/p&gt;

&lt;p&gt;Let's have a look on how you do this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#Import the necessary libraries 
import pandas as pd

# Read in data
df = pd.read_csv('data.csv')

# Check for missing values
print(df.isnull().sum())

# Remove outliers
df = df[df['column_name'] &amp;lt; 100]

# Transform variables
df['new_column'] = df['column_name'] * 2

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Univariate Analysis
&lt;/h3&gt;

&lt;p&gt;Univariate analysis involves analyzing each variable in the dataset individually. Let's say you have variable named &lt;code&gt;age&lt;/code&gt;, by using univariate analysis, you can calculate the summary statistics of this variable eg., mean, median,mode, standard deviation, and variance. This step also involves visualizing the distribution of each variable using histograms, box plots, density plots etc.&lt;/p&gt;

&lt;p&gt;We can use the Seaborn library to perform a univariate analysis on some set of data: -&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import seaborn as sns

# Load data
tips = sns.load_dataset('tips')

# Calculate summary statistics
print(tips.describe())

# Visualize distribution with histogram
sns.histplot(tips['total_bill'], kde=False)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Bivariate Analysis
&lt;/h3&gt;

&lt;p&gt;On the other hand of Univariate Analysis we have Bivariate Analysis, your guess is good as mine, Bivariate Analysis involves analyzing the relationships between two variables in a dataset.Again, let's look this from a practical point of view, you have two variables &lt;code&gt;height&lt;/code&gt; and &lt;code&gt;weight&lt;/code&gt; and you need to understand the relationship between the two, Bivariate analysis let's you use graphical methods such as scatter plots, bar charts, line plots etc. to visualize this relationships.It also includes calculation of correlation coefficients, cross-tabulations and contigency tables. &lt;/p&gt;

&lt;p&gt;Let's use the matplotlib library to illustrate bivariate analysis: -&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#Import the necessary libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Load data
iris = sns.load_dataset('iris')

# Calculate correlation coefficients
print(iris.corr())

# Visualize relationship with scatter plot
plt.scatter(iris['sepal_length'], iris['sepal_width'])
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.show()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Multivariate Analysis
&lt;/h3&gt;

&lt;p&gt;This is the statistical procedure that involves analyzing the relationships between more than two variables. Alternatively, multivariate analysis can be used in analyzing the relationship between dependent and independent variables. It's major applications is relevant in:- clustering, feature selection, dimensionality reduction, hypothesis testing etc.&lt;/p&gt;

&lt;h3&gt;
  
  
  Visualization Techniques
&lt;/h3&gt;

&lt;p&gt;Another very important component of EDA is data visualization, it gives the data analyst a chance to explore and understand the data visually. This step is very crucial for any organization as it is easily understandable by the non-technical people in the organization. Non-technical people sometimes have hard time trying &lt;br&gt;
 to understand the 'under-the-hood' variable relationships but with data visualization, they can easily understand the relationshhips between different variables in the dataset.There are several techniques and tools used in this process. Some tools that are normally used by data analysts to visualize data include: -&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MS Power Bi&lt;/li&gt;
&lt;li&gt;MS Excel&lt;/li&gt;
&lt;li&gt;Tableu&lt;/li&gt;
&lt;li&gt;Google Data Studio&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For the techniques, there are numerous techniques that can be employed to visualize your data. We will discuss some of these techniques using the pandas library: -&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;em&gt;Histograms&lt;/em&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Histograms are used to visualize the distribution of a continuous variable like &lt;code&gt;height&lt;/code&gt;. The &lt;code&gt;hist()&lt;/code&gt; method is used to generate a histogram.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Generate a histogram
df['column_name'].hist()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;&lt;em&gt;Boxplots&lt;/em&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Boxplots are used to visualize the distribution of continuous variables to detect outliers. The &lt;code&gt;boxplot()&lt;/code&gt; method is used.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Generate a boxplot
df['column_name'].boxplot()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;&lt;em&gt;Scatterplots&lt;/em&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This technique is used to visualize the relationship between two continuous variables, e.g, &lt;code&gt;height&lt;/code&gt; and &lt;code&gt;weight&lt;/code&gt;. The &lt;code&gt;plot.scatter()&lt;/code&gt; method is used to generate a scatterplot.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Generate a scatterplot
df.plot.scatter(x='column1', y='column2')

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;&lt;em&gt;Bar Charts&lt;/em&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Bar charts are used in visualizing the distribution of categorical variables in a dataset. An example of a categorical variable is &lt;code&gt;gender&lt;/code&gt;, &lt;code&gt;race&lt;/code&gt;, &lt;code&gt;type of job&lt;/code&gt; etc. The &lt;code&gt;plot.bar()&lt;/code&gt; method is used to generate a bar chart.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Generate a bar chart
df['column_name'].value_counts().plot.bar()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Hands-on EDA
&lt;/h3&gt;

&lt;p&gt;We've talked a lot about the theoretical side of EDA, but now let's get to the fun part where we will be applying these techniques on a real-world dataset. Working with real-world data can at times be quite hard and frustrating, as it involves paying careful attention to data cleaning, exploration, handling outliers, dealing with missing data, and finally understanding the data. Also, it's good to keep in mind that- the ultimate goal for any data scientist is to achieve an accurate, meaningful, and relevant analysis to the problem at hand. To get your hands on a real-word data, there are various open source and free data websites that provides a wide pool of datasets, they iclude:  Data.gov, Kaggle, World Bank Open Data, Open ML, Datahub, etc.&lt;/p&gt;

&lt;p&gt;Enough have been said, let's now dive right into the nitty-gritty part. For this particular article, i'll be using an East African dataset that can be found on: &lt;a href="https://www.kaggle.com/datasets/enockmokua/financial-dataset" rel="noopener noreferrer"&gt;https://www.kaggle.com/datasets/enockmokua/financial-dataset&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Importing Libraries&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#Import the necessary libraries 
%matplotlib inline #for displaying plots directly below the code cell
import matplotlib.pyplot as plt #for creating plots
import pandas as pd #for data manipulation and analysis
import numpy as np # for working with arrays and matrices
import seaborn as sns #for complex visualizations that can't be achieved by plt
sns.set(); #sets the default parameters for seaborn

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Loading and Exploring the Data&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#load the data
df=pd.read_csv("Datasets/finance.csv")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#check the dataset size and shape 
df.shape, df.size 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbj95qv8f4oj5onhbfnzr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbj95qv8f4oj5onhbfnzr.png" alt=" " width="560" height="83"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#Display the first 5 rows
df.head()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6e1w3kyntybdzgfvhgv6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6e1w3kyntybdzgfvhgv6.png" alt=" " width="800" height="281"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#Display the last 5 rows 
df.tail()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz5b3fbx0yoha94ijzwl4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz5b3fbx0yoha94ijzwl4.png" alt=" " width="800" height="265"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#View the column names
df.columns
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foh9elu2benb4e9938au5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foh9elu2benb4e9938au5.png" alt=" " width="800" height="122"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#view the column data types
df.dtypes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fknurjq65h4o0ru0plq6g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fknurjq65h4o0ru0plq6g.png" alt=" " width="800" height="246"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#view the summary statistics of the numerical columns
df.describe()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh7pkq4d2cie2dzbxwkys.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh7pkq4d2cie2dzbxwkys.png" alt=" " width="552" height="308"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#Display the summary of the DataFrame
df.info
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#Display the total number of missing values in each column
df.isnull().sum()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhmkbavt5dxvoyhea1ig0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhmkbavt5dxvoyhea1ig0.png" alt=" " width="553" height="300"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#Create a mini dataframe to see the % of the missing values
missing=(df.isnull().sum()*100/len(df))
missing_df=pd.DataFrame({'Percentage missing': missing})
missing_df
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxn4nge813z7g9hn0odvg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxn4nge813z7g9hn0odvg.png" alt=" " width="550" height="476"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Data Cleaning&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#Drop irrelevant columns, or columns with too many missing values
df.drop(['Unnamed: 0','year'], axis=1, inplace=True)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpvgtjsvrzay9z4bclvlc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpvgtjsvrzay9z4bclvlc.png" alt=" " width="553" height="35"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#Fill the missing values with the mean or median
df['Respondent Age'].fillna(df['Respondent Age'].mean(), inplace=True)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyppawglta6ao2z10u738.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyppawglta6ao2z10u738.png" alt=" " width="551" height="66"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#Check for any duplicates and drop them if exists
df.drop_duplicates(inplace=True)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmuuycxgrzx8qj5wqk093.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmuuycxgrzx8qj5wqk093.png" alt=" " width="549" height="51"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Creating visualizations&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We will create histogram visualizations using the Matplotlib library and seaborn libraries.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#Using matplotlib
x=(df["Respondent Age"])
plt.hist(x,100,density=True, facecolor="green")
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fexssidhswcf3tw49fny9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fexssidhswcf3tw49fny9.png" alt=" " width="563" height="419"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#Using Seaborn
sns.histplot(df['Respondent Age']);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq2s3gw57klbb4heuzp62.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq2s3gw57klbb4heuzp62.png" alt=" " width="583" height="437"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;From the two visalizations, we can see the differences between the two libraries. Seaborn library tends to have more clearer visualizations than the matplotlib.pyplot library.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#Scatter plots for two numerical columns
sns.scatterplot(data=df, x='Respondent Age',y='household_size',hue='Has a Bank account');

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpwn20texu5wks95xkr5b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpwn20texu5wks95xkr5b.png" alt=" " width="558" height="437"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#Boxplot of a numerical column by a categorical column
sns.boxplot(x='Respondent Age',y='country', data=df);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjep0qy2m58irr8oqici9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjep0qy2m58irr8oqici9.png" alt=" " width="610" height="437"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Boxplots are mostly used to check outliers in a dataset
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#A heatmap of the correlation between columns
sns.heatmap(df.corr());
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw06dhod5232xy28gk1t4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw06dhod5232xy28gk1t4.png" alt=" " width="521" height="422"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#A bar chart of the headcount per country
df.country.value_counts().plot(kind='bar')
plt.xlabel("Country")
plt.ylabel("Count");
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpep6rw3032zy7v2zood5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpep6rw3032zy7v2zood5.png" alt=" " width="583" height="484"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#A pie chart for the number of respondents per country
counts=df.country.value_counts()
plt.pie(counts, labels=counts.index, autopct='%1.1f%%')
plt.show();
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp3c2yi3qqfd1pum6o4x7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp3c2yi3qqfd1pum6o4x7.png" alt=" " width="453" height="389"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Chi-Squared Test
&lt;/h3&gt;

&lt;p&gt;So far we've only talked and visualized mostly numerical variables, and you might be wondering, "What about the Categorical variables?".  That's where Chi-Squared test comes in, the chi-squared test is basically a statistical test used to determine if there is any significant association between two categorical variables. It is used to compare observed data with expected data, and to determine if the differences between them are significant enough to reject the null hypothesis that there is no association between the variables.&lt;/p&gt;

&lt;p&gt;In Python, you can use the &lt;code&gt;scipy.stats&lt;/code&gt; module, which provides a function called &lt;code&gt;chi2_contingency()&lt;/code&gt; that calculates the chi-squared statistic, degrees of freedom, p-value, and expected frequencies for a contingency table. Let's try it out.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import numpy as np
from scipy.stats import chi2_contingency

# create a contingency table
table = np.array([[10, 20, 30], [15, 25, 35]])

# perform the chi-squared test
chi2, p, dof, expected = chi2_contingency(table)

# print the results
print('Chi-squared statistic:', chi2)
print('Degrees of freedom:', dof)
print('P-value:', p)
print('Expected frequencies:\n', expected)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;The contingency table we created with two rows and three columns, represents the frequency of two categorical variables.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The expected output will be:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Chi-squared statistic: 0.27692307692307694
Degrees of freedom: 2
P-value: 0.870696738961232
Expected frequencies:
 [[11.11111111 20.         28.88888889]
 [13.88888889 25.         36.11111111]]

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;-The p-value is greater than the significance level of 0.05, which means we fail to reject the null hypothesis that there is no association between the variables. Therefore, we conclude that there is no significant association between the variables.&lt;/p&gt;

&lt;p&gt;-Check out this &lt;a href="https://towardsdatascience.com/chi-square-test-for-feature-selection-in-machine-learning-206b1f0b8223" rel="noopener noreferrer"&gt;article&lt;/a&gt; for further understanding of Chi-Square Test.&lt;/p&gt;

</description>
      <category>gratitude</category>
    </item>
    <item>
      <title>SQL 101: Introduction to SQL</title>
      <dc:creator>Riri</dc:creator>
      <pubDate>Sat, 18 Feb 2023 20:09:40 +0000</pubDate>
      <link>https://forem.com/njarambariri/sql-101-introduction-to-sql-i89</link>
      <guid>https://forem.com/njarambariri/sql-101-introduction-to-sql-i89</guid>
      <description>&lt;h2&gt;
  
  
  What is SQL?
&lt;/h2&gt;

&lt;p&gt;Many are the times you've heard the acronym SQL maybe from your friends, colleagues or from your teachers, but what really is SQL? SQL stands for Structured Query Language and it's the lingua franca that is used to manage, create, manipulate, and retrieve information/data from databases. It was developed in the 1970s by the IBM computer scientists.&lt;/p&gt;

&lt;p&gt;now that you know what is SQL, what does it really do? SQL can perform various tasks in a database which include:-&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;execute queries against databases.&lt;/li&gt;
&lt;li&gt;create new tables in databases.&lt;/li&gt;
&lt;li&gt;insert records in databases.&lt;/li&gt;
&lt;li&gt;update records in a database.&lt;/li&gt;
&lt;li&gt;Create and maintaindatabase users.&lt;/li&gt;
&lt;li&gt;delete records.&lt;/li&gt;
&lt;li&gt;retrieve data from databases e.t.c.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;SQL is a very effective, easy to learn and use language. It is also very funtionaly complete due to it's ability to let users define, retrieve and manipulate data in tables.&lt;/p&gt;

&lt;h2&gt;
  
  
  SQL Statements
&lt;/h2&gt;

&lt;p&gt;Statements in SQL are a set of instructions that consists of identifiers, parameters, variables, data types and SQL reserved key words. An SQL statement must compile successfully. e.g &lt;code&gt;DELETE TABLE Users&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;1. Data Manipulation Language(DML)&lt;/em&gt;&lt;br&gt;
Includes:-&lt;br&gt;
&lt;code&gt;SELECT&lt;/code&gt;- Retrieves certain records from one or more tables.&lt;br&gt;
&lt;code&gt;INSERT&lt;/code&gt;- Creates a new record.&lt;br&gt;
&lt;code&gt;UPDATE&lt;/code&gt;- Modifies an existing record.&lt;br&gt;
&lt;code&gt;DELETE&lt;/code&gt;- Deletes particular record(s).&lt;br&gt;
&lt;code&gt;MERGE&lt;/code&gt; - combines the separate &lt;code&gt;INSERT&lt;/code&gt;, &lt;code&gt;UPDATE&lt;/code&gt;, and &lt;code&gt;DELETE&lt;/code&gt; statements into a single SQL query.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;2.Data Definition Language(DDL)&lt;/em&gt;&lt;br&gt;
Includes:-&lt;br&gt;
&lt;code&gt;CREATE&lt;/code&gt;- Creates a new table, a view of a table, or other object in the database.&lt;br&gt;
&lt;code&gt;ALTER&lt;/code&gt; - Modifies an existing database object, e.g a table.&lt;br&gt;
&lt;code&gt;DROP&lt;/code&gt;  - Deletes an entire table, a view of a table or other objects in the database.&lt;br&gt;
&lt;code&gt;RENAME&lt;/code&gt;- Used together with &lt;code&gt;ALTER&lt;/code&gt; to modify objects in a database.&lt;br&gt;
&lt;code&gt;TRUNCATE&lt;/code&gt;- Deletes all data from the table.&lt;br&gt;
&lt;code&gt;COMMENT&lt;/code&gt; - Starts with /* and ends with */, this part of the code is not executed.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;3.Data Control Language(DCL)&lt;/em&gt;&lt;br&gt;
Includes:-&lt;br&gt;
&lt;code&gt;GRANT&lt;/code&gt; - Gives a privilege to user(s).&lt;br&gt;
&lt;code&gt;REVOKE&lt;/code&gt;- Takes backs privileges that were previously granted to the user.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;4.Transaction Control Language(TCL)&lt;/em&gt;&lt;br&gt;
Includes:-&lt;br&gt;
&lt;code&gt;COMMIT&lt;/code&gt;  - Stores changes invoked by a transaction to the database.&lt;br&gt;
&lt;code&gt;ROLLBACK&lt;/code&gt;- Reverts all the changes that were made since the last &lt;code&gt;COMMIT&lt;/code&gt;.&lt;br&gt;
&lt;code&gt;SAVEPOINT&lt;/code&gt;- Used together with &lt;code&gt;Rollback&lt;/code&gt; to get a certain transaction at a particular point.&lt;/p&gt;

&lt;h4&gt;
  
  
  Writing SQL statements.
&lt;/h4&gt;

&lt;p&gt;While writing SQL Statements it's good to note:-&lt;br&gt;
i. SQL statements are not case sensitive.&lt;br&gt;
ii. SQL can be entered on many lines.&lt;br&gt;
iii. Keywords cannot be split across lines.&lt;br&gt;
iv. Clauses are usually placed on separate lines for readability and ease of editing.&lt;br&gt;
v. Indents make it more readable.&lt;br&gt;
vi. Keywords may be entered in caps and all others in lowercase.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Learn SQL?
&lt;/h3&gt;

&lt;p&gt;If you're a professional in the software development domain, or a student who wants to become a Software Engineer, SQL is a very essential query language that you should equip yourself with.  In most application softwares, developers tend to use SQL to store and manipulate data. Also, in Most Relational Database Management Systems(RDBMS) like MYSQL,Oracle, Postgres, Sybase, MS Access use SQL as their standard database language.  &lt;/p&gt;

&lt;h3&gt;
  
  
  How SQL works.
&lt;/h3&gt;

&lt;p&gt;Let's say you're executing SQL commands for any given SQL task, the system on which you're running the code determines which is the best way to carry out your request while the SQL engine interprets the code. &lt;/p&gt;

&lt;p&gt;Sounds like a complicated process but it's not, now let's see the steps that are involved in Query Processing(this process involves translating the High Level SQL Queries into low level expressions that are used in the physical level of the file system, as well as in the query optimization and of course in the actual execution of the query).&lt;/p&gt;

&lt;h3&gt;
  
  
  Step-1:
&lt;/h3&gt;

&lt;h3&gt;
  
  
  Parser-
&lt;/h3&gt;

&lt;p&gt;During this stage, the database performs the following checks- Syntax check, and Semantic check, this is after converting the query into relational algebra.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;em&gt;Syntax Check:&lt;/em&gt;
&lt;/h4&gt;

&lt;p&gt;involves checking whether the rules of writing an SQL  command have been satisfied/followed ( This rules are what we call Syntax)&lt;br&gt;
e.g &lt;code&gt;SELECT * FORM students&lt;/code&gt;&lt;br&gt;
The above command can't be executed and thus will result into an error, this is due to the mispelling of the keyword &lt;code&gt;FROM&lt;/code&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;em&gt;Semantic Check:&lt;/em&gt;
&lt;/h4&gt;

&lt;p&gt;During this check, the parser determines whether a statement is meaningful or not. Example: Let's say you requested for a table named &lt;code&gt;Students&lt;/code&gt; from the database but you haven't created it yet which technically means it doesn't exist, this check is performed by &lt;em&gt;Semantic Check&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step-2
&lt;/h3&gt;

&lt;h3&gt;
  
  
  Optimizer:
&lt;/h3&gt;

&lt;p&gt;During the optimization stage, database must perform a hard parse(is when your SQL commands are re-loaded into the shared pool)for atleast one unique DML(Data Manipulation Language) statement and of course perform optimization during the parse. &lt;/p&gt;

&lt;h3&gt;
  
  
  Step-3
&lt;/h3&gt;

&lt;h3&gt;
  
  
  Execution Engine:
&lt;/h3&gt;

&lt;p&gt;The query is finally executed and the output is displayed.&lt;/p&gt;

&lt;h1&gt;
  
  
  Hands-on SQL Practicals.
&lt;/h1&gt;

&lt;p&gt;Now let's get down to the nitty-gritty aspect of SQL.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;em&gt;Creating Tables&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;In SQL you create tables using the &lt;code&gt;CREATE TABLE&lt;/code&gt;statement. When creating tables, you must provide three basic essentials:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Table name.&lt;/li&gt;
&lt;li&gt;Column names.&lt;/li&gt;
&lt;li&gt;Data types for each column.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;em&gt;Guidelines for creating tables&lt;/em&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;em&gt;Table and column naming rules&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Must start with a letter, which is followed by a sequence of letters, numbers,_,#,0r $.&lt;/li&gt;
&lt;li&gt;Must be 1 to 30 characters long.&lt;/li&gt;
&lt;li&gt;Must not be an SQL reserved word. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;data types&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;VARCHAR2(n):&lt;/em&gt; Variable length charater string up to n characters.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;CHAR(n):&lt;/em&gt; Fixed length charater string of n characters.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;NUMBER(n):&lt;/em&gt; Integer number of up to n digits.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;NUMBER(precision, scale):&lt;/em&gt; Fixed-point decimal number. “precision” is the total number of digits; “scale” is the number of digits to the right of the decimal point. The decimal point is not counted.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;NUMBER:&lt;/em&gt; Floating-point decimal number.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;DATE:&lt;/em&gt; DD-MON-YY (or YYYY) HH:MM:SS A.M. (or P.M.) form date-time.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;LONG:&lt;/em&gt; Variable-length character string up to 2 GB.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;NCHAR:&lt;/em&gt; LONG for international character sets (2-byte per character).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;CLOB:&lt;/em&gt; Single-byte ASCII character data up to 4 GB.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;BLOB:&lt;/em&gt; Binary data (e.g., program, image, or sound) of up to 4 GB.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;BFILE:&lt;/em&gt; Reference to a binary file that is external to the database (OS file).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;RAW(size) or LONG_RAW:&lt;/em&gt;  raw binary data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;ROWID:&lt;/em&gt;  Unique row address in hexadecimal format.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;code&gt;CREATE TABLE employees&lt;br&gt;
(&lt;br&gt;
employee_id   number(7) not null,&lt;br&gt;
first_name    varchar2(20),&lt;br&gt;
last_name     varchar2(20),&lt;br&gt;
cellphone     varchar2(12),&lt;br&gt;
email         varchar2(20),&lt;br&gt;
hire_date     date,&lt;br&gt;
job_id        varchar2(5),&lt;br&gt;
salary        number(12,2),&lt;br&gt;
manager_id    number(6),&lt;br&gt;
department_id number(4)&lt;br&gt;
);&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;em&gt;Adding data into tables&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;You use the key word &lt;code&gt;INSERT&lt;/code&gt; to add data into any table in SQL.&lt;br&gt;
Example:&lt;br&gt;
&lt;code&gt;INSERT INTO employees&lt;br&gt;
VALUES(1000,'Simon','Otieno','0722456789','otieno@yahoo.com','01-jan-90','5500',32000,5000,10);&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;INSERT INTO employees&lt;br&gt;
VALUES(1001,'Alice','Mwangi','0720766659','alice@yahoo.com','02-feb-80,'5600',42000, 5000,10);&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;em&gt;Retrieving data from database objects using &lt;code&gt;SELECT&lt;/code&gt; Statement&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;As mentioned earlier, &lt;code&gt;SELECT&lt;/code&gt; is used to retrieve data from the database. With the &lt;code&gt;SELECT&lt;/code&gt; statement you can have the following capabilities.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Projection- choose columns/fields from a table through a query.&lt;/li&gt;
&lt;li&gt;Selection- choose rows in a table.&lt;/li&gt;
&lt;li&gt;Joining – bring together data that is stored in different tables by specifying the link between them.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Examples:&lt;br&gt;
&lt;code&gt;SELECT * FROM employees;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT employee_id,first_name,last_name, email , job_id, salary FROM employees;&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Arithmetic Expressions in SQL
&lt;/h3&gt;

&lt;p&gt;Arithmetic expressions in SQL perfom arithmetic operations on the numeric operads/values stored in the database tables.&lt;/p&gt;

&lt;p&gt;Expressions used include:-&lt;br&gt;
&lt;code&gt;+&lt;/code&gt; - for Addition.&lt;br&gt;
&lt;code&gt;-&lt;/code&gt; - for Subtraction operations.&lt;br&gt;
&lt;code&gt;*&lt;/code&gt; - for multiplication.&lt;br&gt;
&lt;code&gt;/&lt;/code&gt; - for division.&lt;br&gt;
&lt;code&gt;%&lt;/code&gt; - for modulus.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;code&gt;SELECT employee_id, first_name,last_name, salary, salary + 3000 FROM employees;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT employee_id, first_name,last_name, salary, 12* salary + 700 FROM employees;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT employee_id, first_name,last_name, salary, 12* (salary + 700) FROM employees;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT employee_id, first_name,last_name, salary, 12* (salary - 2000) FROM employees;&lt;/code&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Restricting and sorting data using the &lt;code&gt;SELECT&lt;/code&gt; statement.
&lt;/h4&gt;

&lt;h6&gt;
  
  
  use of the &lt;code&gt;WHERE&lt;/code&gt; clause
&lt;/h6&gt;

&lt;p&gt;The &lt;code&gt;WHERE&lt;/code&gt; clause is used to filter records. It only extracts those records that fulfill a specified condition.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Comparison conditions&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;=&lt;/code&gt; -Equal to.&lt;br&gt;
 &lt;code&gt;&amp;gt;&lt;/code&gt; -Greater than.&lt;br&gt;
 &lt;code&gt;&amp;gt;=&lt;/code&gt;-Greater than or equal to.&lt;br&gt;
 &lt;code&gt;&amp;lt;&lt;/code&gt; -Less than.&lt;br&gt;
 &lt;code&gt;&amp;lt;=&lt;/code&gt;-Less than or equal to.&lt;br&gt;
 &lt;code&gt;&amp;lt;&amp;gt;&lt;/code&gt;-Not equal to.&lt;br&gt;
 &lt;code&gt;IS Null&lt;/code&gt; is a null value.&lt;br&gt;
 &lt;code&gt;IN (set)&lt;/code&gt; Match any of the list.&lt;/p&gt;

&lt;p&gt;examples:&lt;br&gt;
&lt;code&gt;SELECT employee_id “Employee ID”,first_name “First Name”,last_name “Last Name”, email “Email” , job_id “Job ID”,salary “ Monthly Pay “ FROM employees WHERE employee_id=1000;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT employee_id “Employee ID”,first_name “First Name” ,last_name “Last Name”, email “Email” , job_id “Job ID”,salary “ Monthly Pay “ FROM employees WHERE salary &amp;gt; 10000;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT employee_id “Employee ID”,first_name “First Name” ,last_name “Last Name”, email “Email” , job_id “Job ID”,salary “ Monthly Pay “ FROM employees WHERE salary IN(10000,20000,30000);&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Logical Conditions&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;AND&lt;/code&gt;- Returns true if both are true&lt;br&gt;
 &lt;code&gt;OR&lt;/code&gt;-  Returns true if one is true&lt;br&gt;
 &lt;code&gt;NOT&lt;/code&gt;- Return true if condition is false&lt;/p&gt;

&lt;p&gt;Examples:&lt;br&gt;
&lt;code&gt;SELECT employee_id,last_name FROM employees WHERE salary &amp;gt;=10000 AND manager_id=5000;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT employee_id,last_name FROM employees WHERE department_id NOT IN(90,60,30);&lt;/code&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Using the &lt;code&gt;ORDER BY&lt;/code&gt; Clause.
&lt;/h4&gt;

&lt;p&gt;This keyword, sorts out the recordes in a particular order. By default, it sorts records in ascending order.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;code&gt;SELECT last_name,job_id, department_id FROM employees ORDER BY hire_date DESC;&lt;/code&gt; &lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT last_name,job_id, department_id FROM employees ORDER BY hire_date ASC;&lt;/code&gt; &lt;/p&gt;

&lt;p&gt;NB : asc=ascending, desc=descending.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;UPDATE&lt;/code&gt; COMMAND&lt;br&gt;
As mentioned earlier in the article, this command is used to update data in given tables.&lt;/p&gt;

&lt;p&gt;examples:&lt;br&gt;
&lt;code&gt;UPDATE employees&lt;br&gt;
SET salary= 50000&lt;br&gt;
WHERE employee_id=1001;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;UPDATE employees&lt;br&gt;
SET last_name='Opiyo';&lt;br&gt;
WHERE employee_id=1000;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;DELETE&lt;/code&gt; COMMAND&lt;br&gt;
It is Used to delete or remove records from tables in a database.&lt;/p&gt;

&lt;p&gt;example:&lt;br&gt;
&lt;code&gt;DELETE from employees&lt;br&gt;
where employee_id=1000;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ROLLBACK&lt;/code&gt; COMMAND.&lt;br&gt;
To undo some transactions in the database,&lt;code&gt;ROLLBACK&lt;/code&gt; Command is used. It is used together with data manipulation language(DML) commands.&lt;/p&gt;

&lt;p&gt;example:&lt;br&gt;
&lt;code&gt;ROLLBACK;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;COMMIT&lt;/code&gt; COMMAND.&lt;br&gt;
Ensures that records are permanently saved. used with data manipulation language commands.&lt;/p&gt;

&lt;p&gt;example:&lt;br&gt;
&lt;code&gt;COMMIT;&lt;/code&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  SQL Constrains
&lt;/h4&gt;

&lt;p&gt;Constraints are used to limit the type of data that can go into a table. This ensures the accuracy and reliability of the data in the table. If there is any violation between the constraint and the data action, the action is aborted.&lt;br&gt;
They iclude:-&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;NOT NULL&lt;/code&gt;: Ensures that the column contains no null or empty values.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;code&gt;CREATE TABLE Sales (&lt;br&gt;
    Sale_Id int NOT NULL,&lt;br&gt;
    Sale_Amount int NOT NULL,&lt;br&gt;
    Vendor_Name varchar(255) NOT NULL,&lt;br&gt;
    Sale_Date date,&lt;br&gt;
    Profit int&lt;br&gt;
);&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;UNIQUE&lt;/code&gt;: Requires that every value must be unique.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;code&gt;CREATE TABLE employees2&lt;br&gt;
(&lt;br&gt;
employee_id number(6),&lt;br&gt;
last_name varchar2(20) not null,&lt;br&gt;
email varchar2(20),&lt;br&gt;
salary number(10,2),&lt;br&gt;
hire_date date not null,&lt;br&gt;
constraint emp_email_uk unique(email)&lt;br&gt;
);&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;PRIMARY KEY&lt;/code&gt;: creates a primary key for the table. Only one primary key can be created for each table.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;example:&lt;br&gt;
&lt;code&gt;CREATE TABLE Sales (&lt;br&gt;
    Sale_Id int NOT NULL,&lt;br&gt;
    Sale_Amount int NOT NULL,&lt;br&gt;
    Vendor_Name varchar(255),&lt;br&gt;
    Sale_Date date,&lt;br&gt;
    Profit int,&lt;br&gt;
    PRIMARY KEY (Sale_Id)&lt;br&gt;
);&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;FOREIGN KEY&lt;/code&gt;: designates a column or combination of
columns as a foreign key and establishes a relationship between a primary key or a unique key in the same table or different table.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;example:&lt;br&gt;
&lt;code&gt;CREATE TABLE employees3&lt;br&gt;
(&lt;br&gt;
employee_id number(6) constraint emp_id_pk primary key,&lt;br&gt;
last_name varchar2(20) not null,&lt;br&gt;
first_name varchar2(20),&lt;br&gt;
salary number(10,2),&lt;br&gt;
hire_date date not null,&lt;br&gt;
department_id number(4) constraint emp_dept_fk foreign key(department_id) references department (department_id);&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;From the above SQL commands, the table department must exist with primary key on department_id.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;CHECK&lt;/code&gt;:The &lt;code&gt;CHECK&lt;/code&gt; constraint is used to ensure that all the records in a certain column follow a specific rule.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;code&gt;CREATE TABLE Sales (&lt;br&gt;
    Sale_Id int NOT NULL UNIQUE,&lt;br&gt;
    Sale_Amount int NOT NULL,&lt;br&gt;
    Vendor_Name varchar(255) CHECK (Vendor_Name&amp;lt;&amp;gt; ’ABC’),&lt;br&gt;
    Sale_Date date,&lt;br&gt;
    Profit int&lt;br&gt;
);&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;When you have constrains in place on columns, an error is returned if you try to violate the constrant rule.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;ALTER TABLE&lt;/code&gt; Statements.
&lt;/h3&gt;

&lt;p&gt;This command is used to:-&lt;br&gt;
i. Add a new column to a table.&lt;br&gt;
ii. Modify an existing column.&lt;br&gt;
iii. Define default value for a new column.&lt;br&gt;
iv. Drop a column from a table.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;code&gt;ALTER TABLE employees&lt;br&gt;
ADD constraint emp_id_pk primary key;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ALTER TABLE employees2&lt;br&gt;
DROP constraint emp_dept_fk;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Notice the &lt;code&gt;ADD&lt;/code&gt; and &lt;code&gt;DROP&lt;/code&gt; constraint commands, the former is used to create a UNIQUE, PRIMARY KEY, FOREIGN KEY, or CHECK constraint while the latter is used to delete the same. This is done only after a table is already created.&lt;/p&gt;

&lt;h4&gt;
  
  
  DATA OBJECTS
&lt;/h4&gt;

&lt;p&gt;Objects are used in databases to store or reference data. An object can only be accessed by using its identifier. There are various data objects in SQL, they include:-&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Table-Basic unit of storage.&lt;/li&gt;
&lt;li&gt;View-Logically represents subsets of data from one or more
tables.&lt;/li&gt;
&lt;li&gt;Sequence-GenerateS numeric values.&lt;/li&gt;
&lt;li&gt;Index-Improves the performance of some queries.&lt;/li&gt;
&lt;li&gt;Synonyms-Gives alternative names to objects.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's look at &lt;code&gt;VIEW&lt;/code&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;VIEW&lt;/code&gt;.
&lt;/h4&gt;

&lt;p&gt;This is a logical table based on a table or another view.&lt;/p&gt;

&lt;h5&gt;
  
  
  Why use &lt;code&gt;VIEW&lt;/code&gt;?
&lt;/h5&gt;

&lt;ol&gt;
&lt;li&gt;To restrict data access.&lt;/li&gt;
&lt;li&gt;To make complex queries easy.&lt;/li&gt;
&lt;li&gt;To provide data independence.&lt;/li&gt;
&lt;li&gt;To present different views of the same data.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;creating a view:&lt;br&gt;
&lt;code&gt;CREATE VIEW emp1&lt;br&gt;
AS SELECT * FROM employees;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;To remove a view use the &lt;code&gt;DROP&lt;/code&gt; command.&lt;br&gt;
e.g &lt;code&gt;DROP VIEW emp1&lt;/code&gt;&lt;/p&gt;

</description>
      <category>hackathon</category>
      <category>learning</category>
      <category>career</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
