Forem: Maiyo

Python 101 For Data Engineering

Maiyo — Fri, 25 Apr 2025 09:53:28 +0000

What is python

Python is a popular high level programming language. Programming languages are set of instructions written for computers to perform specific tasks. We have different kinds of programming languages like C, C++, Rust, Java, C# and of course python. Most languages are categorized as low-level, mid-level or high-level languages. Python is a high level programming language because it is more human readable and abstracts away many hardware details. On the other end, low-level languages are more closer to being easily understood by computers (or hardware) than humans.

Python is general purpose and is used for web development, system administration, scientific and numerical analysis and GUI development. The cross-field application and ease of learning has made python one of the popular languages.

Getting started with python

We are now getting closer to starting to write python code. But, before that, we need to set up our computers’ environments to write and run python. When we write code, we need an interpreter to read and execute code.
On that note, we shall begin with installation of the latest python version to our PCs. It is important to note that this installation depends on which Operating System your computer is using. Mostly, Linux distributions and mac OS comes with python pre-installed. As for windows, we can install python from the link and follow instructions on the Python installer for installation. I will be focusing on Linux, since my computer runs on Ubuntu.

We already know that python comes pre installed in linux distributtions, but it is important to know the version that is installed. Python has several versions, and the latest is version 3.x.x. Installing the right version is important because Python 3.x.x is not backward compatible, hence you can experience challenges while coding in python.

Checking Python version.

To check python version, we run the command python -V or python --version.

After installing the right version, we need a code editor to use in writing code. Some editors come pre-installed like Notepad, Vim which is fine.
In my opinion, using editors like Vim to write code as a beginner, you get to learn a lot on the actual syntax of python, and also learn to work from the Command Line Interface. Vim can have a steep learning curve on basic editor functionalities, and on that note we can use different editors of our preferences.
To write python, one can use VSCode, Pycharm or Sublime IDEs. To install them, you can use the links below, and remember to get the right software for your Operating System.

We now everything set to start coding in python. As anyone starts learning,python documentation is really important and contains everything needed to learn python. On top of it, there are various websites that teach well. Python for Data engineering, Here We Go.

Python basics

As for any programming language, we can start by writing ‘Hello World’ program. This is important because it enables us to learn how we run python programs.

Writing Python on IDEs

When using IDEs (Integrated Development Environments), it is easy to run a python program. All you need is to select the file you want to run and click the play button and the program executes.

Writing Python using the terminal

Learning and mastering how to use the terminal is important. Mostly when interacting with virtual machines on cloud, basically one works on a command line interface (CLI)
Step 1: Writing the script on the terminal, and create a .py file. For the tutorial we are using vim editor. Before, navigate to a folder you want using cd, then run Vim hello.py

To save the file, press ESC, then type :wq and hit enter. Run ls command on the directory to list all files on the directory.

Step2: Once the file is already saved, run the script using either: python hello.py or python3 hello.py

As observed, we are using the command python (or python3) + file name to run the script. Before we all get confused, the two commands are similar but subjective to the python version installed. Python command traditionally points to python 2.x.x while python3 explicitly points to python 3.x.x which is the most recent and mantained version on python.

Another method to run python on the command shell is to make it an executable. This is achieved by adding a shebang on top of the script. Note that this is only for unix-like systems eg. Ubuntu

A shebang

A shebang is the first line in a script that instructs an operating system to execute a file. Basically a shebang enables us to run a script directly like ./hello.py unlike using python3 command.

Shebang syntax:
#!/usr/bin/python3

Step 1: Add shebang on the first line of the script. Save and quit vim
Step 2: Make the python script executable. Run the command chmod u+x hello.py. The command gives the user(you) execute permissions on the file hello.py
Step 3: Run the file directly. ./hello.py

In conclusion on these two methods of running python scripts, both of them are interchangeably. Any can be used, but I would prefer shebangs on production scripts especially used for automation. As for python3 command, it is easy and flexible to use(especially when switching between versions) when learning and development.

Popping the Web Stack

Maiyo — Sun, 16 Apr 2023 06:58:53 +0000

Humans are a species that continuously evolve. A while ago, the sole source of written information was the libraries. And by libraries I mean those huge buildings stacked with long shelves filled with books. To access any piece of information, you had to visit the library in person, look for your resource in a catalog, identify the right location of the resource from the many shelves and finally begin to look for information from your resource. Do not get it twisted, I am a fan of libraries. My point is this very process that I have mentioned necessitated one of the most brilliant inventions of humans, The internet.

The Internet

The internet is a global network of interconnected devices around the world that work hand in hand to share information. These devices majorly include computers as the majority together with servers, routers, and switches which are majorly network devices. In other words, the internet is a community of computers.
One lovely thing about the Internet is that it is not owned by a single entity, making it free for anyone with an Internet connection. Isn't lovely? Nonetheless, every community has to have rules on how everyone engages and this is reflected on the internet too. The Internet is decentralized among a network of organizations including Internet service providers (ISPs), content creators, governments, and several agency bodies. All these organizations are tasked with regulations and making the internet safe.

The web

A lot of people think that the web and the internet are the same, but they are not. The web (or World Wide Web) is a subset of the internet that consists of globalized webpages, images, and videos that can be accessed from a web browser.
Then what is the difference between the two? The internet is the infrastructure upon which computers share information, while the web is an application or service that uses the internet to make hypertext documents accessible.

How does the web work?

The web works on a client-server model. This model differentiates the tasks between the client and the server. In this case, the server is the service or resource provider, while the client is the service or resource requestor.
This model is widely used because the data is centralized making it easy to maintain. Also, the capacities of the clients and servers are independent which brings flexibility to the model. Remember this makes it possible for you to access the internet using any device like a phone, tablet, or laptop which vary in terms of specifications.
This model forms the architecture which is the basis of our question, what happens when you type https://www.google.com in your browser and press Enter? It is the foundation of how the web works.

What happens when you type `https://www.google.com` in your browser and press `Enter?`

1. Type `https://www.google.com` on your web browser's address bar

Remember that the main application you can access the internet with from your device is the web browser. A web browser is just a software application that runs on your operating system, that is used to access, view and interact with websites. There are numerous web browsers which include google chrome, Mozilla Firefox, Apple's Safari, and Microsoft Edge. Of course, we had internet explorer back in time which ceased to exist.

The URL

What is keyed into the browser's address bar is known as the uniform resource locator (URL). We now understand that the internet is regulated and standardized, hence the use of URLs. A URL is a standardized way of specifying the location of a resource on the internet. It has the following format
protocol://host/path?query
The URL https://www.google.com does not include the path and query section of a standard URL, but here is its breakdown:

Protocol - Specifies the communication protocol to be used to access the resource. It can be HTTP, HTTPS, FTP among other protocols. The protocol for this URL is HTTPS.
Host - Specifies the domain name for the website. A domain name can be further split into the sub-domain, domain, and top-level domain:

www: This is the subdomain of the domain name google.com. The www subdomain is a common convention for web pages, but it is not strictly necessary for accessing the site.
Google: This is the primary domain name of the site, which identifies the organization or entity that owns it.
.com: This is the top-level domain (TLD) of the site, which indicates the type of organization or domain category. In this case, it is a commercial organization, as indicated by the "com" TLD. The rest of the sections of the URL can be described as:

Path - Specifies the location of the resource
Query - Specifies additional parameters used to specify the resource being searched.

2. DNS lookup

Computers only understand bits, that is 1s and 0s. Due to this reason, URLs need to be translated to the appropriate IP addresses of the respective host. This is done by DNS servers.

What is DNS?

Domain name system is the protocol that translates readable URLs into IP addresses. It is commonly referred to as the

phonebook
of the internet. Humans are poor at remembering numbers. This is why you have a phonebook that keeps names against cell numbers in your phone.
DNS is capable of doing IP mapping since it keeps records of hosts' domain names and respective IP addresses. These records include A, CNAME, MX, NS, SOA, and TXT records.

An (Address) - Used to point a domain name to the associated IP address.
CNAME (Canonical Name) - Used for creating aliases of domain names.
MX (Mail Exchanger) - Used to deliver email to a specified address.
NS (Name Server) - Used to specify an authoritative name server for a given host.
SOA (Start of Authority) - Used to determine key information about a DNS zone.
TXT - Used to verify domain ownership and hold SPF (Sender Policy Framework) data.

What is an IP address?

Remember that computers only understand bits. An IP address consists of a series of numbers separated by dots or colons. It is a unique identifier assigned to every computer in a network. There are two versions of IP addresses, IPv4 and IPv6. IPv4 is 32 bits long with the digits separated by dots, while IPv6 is 128 bits long with colons as the separators.

How is the IP address resolved (DNS lookup)

To resolve or get an IP address for google.com, a forward DNS lookup is done. DNS data is heavily cached, therefore, DNS lookup is first done locally. Let's say you've entered the URL using the Chrome browser, and a recursive query is done. The browser, which in this case is the DNS client, looks up the address from its browser cache, if it does not resolve it, it looks up in the operating system cache.
If the address is not resolved from the local DNS server, the local DNS server queries other DNS servers until a name resolution is found.

3. TCP connection

Once the server's IP address is resolved, the browser initiates a TCP connection. The Transmission Control Protocol (TCP) is a transport layer protocol according to the OSI model which is responsible for how network devices exchange data. Data is transmitted over the internet as packets. TCP is a reliable protocol, meaning any data packets lost or duplicated in transit are detected and corrected. TCP works on top of Internet Protocol (IP) that is responsible for routing data packets to the right location address.
TCP connection is established through a three-way handshake.
First, the client browser sends a SYN packet or segment to the server.
If the server has received the segment, it agrees to the connection by sending the SYN-ACK packet back to the client.
The client acknowledges the receipt of the SYN-ACK packet by sending its own ACK packet. At the same time, the client can already begin transmitting data packets to the server.

4. HTTP Request.

When we request a resource from Google, the client browser sends an HTTP request to the web server. HTTP stands for the Hypertext Transfer Protocol. It is used to request resources from the web.
HTTP requests sent to the web server are a combination of the following:
Request methods - This indicates the HTTP method used to fetch the resources. There are several methods like GET, HEAD, POST, PUT, DELETE, CONNECT, OPTIONS, and TRACE. GET is used to retrieve data from the web server.
Request URL - This is the URL of the resource requested. Remember URLs carry the path to the resource requested.
Request headers - This is the additional information such as the user-agent, which is the browser sending the HTTP requests.
Request body - This is optional in an HTTP request and is used for methods that send data to the web server like forms data.

HTTP with 's' (HTTPS)

HTTPS stands for Hypertext Transfer Protocol Secure. Up to this stage, we have explored HTTP requests that do not have a security mechanism within them. Dynamic websites are interactive and more than often it requires users or clients to send data in forms for services such as authentication. Let's say you want to log in to your favorite blog post, you have to key in your password. When HTTP protocol is used there is vulnerability to your data in this case password and username to anyone sniffing over your connection.
This has led to the use of a more secure protocol HTTPS. This protocol encrypts data as it is being exchanged between the client and the web server, therefore, providing secure communication. It achieves this by the use of SSL (Secure Sockets Layer) or TLS (Transport Layer Security) protocols. SSL/TLS uses public-key cryptography for its encryption. Public-key cryptography relies on two keys: a public key and a private key.
When a client connects to an HTTPS website, the website provides a public key to the client. This public key is also normally referred to as SSL or TSL certificate depending on the protocol in use. The client uses this public key to encrypt a symmetric encryption key and send it over to the web server. After the server receives it, it uses its private key to decrypt the symmetric encryption key. Once both the browser and the web server have the symmetric encryption key, they use it to encrypt data as they are transmitted. When the transmission is done, the symmetric encryption key is discarded.

5. Processing the requests and serving a response

Processing of HTTP requests is done by a web server. A web server is a software that runs on a physical server that serves web pages. It hosts website files that include HTML documents, CSS stylesheets, JavaScript files, images, and videos. Normally the web stack includes a database, which can be running on the same physical server as the web server or on a different server. The database is used to store user information for dynamic websites.
To fully process a request, the web server undergoes this series of steps before sending back a response.

Parsing the request: The server first parses the incoming request to determine what action it is being asked to perform and what resources are being requested.
Routing the request: Once the server has parsed the request, it determines which application or service should handle the request. This is typically done using a routing table or configuration file.
Authenticating the request: Depending on the security requirements of the application or service handling the request, the server may need to authenticate the client before processing the request. This may involve validating credentials or checking permissions.
Handling the request: The server then passes the request to the appropriate application or service to handle the requested resource. This may involve executing code, querying a database, or serving a static file.
Generating a response: Once the requested resource has been processed, the server generates a response and sends it back to the client. The response typically includes an HTTP status code, headers, and a message body.
Closing the connection: Finally, the server closes the connection with the client. Depending on the server configuration, the connection may be kept open for a certain period for subsequent requests or closed immediately.

6. Rendering the content

Rendering of responses refers to the process of converting a server's response into a visual representation that can be displayed to the user. When a user requests a server via their browser, the server sends a response containing HTML, CSS, and JavaScript files.

The browser then reads the response and begins to render the content. It starts by parsing the HTML and building the Document Object Model (DOM) tree, which represents the structure of the web page. Next, it parses the CSS to determine how the content should be styled, and it applies these styles to the appropriate elements in the DOM tree. Finally, it executes any JavaScript code included in the response to add interactivity and modify the content dynamically.

Monitoring and Managing traffic

So far, we have looked at the cycle that happens from when you enter a URL up to the point you receive a rendered page. This is a single instance of how that happens. The internet has made the whole world a village, in that sense so many people across the globe interact with the internet. It is a normal scenario to have numerous requests for the same resource. One funny thing about software engineering or engineering as a whole is that you design for a worst-case scenario. Yes, worst-case scenarios, remember Big O notations for algorithms? This is the same when it comes to web development. While designing the web infrastructure, you have to assume you will have numerous if not many concurrent requests to the same resource, and also someone will try and access your resources without authorization and maybe with malicious intent.

Load Balancing

Some of the world's leading giant corporates are web-based. Talk of Meta, Twitter, and LinkedIn among others. Have you ever wondered how you get your Twitter feed refreshed in a matter of seconds? How you can always have access to your feed at any time of the day? When do they maintain their systems? Then a load balancer is your answer.
A load balancer is a traffic manager that manages incoming traffic to a pool of servers. This is applicable when a distributed server system is used, where we have different web servers with the same resources and services distributed in different machine servers. Load balancing is a technique used in web infrastructure to distribute incoming network traffic across multiple servers. The main objective of load balancing is to prevent any single server from being overloaded with traffic, thereby ensuring the high availability and reliability of the web application.
When a user sends a request to access a web application, the request first goes to the load balancer, which examines the request and decides which server in the pool is best suited to handle the request. The load balancer can use various algorithms to make this decision, such as round-robin, least connections, IP hash, or other custom algorithms. Once the load balancer has selected a server, it forwards the request to that server, which then processes the request and sends the response back to the user through the load balancer. The load balancer can also monitor the health and performance of the servers in the pool and redirect traffic away from any servers that are experiencing issues or are overloaded. This is also important in server predictive maintenance which helps in preventing server breakdowns before they break.

Firewalls

Firewalls are used to provide security in networks and are used in the web infrastructure to protect an application from unauthorized access or malicious attacks. Remember that a lot of sensitive data like personal information, health records, and organization secrets among others is stored or shared across the web. This makes Firewalls a really important component in the web stack to ensure data does not fall into the wrong hands. A firewall network security system monitors both incoming and outgoing traffic on predetermined security rules.
In a web infrastructure, firewalls are typically placed between the internet and the web servers to provide an additional layer of security. The firewall acts as a barrier between the internet and the web servers, filtering and blocking incoming traffic that does not comply with the predefined security rules.

The Ultimate Exploratory Data Analysis

Maiyo — Tue, 28 Feb 2023 19:37:13 +0000

Introduction

Exploratory Data Analysis (EDA) is the process of examining and analyzing data sets to summarize their main characteristics and gain insights into their underlying structure, patterns, and relationships. EDA is typically used as a preliminary step before performing more complex statistical analysis or building predictive models.
For this article I will explore this dataset from Kaggle and list down findings from it. This dataset contains rich information about the salary patterns among the IT professionals in the EU region and offers some great insights.
I will carry out my data analysis using python on a Jupyter notebook. Now let's get down to it.

Importing Libraries and Loading Datasets

import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
plt.style.use('ggplot')

The initial step in analyzing data using Python is to import python libraries that provide data analysis functionalities. These modules include pandas, numpy, matplotlib and seaborn. Here is a breakdown for the respective use of each module:

pandas - This is a common python library used for data manipulation, analysis, and cleaning. It provides data structures and functions for efficiently handling and processing large and complex data sets.
numpy - It is a popular python library that is used for mathematical and scientific computing. It provides a powerful array computing feature that enables developers to perform complex mathematical operations on large arrays and matrices of numeric data.
matplotlib - The library is used for data visualization. It provides tools for creating a wide range of static, animated, and interactive visualizations in Python.
Seaborn - Seaborn is a Python library that is built on top of Matplotlib and is primarily used for statistical data visualization. It provides a high-level interface for creating informative and attractive statistical graphics in Python. From the python libraries, it is possible to carry out an exploratory data analysis which includes loading datasets, cleaning datasets, analysing datasets and finally visualizing findings from datasets.

df = pd.read_csv('IT Salary Survey EU  2020.csv', sep=',')

To load the dataset, download it first from the link provided. Save the file on the same directory as that of the notebook being used. If the file is saved elsewhere note that the path to its location will be provided while loading the dataset.
Once the data is downloaded it is good to open it and explore to find out it's separator value. Separator values can range from single spaces, tab spaces to commas.
To load the dataset onto a data frame, we use pandas read_csv method. The method takes in a string value representing the location of the dataset downloaded, and separator value specified as arguments.

Understanding the dataset

One of the most important step in carrying out EDA is understanding the dataset you are analyzing. Since the dataset can be very large, Pandas library offers methods that can simplify our work in exploring the dataset to understand it.

To know the size of the dataset, run df.shape. From the result, the dataset has 1253 rows and 23 columns.

To display the first five rows of the dataset, run df.head(5). Note that you can specify any number of rows in this case by changing the integer value in the argument.
The image above shows the results of the first five rows. The image itself cannot capture all the 23 columns but u can extend the columns by running pd.set_option('max_columns', 200) at the import section.
To display the last line of rows, the method is just changed, and it becomes df.tail(5). This line of code performs in the same manner as the head method.

To list all the columns names i.e. all 23 of them, run the code df.columns. This lists down all the column names as an array.

To display the column types, we run the code df.dtypes. From the picture, the colums are of an object type as well as float type.

Finally to get information and statistics of the numerical data in the dataset, use the code df.describe(). This method output statistical parameters such as count, mean , standard deviation, minimum, maximum, 25%, 50% and 75% percentiles for the numerical data within the dataset. Note that we only have four columns with numerical data inform of floats in the datasets.

Cleaning data

Most of the datasets are going to be very large and contain lots of redundant information that might not be needed in the analysis. On the other hand, new columns can be derived from the already existing ones and added to the data frame to be analyzed.
Cleaning of the data requires both a deep understanding of the dataset and the objective of the analysis. Cleaning of the data can involve:

dropping columns that are not required for analysis
Updating the columns names to more appropriate names and removing whitespaces between column names
Updating the columns data types to suit the type of data they hold for seamless analysis using python code.
Checking for null values within the dataset and understanding their distribution.
Checking for duplicate records in the dataset. When duplicates are found, they should be removed.

From the initial 23 columns, some of the columns are supposed to be dropped for this analysis and retain only ten columns. The columns to be retained are:

Age
Gender
City
Position
Total years of experience
Seniority level
Main technology/programming language
company size
company type
contract duration

To clean the data, it is advisable to create a subset of the original data frame which is large and complex. his can be useful when the dataset is very large and complex, and analyzing the entire dataset would be time-consuming or computationally challenging.
For this dataset, a subset of only the 10 selected columns have been selected. The rest of the columns have been commented out just to clearly show which ones were dropped. When the df.shape is run we note that the number of rows remain same as before but the columns reduce to ten.

The next step is to scrutinize the columns and check their data types and their naming conventions. As for this dataset, the data types were not altered. On the naming of columns, it is good practice to use short and concise names without whitespaces between two words. Whitespaces will bring up errors when code implementation is done for the specific columns. The columns we renamed as shown above.

To check for null values, isna method is used together with the common sum method. From the results above it is clear that all columns apart from the city has null values.

Next is to check for duplicates in the dataset. I looked for the duplicates but could not find a conclusive evidence of duplicated records. This is because the nature of the data lacks a unique identifier for every member that filled in the form, although this does not eliminate the fact that one could refill the form with the exact same details. To be on the safer side, I looked for duplicates in the subset data frame by using all the columns and did not find none.
Up to this point, the dataset can now proceed for data analysis.

Analyzing the data.

To analyze the data subset, data visualization tools come in handy in this case. Features of the data can be analyzed individually through univariate analysis or can be compared against each other through correlation.
This data frame subset merely has numerical values for simple numerical analysis using histograms, boxplots, scatter diagrams among other visualization diagrams.
For this tutorial I was able to do an example of numerical analysis of the age feature. Here are the findings:

The age feature was plotted on a histogram. From the histogram the following is gathered from the age of employees who undertook this survey:

Most of the employees employed in IT in Europe are around 30 years.
From the histogram we can also note that the distribution of age ranges from 20 to almost 70 years This is one of the examples of feature analysis. I would have wanted to explore every feature, but my skillset currently limits me to do an analysis on text based data which is the majority on this data subset, might as well take it up as a challenge and do a dedicated article on it after learning it.

References

Here's a link for my Jupyter notebook I used for this tutorial.

Ultimate_EDA.ipynb