Forem: StevenMcGown

Data Science Zero to Hero - 2.2: Data ETF (Extract, Transform, Load)

StevenMcGown — Sat, 12 Aug 2023 14:46:22 +0000

Where does data come from?

Data can come from many different sources - it might be generated by users, collected from sensors, retrieved from databases, even scraped from websites. The methods of data collection may depend on the nature of the data source, and the process of managing this data and making it usable for analysis often involves ETL (Extract, Transform, Load). In many cases, extracted raw data is not loaded directly into a database; It has to be cleaned and transformed before it is suitable for machine learning.

Extract: Data is pulled from various sources like databases, sensors, or online services.
Transform: The extracted data is then cleaned and transformed into a suitable format. This might include handling inconsistencies, converting data types, aggregating information, and more.
Load: Finally, the cleaned and structured data is loaded into a destination system such as a database where it can be accessed and analyzed.

Data Extraction Tools

The choice of an ETL tool should be guided by the specific needs of a project. For example, Amazon Kinesis is a powerful tool for real-time data streaming and analytics, often used in the extraction phase of ETL for AWS environments. It's designed to handle large-scale data like video, audio, application logs, and more. Kinesis can ingest massive amounts of information at high speed. This real-time processing allows for immediate insights and responsiveness, distinguishing Kinesis from traditional batch-based ETL tools.

Kinesis is an ETL tool that might be best used in an AWS environment, but there are other options depending on your needs. To name a couple, you might consider Apache Kafka, a great fit for high-throughput real-time data processing or Microsoft SQL Server Integration Services (SSIS), designed for businesses embedded in the Microsoft ecosystem. There's tons of tools out there, each with their own benefits.

Data Transformation Tools

The choice for a transformation tool is again dependent on your needs.
Apache Spark excels in rapid transformations with in-memory processing, while Microsoft's Power Query is favored for Excel and Power BI with its graphical interface. AWS Glue, a fully managed ETL service, offers a serverless data integration service that simplifies and automates the transformation process. Again with benefits and drawbacks for each, the choice hinges on data size and complexity, processing speed, and user expertise.

File Formats and Storage Solutions

So far we've covered extracting and transforming data, but what exactly are we trasnforming it into? Well data comes in various formats, each catering to different requirements and use-cases. The chosen file format often influences the way data is stored and managed in databases. Here's a look at some common file formats and corresponding storage solutions:

CSV, Excel, Parquet, and ORC Files

For structured data, CSV, Excel, Parquet, and ORC files are popular choices. CSV and Excel are simple, human-readable, and can be easily manipulated using spreadsheet software. Parquet and ORC are columnar storage formats designed for efficiency. These file formats are often used for storing tabular data in Relational Databases like MySQL and PostgreSQL, which provide robust querying capabilities, data integrity, and scalability.

JSON, XML, Avro, and Protobuf

JSON (JavaScript Object Notation), XML (eXtensible Markup Language), Avro, and Protobuf are commonly used for semi-structured data. These formats allow hierarchical data representation, making them suitable for complex data structures. They are widely used in web development, API responses, and configuration files, and can be stored in NoSQL Databases like MongoDB, DynamoDb, or Cassandra, offering more flexibility in data models and allowing for horizontal scaling.

Images, Audio, Video Files, and HDF5

Unstructured data often takes the form of multimedia files like images, audio, videos, and large numerical data like HDF5. They may require specialized preprocessing techniques such as feature extraction, image recognition, or natural language processing (NLP) to be utilized in machine learning models. These types of files are typically stored in Object Storage Systems like AWS S3, Azure Blob Storage, or Google Cloud Storage. These systems are commonly used for large, unstructured datasets and provide scalability, particularly useful when working with big data.

Conclusion

The world of data isn't just about having it; it's about molding it in a way that extracts its maximum potential. This requires both a deep understanding of the available tools and an intricate knowledge of the data itself. Each transformation step can potentially unlock new insights or, if done incorrectly, can lead to misguided conclusions.

We now know the basic flow of ETF and its place in the ML Cycle, but there are still questions to be answered. What kind of transformations are most impactful for specific datasets? How do these transformations vary when dealing with structured versus unstructured data? And, are there transformations that universally benefit every dataset or are some uniquely tailored for specific contexts? These are all questions that I am to explain in the following posts in section 2 of this series.

Thank you for reading!

Data Science Zero to Hero - 2.1: The Machine Learning Cycle

StevenMcGown — Wed, 09 Aug 2023 21:15:40 +0000

Data Collection and Preparation: ML Concepts

Data, data, data! If there's one thing you should take away from this series, it's that data is super important to data scientists and Machine Learning Engineers alike. In the previous posts, we talked about different ways of transforming and visualizing data with Python and those libraries are certainly powerful tools, but where does this data come from anyway? How do we collect it? Where do we store it? What other steps might need to be taken? Furthermore, what do we do with that data once we have it? Questions like these are best answered with the steps in the Machine Learning Cycle.

You see, the process of machine learning is cyclical because its purpose is to improve a machine learning model's performance on a specific task as new data becomes available. These models should be flexible enough to incorporate that new data so that it doesn't become biased. This includes understanding how the data was collected and ensuring that it represents the problem space without inherent biases related to factors like gender, ethnicity, or socioeconomic status.

Depending on who you ask, the Machine Learning Cycle may vary, but the basic idea is this: Get the data, train a model, and then evaluate the results to do the process over again. Let's break the graphic above step by step.

Data Collection: This is the initial phase where all relevant data is gathered from various sources, whether structured or unstructured. It may include collecting data from databases, sensors, online sources, etc.
Data Transformation: Here, the data is cleaned and transformed into a suitable format for training models. It might involve handling missing values, encoding categorical variables, or normalizing numerical features.
Exploratory Data Analysis (EDA): Before diving into modeling, EDA helps in understanding the data's characteristics, distribution, and patterns. This step often includes visualizations, statistical tests, and preliminary insights. Note: A preliminary or light EDA might be performed on the raw data **before transformation,* especially if you are dealing with an entirely new dataset, to get an initial sense of the data and to identify the transformations that might be needed.*
Model Training and Evaluation: This phase includes selecting the appropriate algorithm, training the model on the training dataset, and evaluating its performance using techniques like cross-validation on a validation dataset. Adjustments may be made to optimize performance.
Model Deployment: Once the model is trained and validated, it's deployed into a production environment where it can start making predictions or decisions based on new data.
Model Monitoring: Continuous monitoring ensures that the model performs well with the real-world data it encounters. Monitoring can detect issues like drifts in data or degraded performance.
Model Retraining: Models are not static; they might need to be retrained as new data becomes available or if the underlying data patterns change. Retraining ensures that the model remains accurate and relevant.

Each of these steps plays a critical role in creating a robust machine learning model. They collectively contribute to a cyclical process of continuous improvement and adaptation. In the next posts, we'll expand on each of these concepts in much more detail.

Data Science Zero to Hero - 1.3: Matplotlib, Seaborn & Jupyter Notebooks

StevenMcGown — Thu, 27 Jul 2023 23:05:48 +0000

You like python programs, don't you Squidward?

*crickets*

I'll be here all week!

Jokes aside, if you're like me, you're getting excited about learning new tools for your Data Science/ML/AI journey. So far we've covered Numpy and Pandas, where we learned how to manipulate, process, and analyze numerical and tabular data. These libraries gave us a solid foundation in handling and preparing data for further analysis or modeling, and as we delve into Matplotlib and Seaborn inside of Jupyter Notebooks, we're now stepping into the fascinating world of data visualization. Trust me, it only gets better from here!

Line Plot
Using Pandas with Matplotlib
Scatter Plot
Bar Plot
Pie Plot
Histogram
Box Plot
Violin Plot
Strip Plot
Pair Plot
Distribution Plot
Count Plot
Heat Map

Data Visualization with Matplotlib and Seaborn in Jupyter Notebooks

Matplotlib is one of the most widely used libraries for creating static, animated, and interactive visualizations in Python. Its extensive functionality and versatility make it a powerful tool for any data scientist or analyst to perform Exploratory Data Analysis (EDA)

Once imported, Matplotlib provides a wide variety of plots and charts to visualize data, from simple line and bar plots to more complex scatter plots and histograms. Whether you're trying to spot trends over time, distributions of data, or relationships between variables, Matplotlib has the flexibility to meet your needs.

Seaborn, while built on Matplotlib, enhances its capabilities and introduces more sophisticated visualization tools. It's designed to work seamlessly with Pandas dataframes and makes creating complex plots from dataframes quite straightforward.

With Seaborn, you can create a range of informative and attractive statistical graphics. Heat maps, violin plots, pair plots, and swarm plots are just a few of the more advanced visualizations available.

Both Matplotlib and Seaborn work exceptionally well in Jupyter Notebooks, a popular open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Jupyter Notebooks provide an interactive and intuitive interface for conducting data analysis and visualization.

To use Matplotlib or Seaborn in Jupyter Notebooks, you simply need to import the required libraries and execute your code. The outputs, including all graphs and plots, are then displayed directly under each code cell, making it easy to view and interpret your results in a structured and clear manner.

Typically when people use these libraries, they do the imports with the following aliases:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Now for the fun. Let's see what these libraries can do.

Line Plot

Lets start with a simple example. A line plot is used to display information as a series of data points connected by straight line segments. It's useful for visualizing data over time, also known as time series data. In this example, we can show the unemployment rate each year. In the code we can see that we have two arrays with an equal number of values in each, and we plot an unemployment rate with a corresponding year.

year = [1920, 1930, 1940, 1950, 1960, 1970, 1980, 1990, 2000, 2010]
unemployment_rate = [9.8, 12, 8, 7.2, 6.9, 7, 6.5, 6.2, 5.5, 6.3]

plt.plot(year, unemployment_rate)
plt.title('unemployment rate vs year')
plt.xlabel('year')
plt.ylabel('unemployment rate')
plt.show()

Side note: Don't ask me where this data came from, it could be wrong for all I know and should only be used for demonstration purposes.

Using Pandas with matplotlib

Now that we have taken a stab at using matplotlib, let's load a dataset from a csv file into a pandas dataframe. We can take the information from this dataframe and plot it with a variety of different methods. You can download the dataset that I'm using from here:
https://www.kaggle.com/code/sanjanabasu/tips-dataset/input

Also, recall that above we defined pandas as 'pd'

df=pd.read_csv('tips.csv')

# Print DataFrame
print(df.head())

# Outputs:

   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4

Scatter Plot

A scatter plot uses dots to represent values for two different numeric variables. The position of each dot represents the value of data point, and this is useful for visualizing the relationship between two variables. Many scatter plots simply use one color of dots to illustrate the relationship between two variables, but in this example I've shown that you can describe the characteristics of your data points better with a little bit of creativity.

# Prepare df for plotting
total_bill = df['total_bill']
tip = df['tip']
sex = df['sex']
smoker = df['smoker']

# Create a scatter plot
plt.figure(figsize=(10, 6))
colors = {'Male': 'blue', 'Female': 'red'}
smoker_markers = {'Yes': 'x', 'No': 'o'}

for i in range(len(total_bill)):
    plt.scatter(total_bill[i], tip[i], c=colors[sex[i]], marker=smoker_markers[smoker[i]], s=100)

# Set plot labels and title
plt.xlabel('Amount Due')
plt.ylabel('Gratuity')
plt.title('Scatter Plot of Amount Due vs. Gratuity')

# Add legend for gender and smoker status
for gender_label, color in colors.items():
    plt.scatter([], [], c=color, label=gender_label)
for smoker_label, marker in smoker_markers.items():
    plt.scatter([], [], marker=marker, label='Smoker: ' + smoker_label)

plt.legend(loc='upper right')

plt.grid(True)
plt.show()

We can see from the plot of the following example that there is a positive correlation between the x and y variables. This positive correlation implies that the higher the bill is, the higher the tip will be. This makes sense if you think about how many people tip based on the percentage of the bill.

Bar Plot

Bar plots are used to display and compare the number, frequency or other measure (like mean) for different categories. Each bar's height is proportional to the value it represents. In this plot we can see that there are 4 days recorded, Friday, Saturday, Sunday and Thursday. I suppose we can assume that the person who recorded this data only worked and recorded tips on those days.

# Group the data by 'day' and calculate the average 'total_bill' for each day
average_total_bill_by_day = df.groupby('day')['total_bill'].mean()

# Create the bar plot
plt.bar(average_total_bill_by_day.index, average_total_bill_by_day.values)
plt.xlabel('Day of the Week')
plt.ylabel('Average Total Bill')
plt.title('Average Total Bill by Day of the Week')
plt.show()

Pie Plot

Pie plots represent the size of items (out of 100%) in one data series, proportional to the sum of the items. We've all seen a pie plot before; it's useful when you want to visualize percentage breakdown of categories.

# Group the data by 'sex' and calculate the total count for each category
sex_counts = df['sex'].value_counts()

# Create the pie plot
plt.figure(figsize=(6, 6))
plt.pie(sex_counts, labels=sex_counts.index, autopct='%1.1f%%', startangle=90)
plt.title('Distribution of Sex')
plt.axis('equal')  # Equal aspect ratio ensures that the pie plot is circular.
plt.show()

Histogram

Histograms show the distribution of numeric data by dividing the data into bins of equal width. Each bin is plotted as a bar whose height corresponds to how many data points are in that bin. We'll get more into distributions in the future, they're important to understanding nature of your data.

You may also notice in the code that an argument 'kde' is set to true. This is known as the Kernel Density Estimation, and it allows us to visualize the data distribution in a smooth and continuous manner, avoiding the limitations of discrete binning.

plt.figure(figsize=(8, 6))
sns.histplot(df['total_bill'], kde=True)
plt.xlabel('Total Bill')
plt.ylabel('Frequency')
plt.title('Total Bill Histogram')
plt.show()

Box Plot

A box plot, also known as a box and whisker plot, shows the quartiles of the dataset and is useful to visualize the distribution and skewness of your data. It also identifies outliers in your data. We'll go deeper into quartiles in future posts about distributions as well. For now, think of it this way:

A box plot divides your data into four equal parts, with each part representing a quarter of the data points. The "box" in the plot represents the middle 50% of the data, where the lower boundary of the box is the first quartile (Q1) and the upper boundary is the third quartile (Q3). The line inside the box represents the median (Q2), which is the middle value of the dataset.

Additionally, the "whiskers" extend from the box and indicate the range of the data, excluding outliers. Typically, the whiskers encompass data within 1.5 times the interquartile range (IQR), which is the difference between Q3 and Q1. Data points outside this range are considered outliers and are represented as individual points beyond the whiskers.

plt.figure(figsize=(8, 6))
sns.boxplot(data=df[['A', 'B']])
plt.title('Box Plot')
plt.show()

Violin Plot

A violin plot plays a similar role as a box and whisker plot. It shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared.

The real value in using a violin plot is that it not only displays the quartile information like a box plot, but it also provides a more detailed view of the data distribution by showing the probability density of the data at different values.

plt.figure(figsize=(10, 6))
sns.violinplot(x='day', y='total_bill', hue='sex', data=df, split=True, palette='muted')
plt.xlabel('Day of the Week')
plt.ylabel('Total Bill')
plt.title('Total Bill Distribution by Day and Sex')
plt.legend(title='Sex', loc='upper right')
plt.show()

Strip Plot

Strip plots are used to represent the distribution of data. It's a good complement to a box or violin plot in cases where all observations along each category can be shown.

plt.figure(figsize=(8, 6))
sns.stripplot(data=df, x='D', y='A', jitter=True)
plt.title('Strip Plot')
plt.show()

Pair Plot

Pair plots are used to visualize the pairwise relationship between the columns. They are an effective way to visualize the relationships between different columns in a dataset, allowing you to quickly identify patterns, correlations, and trends. It is a useful exploratory data analysis tool when dealing with datasets containing multiple numerical variables.

sns.pairplot(df, hue='D')
plt.title('Pair Plot')
plt.show()

Distribution Plot

Distribution plot visualizes the distribution of a univariate set of observations. In seaborn, it is mainly done through the histplot function.

# Distribution Plot
plt.figure(figsize=(8, 6))
sns.histplot(data=df, x='A', kde=True)
plt.title('Distribution Plot')
plt.show()

Count Plot

Count plot can be thought of as a histogram across a categorical variable, instead of a quantitative one. It shows the counts of observations in each categorical bin.

# Count Plot
plt.figure(figsize=(8, 6))
sns.countplot(data=df, x='D')
plt.title('Count Plot')
plt.show()

Heat Map

A heat map is a two-dimensional representation of information with the help of colors. In the context of data visualization, it is used to represent the correlation between different features.

# Heatmap
corr = df.corr()
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Heat Map')
plt.show()

Heat maps are exceptionally powerful as they provide an intuitive and visually striking representation of data. By using a color-coded system to display values on a 2D matrix, heat maps allow us to grasp complex patterns, trends, and relationships within the data at a glance.

We can see a few different things in this plot, and the first thing that might stick out to you is the red, diagonal line of boxes of 1's. These all indicate a 100% correlation, which makes sense when you see that it is being shown to correlate with itself.

If you look in the first column where total_bill is being compared with tip, we can see that there is a relatively strong correlation. This is in line with our assumption earlier that larger bill totals tend to garner larger tips. We can also see that there's a relatively strong correlation between large party size and tips, as well as total bill, which makes sense.

On the flip side, there's the cool blue size of the spectrum, which indicates negative correlation. If we look where time_Dinner and day_Thur is, we can see there is a very strong negative correlation between the two variables. Saturday and Sunday seem to follow opposite trends.

Conclusion

Well, we've made it to the end. I hope you have enjoyed this post on data visualization with matplotlib and seaborn! I highly recommend using these plots for your data science projects
as they will not only make your analyses more insightful and compelling but also enable you to effectively communicate your findings to others.

Happy visualizing and exploring the exciting world of data science!

Data Science Zero to Hero - 1.2: Pandas

StevenMcGown — Wed, 26 Jul 2023 22:59:45 +0000

Pandas are nature's adorable, bamboo-munching- wait, not that kind of panda...

Much less cute but much more useful for data science, Pandas is a popular open-source Python library that provides powerful data manipulation and analysis tools. The name "Pandas" is derived from "Panel Data," reflecting its original focus on handling and analyzing financial data with panel data structures.

It is built on top of NumPy and offers easy-to-use data structures and data analysis functionalities. In this blog post, we will explore various features and capabilities of Pandas, including Series, DataFrames, reading data, manipulation techniques, merging and joining, reshaping data, pivot tables, duplication, mapping and replacing values, and grouping data.

This one's going to be long, so put on your seat belts!

Pandas Series: Foundations of Data Manipulation
DataFrames: Tabular Data Made Easy
Reading Data into Pandas
Concatenation, Merge, and Joining: Combining DataFrames
Reshaping Data: Pivoting and Melting
Duplication: Identifying and Handling Duplicate Data
Map and Replace: Modifying Values in DataFrames
GroupBy: Aggregating and Analyzing Data
Conclusion

Pandas Series: Foundations of Data Manipulation

At the core of Pandas lies the concept of a Series. A Series is a one-dimensional labeled array that can hold any data type, such as integers, strings, or even Python objects. It consists of two main components: the index and the data.

The index is an array-like structure that holds labels for each element in the Series, allowing for fast and efficient data access. The data component contains the actual values associated with each index label.

Creating a series

We can create a series from a list like so:

import pandas as pd

# Creating a Series from a list
my_list = [10, 20, 30, 40, 50]
my_series = pd.Series(my_list)
print(my_series)

# Outputs:
0    10
1    20
2    30
3    40
4    50
dtype: int64

Indexing and slicing a Series

To index a series, we simply put the index of the series that we want to retrieve in brackets. Like we have seen before in numpy, we can slice the series by indexing with an inclusive start value and a exclusive end value separated by a colon.

print(my_series[2])
print(my_series[1:4])

# Outputs:
30
1    20
2    30
3    40
dtype: int64

Filtering values in a Series

...is as simple as this. Remember that our series contains the values 10, 20, 30, 40 and 5. So it makes sense that we only see the values greater than 30 here:

print(my_series[my_series > 30])

# Outputs:
3    40
4    50
dtype: int64

Performing arithmetic operations on a Series

In this example, we're multiplying all of the values in the series by 2.

print(my_series * 2)

# Outputs:
0     20
1     40
2     60
3     80
4    100
dtype: int64

Manipulating Series is a fundamental task in Pandas. You can create a Series from various data sources, such as Python lists or dictionaries. Additionally, you can perform operations like indexing, slicing, filtering, and arithmetic operations on Series, enabling powerful data transformations.

DataFrames: Tabular Data Made Easy

It's worth noting at this point that Pandas isn't really great for manipulating large datasets for a number of reasons (not parallelizing operations, loading the whole data set in memory, etc). There are other data manipulation libraries like Dask and Vaesk that may work better with larger data sets, and you may be waiting a while if you try to do these operations on a very large dataset.

Now that we have learned about series which are 1-dimensional, we will look at Data Frames.
DataFrames are two-dimensional labeled data structures in Pandas, inspired by the concept of tables in relational databases. They are essentially a collection of Series that share a common index, allowing for intuitive and efficient data handling.

Creating a Pandas data frame from a python dictionary

Pandas easily transforms Python dictionaries into data frames for efficient data manipulation and analysis.

import pandas as pd

### Creating a DataFrame from a dictionary
data = {'Name': ['John', 'Emma', 'Ryan'],
        'Age': [25, 30, 35],
        'City': ['New York', 'London', 'Sydney']}
df = pd.DataFrame(data)
print(df)

# Outputs:
   Name  Age       City
0  John   25   New York
1  Emma   30     London
2  Ryan   35     Sydney

Selecting specific columns in a DataFrame

Just like you can index a series, you can index the columns of a data frame as well.

print(df['Name'])
print(df[['Name', 'Age']])

# Outputs:
0    John
1    Emma
2    Ryan
Name: Name, dtype: object

   Name  Age
0  John   25
1  Emma   30
2  Ryan   35

Filtering rows based on conditions

This works exactly like filtering on series.

print(df[df['Age'] > 28])

# Outputs:
   Name  Age      City
1  Emma   30    London
2  Ryan   35    Sydney

Sorting a DataFrame

You can sort the rows of a dataframe on the names of the columns.

print(df.sort_values('Age'))

# Outputs:
   Name  Age       City
0  John   25   New York
1  Emma   30     London
2  Ryan   35     Sydney

# Applying aggregate functions
print(df['Age'].mean())

# Outputs:
30.0

Manipulating DataFrames offers a wide range of possibilities for data analysis. You can create DataFrames from various data sources, including CSV files, Excel spreadsheets, and SQL databases. Once you have a DataFrame, you can perform operations like selecting specific columns, filtering rows based on conditions, sorting data, and applying aggregate functions.

Reading Data into Pandas

Pandas provides several functions to read data from different file formats. Some commonly used methods include read_csv(), read_excel(), and read_sql(). These functions allow you to load data into DataFrames, making it easy to analyze and manipulate the data using Pandas' powerful functionality.

import pandas as pd

## Reading data from a CSV file
df = pd.read_csv('data.csv')
print(df.head())

# Outputs:
   Column1  Column2  Column3
0        1        2        3
1        4        5        6
2        7        8        9
3       10       11       12
4       13       14       15

Concatenation, Merge, and Joining: Combining DataFrames

Concatenation, merging, and joining are techniques used to combine multiple DataFrames into a single DataFrame, allowing for comprehensive data analysis. Concatenation is the process of stacking DataFrames vertically or horizontally. Merging involves combining DataFrames based on common columns, similar to SQL joins. Joining is the process of combining DataFrames based on their index.

Concatenating Dataframes

You can concatenate along two axes - axis 0 (rows) or axis 1 (columns). The default is axis 0. During concatenation, make sure the column names and index labels match correctly. If they don't, Pandas will create NaN values for non-matching elements.

import pandas as pd

# Concatenating DataFrames vertically
df1 = pd.DataFrame({'A': [1, 2, 3],
                    'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [7, 8, 9],
                    'B': [10, 11, 12]})
result = pd.concat([df1, df2])
print(result)

# Outputs:
   A   B
0  1   4
1  2   5
2  3   6
0  7  10
1  8  11
2  9  12

Merging DataFrames based on a common column

You need a common key or set of keys to merge DataFrames. These keys serve as the basis for matching and combining rows. Pandas supports different types of merges - 'inner', 'outer', 'left', and 'right'. Each type specifies how the rows are combined based on the keys.

df1 = pd.DataFrame({'Key': ['A', 'B', 'C'],
                    'Value': [1, 2, 3]})
df2 = pd.DataFrame({'Key': ['B', 'C', 'D'],
                    'Value': [4, 5, 6]})
result = pd.merge(df1, df2, on='Key')
print(result)

# Outputs:
  Key  Value_x  Value_y
0   B        2        4
1   C        3        5

These operations provide flexibility in data integration and enable the user to perform more complex analyses by leveraging data from multiple sources.

Reshaping Data: Pivoting and Melting

Reshaping data is a common task in data analysis, and Pandas provides functions to pivot and melt DataFrames. Pivoting involves transforming data from a "long" format to a "wide" format, creating new columns based on unique values in an existing column. Melt, on the other hand, transforms data from a "wide" format to a "long" format, unpivoting the data.

So they're just the inverse operation of each other. Changing the structure from wide to long or long to wide is often necessary to adapt the data to different analysis, modeling, or visualization requirements. Each format has its advantages and is suitable for specific scenarios. For this section, I would highly suggest paying close attention to the outputs of these examples to see they're doing. Not sure why, but it took me some time to understand these concepts.

Pivoting

Again, pivoting is used to turn dataframes long to wide. Here's some examples of when that might be useful:

Redundancy Reduction: Wide-format data can help reduce redundancy, especially when dealing with sparse data. By pivoting, you can consolidate related information into a more compact representation.
Visualization: In some cases, a wide-format presentation may be more intuitive or easier to understand for certain types of visualizations, especially when there are fewer categorical variables.
Exporting Data: For some specific use cases or external tools, a wide-format might be the preferred format for data export.

import pandas as pd

# Pivoting a DataFrame
df = pd.DataFrame({'Date': ['2023-01-01', '2023-01-02', '2023-01-03'],
                   'City': ['New York', 'London', 'Sydney'],
                   'Temperature': [32, 28, 35]})
pivot_table = df.pivot(index='Date', columns='City', values='Temperature')
print(pivot_table)

# Outputs:
City        London  New York  Sydney
Date                               
2023-01-01    NaN      32.0     NaN
2023-01-02   28.0       NaN     NaN
2023-01-03    NaN       NaN    35.0

Melting

Once more, melting is for making dataframes wide to long.

Aggregation and Analysis: When you need to perform aggregate functions or statistical analysis on multiple related columns, melting the data into a long format is often more convenient. It allows you to treat the column names as data values, making it easier to apply operations uniformly across different groups.
Visualization: Certain visualization libraries or tools work better with data in a long format. For example, tools like Seaborn or Plotly often expect data in a long format for categorical data plotting.
Normalization: Long-format data can help in data normalization and provide a standardized way to handle repeated measures.

df = pd.DataFrame({'Name': ['John', 'Emma', 'Tom'],
                   'Math': [95, 87, 92],
                   'Science': [88, 90, 85]})
melted_df = pd.melt(df, id_vars='Name', var_name='Subject', value_name='Score')
print(melted_df)

# Outputs:
   Name  Subject  Score
0  John     Math     95
1  Emma     Math     87
2   Tom     Math     92
3  John  Science     88
4  Emma  Science     90
5   Tom  Science     85

These reshaping techniques are particularly useful when working with datasets that require restructuring for better analysis and visualization.

Duplication: Identifying and Handling Duplicate Data

Data duplication is a common issue in real-world datasets. Pandas offers functions to identify and handle duplicate data effectively. You can use methods like duplicated() and drop_duplicates() to detect and remove duplicate rows from DataFrames. By addressing duplication, you can ensure data integrity and obtain accurate insights from your analysis.

Duplicating

import pandas as pd

# Identifying duplicate rows
df = pd.DataFrame({'Name': ['John', 'Jane', 'John'],
                   'Age': [25, 30, 25]})
duplicated_rows = df.duplicated()
print(duplicated_rows)

# Outputs:
0    False
1    False
2     True
dtype: bool

Dropping duplicate rows

df = df.drop_duplicates()
print(df)

# Outputs:
   Name  Age
0  John   25
1  Jane   30

Map and Replace: Modifying Values in DataFrames

Pandas provides convenient methods for mapping and replacing values in DataFrames. The map() function allows you to create new columns based on existing values or apply custom transformations to existing columns. The replace() function is useful for substituting specific values or patterns in a DataFrame with new values.

Mapping Values

import pandas as pd

# Mapping values using a dictionary
df = pd.DataFrame({'Grade': ['A', 'B', 'C']})
grades_mapping = {'A': 'Excellent', 'B': 'Good', 'C': 'Average'}
df['Grade'] = df['Grade'].map(grades_mapping)
print(df)

# Outputs:
       Grade
0  Excellent
1       Good
2    Average

Replacing Values

df = pd.DataFrame({'Age': [25, 30, 35, 40]})
df['Age'] = df['Age'].replace({30: 31, 35: 36})
print(df)

# Outputs:
   Age
0   25
1   31
2   36
3   40

These operations help in cleaning and transforming data, enabling you to make data more consistent and suitable for further analysis.

GroupBy: Aggregating and Analyzing Data

GroupBy operations in Pandas allow you to split data into groups based on specified criteria, apply aggregation functions to each group, and combine the results. This functionality is invaluable for statistical analysis, as it enables you to compute summary statistics, perform group-level calculations, and gain insights into the data distribution.

GroupBy

In this example, we create a DataFrame 'df' with population data for cities 'A' and 'B.' We then group the data by city and calculate the mean population for each group, creating a new DataFrame 'grouped_df' with the results. The output shows the average population for cities 'A' and 'B.'

import pandas as pd

# Grouping and calculating mean
df = pd.DataFrame({'City': ['A', 'B', 'A', 'B'],
                   'Population': [100000, 200000, 150000, 250000]})
grouped_df = df.groupby('City').mean()
print(grouped_df)

# Outputs:
      Population
City            
A         125000
B         225000

Applying multiple aggregations

We can do multiple aggregations like so. This gives us both the sum and mean of the population for city A and B

df = pd.DataFrame({'City': ['A', 'B', 'A', 'B'],
                   'Population': [100000, 200000, 150000, 250000]})
aggregations = {'Population': ['sum', 'mean']}
grouped_df = df.groupby('City').agg(aggregations)
print(grouped_df)

# Outputs:
     Population          
            sum      mean
City                     
A        250000  125000.0
B        450000  225000.0

GroupBy operations are particularly useful when working with large datasets, as they allow you to analyze data at different granularities and identify patterns or trends within each group.

Conclusion

Whew! You made it!! You're ready to go out into the world and confidently manipulate DataFrames with Pandas 🐼

By leveraging Pandas' Series and DataFrame data structures, as well as various functions and operations, you can effectively handle, transform, and analyze data for a wide range of use cases.

In this blog post, we explored key aspects of Pandas, including Series manipulation, DataFrame operations, data reading, concatenation, merging, reshaping, duplication handling, mapping and replacing values, and GroupBy analysis. Armed with this knowledge, you can confidently dive into data analysis tasks and unleash the full potential of Pandas in your Python projects!

Data Science Zero to Hero - 1.1: Numpy

StevenMcGown — Mon, 24 Jul 2023 23:17:46 +0000

Numpy is a Python library used for working with arrays, linear algebra, matrices, and much more. It’s a fantastic tool for anyone who wants to work with numerical data in Python, particularly in the context of data science and machine learning. Numpy is used extensively in machine learning algorithms, so it’s good to have some experience with it if you want to be able to create any ML solutions.

This post will assume that you already have some programming experience with Python. In the future I might put out some posts on Python programming, but there are plenty of great videos and websites that can teach you how to program with Python.

The most basic thing to know about Numpy is a numpy array. You may be familiar with arrays already; in its most basic form, a 1 dimensional array is just a row of data. This can be anything, but most of the time we use these arrays for numbers, strings, or even a mix of both.

For example, let's create a Numpy array with the following values:



import numpy as np
mixed_array = np.array(['Apple', 10, 'B', 5, 'Banana', 7.5, 'D', 3])

You can see that we have an array that contains a mix of strings, characters, integers and floating point numbers. We can reference items in the array like so:



mixed_array[0]
# Outputs "Apple"

This will give us the value stored in the second compartment of the array, which is 4.

Like an egg carton, a 2D array has compartments, or cells, that can hold values. An array with 2 or dimensions is also called a matrix. Each cell is identified by an index, which is like the number of the compartment in the egg carton.

Just like we did in the previous example, we can take an element out of the array. An important thing to notice with this is that because it is a 2D array, each element is an array. So when we index the first element in the array (0), we will get the first row in our array.



egg_array = np.array([['egg1','egg2','egg3','egg4','egg5','egg6'],
                      ['egg7','egg8','egg9','egg10','egg11','egg12']])

print(egg_array[0])

# Outputs: ['egg1' 'egg2' 'egg3' 'egg4' 'egg5' 'egg6']

We can also slice a NumPy array, just like we can take out a row of eggs from an egg carton. For example, let's say we want to get the second and third values from the array:



egg_array = np.array([['egg1','egg2','egg3','egg4','egg5','egg6'],
                      ['egg7','egg8','egg9','egg10','egg11','egg12']])


print(egg_array[0][1:3])

# Outputs: ['egg2' 'egg3']

This outputs a new array with the values in the second and third compartments. Taking a closer look, we can see that we index the first array in the matrix with [0] and then the 2nd through 3rd elements in the array with [1:3].

You might be wondering why is it that the 2nd and 3rd elements are selected with [1:3] and not [2:3]? In Python, when using slicing notation start:end, the start index is inclusive, while the end index is exclusive. It means that the range of elements selected includes the element at the start index but excludes the element at the end index.

Great! Now that we understand how to index and slice arrays, let’s simplify our array just a little bit, so that each egg is represented as a number instead of a string.

Just like you can add or subtract eggs from an egg carton, you can perform operations on a NumPy array. For example, let's say we want to multiply all the values in our array by 2:



print(egg_carton * 2) 
# Outputs: [4, 8, 12, 16]

This will give us a new array with the values doubled.

You can also concatenate, or combine, two or more arrays together, just like you can stack multiple egg cartons on top of each other. For example, let's say we have two arrays:

egg_carton1 = np.array([1, 2, 3])
egg_carton2 = np.array([4, 5, 6])
We can concatenate them together like this:



print(np.concatenate([egg_carton1, egg_carton2])) 
# Outputs: [1, 2, 3, 4, 5, 6]

This will give us a new array with all the values from both arrays.

Finally, just like you can split an egg carton into two or more parts, you can split a NumPy array into smaller arrays. For example, let's say we have an array with six values:



egg_carton3 = np.array([1, 2, 3, 4, 5, 6])

We can split it into two arrays of equal size like this:



print(np.split(egg_carton3, 2)) 
# Outputs: [array([1, 2, 3]), array([4, 5, 6])]

This will give us a list of two new arrays, each containing half of the values from the original array.

There are plenty of other concepts to learn about numpy, but for the sake of brevity, I'm going to cover the essentials in this post. If you would like to know more, please explore the numpy documentation!

So there you have it – a NumPy array is like an egg carton for numbers, with compartments that can be indexed, sliced, operated on, concatenated, and split! There are many useful things that you can do with Numpy that are more advanced and won’t be covered in this post. After you read this, I encourage you to dive deeper into the documentation if you’re interested in learning more.

Thanks for reading!

Data Science Zero to Hero - Foreword

StevenMcGown — Mon, 24 Jul 2023 23:04:20 +0000

Wow, it's been such a long time since I made a post here. I've been working hard, learning a lot, and I'm pleased to share that I have made sizeable strides into the world of Data Science!

Since I've been gone, I saw that according to my post statistics I have totaled over 100k views! That's so amazing. Speaking of statistics, they have been on my mind for the past year! You might be wondering though, why? Well, statistics is truly the heart of the scientific method, and plays an integral role in Artificial Intelligence, Machine Learning and Data Science.

Today I'll be starting a new series called "Data Science Zero to Hero" where I will be breaking down popular concepts so that those who are interested in learning about ML, AI and data science can learn in a guided way without any prior knowledge. It's also worth noting that these terms are often used interchangeably, but are not necessarily the same things.

Artificial Intelligence (AI) is a broad concept focused on creating machines that can mimic human thinking, reasoning, and behavior. On the other hand, Machine Learning (ML) is a subset of AI where computer systems learn from their environment, using these learnings to enhance experiences and processes. It's important to note that all ML is AI, but not all AI involves ML.

Data Science, in contrast, involves processing, analyzing, and extracting relevant insights from data. Data Scientists utilize machine learning techniques to predict future events by uncovering hidden patterns in the data.

A little bit about myself...

Just a little bit of background into what I learned so far - I took MIT's online Applied Data Science course and learned tons. If anyone is interested in an introduction to learning Data Science, I would definitely recommend it.

My goal is to learn by teaching, and to help those who are intimidated by complex technical concepts. I hope that you're as excited as I am! Thanks for reading!

Terraform for Dummies - Part 2: Getting Started

StevenMcGown — Wed, 08 Jun 2022 23:17:35 +0000

Welcome to the second part in the “Terraform for Dummies” series 😃 In the last post, we learned about instructure as code and got a high level overview of how terraform works in a DevOps environment. If you haven’t already, please learn the basics in the previous post before reading further. In this post, we will get our environment set up so we can start creating Terraform code!

If you haven't already, please read part 1 of this series to have the best info going forward 👍

Setting Up
Before we go any further, we will need to do three things:

Have an active AWS account
Install an IDE of your choice (I use VSCode)
Install terraform

Create an AWS account:
https://aws.amazon.com/premiumsupport/knowledge-center/create-and-activate-aws-account/

Install VSCode:
https://code.visualstudio.com/download

Install Terraform:
https://learn.hashicorp.com/tutorials/terraform/install-cli

Once you have done these three things, we will need to open up our IDE. I recommend using VSCode because of its wide array of extensions. One of the extensions I will be using is the Hashicorp Terraform extension, which will do things like syntax highlighting and autocompletion.

Open up a terminal in VSCode and type $ terraform version. If you get an error message, you may have missed a step in the terraform installation step.

Getting Started
Create a project folder and a new file called main.tf. Inside of this file, we are going to add our aws provider found on hashicorp’s website. If you have programmed before, you are probably familiar with ‘main’ as a special keyword for compilers to find as a starting point for code execution. This is actually not the case with Terraform; Terraform treats all files with the .tf extension as one single file, so any file names that you specify are just logical groups. It’s also a best practice to keep your naming conventions and file structure consistent across your projects.

You can see that we have a terraform block which is our terraform settings configuration. We need at least one provider in this block, so in our case we have aws. Notice that the provider is actually Hashicorp and not AWS themselves, although it is the official aws provider. At the time I am creating this post, the current version of terraform is 4.16.0.

We can also see we have a provider block named “aws”. In this provider block, you can specify our region, access key and secret key. It is important to know that you should NEVER share these keys with ANYONE!

While it is possible to configure these credentials directly in the terraform code, you should not do this. A better way is to store them in an AWS profile, which is what we will do for this tutorial. The best practice would be to store these keys in a secret manager like Hashicorp’s Vault. For the sake of simplicity and as to not deviate from focusing on Terraform itself, we will not be storing these secrets in Vault, but again in an AWS profile.

Generating AWS Keys
To generate our keys, we will first need to navigate to Identity Access Management in the AWS console.

Once in IAM, create a User with any name that you want. In this case I will simply call it “Terraform.” Since we will only be using it from the terminal, only give it programmatic access and attach the PowerUserAccess policy. This policy essentially gives the same permissions as the AdministratorAccess policy minus management of users and groups.

Once you have created your user, you should see your Access key ID and Secret access key.

Again… DO NOT SHOW YOUR access_key OR secret_key TO ANYONE OR PUSH THEM TO ANY REPOSITORY. Doing so could cause your AWS Account to be compromised!! That being said, I am not responsible for any wrongdoing that may come as a result of exposing secret keys.

Configuring an AWS Profile
Depending on how experienced you are with AWS CLI, you may have already configured your default aws profile to have access to AWS. We can configure AWS CLI to have multiple profiles which will allow us to have an IAM user dedicated to this terraform tutorial.

To configure these credentials, edit the .aws/credentials file using your favorite text editor:
~/.aws/credentials (Linux & Mac) or %USERPROFILE%.aws\credentials (Windows)

In this file we can see the default credentials and we can add our own set of credentials to use. AWS specifies that you cannot use the word “profile” when creating an entry in the credentials file.

[default]
aws_access_key_id=AKIAIOSFODNN7EXAMPLE
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

[terraform_tutorial]
aws_access_key_id=ACCESS_KEY
aws_secret_access_key=SECRET_ACCESS_KEY

Once you have configured your credentials, you can now reference that credentials profile in your terraform code.

Terraform Code
Now that we have taken care of the credentials configuration, we can write a small piece of terraform code called a resource block. A resource block describes one or more infrastructure objects, such as virtual networks, compute instances, or higher-level components such as DNS records. In our case, we can start by creating a sample S3 bucket so we don’t run up charges on our AWS account.

We can see that this resource block declares resource type "aws_s3_bucket" with a given local name "test-terraform-bucket". The local name is used to refer to the s3 bucket resource from elsewhere in the same Terraform module, but has no significance outside that module's scope.

Within the block body (between the { and }) are the configuration arguments, bucket and tags, for aws_s3_bucket.
For bucket, we must give a string that will create an s3 bucket with a globally unique name.

tags are optional, but I have added this argument to demonstrate that many configuration arguments can also have arguments from within themselves.

Terraform init
Recall from the previous post that “init” is the second stage of the terraform lifecycle, right after code. To recap, terraform init is used to initialize a working directory containing Terraform configuration files. This is the first command that should be run after writing a new Terraform configuration or cloning an existing one from version control.

We can initialize terraform using $ terraform init

Terraform plan
The next step in the terraform lifecycle is plan. This step creates an execution plan, like a blueprint for a house. Unless explicitly disabled, terraform does a refresh and then declaratively determines what needs to be done in order to reach the desired configuration state.

Run the terraform plan for this with $ terraform plan. If successful, the following output and more will print out:

You can see that the bucket and tags are listed in the output, and that everything else will be known after the apply. If you are receiving an error, you may have given invalid values or there might be a syntax error. At this point, terraform has not applied any changes to the AWS environment.

You may have also noticed 2 new files generated in your project folder: “terraform-provider-aws_v4.16.0_x5” and “.terraform.lock.hcl”

“terraform-provider-aws_v4.16.0_x5” is nothing but the aws provider binary for terraform 4.16 and “.terraform.lock.hcl” is known as the dependency lock file which allows terraform to “remember” which exact version of each provider you used before.

Terraform apply
Once your plan is successful, you can run $ terraform apply. This will apply all of the configuration changes which were made in the plan stage.

When you run this command, you will be greeted with the following message asking if you want to perform the changes.

Now your changes will be visible in the AWS console. Alternatively, you can run the following command to list your bucket. Make sure to use the profile that we configured earlier.

$ aws s3 ls –profile <profile-name>

You will also see a new file called “terraform.tfstate”. This file unsurprisingly stores the state of your terraform configuration. While the format of the state files are just JSON, direct file editing of the state is discouraged. Terraform provides the terraform state command to perform basic modifications of the state using the CLI.

Terraform destroy
At this point you have everything you need to deploy an s3 bucket solely using terraform. You might be seeing how powerful terraform can be now, since you can create and manage cloud infrastructure with some code and a few simple commands.

At the end of every lifecycle, in our case terraform, there is an end. If you wish to tear down all of the infrastructure that you created, it's an easy task for terraform. Please note that you should only do this if you wish to delete the infrastructure that you provisioned.

Once again, we are greeted with a similar message upon running $ terraform destroy.

Once we allow terraform to destroy our resources by typing “yes”, we will no longer be able to see our s3 bucket in the aws console. And of course, you can again list your buckets by using the command:
$ aws s3 ls –profile <profile-name>

This concludes part 2 of this series! Thank you for reading, and stay tuned for part 3 of this series. In the next post, we will be taking an even deeper dive into terraform.

Please comment below if you’re enjoying these posts, and let me know if there’s anything I can improve!😊

Terraform for Dummies - Part 1: What is Terraform?

StevenMcGown — Tue, 31 May 2022 22:37:51 +0000

Welcome to this first part of Terraform for Dummies! In this series, you will learn about what Terraform is and how DevOps engineers use it to provision, deploy and orchestrate both on-prem and cloud infrastructure in a reliable and easy way.

If you’re a DevOps engineer, odds are you have at least heard of Terraform. In this series, we will use Terraform as an Infrastructure as Code (IaC) tool for provisioning Amazon Web Services (AWS) resources. We will also touch on important DevOps concepts along the way.

What is Infrastructure as Code?

IaC is pretty much exactly what it sounds like. Using IaC tools like terraform, you can do away with manual configuration all together by defining your infrastructure in a code template. There are a few reasons why IaC champions over manual configuration:

People are imperfect and get it wrong - misconfiguration becomes more likely as infrastructure scales due to human error.
Transferring knowledge to teammates is difficult when environments are manually configured, especially across multiple projects. IaC allows any team member to read/edit IaC with little KT.
Configuration compliance is also difficult. Enforcing compliance according to customer/company standards is easier with IaC.

Terraform uses their own configuration language called HCL, which we will dive into more in up-coming posts. You can use configuration scripts to automate creating, updating and destroying cloud infrastructure. Think of these configuration files as blueprints, much like a blueprint an architect would use to build a house.

Terraform supports a variety of providers outside of GCP, AWS and Azure and sometimes it’s the only provider available. It’s also open-source and extendable so any API can be used to create IaC tooling for any kind of cloud platform technology. E.g. Heroku, Spotify Playlists.

Terraform is also cloud-agnostic, meaning that it allows a single configuration to be used to manage multiple providers and even handle cross-cloud dependencies.

Declarative vs. Imperative

Immutability, or the inability to change the object state, is an important concept to grasp for understanding declarative vs imperative programming. Immutable types are safer from bugs, easier to understand, and more ready for change.

Declarative programming is a paradigm describing WHAT the program does, without explicitly specifying its control flow. There is zero chance of misconfiguration with declarative programming.

Declarative languages don't have looping control structures, e.g. for and while, because due to immutability, the loop condition would never change.
Declarative languages don't express control-flow other than nested function order (a.k.a logical dependencies), because due to immutability, other choices of evaluation order do not change the result.

Declarative programming uses scripting languages such as JSON, YAML, XML

Imperative programming is a paradigm describing HOW the program should do something by explicitly specifying each instruction (or statement) step by step, which mutate the program's state.

Imperative programming is less verbose; you could end up with misconfiguration
AWS CDK and Pulumi use imperative

Imperative programming uses scripting languages such as Python, Ruby and Javascript.
So is Terraform declarative or imperative? Well, Terraform is declarative but it has imperative-like features like for loops, dynamic blocks, locals and complex data structures like maps and collections.

Idempotent vs non-idempotent

Idempotency is a principle of IaC that refers to the state of a configuration after applying changes.

Non-idempotent configurations will add the specified resources each time they are applied whereas idempotent configurations can be applied multiple times without changing the results.

Non-idempotent example:
Your configuration file specifies that you need three virtual machines. Each time you apply your configuration, you have 3 more virtual machines. So each time you apply, the number of VMs will increment by 3.

Idempotent example:
Your configuration file specifies that you need three virtual machines. Each time you apply your configuration, the number of virtual machines is always 3 virtual machines. So no matter how many times you apply, you will always have the same number of VMs in your configuration.

Configuration Drift

Configuration drift is simply unexpected changes to your infrastructure. This can happen for a number of different reasons, including manual adjustments to configurations, side effects of SDKs, CLIs, or APIs or even from malicious actors.

Terraform prevents configuration drift with the state file (.tfstate) which show what the configuration of files should be. To correct the drift, you can utilize the terraform plan and refresh commands which we will look to more in depth in the coming posts.

GitOps

Terraform goes hand-in-hand with GitOps. In collaborative environments, you can use a git repository as a formal process to review and accept changes to IaC. Once those changes are accepted, a deployment is triggered.

In this example, Terraform code is pushed to a git repo such as Github or Bitbucket. The person who made the push can continue to commit until they are ready to create a pull request to the main branch. Once the pull request is approved by a reviewer, a CI/CD pipeline such as Jenkins, Concourse or Github Actions is triggered to deploy changes to your cloud environments.

Terraform cloud also has their own version of this where you have a git repo, PR, and Terraform cloud takes care of the CI/CD.

Terraform Lifecycle

At the start, you will update the code of your terraform configuration file
Then you will initialize your project or pull the latest providers using terraform init
Next, the plan allows you to speculate what your changes will be
Validation happens automatically when your run plan but you can also validate manually
Finally, you will execute the terraform plan to provision infrastructure using apply
You can also destroy terraform infrastructure using apply or the destroy command

You now know the basics of Terraform! In the next post, we will set up a development environment for terraform and begin to start using terraform with AWS.

OpenShift for Dummies - Part 2

StevenMcGown — Wed, 04 Aug 2021 23:47:31 +0000

Thank you for reading part two of OpenShift for Dummies! In this article, I will briefly outline advantages and use cases of OpenShift. Additionally, I will go into technical detail of how you can get started using OpenShift. As a reminder, OpenShift is open source and has a free tier intended for experimentation and development, which is perfect for beginners. If you haven’t read OpenShift for Dummies - Part 1, please read it here before continuing. In the last post, we talked about containers and their advantages over VM’s, but these important questions still remain: why should you use OpenShift and how do you get started using it?

Why Should I Use OpenShift?

In Kubernetes for Dummies, we talked about the need for a container orchestration system. In 2015, there were many different orchestration systems that people were using including cloud foundry, mesosphere, docker swarm, and kubernetes to name a few. Today, the market has consolidated and kubernetes has come out on top. Red Hat bet early on K8s and is now the second largest contributor and influencer of its direction, only next to Google. K8s is the kernel of distributed systems, while OpenShift is a distribution of it. What this means for developers is that whenever there is a new version of Kubernetes available, Red Hat can take K8s from upstream, secure it, test it and certify it with hardware and software vendors. In addition, Red Hat patches 97% of all security vulnerabilities within 24 hours and 99% within the first week, showing the difference between Red Hat and their competition.

OpenShift, the Platform of the Future

OpenShift is a platform that can run on premise, in a virtual environment, in a private cloud, or in a public cloud. You can migrate all of your traditional applications to OpenShift so you can get all of the advantages of containerization, as well as software from independent software vendors. You can also build cloud-native greenfield applications (greenfield describes a completely new project that has to be executed from scratch) as well as integrate Machine Learning and Artificial Intelligence functions.

OpenShift also provides automated operations, multi-tenancy, secure by default capabilities, network traffic control, and the option for chargeback and showback. OpenShift is also pluggable so you can introduce third party security vendors if you wish. Developers also get a self service provisioning portal so operations teams can define what is available for developers and developers can request controls as authorized by the operations team. The OpenShift platform is very versatile in that it runs on most public cloud services such as AWS, Azure, Google Cloud Platform, IBM Cloud, and of course it runs on-premises as well.

OpenShift Demo

You can use the trial version of OpenShift by visiting:
https://www.redhat.com/en/products/trials?products=hybrid-cloud

For this demo, you will need a Red Hat account. We will be selecting the option that plainly says ‘Red Hat OpenShift - An enterprise-ready Kubernetes container platform’

Select ‘Start your trial’ under ‘Developer Sandbox.’ The developer sandbox will suffice for this walkthrough. Please note that the account created will be active for 30 days. At the end of the active period, your access will be deactivated and all your data on the Developer Sandbox will be deleted. Upon logging in, you should be brought to this webpage:

If you are not brought here, visit https://developers.redhat.com/developer-sandbox.

Click ‘Get started in the Sandbox’ and then ‘Launch your Developer Sandbox for Red Hat OpenShift’ and then ‘Start using your sandbox.’ You may also need to verify your email address to continue.

Welcome to OpenShift!

On the side bar you can see different options to select from...

Perspective Switcher
You can toggle between Developer and Administrator perspectives using the perspective switcher.

The Administrator perspective can be used to manage workload storage, networking, cluster settings, and more. This may require additional user access.

Use the Developer perspective to build applications and associated components and services, define how they work together, and monitor their health over time.

Add
You can select a way to create an application component or service from one of the options.

Monitor
The monitoring tab allows you to monitor application metrics, create custom metrics queries, and view & silence alerts in your project.

Search
Search for resources in your Project by simply starting to type or by scrolling through a list of existing resources.

Now, switch to the Administrator perspective and look under projects.

Under projects, you may see two different projects, one for development and one for staging. The projects section allows you to create projects based on domains within IT (Developers, Operations, Security, Network, Infrastructure, Storage, etc) and isolate their functions from one another. Normally these teams would have their own systems, but through OpenShift they all have one singular console where they can have control for their respective roles.

Now, change back to the developer perspective. Under topology, we can see that we currently do not have any workloads. OpenShift gives us many options to create applications, components and services using the options listed.

Let’s explore the catalog to see what we can choose from. Through the developer catalog, the developer does not need to request from the infrastructure team that they need a new developing environment, database, runtime, etc. Rather, the developer can choose from a list of pre-approved apps, services or source-to-image builders. For our purposes, we will be using python to create a front end. I will be using a sample random background color generator to demonstrate the use of python in OpenShift. This app will randomly generate a color and a welcome message to the user who opens the website. Simply type in ‘Python’ in the developer catalog or find it under Languages > Python and click the option that plainly says ‘Python.’

Next, click ‘Create Application’

From here, we will paste the link from the github repository that holds the python script we will use for our webpage: https://github.com/StevenMcGown/OpenShift_Demo

You can also change the name of the application if you wish. For our purposes, we will leave everything in default settings. Once you click ‘Create’, OpenShift will begin to build the application. You can see the build process from the side bar by navigating Builds > open-shift-demo > Builds > open-shift-demo-1 > Logs. In this screenshot, we can see that OpenShift goes to the location of the source code, and copies the source code. Once the source code is copied, it is analyzed and it will build an application binary. Next, OpenShift creates a dockerfile which will install all of the app dependencies needed to run the application binary. The application dependencies are layered to make a container image, where it will be stored in a registry which is built into OpenShift. Finally, from that registry it will deploy an application file.

Next, click on the Topology tab in the sidebar. We can see our python application in a bubble with 3 smaller bubbles attached. The green check mark shows that the build was successful, and we can actually check the build log we just saw by clicking on it.

The bubble on the bottom right with a red C allows us to edit our source code with CodeReady Workspaces. CodeReady Workspaces allows you to edit the code within the browser. Opening CodeReady Workspaces will take some time to open, but when it opens you should see an IDE similar to that of VSCode.

Looking back at the Topology of our application, we can now see that a CodeReady Workspaces icon is now added to our project.

Clicking the bubble on the top-right on the python icon will open the container application. In this instance, it took green as the random color and we are welcomed with a message from the application open-shift-demo hosted on the ‘hkqbv’ container under the ‘7c749ff559’ replica set.

As an administrator, we are interested in giving the application high availability by scaling, control routing, etc. Let’s look at the application from an administrator’s perspective now. In the admin perspective, we can view our application pods by navigating to Workloads > Pods. Here we can see that only one pod is serving our application. If we want to increase our availability, we can navigate to Workloads > Deployments and increase the number of pods to serve our application. As a reminder, a deployment is a set of pods and ensures that a sufficient number of pods are running at one time to service an application. If you need to brush up on Kubernetes concepts such as deployments and pods, please read Kubernetes for Dummies.

Traditionally, if you wanted to increase the availability of your app, you would have to create an additional VM, create a load balancer, install the application and only then would you be set to have high availability. In OpenShift, increasing the availability is as simple as incrementing or decrementing the pod counter under ‘Deployment Details,’ which is done in seconds. Increasing the number of pods means an application will be hosted on each pod, meaning that the application we have will use 3 pods and thus 3 random colors.

After refreshing your page, you may notice that the app does not ever change color… What gives? From a networking perspective, the default configuration is to have a sticky session, meaning that the user will always be hosted by the same container once they connect to the application. To change this, we will navigate to Networking > Routes and click on the 3 dots to edit annotations.

We will add these key-value pairs to our existing annotations:
haproxy.router.openshift.io/balance : roundrobin
haproxy.router.openshift.io/disable_cookies : true

For more information on the round robin scheduling algorithm and cookies, visit these links:
https://en.wikipedia.org/wiki/Round-robin_scheduling#Network_packet_scheduling

https://en.wikipedia.org/wiki/HTTP_cookie

When you refresh the page, you will receive a new message each time indicating that you are being serviced by a different pod for the application. The background color, however, might be the same as another container since the app is initialized with a random color from an array of 7 colors.

Simulating a Crash

Let’s simulate one of the pods crashing to test our availability. In the administrator view, navigate to Workloads > Pods. You should see 3 pods running under the Replica Set tag, indicating that the pods are created using the same data set. If we delete one of these pods, it would mean that the pod immediately fails. In doing so, Kubernetes will simultaneously create a container to replace the pod that failed. This shows that the controller is always looking at how many pods are running vs. how many are needed. In this case, K8s will detect that there are only 2 pods running and immediately create a new pod to replace the failed pod.

Because the old pod was deleted, a new pod was created with a container ID ‘hxghh’ and purple background.

Developer Updates

Let's suppose the developer of the application updates the source code. When this happens, OpenShift needs to reflect the changes made by the developer. We can do this by building the project again by going into the developer perspective, clicking on the python icon, and clicking 'Start Build.' In this case, I added black to the array of colors.

One thing to note is a feature OpenShift uses called 'rolling updates.' Rolling updates ensure seamless transitions from one update to another. With rolling updates, new pods are commissioned while old ones are decommissioned one at a time until completion. This way, there is never a service loss for the end user. With some luck, we can now see a new color background for our web page as given by the developers.

Conclusion

That's all I have for now! Thank you so much for reading part 2 of OpenShift for Dummies. I plan on making more of these in the future, but please let me know if you have any questions or concerns for these posts!

I hope you have enjoyed reading. If you did, please leave a like and a comment! Also, follow me on LinkedIn at https://www.linkedin.com/in/steven-mcgown/

OpenShift for Dummies - Part 1

StevenMcGown — Sat, 24 Jul 2021 19:20:12 +0000

The best way to bring value to an existing business is with the development of new applications, whether they be cloud-native applications, AI & machine learning, analytics, IoT, or any other innovative application. OpenShift, created by Red Hat, is the platform large enterprises use to deliver container-based applications. If you have read my previous posts about Docker and Kubernetes, you understand the importance of containerization in the cloud. OpenShift claims to be “the industry's most secure and comprehensive enterprise-grade container platform based on industry standards.” To put it bluntly, OpenShift is like Kubernetes on steroids.

In this post, I will cover a brief history of IT infrastructure, summarize the role of a DevOps engineer and why they use OpenShift, how OpenShift works, and in part 2 I will show how to use OpenShift. Having a good grasp of the evolution of IT infrastructure is important to fully understand why DevOps engineers use OpenShift and appreciate the impact OpenShift on the cloud computing industry.

If you have already read part 1, please stay tuned for part 2!

How IT Infrastructure Has Changed Over Time

Moore’s law is the observation that the number of transistors on a chip doubles every two years. This means that we can expect the speed and capability of computers to increase every couple of years, and that we will pay less for them. Increasingly powerful microchips have consistently brought forth sweeping changes to business and life in general ever since the advent of the internet. It is hilarious how wrong some people were about how the internet would affect our lives.

In 1995, scientist Clifford Stoll was promoting his book, “Silicon Snake Oil.” At the same time, he published an article in Newsweek titled, “The Internet? Bah!,” in which he stated that services like e-commerce would not be viable, and that “no online database will replace your daily newspaper.”

Perhaps it's just that hindsight is 20/20. After all, Clifford Stoll managed computers at Lawrence Berkeley National Laboratory in California, so it’s not like his opinion was coming from a place of ignorance. Today, most people know how ubiquitous internet technology is to some degree, but where did that process start, and where are we now? Let's begin with the development process. There are 3 pertinent development processes that you may be familiar with:

Development Processes

Waterfall: The waterfall process is named for its one-way approach. Progress is typically only made in mode direction, like a waterfall. A project is broken down into linear sequential phases and each phase depends on the deliverables of the previous phase. The main issue with this is if a client decides that their needs have changed, it is difficult to go back and make changes.

Agile: From the waterfall method, the agile method was born. Instead of delivering 100% to the client at each stage, you may deliver about 20% of a functionality. At that point, the client gives their feedback and while you continue to work on the deliverable, you begin the process of creating another functionality. After 5 iterations, developers have a product that they are satisfied with and will continue to use. But what about IT operations; those who run existing servers, websites and databases?

DevOps: Since development and operation teams can have different goals and skills, the division of these two teams often creates an environment where they do not trust each other. The DevOps approach combines these two teams so they have shared passion and common goals. I will touch more on this later.

The next aspect of IT evolution is application architecture, which refers to the software modules and components, internal and external systems, and the interactions between them.

Application Architecture

Monolithic: A monolithic architecture is the traditional unified model for the design of a software program. Monoliths had a main frame that held the entire application stack, so if the mainframe hardware crashed, the entire system was down. These were then broken down over the years into separate tiers.

3-tier: Three-tier systems break down the model of the mainframe into a web tier, an application tier, and a database tier. This is known as service-oriented architecture (SOA). The reality remains, however, if one of these tiers goes down, you have downtime. And still, all of the application logic remains in the app tier. People have moved on from this architecture to a microservice architecture.

Microservices: In this architecture, you build your services not by the tier, but rather by the business functionality. I will go into more detail about microservices later.

The next aspect is application infrastructure, which is the software platform for the delivery of business applications.

Application Infrastructure

Data Centers: These are giant rooms filled with large, powerful computers that make a lot of noise. To put it in perspective, many data centers are larger than 100,000 sq feet and require specially designed air conditioning systems to make sure they do not overheat.

Hosted: Collections of organizations all underneath one umbrella host their computing and storage power for other businesses to use. This is the precursor of cloud computing.

Cloud computing: Cloud computing uses a network of remote servers to take care of storing, managing, and processing data. There are many cloud providers out there, namely Amazon Webservices, Azure and Google Cloud Platform, which dominate the global market. With cloud computing, hosting, availability, redundancy, etc. are taken care of by a cloud provider. Many businesses are wanting to move to cloud computing, but there are logistical challenges that people have to consider and overcome to do so.

Not only is how applications are developed important, but how they are delivered is also important. The evolution of deployment and packaging is as follows:

Deployment and Packaging

Physical servers: At one point, one physical server hosted one application. By today's standards, this is very inefficient.

Virtual servers: On one physical server you can have many virtual machines, which can host an application.

Containers: Containerization is the next step that businesses are adopting for application development. With containers, multiple applications with all their dependencies can be hosted on a single server without having the OS layer in between.

You should be familiar with containers! If you are not, please read “Docker for Dummies” and "Kubernetes for Dummies". The rest of this post assumes you understand the basics behind containerization and container orchestration.

DevOps Best Practices

We have seen that new technology gives rise to new ways of creating applications. As stated before, there are often issues between development and operation teams, which is why companies are adopting DevOps practices to streamline their development. DevOps engineers use OpenShift because it makes cloud deployments easy and enables them to follow these DevOps best practices:

Everything as Code: The practice of treating all parts of a system as code.

Infrastructure as Code: Simple workflows to auto-provision infrastructure in minutes. (e.g. Terraform, AWS CloudFormation)
Environments as Code: Single workflows to build and deploy virtual machine environments in minutes. (e.g. Vagrant, Docker)
Configuration as Code: Simple, model-based workflows to scale app deployment and configuration management. (Ansible, Puppet, Chef)
Data Pipelines as Code: Programmatically author, schedule and monitor data pipeline workflows as code. (e.g. Apache Airflow, Jenkins)
Security Configuration as Code: Detect and remediate build & production security misconfigurations at scale. (e.g. Checkov)
Encryption Management as Code: Programmatically secure, store and tightly control access across cloud and data center (e.g. Vault, AWS KMS)

Application is always “releasable”: Because everything is code, it is always releasable at any point in time.

Rebuild vs. Repair: This is precisely the point between developers and operations that causes friction (AKA integration hell). For example, someone on the development team changes something in the development environment of the application. Or maybe someone in operations changes something in staging of production. Either way, you have an end product which does not reflect either side. What you should do is instead of tweaking the end product, you should have a golden image that everyone tweaks and can be released at any time.

Continuous Monitoring: You should ensure there is no malware or security vulnerabilities in your application stack, and that any sensitive information like passwords or keys are not exposed to the public.

Automate Everything: Don’t do things manually which can be done automatically like configuration management and testing.

Rapid Feedback: Rapid feedback loops are what make good development teams. The goal behind having rapid feedback is to continuously remove bottlenecks. A simple example of a rapid feedback loop is a CI/CD pipeline.

Delivery pipeline: A delivery pipeline automates the continuous deployment of a project. In a project's pipeline, sequences of stages retrieve input and run jobs, such as builds, tests, and deployments.

Continuous Integration/Continuous Delivery or Deployment(CI/CD): CI/CD is a method to frequently deliver apps to customers.

Continuous Integration: New code changes to an app are regularly built, tested and merged to a shared repository. This solves the problem of having too many branches or an app in development at once that could possibly conflict with one another.

Continuous Delivery: Applications are automatically bug tested and uploaded to a repository (e.g. GitHub, DockerHub) where they can then be deployed to a live production environment.

Continuous Deployment: Automatically testing a developer’s changes from the repository to production where it is usable by customers.

Developing Applications With Monoliths vs. Microservices

Consider the following application: An airline company wants to create an application to book flights. There are three major areas of the application that should be created: Registration, Payment and Service Inquiry.

Monolithic:
With a monolithic architecture, there are some serious drawbacks. All three major areas must be tightly coupled in order to run.

In the monolithic model, all of the modules and programs only work on one type of system and are dependent on one another. This makes new changes a challenge and drives up cost to scale. There is also low resilience in this system because if anything fails, the whole system fails, which can happen for any number of reasons. Maybe there is a hardware issue, or maybe the application receives more traffic than it is designed to handle. In both cases, the application crashes and the whole system is down.

Microservices:
In a microservices architecture, you split your business function so each function has its own set of independent resources. Each function might have its own application and its own database. This means you can independently scale these services and that you are not limited to the technology that you use for your application stack. Autonomous services (microservices) make systems more resilient, flexible to changes, easy to scale and highly available. With cloud platforms like AWS, you can automate these processes.

Moving to the Cloud

A notable issue with using virtual machines in infrastructure is that they are not portable across the hypervisor and thus do not provide portable packaging for applications. So in practice, there is no guarantee that applications will migrate smoothly from, say, an employee’s laptop to a virtual machine or to a public cloud. There are different OS layers, different stacks for each environment, so the portability is nonexistent. As explained before in “Docker for Dummies”, containerization software like Docker offers a solution for this. OpenShift uses Red Hat Enterprise Linux which has their own container daemon built into the kernel, eliminating the need for containerization applications like Docker.

A conceptual way of thinking about it is in terms of microservices. If you have hundreds of microservices, you are not going to want to have hundreds of virtual machines for each service because there would be too much overhead. This is why containerization is necessary if businesses want to adopt a microservice architecture. Here is a simple view of the advantages of using containers vs. virtual machines:

We’ve talked about IT infrastructure, DevOps practices and how OpenShift is used to manage containers. We also discussed some advantages of using OpenShift, but how does OpenShift differ from Kubernetes, and how do you get started?

Please stay tuned for part 2 to learn more!

If you are enjoying this series, please leave a like and a comment. Writing these posts takes a lot of work and I love to hear your feedback! Also, feel free to follow me here on Dev.to for more posts like these, and on LinkedIn to get in contact with me!

Kubernetes for Dummies

StevenMcGown — Tue, 13 Jul 2021 14:58:08 +0000

After receiving much positive feedback on my post “Docker for Dummies”, I wanted to create a post about Docker’s often-paired technology Kubernetes. If you haven’t read Docker for Dummies yet, please read it here, and if you are already familiar with Kubernetes, consider reading my post about OpenShift. Understanding a container service like Docker is fundamental to having a good grasp of Kubernetes. In fact, Kubernetes is capable of managing other container runtimes that will not be covered in this post. In this post I will explain what Kubernetes is, what problems it solves with containers, and how you can get started using it today.

An Introduction to Kubernetes

Kubernetes is derived from the Greek word κυβερνήτης (kubernḗtēs), which means pilot or helmsman. The Kubernetes logo of a ship's steering wheel further enforces the idea of piloting or managing, which is exactly what Kubernetes does with Docker containers. Kubernetes manages Docker containers in a variety of ways so it does not have to be done manually. Kubernetes is often referred to as K8s for simplicity because of the 8 letters in between “K” and “s”. I will be referring to Kubernetes as K8s from here on.

Using K8s further abstracts machines, storage and networks from their physical implementation. As described in the last post, manually managing numerous containers can create similar issues to managing virtual machines. However, managing containers is especially important because cloud companies bill you for things like computing time and storage. You don’t want to have many running containers doing nothing for this reason. In addition, you also don’t want one container taking a network load it cannot handle by itself. K8s is designed to solve problems like these.

What services does K8s provide?

Service discovery and load balancing: K8s can locate a container using a DNS name or IP address and can distribute network traffic to other containers to stabilize deployments.
Storage orchestration: You can automatically mount a storage system of your choice, whether it be locally, with a cloud provider such as AWS or GCP, or a network storage system such as NFS iSCSI, Gluster, Ceph, Cinder, or Flocker.
Automated rollouts and rollbacks: You can define the desired state of deployed containers and change the state at a controlled rate. For example, you can automate Kubernetes to create new containers for your deployment, remove existing containers and adopt all their resources to the new container.
Automatic bin packing: You can provide K8s with a cluster of nodes to run containerized tasks and specify how much CPU and memory each container needs. Kubernetes can automatically fit containers onto nodes to make the best use of resources.
Self-healing: K8s restarts containers that fail, replaces containers, kills containers that don't respond to your user-defined health check, and doesn't advertise them to clients until they are ready to serve.
Secret and configuration management: K8s lets you store sensitive information such as SSH keys, OAuth tokens and passwords. You can update these secrets and app configuration without rebuilding your container images and without exposing secrets in your stack configuration.

We will only scratch the surface of these features in this post.

Some Definitions

It is important to understand these basic K8s concepts. Again, you should also be familiar with container services such as Docker before continuing.

Pods are groups of one or more containers. Pods have shared storage and network resources which have specifications on how to run the containers. They are the smallest deployable units of computing that you can create and manage using K8s. Pods run on nodes together as a logical unit, so they all share the same IP address but can reach each other via localhost. Pods can also share storage but do not have to run on the same machine since containers can span multiple machines.

Nodes are physical or virtual machines that are not created by K8s. Typically, you would have several nodes in a cluster but you may have just one node in a learning or resource-limited environment. Nodes are created manually or with public cloud services such as AWS EC2 or OpenStack, so you need to have basic infrastructure laid down before you use K8s to deploy applications. From this point you can define virtual networks, storage, etc. One node can run multiple pods.

Deployments are a set of Pods. A Deployment ensures that a sufficient number of Pods are running at one time to service the app. Deployments can also shut down Pods that are not needed by looking at metrics such as CPU utilization.

Let's Get Started Using K8s

To run K8s locally, I will be using Minikube and Kubectl. You can install the latest version of Minikube at https://minikube.sigs.k8s.io/docs/start/ and Kubectl at https://kubernetes.io/docs/tasks/tools/

Please note that you must also have Docker installed to move forward with this tutorial.

1) Install prerequisites

Use docker --version and minikube version and kubectl version Don't worry about the message stating the connection to localhost:8080 was refused, we will address this later.

2) Create nodes with Minikube

To create nodes, start up Minikube. For example, starting Minikube with 2 nodes:
minikube start --nodes=2

This will take a while the first time around, so be patient. The first time I ran this it easily took 5-10 minutes to complete.

We can check the status of our nodes using minikube status

The first node is the master node. You can see that it has the control plane, running host, kubelet, API, and kubeconfig configured. The second node is the work node.

We can see that we have 2 containers running if we run:
docker ps

This image shows the master node as ce3359246578 and the work node as 44697ff120e4 with relevant information for both nodes.

We can view our nodes using:
kubectl get nodes

By running kubectl get pods -A, we can retrieve all pods in all namespaces.

All of the these pods make up the control plane. For example, kube-apiserver-minikube is the API which is exposed for external and internal communications. So when we type kubectl, this is the API server that handles the request for it.

3) Creating a deployment

We can check the Pods by typing kubectl get pod. At this point, you should not have any Pods and it will read “No resources found in default namespace”

Pods are the smallest unit of the K8s cluster but in practice, you don’t create Pods but rather deployments. To create a kubernetes deployment, the usage is kubectl create deployment <NAME> --image=<image>. For this particular deployment, we will be creating a nginx deployment:
kubectl create deployment nginx-depl --image=nginx

For those who don't know, NGINX is an open-source webserver that is used to develop server-side applications.

Now when we run kubectl get deployment and kubectl get pod we get the following output.

Our usage of kubectl create deployment <NAME> --image=<image> is the most minimalistic way to create a deployment. The rest of the deployment uses default configuration. Between the deployment and Pod there is another layer which is automatically manages by K8s deployment called ReplicaSet.

The ReplicaSet specifies how to identify Pods that it can acquire, a number of replicas indicating how many Pods it should be maintaining, and a Pod template specifying the data of new Pods it should create to meet the number of replicas criteria. A ReplicaSet then fulfills its purpose by creating and deleting Pods as needed to reach the desired number. When a ReplicaSet needs to create new Pods, it uses its Pod template.

We can view the ReplicaSet with kubectl get replicaset

We can see the ID for the ReplicaSet appended to the deployment name: 5c8bf76b5b. You may notice that the ReplicaSet ID is included in the ID for the Pod; like stated before the ReplicaSet is a layer that sits between the Deployment and the Pod.

So all together, this is how the layers of Abstraction work. A Deployment manages a ReplicaSet,a ReplicaSet manages all the replicas of the Pod, and a Pod is an abstraction of a container.

4) Edit the Deployment

Edit your deployment using:
kubectl edit deployment nginx-depl
This will show the auto-generated configuration file. Don't worry, you don't need to understand everything in the config file at this time. For the sake of this tutorial, we will only edit the image version which can be found somewhere in the middle of the file.

When you are finished editing, type :wq for write & quit. This will terminate the old image and create a new one.

Upon invoking kubectl get replicaset, we can see that the old one has no pods in it and a new one has been created as well.

5) Debugging Pods

Another useful command is kubectl logs <Pod Name>.
If you run this on nginx, you will get nothing because nginx did not log anything. To demonstrate logs we can use MongoDB, which is a document database.

kubectl create deployment mongo-depl --image=mongo

Executing kubectl logs mongo-depl-5fd6b7d4b4-vbf2t will produce concise logs and kubectl describe pod mongo-depl-5fd6b7d4b4-vbf2t will produce a more verbose output.

Logging will help with debugging if something goes wrong, and describe produces something a little more intelligible.

Another useful command to see whats going on inside the Pod is kubectl exec -it <Pod Name> -- bin/bash. (-it stands for interactive terminal)

Suppose we want to use this to enter our MongoDB Pod:
kubectl exec -it mongo-depl-5fd6b7d4b4-vbf2t --bin/bash and if we type ls we can see our directories. To exit this, simply type exit.

6) Deleting deployments

Deleting a Deployment will delete all of the Pods inside of the Deployment. For example, to delete the MongoDB Deployment type:
kubectl delete deployment mongo-depl

The following commands are useful, but you should be careful not to delete anything important

You can delete all the pods in a single namespace with this command:

kubectl delete --all pods --namespace=foo
You can also delete all deployments in namespace which will delete all pods attached with the deployments corresponding to the namespace

kubectl delete --all deployments --namespace=foo
You can delete all namespaces and every object in every namespace (but not un-namespaced objects, like nodes and some events) with this command:

kubectl delete --all namespaces
However, the latter command is probably not something you want to do, since it will delete things in the kube-system namespace, which will make your cluster not usable.

This command will delete all the namespaces except kube-system, which might be useful:

for each in $(kubectl get ns -o jsonpath="{.items[*].metadata.name}" | grep -v kube-system); do kubectl delete ns $each done

7) Apply configuration files

To apply a configuration file, we must first create one. In a directory you can refer back to, create a config file for the nginx Deployment.

touch nginx-deployment.yaml

Next, copy and paste this configuration to the file:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  labels:
    app: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.16
        ports:
        - containerPort: 80

Everything under "template" is the blueprint for the Pods. The first spec tag is specification for Deployments, and the second spec tag is specification for the Pods.

This config file basically states we want one container inside of the pod with an NGINX image and we are going to bind that on port 80.

When we use kubectl apply -f nginx-deployment.yaml, it creates a deployment using the configuration. Lets say that in the config file we changed the Deployment to create 4 replicas instead of 1.

After typing kubectl get pod and kubectl get deployment we get the following output:

As you can see, K8s knows when to create or update deployment.

Conclusion

By following this article you should gain a good understanding of the fundamentals of Kubernetes. In summary, we learned what Kubernetes is and what it can do at a high level. We learned what services K8s provides as well as some important definitions. We then used Docker, Minikube and Kubectl to explore CRUD commands for Deployments. Finally, we learned how to debug Pods and use configuration files for Deployments. It should be mentioned that you can also use kubectl for services, volumes, any other K8s component.

Final thoughts

I hope you enjoyed following along, and I hope you learned something! Kubernetes is a very useful tool for managing containers, so I'm glad you made it this far. You can always find much more thorough documentation at https://kubernetes.io/docs/home/

If you would like for me to expand on this lesson in the future, if you have any questions or if I missed something please let me know. I really value your feedback, so please leave me a comment!

P.S.
I am currently looking for a job in DevOps! If you know someone who is hiring entry-level DevOps engineers please send them my resume which can be found at https://smcgown.com and https://www.linkedin.com/in/steven-mcgown/

Thank you!

Docker for Dummies

StevenMcGown — Fri, 09 Jul 2021 00:23:18 +0000

Docker is one of those services that you always hear about but may have never used. I never used Docker in college, and I actually never heard of it until I began researching the field of DevOps. Knowing how to use Docker is a quintessential element of becoming a part of a modern development team. My goal of this post is to help the reader gain an understanding of what Docker is, to learn why enterprise teams are adopting it today, and how to get started using Docker.

If you already know how to use Docker, consider reading my post about managing containers using Kubernetes.

Questions to answer:

What problems does Docker solve?
What are containers?
What is the difference between a container and a VM?
What is the difference between images and containers?
How does Docker help create applications?

Why Docker?

The need for Docker arises from virtual machines on servers being used at large scales. Take a large business for example. For a business that uses hundreds of servers with a cluster of virtual machines for each of their platforms, maintaining these machines is a full-time job. Each server has to have an OS installed, it needs upgrades and patches from time to time, and then dependencies for the applications each machine uses also have to be installed.

You can see why this quickly becomes very complex. Manually configuring these servers is not feasible, so many companies keep a list of servers that they programmatically update. This can work, however the list of servers is shared between a team of people, and this list does not always stay up to date. Some servers never receive updates and consequentially errors may arise which impact system performance. Finding one faulty server in a room of hundreds can also be a troubleshooting nightmare. How does Docker solve this?

Docker to the rescue!

Rather than running applications on virtual machines, you can upload Docker images to your server. When an image fails, you just upload a new one. There is no need to worry about configuration because the image exists as an exact replica of the original configuration. In this way, you do not have to worry about installing application dependencies or OS patches because they have already been configured in your Docker image. The Docker setup frees you from treating servers as pets, constantly monitored and cared for, to something more ephemeral; It is okay if the image fails, you can just replace it. "What is an image, and why can it be a better fit than virtual machines? You might ask." This term will make more sense as we move forward.

Docker is also great for developers. It means no more "It works on my machine" since all the developers are developing with the same stack maintained in the Docker file.

How does Docker streamline the development process?

CI/CD: You can consistently test and deploy your code to different environments in the development process (staging, user acceptance testing, production) without the hassle of configuring various testing environments.

Versioning: Docker also helps with versioning, as you can save different versions of software on repositories and check them out later if needed. This eliminates the need for changing versions of software when running an older version of an application.

Roll Forward: When defects are found, there is no need to patch or update the application. You just need to use a new image.

What is the difference between an image and a container?

Docker images and containers are closely related, however they are distinct. Docker images are immutable, meaning they cannot be changed. I have explained previously that these images can be uploaded to servers in place of running applications directly on an OS. Images contain the source code, libraries, dependencies, tools and other files that the application needs to run. When using Docker, we start with a base image. Because images can become quite large, images are designed to be composed of layers of other images to allow a minimal amount of data to be sent when transferring images over the network.

The instance of an image is called a container. Containers are running instances with top writable layers, and they run the actual applications. When the container is deleted, the writable layer is also deleted but the underlying image remains the same. The main takeaway from this is that you can have many running containers off of the same image. A good way to think about images and containers is with this metaphor: Images are the recipe to make a cake, and containers are the cakes you bake. You can make as many cakes as your resources allow you with a recipe; you can make as many containers as your resources will allow you with an image.

What is the difference between virtual machines and containers?

Consider the layout of a typical VM fleet: Virtual machines are managed through a hypervisor, which runs on a host OS that is installed on server hardware. The hypervisor virtualizes hardware that virtual machines use to run their operating systems (Guest OS). So basically the server has a host OS, and the virtual machines themselves have a complete operating system installed.

What makes a container different is that the container does not have a Guest OS. Instead, the container actually virtualizes the operating system. Inside this container you can build whatever you want. The advantages to using containers over virtual machines are the fast boot up time and their portability.

Building images with Dockerfiles

As you can see, Docker helps ease the hassle of installation and configuration. Let's look at a sample Docker command:
sudo docker run docker/whalesay cowsay Hello-World!

As you can see, the docker image did not initially exist locally so it had to be pulled from docker/whalesay. You can also see that the image consists of multiple layers e190868d63f8, 909cd34c6fd7, etc. To create an image, we can create a Dockerfile. Once this file is completed, we will use docker build [OPTIONS] PATH | URL | - to create our image.

A Dockerfile can be created using touch Dockerfile and can be edited using your favorite text editor. Notice that this file is created without an extension, this is intentional.

In your Dockerfile, type the following code:

The FROM statement declares what image your new image will be based on. For this sample project, I will be using the ubuntu image. However if you want to create a Docker image from scratch, you can simply write FROM scratch.

LABEL is used to apply meta data to Docker objects. In this case, you can use LABEL to specify the maintainer of the Docker image. MAINTAINER was once used but this is since deprecated.

RUN is used to execute commands during the building of the image, while CMD is executed only when the container is created out of the image.

In the directory of your Dockerfile, type docker build .

The first time each command is executed, each command will be executed. Each command in the Dockerfile is cached, so if you edit the file it will only need to build for edited command. After editing the echo command of our Dockerfile, we will also give the Docker image a name and the 'latest' tag.
docker build -t helloworld:latest .

To run your image, first find the image name by running docker images.

Note that you can run a Docker image by its image ID or its name and tag. If you run by name only, Docker will automatically run by the 'latest' tag.

docker run helloworld:latest and docker run 4d6c8eea04c9
produce the same output in this case.

And there you have it! You have created your first Docker image. You can find other images on https://hub.docker.com and documentation at https://docs.docker.com/

Extra Credit!

I suggest pushing your newly created docker image to DockerHub if you would like to share your Docker images. First, create a DockerHub account at https://hub.docker.com

You can also log out from your CLI using:
docker logout

Furthermore, you can push your newly created docker image by first tagging it.
docker tag <image> helloworld:latest <DOCKER_HUB_USERNAME>/dockerhub:myfirstimagepush

Next, push the docker image:
docker push <DOCKER_HUB_USERNAME>/dockerhub:myfirstimagepush
You should receive a SHA-256 hash indicating the push was successful.

If you think you're ready to manage your Docker containers using Kubernetes, read my post about Kubernetes and then read about OpenShift, the industry's most secure and comprehensive enterprise-grade container platform!

Conclusion

I hope that this post can help anyone who feels like they aren't ready to learn about Docker. Often times, the most difficult part of getting something done is starting it. Please let me know if this helped or if I missed anything!

Thank you!

Forem: StevenMcGown

Data Science Zero to Hero - 2.2: Data ETF (Extract, Transform, Load)

Where does data come from?

Data Extraction Tools

Data Transformation Tools

File Formats and Storage Solutions

CSV, Excel, Parquet, and ORC Files

JSON, XML, Avro, and Protobuf

Images, Audio, Video Files, and HDF5

Conclusion

Data Science Zero to Hero - 2.1: The Machine Learning Cycle

Data Collection and Preparation: ML Concepts

Data Science Zero to Hero - 1.3: Matplotlib, Seaborn & Jupyter Notebooks

Table of Contents

Data Visualization with Matplotlib and Seaborn in Jupyter Notebooks

Line Plot

Using Pandas with matplotlib

Scatter Plot

Bar Plot

Pie Plot

Histogram

Box Plot

Violin Plot

Strip Plot

Pair Plot

Distribution Plot

Count Plot

Heat Map

Conclusion

Data Science Zero to Hero - 1.2: Pandas

Pandas Series: Foundations of Data Manipulation

Creating a series

Indexing and slicing a Series

Filtering values in a Series

Performing arithmetic operations on a Series

DataFrames: Tabular Data Made Easy

Creating a Pandas data frame from a python dictionary

Selecting specific columns in a DataFrame

Filtering rows based on conditions

Sorting a DataFrame

Reading Data into Pandas

Concatenation, Merge, and Joining: Combining DataFrames

Concatenating Dataframes

Merging DataFrames based on a common column

Reshaping Data: Pivoting and Melting

Pivoting

Melting

Duplication: Identifying and Handling Duplicate Data

Duplicating

Dropping duplicate rows

Map and Replace: Modifying Values in DataFrames

Mapping Values

Replacing Values

GroupBy: Aggregating and Analyzing Data

GroupBy

Applying multiple aggregations

Conclusion

Data Science Zero to Hero - 1.1: Numpy

Data Science Zero to Hero - Foreword

A little bit about myself...

Terraform for Dummies - Part 2: Getting Started

Terraform for Dummies - Part 1: What is Terraform?

What is Infrastructure as Code?

Declarative vs. Imperative

Idempotent vs non-idempotent

Configuration Drift

GitOps

Terraform Lifecycle

OpenShift for Dummies - Part 2

Why Should I Use OpenShift?

OpenShift, the Platform of the Future

OpenShift Demo

Welcome to OpenShift!

Simulating a Crash

Developer Updates

Conclusion

OpenShift for Dummies - Part 1

How IT Infrastructure Has Changed Over Time

Development Processes

Application Architecture