Forem: maninekkalapudi

Book Review- Fundamentals of Data Engineering

maninekkalapudi — Mon, 31 Jul 2023 17:02:49 +0000

Hi! Hope youre doing well. Let me walk you through whats going on in my head when I need to explain What is Data Engineering? And What has been going on with it recently?.

Where do I start? And how to not kill an enthusiast or a friend with my tedious explanation? Should I start from the beginning of the time? Or should start from the beginning of all data? Although I did not live through all the eras of data I wander off onto a lot of things in the field while explaining it.

Introduction

After being in the data field for more than 4 years and witnessing two major industry shifts, Im somewhat comfortable talking about my experience in Big Data era (Hadoop Ecosystem) and Modern Data Engineering (Spark on Databricks, Cloud and Data Modelling).

But, how does all of this can be contextualized by beginners and experienced folks? Be it for skills or where the industry is headed. A lot of buzzwords and fancy tools are thrown around without the right context IMO.

The Fundamentals of Data Engineering book provides the rightful context about Data Engineering (DE) from the past, the present and the future. Lets dive in!

What is this book about?

The authors mentioned that they are recovering data scientists turned data engineers at the very beginning of the book. Much like the analytics industry, they started with Data Science stuff before moving into data engineering roles.

In this book, the fundamentals of data engineering are expressed in a less opinionated manner. Everything about the data platform/s, architectures, processes, tools and managed services is put into a context of the broader data engineering lifecycle without arguing which tool is better than the other.

The book also introduces some of the central ideas to the DE field like:

Architecture is a living entity within the data engineering lifecycle and evolves based on the requirements
Use standardized tools and services mostly and build custom tools and services for competitive advantage
The data maturity of an organization i.e., utilizing the data for business use cases; matters more than just having the latest and greatest tools in the data architecture
The Data Engineering team must collaborate with all the stakeholders from upstream systems to downstream data consumers to understand systems and automate the data serving

and many more

Data Engineering Lifecycle

The bulk of the book covers the different phases within the Data Engineering Lifecycle. As stated in the book, the Data engineering lifecycle comprises stages that turn raw data ingredients into a useful end product, ready for consumption by analysts, data scientists, ML engineers, and others.

The stages of the data engineering lifecycle are as follows:

Generation - How the data is generated, type of the system, frequency of data generation etc.
Storage - How to store the generated data, choose the right data storage system etc.
Ingestion - Types of ingestion, frequency, ETL vs ELT, CDC etc.
Transformation - Transform the data to a required format, tools available for transformation, the role of SQL in data transformations and data modeling
Serving - Data served to different stakeholders, Reverse ETL etc.

While performing all the necessary activities to serve the data to the stakeholders, data engineering teams must also take care of the undercurrents like

Security - Security of the data at rest, while transmission and in some cases while processing
Data Management - How data is stored and exposed to various stakeholders
DataOps - data quality, governance, and security
Data Architecture - data architecture reflects the current and future state of data systems that support an organizations long-term data needs and strategy
Orchestration - the central hub that coordinates workflows across various systems
Software Engineering - Coding and DevOps practices. Writing production-grade code and scaling backend systems for data applications

The authors mention a variety of tools for each topic in the data engineering lifecycle and explain their benefits and tradeoffs on a high level. This gives us enough context to understand the tools but one should go much deeper than whats mentioned to understand them completely.

Future of Data Engineering

The data industry has seen a lot of small to large transformations thanks to managed services on the cloud. Batch processing, ETL and ELT have served the data industry well for a long time and its showing its age in recent times.

The next major transformation around the corner is near real-time data processing and serving. It is discussed at length in the book and offers a practical view of this trend.

One thing that doesnt change for the foreseeable future in data engineering is the Data Engineering Lifecycle. It is fundamental to how the data is generated, stored, processed and served. It will still be relevant in the streaming era as well.

The below picture illustrates the essence and complexity of data processes.

The role of data engineers currently looks like the below picture. There is a lot of definitions for what a data engineer is and what they should do. I think it highly depends on the organization, team and who they are primarily serving the data to.

Expect the definitions and roles to change a bit as the industry finds new ways to produce and consume the data in the future

Who is this book for?

This book is for data engineers, of course; data professionals who would like to understand where the industry is headed.

This book offers a very broad view of the data engineering field. If you are an experienced professional who would like to go in-depth on certain topics, it's better to pick books or material specific to those topics.

Conclusion

If youre a beginner or someone trying to understand whats going on in the data engineering field, this book helps you contextualize a lot of jargon and trends in it well. You can expand your knowledge further on the relevant topics and in demand based on that.

Thanks,

Mani

What is Data Engineering?

maninekkalapudi — Sat, 12 Nov 2022 17:21:13 +0000

Intro

Hello all! Hope you are doing good.

In the last couple of years, you mightve heard a lot about Data Engineering. It surely gained a lot of buzz in recent times and every company wanted data engineers. Needless to say, the demand for data engineers was at an all-time high.

But what is data engineering though? And why do we need it? Let's understand all of that in this post with the following topics:

What is Data Engineering? And why do we need it?
Responsibilities of a Data Engineering Team
Challenges

What is Data Engineering? And why do we need it?

Simply put, data engineering deals with collecting, storing, processing the data in a data warehouse and serving that data to various stakeholders.

Data will be generated in a company by different teams at variety of systems like databases, APIs, streaming events, file servers etc. This is the data required by different teams to carry out various analysis.

Generally, the incoming data is in different formats and sizes from different sources and that data is stored into an archival/analytics system like a data warehouse or a data lake. When the data is in data warehouse, it will be cleaned, transformed into a mutually agreed format between the stakeholders.

The data engineering team will build and maintain the pipelines and processes like ETL/ELT for data ingestion and data transformations across all the data that is being received in the data warehouse.

The end goal of a data engineering efforts is analytics ready data or clean data.

Why a centralized system like data warehouse?

To ensure that all the companys data is in one single system and any team looking for particular data can easily access it. This ensures that there is no overhead for any team to obtain the required data in a common format that is used across the entire organization

This also means that there is no duplication of effort across any teams for creating any dataset from multiple sources.

Responsibilities of a Data Engineering Team

Identify data sources, analyze the data and ingest the data into the data warehouse
Build and maintain the data pipelines for periodic ingestion and processing of the data
Adding resiliency to the pipelines for failures
Build and maintain the data warehouse tables and specialized datasets
Maintaining data quality and integrity
Last but not least, maintain and scale the data infrastructure

Challenges

The whole data engineering effort is internal to a company mostly. There is no customer interaction, nor any direct revenue generated here. There will be a number of questions on its viability, credibility and ROI
There can be a lot of incoming data requests from various teams across entire company
Context switching. The data engineering team handles a good number of pipelines, and they will be taking up further tasks collaborating with different teams to fulfill their data requests. Handling all of these things at once will require context switching and that might affect the quality in the long run.

You can find more details, examples on the data engineering and few important questions about it in the below video

Resources

A Typical Data Pipeline

maninekkalapudi — Tue, 01 Nov 2022 06:33:02 +0000

Introduction

Hello people, Hope you are doing well.

As data engineers, we build data pipelines to collect data from different source systems and place it in an analytics system i.e., a data warehouse/data lake. The data is usually sourced systems like a database, web events, or an API and etc.

In this post, we will talk about

What is a data pipeline?
Stages of a data pipeline
What is ETL and ELT?
What is the difference between data warehouse and data lake?
Why do we need a data pipeline?

1. What is a data pipeline?

We know that a data pipeline collects data from different source systems and moves it to the analytics system. Let's improvise that definition a bit

A data pipeline is a series of interconnected systems that passes data in only one direction, i.e., from source to serving layer with increasing order of clarity and value in the data

The data received from a source may contain duplicate records, test records and records that are problematic. Data pipeline should be designed in such a way that it eliminates all these issues and only then is the data moved from raw to staging to serving thus increasing the clarity and value in it

2. Stages of a data pipeline

A typical data pipeline will have the stages below (refer above picture).

Sources
Raw
Staging
Serving

The below picture shows the different stages of a data pipeline

Data Sources : Where data is generated, recorded or obtained for the data pipeline. For example, a database, an SFTP file server, an API and etc.
Raw layer : It is the first level of data storage in the data warehouse. This is an archival storage layer and the data stored in it will not be modified at any time.
Staging/Transform layer - It is the second level of data storage in the data warehouse. The data in this layer, which is sourced from raw layer, is cleaned, transformed into a certain format and then stored in various staging tables.
Serving layer : This is the aggregated data layer, and the data will be sourced from different staging tables. For example, average amount spent by the customers over the years

3 . What is ETL and ELT

ETL stands for Extract, Transform and Load. ETL/ELT is a process to get the data into an analytical system like a data warehouse or data lake, preferably on a schedule.

ETL

In ETL, the data is extracted from a source, transformed into final format and loaded into DWH table. This is well suited when the data is in a database where the data needs minimal changes.

ELT

In ELT, the data is extracted from a source, loaded into the raw layer. When the transformations are defined in the future, the data is transformed readily and loaded into the final tables in the data warehouse.

ELT is well suited when the data comes from different sources and the transformations for all the data are not defined completely. The data is transformed when they are available, and the subsequent staging table is modified accordingly.

4 . What is the difference between data warehouse and data lake?

4. Why do we need a data pipeline?

Large volumes of data come in from different sources (Apps, web events, transactions, images/videos, telemetry data). These data are of different types, and sizes

These characteristics i.e., volume, variety, velocity is attributed to Big Data. To process the big data in a consistent manner while serving the data to an entire organization is a huge challenge. Data pipelines are built by data engineers to solve this problem.

Once the data is processed, it is stored in a centralized data repository like a data warehouse. This ensures that the data is not siloed and anyone with the right access can always access the data and perform the analysis.

Data pipelines can also ensure data quality at a scale. Any checks that need to be performed on the data can be applied on the transformation stage ensuring quality.

Since we also maintain the raw data, the pipelines will be made resilient of failures, any data loss or corruption can be eliminated using them by reprocessing the existing data.

Check out the video on the same topic:

Resources

Process management with Linux CLI

maninekkalapudi — Sat, 17 Sep 2022 12:27:56 +0000

Hello! In my last post I have written about permissions in linux. In this post we will explore about processes in linux and how to manage them from cli. Let's go!

Topics covered in this post:

Intro to modern systems
What is a process?
Processes in Linux
Interacting with processes in cli
Signals
Shutting down the system with cli

1. Intro to modern systems

Many modern systems are usually multitasking, meaning they can perform more than one task at once or at least they pretend to do this so well. In reality, the kernel in the operating system rapidly switches from one process to another and provides an impression that the system is multitasking.

This is true for linux as well. When we switch from a text document to a terminal, the linux kernel also switches the underlying processes as well. Switching a process means allotting the CPU time execution cycle and resources like memory to the processes

2. What is a process?

A program or a script is stored in a file and a process is nothing but a program in motion. Whenever we execute a program, the linux kernel has to allocate certain resources like CPU and memory to that program. The kernel also tracks the execution by assigning an ID called PID or Process ID.

This is shown graphically in operating systems like windows as below

3. Processes in Linux

When a linux system boots up, the kernel starts a few processes using init. init is the first program that is launched and it also launches a series of shell scripts (located in /etc) called init scripts which start all the system services. The following image shows these init scripts in my system.

Many of these services will run as deamon programs , programs that run in the background and they generally dont have any UI or require user interaction.

Even when we just login to the system and not necessarily perform any tasks, it manages few processes to keep the system up.

The init process here becomes the parent or " grandparent process" to all the process launched through it. The processes launched by other processes are called child processes.

4. Interacting with processes in cli

The following commands are widely used in monitoring and managing the processes in linux.

a. ps- provides snapshots of all the processes running to the terminal

b. top- provides a dynamic view of all the processes running on the system within the terminal. The output looks similar to the GUI of a modern task manager

c. <command/script> &- runs the process in the background

d. jobs- shows the list of processes running in the background

e. fg %<job_number>- returns the process to the foreground

a. **ps command: **

It is the most commonly used command to view processes for the user. It shows the snapshot of the processes running on the system at that time. The result looks like the below image

PID or Processes ID, is the number assigned by the kernel for that process to track resources allotted. Ex: CPU time (TIME) and memory. TTY or teletype, refers to the controlling terminal for the process. " TIME", the amount of CPU time consumed by the process.

ps aux command will display the process from all the users(a) that can be attached to any terminal(x) along with its user/owner info(u)

b. **top (table of processes) command: **

It displays a real-time view of the running processes, their resource consumption and also displays kernel-managed tasks. The output is refreshed every 3 seconds by default.

To exit the top command output prompt and get back to the terminal, press q.

c. <command/script> & command

When we run a program with GUI like notepad (gedit) from the cli, it will open a window and the terminal will be busy until the window is closed.

Any command/script that is suffixed with & operator will run in the background and the terminal is available for the user. For example, gedit & command will launch the program and terminal will be available to the user right after it.

Notice that there is a number printed on the terminal after the command. It is the PID assigned a shell feature called job contol which shows the PID of the process.

d. jobs command

It will show the list of background processes/jobs that are launched from the terminal as shown below

The ps command will also show info about the above process

e. fg %<job_number> command

Our process (job id 1) is running in the background and any process running in the background is immune from terminal keyboard input, including any attempt to interrupt it with Ctrl-c. fg %<job_number> command will bring the process to the foreground

When we press Ctrl+C, the process will terminate

Signals

kill command is used to kill the processes. Heres an example:

gedit & command will launch the program in the background and our PID is available on the terminal. Next, we use kill <PID> command to terminate the process.

The kill command here doesnt actually kill the program, rather it will send a signal to the process to terminate. This gives the process to save the work in progress and the processes will also listen to these signals.

When the process is running on the foreground, CTRL+C and CTRL+Z will send a signal to interrupt and terminate the process respectively.

killall command will terminate multiple processes with same or matching name or by username.
killall -u <username>- will kill all the processes under the username provided
killall name- will kill all the processes which matches the provided name

In the above example, multiple instances of gedit program is launched in the background and will killall command, we can terminate all those instances at once.

Note: If one user wants to terminate the processes that dont belong to them, they need superuser privileges.

6. Shutting down the system with CLI

Yes, we can do it! Shutting down a system involves orderly termination of all the processes in the system. It also requires admin privileges to perform this action.

sudo reboot- restart the system
sudo shutdown -h now- shuts down the system without any delay
sudo shutdown -r now- reboots the system without any delay

shutdown command options:

-h- specifies shutting down the system
now- The time string may either be in the format "hh:mm" for hour/minutes (24h format) specifying the time to execute the shutdown at. We can also specify minutes with +m in place of now. Example, +0 means now and by default the values is +1 when not specified in the command.

Conclusion

Process management is a task that is usually maintained by the sysadmins and devops engineers to name a few. This helps in maintaining the health of the machines or servers and resource monitoring as well. Managing a linux system using cli is effective and swift.

Resources

Permissions in Linux

maninekkalapudi — Mon, 18 Jul 2022 04:54:48 +0000

Hello! In my last post I have written about how I/O redirection works in linux. When we think of a command, which is a file in linux; it will be assigned with a set of permissions. Only users with the right access will be able to run the command. All of this will be detailed in the following post. Lets go!

Topics covered in this post:

Preface
Permission Groups
Permission Types
Modifying Permissions
Executing Commands with sudo

1. Preface

Linux or any UNIX-like operating system is built to the core with multi-user model in mind. Before the computers were personal, they filled up buildings as often seen in universities. A practical way to utilize the computer was to connect it with multiple terminals for say, each department.

This is a multi-user model in a nutshell. Multiple users can connect to the same computer via terminals using relevant credentials and each user will have certain permissions for certain actions only.

In a modern system like cloud, multiple users connect to a remote server using SSH (Secure SHell). Each user accessing the server will have a separate account and permissions that are relevant to the role. For example, developer can have SSH access to the dev server but a tester may not have the same access.

2. Permission Groups

The way permissions are assigned can be grouped into 3 major categories.

Owner - Users may own files and directories and they have control over their access. The owner level access will not impact the actions of other users.
Group - Users can be grouped based on their role, say developers or admins; and assign access to files and directories to all the users under the specific group.
All Users - In addition to the above two groups, any user who can access the system will have access to some files and directories granted by the owner.

Lets see all of this in action. First, let's see what permissions are assigned to current user in the cli. We use id command and it shows

uid - user ID. A number (1000) is assigned when user is created and it is mapped to the user ID.
gid - primary group ID. User is assigned a primary group ID (gid) and may belong to additional groups
groups- Different groups that the user is part of. Example: 27 is the sudo(root) user group

id command will show how the permissions are enabled to a user using different permission groups. Access to any resources within the system can be assigned using the groups which have a common function. Ex: SSH access to a machine for only developers.

Note: The uid and gid starts with 1000 for ubuntu and it might be different for other linux operating systems. Ex: Fedora starts with 500

3. Permission Types

Every file or a directory in linux has 3 basic permissions:

Read(r)- To read the contents of a file or a directory (if execute(x) permission is also set for the directory)
Write(w)- Create, Edit, rename or delete the contents of a file or a directory (if execute(x) permission is also set for the directory)
Execute(x)- Run or execute a file or view the contents of a directory. This allows a file to be treated as a program and executed

Let's take a look at these permissions in the cli. When we run the long list command i.e., ls -l command on a file or a directory, the first column in the resulting list is the permission to that object (highlighted in the below image).

The first character in the permissions is the object indicator, directory is d and file is -. The rest of the characters represent permission groups.

The first set of three characters after the object indicator are Owner permission, next set are Group permissions and last set are All User permissions.

For example, we can understand the permission in the above long list as follows:

dir1- This is a directory(d). The owner has all the permissions(rwx), the group and all users have read and execute permissions(r-x)
logs.txt- This is file (-). The owner has only read and write access(rw-). The group and the all users han only read permissions(r)

4. Modifying Permissions

chmod command is used to change the permissions of a file or a directory. Only the files owner or the superuser can change the mode of a file or directory.

The permission groups for the chmod commands will be mentioned as

u- Owner
g- Group
o- Owner
a- All users

Additionally, + and - assignment operators are used to add or remove permissions respectively. Again, the permission here are read, write and execute(rwx).

The syntax for chmod command is as follows:

chmod <permission_group><assignment_operator><permission> <file_name/directory_name>

Scenario 1 :

Remove the execution permission for all users for dir1 (from previous example).

Scenario 2 :

Assign execution permission to only group for dir1.

Scenario 3 :

Add execute permission for the owner and set the permissions for the group and others to read and execute. Multiple specifications may be separated by commas

Note : Assigning the right permissions to the relevant users is a very necessary activity to maintain best security practices. Generally, the less permission given to a user or a group, the better to avoid any mishaps.

5. Executing Commands with sudo

Sudo (su do) allows a system administrator to delegate authority to give certain users (or groups of users) the ability to run some (or all) commands as root or another user while providing an audit trail of the commands and their arguments.

In linux, some resources that are very fundamental to the system are managed by administrators only to ensure the integrity. The simplest example to use a sudo is update command.

In ubuntu, apt-get update will update the operating system. apt is the package manager here and update is the option to tell the system to fetch the latest version of the software.

The current user does not have the necessary permission to run the command successfully. The update command refers to the systems files(/var/lib/apt/lists/lock) that only admin users have permissions.

To execute the update command successfully, sudo will be prefixed to the update command. This will temporarily allow us to execute the commands as an admin. This is demonstrated in the second half in the above example image.

Thoughts on using sudo command:

sudo command gives the privileges of a sysadmin to any user. In a well-designed system, admin privileges will not be provided to any user. As a general rule of thumb, less privileges to a user are better to avoid any change of full system compromise under an attack.

Conclusion

Multi-user systems like UNIX/linux are designed to ensure that multiple users would be able to use one machine. This also brings interesting questions on who gets to access what data on that machine.

Linux has always been developed with this in mind. Using chmod command in cli to assign right permission is a lot easier especially in server environments.

Resources

Redirecting linux command output

maninekkalapudi — Tue, 21 Jun 2022 03:55:24 +0000

Hello! Hope youre doing great. In my last post, I have written about commands in linux CLI. In this post we will understand how to play with any commands output, store it in files or even connect multiple commands together into command pipelines. Lets go!

Topics covered in this post:

What are Standard Input, Output and Error in Linux?
File Descriptors
Redirecting stdout
Redirecting stderr
cat command
Command pipelines

1. What is Standard Input, Output and Error in Linux?

Let's say we use a command like ls in the linux cli. ls will list all the files and directories in a given path and if the path is non-existent or incorrect path it will throw an error. This is shown below:

Now, every linux command like ls is designed to produce the following:

Command output
Status messages and error messages

The output and the error messages are displayed on the screen and we should know that everything is a file in linux. Which means that a command will (not necessarily) send its output to a special file called Standard Output (stdout) and error to Standard Error (stderr).

By default, both stdout and stdout are linked to the screen and not saved into a disk file. Also, many commands or programs take input from Standard Input (stdin) which is by default attached to the keyboard.

So stdout, stderr and stdin are files that are attached to screen and keyboard respectively. When a command is used in the cli, these files receive the output, errors and input.

Source: https://linuxhint.com/redirect-stderr-stdout-bash/

2. File Descriptors

A file descriptor is a unique number that identifies an open file in an operating system. It has a record of all the files opened and their locations stored in a global table along with the permissions.

Source: What is a File Descriptor? (computerhope.com)

On a Unix-like operating system, the first three file descriptors, by default, are STDIN (standard input), STDOUT (standard output), and STDERR (standard error).

The file descriptors are used to input/output(I/O) redirection in the linux cli. Let's discuss this in the next one.

3. Redirecting stdout

I/O redirection basically means that we can define where the output ends up finally with the redirection operator >.

To redirect the stdout to a file, we will use <command> > <output_file_name>. Let's take a look at this with an example:

Here we are using the command ls -l > ls_op.txt to long list the files and write the output to ls_op.txt file. cat command will display the output on the screen

What happens if we provide a non-existing path to the ls command?

Here, /bn/usr/ path doesnt exist and we anticipated an error message written to the ls_op.txt file. Instead, the error is shown on the screen. How about the output file?

ls_op.txt file is empty and the error is displayed on the screen. What happened here?

First, we received an error message for the non-existing path. The error will be sent to stderr instead of stdout. Next, the redirection operator (>) will overwrite the data on the output file. Since we didnt receive any output, the file was overwritten with nothing and the error was redirected to screen.

Intrestingly, we can use > to create a new file or even truncate an existing file.

> <filename> will create a new file and if this file exists, its contents will be truncated.

What if we want to just append the output to an existing file? >> will append the data to the existing file and create a new file if it is not present.

4. Redirecting stderr

Redirecting the stderr must need its file descriptor. As discussed above, the file descriptor for the stderr is 2 and this will be used along with redirection operator for stderr.

The stderr can be redirected to a file as mentioned below:

ls -l /bn/usr 2> ls-err.txt. Here 2 is the file descriptor for stderr followed by the redirection operator. Lets see the output

Redirecting both stdout and stderr to single file

4. cat command

cat, which is short for concatenate, can perform multiple operations like:

Display file contents
Concate multiple files into a new file using redirection operator
Create a new file using cat command and redirection operator
Display file contents

cat <filename> will display the contents of the given file. If multiple files were to be displayed, then add the option -n and pass the file names one after the other.

Concate multiple files into a new file using redirection operator

What cat commad essentially did in the above step is that it read the file contents and passed it to stdout file which is attached to the screen. We can use the I/O redirection technique to redirect the output of cat command to a file. Lets see this in action.

cat testfile > catfile command put the testfile contents to catfile using > operator. Also, the catfile is created by the command on the go.

This can be easily applied in a scenario where we want to write multiple file contents to a single file. cat <input1> <input2> > <ouputfile>

Create a new file using cat command and redirection operator

What happens when we dont pass a file name for the cat command? cat command expects a file name and when not provided, it will read from the stdin. This is provided in the manual (man cat)

Here, the cat command will continuously accept the input from keyboard and displays it back on the screen as the stdout is still attached to the screen. By pressing ctrl+d or cmd+d keys will exit the prompt.

What if we redirect the above examples output to an output file? The command for this scenario would be cat > <filename>.

The input will be accepted continuously from the keyboard and output will be written to the file. Again, by pressing ctrl+d or cmd+d keys will exit the prompt.

6. Command Pipelines

Till now, we have used single commands in the cli and played with the output from those commands (>). Piplines in the shell help utilize the stdout of one command to be piped into stdin of other command with the help of pipe (|) operator.

ls -l /usr/bin | less. ls -l command will list all the contents in the path /usr/bin. Later the output of this command will be passed to less command. less command will display the contents page (one screen) by page.

Filters

ls -l /usr/bin | sort- This command will provide the sorted lists
ls -l /usr/bin | sort | uniq | less- This command will provide the sorted, unique list in pages. The uniq command is often used in conjunction with sort command. uniq command accepts either stdin or a single filename argument.
ls -l | sort | uniq | grep zip- This command will search for a pattern in the sorted, unique list using grep command. grep command accepts a pattern within a file or a sdtin. A pattern here means a word or a regex pattern.
ls /usr/bin | tee ls.txt | grep zip- The tee command reads the stdin from the previous command and copies it to the stdout and also to one or more files. This allows us to capture the output from the intermediate steps in the process

Conclusion

The I/O redirection is a powerful technique and pairing with the pipelines will help us build a quick script or first versions of data pipeline. Commands like grep will help trigger other commands based on the seach results or even help understand the output.

Resources

Know Your Linux Commands

maninekkalapudi — Sun, 08 May 2022 14:11:57 +0000

Hello! Hope youre doing great. In my last post I have written about working with files and directories in linux CLI. In this post, lets discuss what actually is a command and how to create a command of our own.

Topics covered in this post:

What is a command?
Identifying a command
Know your commands via CLI
Create your own command using alias

1. What is a command?

Command(s) in general means an instruction or a set of instructions given to a machine to perform an action. A command in linux world can be any of the following:

a. An executable program - /usr/bin in linux has all the compiled binaries(installed programs). These are written in C, C++, Python, Shell and etc.

b. Shell bulit-in - bash shell supports a number of commands called as *shell built-ins. Ex: cd command

c. Shell function - Shell scripts that are included in the environment.

d. Alias - Aliases, like the name suggests, we can give an alias to the built-in functions

2. Identifying a command

type command

type ls command on the other hand shows that ls is in fact an alias to the command ls --color=auto. When we use ls command, the results will be displayed with color coding as above. An alias will work just like any command and when we use the alias it will invoke the command it is pointing to.

which command

type and which commands are two ways we can determine the type of a command and where it is referenced(installed) from.

3. Know your command

--help

To know more about any command we can use --help option for any command. <command> --help will show all the options for the command. In the below example, we can see the documentation for mv command.

Each option will give additional functionality to the command. For example: mv command with option -u will only move those files from source directory that are new or updated than the destination directory.

manual (man command)

man short for manual will provide the formal documentation for any executable programs. man command will provide all the information for the command in different sections like name, synopsis, description and others.

apropos command

apropos <search_term> command will show the appropriate commands by scanning the man pages based on the search term.

The results of the apropos command covers a wide range of cases from man pages thus very different results. It is recommended to use those commands that are suitable for a scenario. A brief description of the command is given after the command in the results.

4. Create your own command using alias

Till now we saw the examples that had only one command. We can use semicolon(;) between each command to run all of them at once(cmd1; cmd2; cmd3;).

For example: echo "Hi, there!"; ls; ls destdir. echo command is the print statement of the linux cli and ls command will list the files and directories.

Now, we can use these commands and create an alias and use the alias to perform the same action every time. Note that the user-defined alias is specific to machine.

alias <name> = '<command_string>' will create an alias with the supplied name. Now, lets create the alias and see it in action.

After creating alias with alias mycommand="echo \"Hi, there\"; ls; ls destdir command, we can invoke the alias like any linux command shown above. When we check the type of the alias using type mycommand, it shows mycommand is aliased toecho "Hi, there"; ls; ls destdir'`.

To list all the aliases that are currently in the system, use alias command and to remove any alias use unalias <alias_name>. For example, unalias mycommand

References

Linux Command Line Books by William Shotts

Working with Files and Directories in Linux CLI

maninekkalapudi — Mon, 21 Feb 2022 08:45:26 +0000

Hello! Hope youre doing great. In my last post I have written about the how to get started with linux command line(cli) and terms like shell, terminal and etc. We also tried few basic commands to list the files and directories in a path. In this post, we will take this further and discuss about how we can interact with files and directories. Lets dive in!

Topics covered in this post :

Files and directories in Linux
Create and edit files in cli
Create directories with cli
File permissions in linux
Manipulating files and directories in cli

1. Files and directories in Linux

Files are the basic entities in linux which can store some data, text or a script/program. Directories (folders in other operating systems) contain either files or other directories. Both files and directories are common among all the operating systems.

Linux filesystem, shown in the diagram below; has different files and directories for various operations.

Linux Filesystem Source: ICS 240: Operating Systems by William McDaniel Albritton (hawaii.edu)

When a user is logged in, they will land in /home/<username>(shown as ~ in cli). The user can create and delete files and directories within their home directory. There will be some files and directories that are created by the sysadmin(root or admin user) in your user directory which cannot be modified or deleted.

ls command shows contents in the current directory. When we run ls command with all option(-a), it will show hidden files/directories (filenames starting .) along with the regular content. Hidden files can be the configuration files(.bashrc), environment files(.profile) and etc. More on this in upcoming posts.

Also, there is no concept of file extensions in linux. The type of the file is determined by the contents of the file (or file header) rather when not provided. Operating systems like Windows do rely upon the file extension to determine the file type. For example, .txt is a text file, .exe is an executable program and .jpg is an image file and etc.

To check the type of file, we can use file <filename> command. In the below example, weve OMENCity without any file extension mentioned. When we run the command file OMENCity, we get the file metadata (file information).

2. Create and edit files in cli

Creating a file with touch command:

touch <filename> will create a new file in the current working directory. Optionally, we can pass the <path>/to/<file>/<filename> to the touch command to create a file in the specific location (shown in the next example).

Notice that we didnt pass any file extension with the touch command like this touch <filename.extension>. We can do that as well. For example, touch index.html. Let's try this example.

Editing a file with Vim(vi) editor:

Now that we have created the files, let's edit the file in the command line. We can use command line editors like Vim (personal preference) or nano. The command to open a file using vim editor is vim <filename> (we can use vi only instead of vim for the command). We can use the similar command for nano editor as well, nano <filename>.

Let's edit the testfile in home directory. Once we say vim testfile command and press the return(enter) key, the below screen is presented. We cant just edit the file yet. Alternatively, we can use vim /path/to/<filename> to edit the file in a different path than the current working directory.

So, to edit the file we need to press esc key and then I key. This will turn the vim editor to insert mode. We can observe the INSERT at the bottom of the screen. Now, we can write text to the file.

After entering the text, we need to save the file with the latest changes. Press esc key and type :wq and press enter to save the latest changes. :wq is the command to save the file(w) and quit the vim editor(q).

Viewing the file content with cat command

To view the contents of the file, we can use cat(concatenate) command. cat <filename> will spit out the contents of the file directly to the command line.

Creating files with Vim editor:

We have seen an example on creating a file with touch command. We can also use the vim editor to do the same and we can eliminate the file creation step altogether. Let's see this with an example.

Previously, we have used the vim <filename> or vim /path/to/<filename> commands to edit the file in vim editor on an existing file. We can use the same vim <filename> command to create a non-existing file as well. Let's see this in action.

The following steps were taken in the above example:

a. List files using ls command

b. Open vim editor for the new file vi vinewfile

c. Change the vim editor mode to Insert using esc key and I key

d. Edit the contents of the file and save it using :wq command

e. cat command to view the contents of the file.

We can use the above steps to create a hidden file. For example, vim .vimhiddenfile

Note : While using vim command to create a file on the go, we must save(:wq) it to appear in the path. Else the file will not be created.

3. Create directories with cli

Creating a directory is pretty straightforward. mkdir, short for make directory, is the command to create the directories. Let's see this in action.

The following steps were taken in the above example:

a. ls command to list all the files

b. mkdir <dirname>(mkdir newdir) command to create the directory in the current working directory

c. mkdir ./realtive/path/<dirname> command to create a directory in the specified path

d. To create an empty folder path i.e., creating a directory within a non-existing directory in a path, we should use -p option with the mkdir command. Else, the error (similar one) mkdir: cannot create directory ./newnewdir/dir1: No such file or directory. For example: mkdir -p ./newnewdir/dir1 will create both newnewdir and dir1 within it

e. To change the directory we use cd /path/to/dir command

4. File permissions in linux

Now, let's run ls -al or ll (long list) command in the home directory and check the output. The output of both the commands are similar and they display the following info in the columns respectively

a. File permissions

b. File's number of hard links

c. File owner username

d. name of the group that owns the file

e. Size of the file in bytes

f. Date and time of the file's last modification

g. Name of the file

In the above list the directories are marked with d for the first letter in the File permissions and similarly for files it is -. Each file and directory in linux will have read(r), write(w) and execute(x) permissions for the below categories of users:

a. Owner - Who owns a file or directory
b. Group - A group of users with the same permissions provided by the owner
c. World - Any user who is granted some permissions provided by the owner

The first 3 letters after the directory/file indicator are the permissions for the Owner, followed by Group and finally World. Let's take 3 examples of the files/directories highlighted in the above picture.

.bashrc file(-rw-r--r) - The owner has read and write permissions for the file. No execution permissions for the owner. The group and the world share same permissions i.e., read only
newdir directory(drwxr-xr-x) - The owner has all 3 permissions for this directory i.e., read, write and execute. The group and the world share the same permissions which is read and execute

File permissions, changing the permissions for file and user access in linux is a pretty interesting topic and we have barely touched the surface. Will dive deep in an upcoming post.

5. Manipulating files and directories in cli

The basic operations we perform in a filesystem (graphical or cli based) are create, copy, move, delete and rename the files and directories. Let's see how that happens in a cli.

Creating a file with touch and vim commands:

Copy files and directories with cp command

Copying a file to another file will overwrite the file in destination path. For example, the commandcp /path/to/sourcefile /path/to/destinationfile will overwrite the contents of destination file with source files contents

To copy a directory to a path, we need an additional option -r which stands for recursive and it will allow us to copy a directory and its contents recursively. For example, cp -r ./srcdir ./destdir.

At this point one might wonder why do we need a cli to perform these simple tasks which could be done in GUI very easily. Like drag and drop a file/directory to copy. The answer is power and flexibility.

Let say we have thousands of files common between two directories (src and dest). We need to copy only those files that are not in to dest path from the src path. Doing this in the GUI is a tedious task and if we have to repeat it every day or even every hour, it would be nearly impossible to finish the task with consistent results. But in cli it is just a simple command.

cp -u srcdir/* destdir/ command will copy only the files that are not present in destdir directory from srcdirand also the files that are modified recently.

Move files/directories with mv command:

Remove/delete a file or directory with rm command:
- rm /path/to/file command will delete the file from the path. The following shows how rm command works

rm -d /path/to/dir command will delete empty directory from the path. if we try the same command with non-empty folder, we will see Directory not empty error. To remove a directory along with its contents, we use -r(recursive) option which will delete the contents recursively.

To check how to delete process is carried out we can use i option with rm command. This will show which file or directory is being deleted. Example shown below. For every file and subdirectories in the directory will ask for prompt to delete it (yes-y, no-n)

Note: To perform copy, delete or move operations the users should have necessary permissions(rwx) as discussed in the above section.

Resources:

The Linux Command Line Experience

maninekkalapudi — Wed, 02 Feb 2022 05:33:13 +0000

Hello! Hope youre doing well. In this post well talk about Command Line Interface(CLI) in Linux. In the last post we discussed about what Linux is. In this one we will get a taste of working with Linux command line. Lets go!

Topics covered in this post :

What is a Shell?
Terminal Emulators
Linux Filesystem
Navigating Linux filesystem in CLI
Linux command behavior

1. What is a Shell?

When we refer to the command line what we really mean is shell. The shell is a program that takes keyboard commands and passes them to the operating system to carry out. Almost all Linux distributions supply a shell program from the GNU Project called bash.

The name bash is an acronym for Bourne Again SHell. It is an enhanced replacement for Shell(sh), the original Unix shell program written by Steve Bourne.

The popular shells used in linux are:

C Shell (csh)
Kron Shell (ksh)
Z Shell(zsh)

Linux shell offers a way to interact with the kernel through commands. Ex: ls lists all the files and folders in a directory. Each command represents a task to be performed.

Every shell bash or zsh will offer the similar functionality with same commands for the most part with some additional functionality of their own. One such example is the difference between bash and shell. The original shell didnt offer command history(list of previously executed commands) and bash has it.

2. Terminal Emulators

A terminal is a program which passes the user input commands to the shell and displays the command output from shell to the user. A number of terminal emulators are available for Linux, but they all basically do the same thing; give us access to the shell.

Terminal offers a way to customize the appearance of the text(commands, progress bars, icons and output) displayed and much more. The possibilities for customization are endless. One such example is here.

3. Linux Filesystem

Linux is a hierarchical directory structure. This means that every directory can have files and other directories in them. When represented pictorially, it looks like a tree(data structure). The Root directory is the first directory in the filesystem and it has the various folders that are assigned for a purpose mentioned below:

Linux Filesystem. Source:Linux File Hierarchy Structure - GeeksforGeeks

When we first login to the Linux machine as a typical user we will be logged into the /home/<username> directory. Only a root user will be logged into root (/) directory.

4. Navigating Linux filesystem in CLI

One of the fundamental actions we perform in any operating system is navigating the filesystem. We are familiar with the graphical file manager in Windows and MacOS and even in the desktop linux distros. But, how about a server with only CLI? Lets dive in.

As soon as you login to the linux machine(ex: a server) as a user, you will be logged in to ~(/home/<username>) directory. In the below picture, Ive logged into ubuntu using WSL on Windows. We will be using this going further.

At a given time, we are inside a single directory per terminal session. We can see the files contained in the directory and the pathway to the directory above us (called the parent directory) and any subdirectories below us. The directory we are standing in is called the current working directory.

Lets try two basic commands after logging in to linux terminal

whoami- shows the username
pwd- gives current working directory that we are in
ls- gives the list of files and directories in the current directory

This gives us a pretty good understanding about who we are(whoami), where we are(pwd) and what do we have in our current directory(ls). To navigate the filesystem we use cd command.

cd is short for change directory, which allows us to change the current working directory. cd command expects a folder or a path in the filesystem.

From the above example:

cd /usr/bin- change the current working directory to /usr/bin
cd ~- ~ is notation for users home directory, which is /home/<username>/
cd ./test- . represents current directory and /test represents test directory under the current one.

Notice that the path is provided in two different ways. We have two ways to define a path in linux.

Absolute path:
Relative path:

Few Tips:

While navigating the subdirectories() within the current working directory, the . can be ommited from the cd command. For example: We are in pdir. pdir contains dir1 and dir1 has dir2. To navigate to dir2 from pdir, we can use cd dir1/dir2
Just enter cd command and hit return to navigate to users home directory from any directory.
cd ~username- changes the working directory to the home directory of username

5. Linux command behavior

ls command gives the list of files and directories in the current working directory. But, what if we want more info while displaying the info? This is where command options and arguments comes handy. The options for a command will modify its behavior thus giving different results.

ls -a- displays all the files even the hidden files in the current working directory. Hidden files and directories start with .. For example: .profileis a hidden file and .landscape is a hidden directory in the below example.

ls -altr-displays all the files in the current working directory(a), in long list(l), sorted by modification date(t) and in reverse order(t). We can use multiple options for a command for a desired behavior.

ls /usr . -altr command takes two paths as arguments (/usr and . ) and displays the results for both the paths as shown below

An important question here is... do we have to remember all of the commands and their options available in linux? NO! You cant possibly do that but we have a command for that as well.

man short for manual gives the full documentation for almost any command in linux. We can get documentation for any command using man <command>. Lets try for ls command.

In the upcoming posts we will dive deep into working with files and directories in linux. Stay tuned!

Sources:

What is Linux?

maninekkalapudi — Sun, 23 Jan 2022 06:33:50 +0000

Hello! Hope youre doing well. In this post well talk about Linux. Linux is a free, opensource software that is and highly customizable and ubiquitous in the computing world. Large parts of the internet as we know it is runs on Linux based operating systems(OS). So, knowing Linux and working with Linux command line will take us a long way in the software industry and others as well.

Have you ever wondered whey there is no Linux OS out there? Im sure you heard of MacOS, Windows and even Linux distros but never Linux OS. Let find out in this blog post.

Topics covered in this post :

Types of software development and distribution
What is Linux?
What is a Linux distro?
What is a Linux command line?
Linux commands

1. Types of software development and distribution

Lets understand the following terms to get a better idea how software is developed:

Opensource- The source code in the software is available to view and modify. A community of developers contribute to build and maintain it.
Free- Software that is free to use for the individual. One may not be able to view the source code and in some cases the code cannot be modified or distributed
Closed source- The source code is not visible and the end user cannot modify anything in the product or even redistribute it. Users have to pay to obtain it

2. What is Linux?

Linux is a free and opensource, unix-like operating system(actually a kernel) developed by Linus Torvalds as a free operating system in 1991. It was based of(not a copy) Unix operating system, which was developed by AT&T(Bell Labs) as a proprietary OS(some versions).

Linux on the other hand was developed as a free and opensource alternative for Unix. We can get the publicly available source code for Linux, modify it and even redistribute it without any cost involved. Also, developers across the globe participate in contributing to the Linux development.

Opensource nature of Linux allowed to modify it for all kinds of purposes ranging from microcontrollers to a massive supercomputers and even in space vehicles there are on moon and mars.

Linux as your probably might think, is not a full-fledged OS but it is a kernel. A kernel is a part of the OS which talks to the hardware components like CPU, RAM etc., and other components in the OS as shown below:

Linux Architecture. Source: https://www.interviewbit.com/linux-interview-questions/

3. What is a Linux distro?

When the first version of the Linux kernel was developed, it was distributed with the with a set of GNU utilities and tools for setting up a file system, the Graphical User Interface(GUI) and apps like terminal. This is where it gets the name Linux distribution(Linux distro).

Linux distribution. Source: How SUSE builds its Enterprise Linux distribution PART 2 | SUSE Communities

This is still the case to this day and there hundreds of Linux distributions are available and every distro has Linux kernel and GNU components. One can download for free and even customize them to their hearts content. There are distros that are derived from other distros. Needless to say this is a huge hobby among enthusiasts to try out various distros and tinker them.

4. What is a Linux command line?

A Linux command line is a text interface which allows us to interact with the computer using commands. It is often referred to as shell, terminal, console or various other names and definitions are mentioned below:

Terminal : A text based environment where you input the commands and see the output. A terminal will pass the input commands to shell for execution and display the output from it.
Shell : A shell is the program that the terminal sends user input to. The shell generates output and passes it back to the terminal for display
Console : A console is a physical device that had the terminal with screen and keyboard. In the software world a terminal and a console are referred interchangeably

Now you might ask, how is this useful? Well, in a normal desktop environment you will get all the GUI components installed. Which would look like in the below picture

Ubuntu GUI

This is great for personal use and everything in the GUI seems to be laid out perfectly. But when it comes to servers, where Linux is a primary choice; there will be no GUI and all the work should be done through the terminal.

Ubuntu command line

When you login to a server or open terminal app in your Linux distro, youll see the above window. What does the text me@linuxbox:~$ mean?

me: username which you logged in
linuxbox: name of the machine or server
~: home directory(folder) for the user. At any point one terminal session will be on one single directory. Likewise, multiple terminals can point to different directories
$: represents that you are normal user. In few shells, the admin/root user will see # instead of the $ sign. # also represents elevated privileges to the system.

You can enter any command after this window appears and hit Return key to display the output. We will discuss about the user types, permissions and other topics in the future posts.

Im using WSL with Ubuntu going further and it looks like the below picture

Ubuntu on Windows Terminal with WSL2

/mnt/c/Users/manik is the home directory for the user.

5. Linux commands

Commands are reserved keywords that signifies an action in the system. Here are the few example commands that we can try out in a Linux command line.

date- displays current date
cal- displays calendar with only current month and current date is highlighted
ls- lists all the directories and files in the current folder
pwd- present working directory
clear- clears(hides) all the contents on the terminal window

Meanwhile if you type some random gibberish into the command line, it will throw error saying command not found. We can use the arrow keys to go through the command history(up arrow key) and also navigate within the command(left and right arrow keys)

We can perform almost any action within the operating system using the commands. We will further explore all the important commands in the future blog posts

References:

WordCount Example with MapReduce

maninekkalapudi — Sat, 14 Aug 2021 05:10:29 +0000

Hello! Hope you're doing well. In my last post I've explained about internals of Hadoop MapReduce. As promised in that post, we will write and execute a MapReduce program in Java for a simple wordcount example. Let's dive in!

Topics covered in this post

Pre-requisites
Hadoop cluster setup on local machine and on Cloud
Writing a MapReduce program on Eclipse
Create a JAR file for the MapReduce Program and Uploading to HDFS
Executing the MapReduce Program on the Hadoop Cluster
Results

1. Pre-requisites

Admin access to the machine (local preferably)
Hadoop Cluster (Single/Multi node cluster) on local machine or on cloud
Install JDK 1.8 or later on the local machine
Eclipse IDE or any Java IDE installed on the local machine

2. Hadoop cluster setup on local machine and on Cloud

i. Single Node cluster setup

As we already discussed, the DataNodes store and process the data. We need at least a single node Hadoop cluster to run the MapReduce program and process the data.

Setting up a single node Hadoop cluster on a local machine is a bit lengthy process and often could lead us to errors. I'm sharing the guides that I've used to setup the cluster on my local for testing below for both and Windows and Linux.

Windows- https://towardsdatascience.com/installing-hadoop-3-2-1-single-node-cluster-on-windows-10-ac258dd48aef
Linux (Ubuntu)- https://phoenixnap.com/kb/install-hadoop-ubuntu

ii. Multi node Cluster setup

Alternatively, we can use a cloud-based Hadoop cluster like DataProc on Google Cloud platform (GCP) which doesn't require any setup other than selecting the configuration of the NameNode and the DataNodes. The GCP account setup can be referred here. We'll see the setup in the following steps.

Before going any further you should consider two important steps while operating in any cloud environment.

a. Setting up the billing alerts to avoid any unexpected bills.

b. Turn off/delete the resources soon after the work is done

a. Sign up to the Google Cloud and login to your account

b. Search for " DataProc" and select the option with the same name in the results

c. Select the " Create Cluster" option

d. Provide the following details in the create cluster page under "setup a cluster page"

i. Cluster name - **test-cluster**

ii. Cluster region and Zone - **us-central1** , **us-central1-a**

iii. Cluster Type - **Standard (1 master, N workers)**

iv. Autoscaling Policy - **None**

v. Image type and version - **2.0-debian10 (default)**

vi. **Select Enable Component gateway**

e. Under " Configure nodes" select the following for Master node

i. Machine family - **General-Purpose (default)**

ii. Series - **N1 (default)**

iii. Machine type - **n1-standard-2 (2 vCPU, 7.5 GB memory)**

iv. Primary disk size (min 15 GB) - **100GB**

v. Primary disk type - **Standard Persistent Disk**

vi. Number of local SSDs - **0**

f. Select the following for "Worker Nodes"

i. Machine family - **General-Purpose (default)**

ii. Series - **N1 (default)**

iii. Machine type - **n1-standard-2 (2 vCPU, 7.5 GB memory)**

iv. Number of worker nodes - **2**

v. Primary disk size (min 15 GB) - **100GB**

vi. Primary disk type - **Standard Persistent Disk**

vii. Number of local SSDs **- 0**

g. Leave the rest of the config as is and select on "CREATE"

h. Click on the cluster name and select the "VM Instances" tab in the page

i. Click on "SSH" for the master node and you'll be presented with a new browser window connected to the master node of our HDFS cluster. I've used local terminal to connect to the master node for the rest of the post.

Note: In real world scenarios, we would connect to the Hadoop cluster via a gateway node or edge node. We'll not use the NameNode for connecting to the cluster since it'll be very busy in handling the cluster.

3. Writing a MapReduce program on Eclipse

a. Create a new Java project called "wordcountmapreduce" in Eclipse IDE on your local machine. Here, I'm using a Linux (ubuntu) machine to create the project. The rest of the steps should stay same for Windows machine as well.

b. Create a new Class for Map by right clicking on the project and select "Class". Once you select it, enter the name of the Map class as "WordCountMapper" and hit Finish.

c. Once WordCountMapper class is created, use the following link for the mapper, reducer, partitioner implementation for the wordcount example. Refer the GitHub link for the code.

d. To remove the errors in the IDE, we must mention the Hadoop libraries in project build path. The following are the libraries (only jar files) that should be added to the project:

<hadoop_dir>/share/hadoop/mapreduce (<hadoop_dir> is the path where you saved the hadoop distribution. Ex: /home/<username>/hadoop-3.3.1)
<hadoop_dir>/share/hadoop/hdfs
<hadoop_dir>/share/hadoop/client
<hadoop_dir>/share/hadoop/common
<hadoop_dir>/share/hadoop/yarn

Click on "Add External JARs" and navigate to the paths mentioned in the above list. After all the required JARs, click "Apply and Close"

e. After adding the jars to the project build path, we can see the errors disappeared in the IDE in the below image. Use the code for the reducer (WordCountReducer.java), partitioner (WordCountPartitioner.java) and the driver (WordCount.java) classes from the GitHub link

f. Once the project setup is done, we will have a look at the "WordCount.java" class. This is a driver class which executes the Map, Reduce, Combiner and the Partitioner classes on the cluster. This class includes config like

i. Job Name - setJobName ii. Driver class - setJarByClass iii. Mapper class - setMapperClass iv. Combiner class - setCombinerClass. Same as Reducer class for wordcount example v. Reducer class - setReducerClass vi. Number of Reducers - setNumReduceTasks vii. Output data types from each class - setOutputKeyClass, setOutputValueClass viii. Input and Output paths - addInputPath, setOutputPath respectively

This is basically the end of the project and code setup required for the wordcount problem in MapReduce.

4. Create a JAR file for the MapReduce Program and Uploading to HDFS

Once the project and the Mapreduce code setup is done, there are two ways we could execute the MapReduce Java program:

Run the Java program within the eclipse. You can find the guide for the same here.
Package the Java program as a JAR file with all the dependencies and execute on the Hadoop cluster. We'll follow this method for this guide

Steps to package the wordcount MapReduce Java program as a JAR file:

a. Right click on the project and select "Export" option

b. Under Java, select "JAR" option and click Next.

c. Select the path for saving the JAR file. Click Next until the final step

d. Select the Main class as "WordCount" using Browse window.

e. Select Finish to create the jar file

f. The jar file will be created as shown below. Once the jar file is created, we'll upload it to the GCP Hadoop cluster and run it.

g. Now, we'll upload this to the master node in the HDFS cluster using SCP. You can configure SSH to connect to HDFS cluster instance on GCP using this link. I've used Windows + Windows Terminal and the same steps mentioned below are followed. To copy the jar file(s) to master node on the cluster, we use the following command:

SCP -i "`<Path/to/SSH/key/ssh-key>`" Path/to/jar/file/wordcountmapperonly.jar username@`<master-ip>`:/path/on/server

h. Once the jar file is available on the master node instance, we can use the following commands to copy the jar file to the HDFS cluster. Please note master node instance and the HDFS cluster are different.

SSH -i "`<Path/to/SSH/key/>`" username@`<master-ip>`
hadoop fs -put -f Path/to/jar/file/wordcountmapperonly.jar `<hdfs_path>`
hadoop fs -ls `<hdfs_path>`

Here, we are copying the jar files wordcountmapperonly.jar, wordcountmapreduce.jar and wordcountmapreducepartitioner.jar and the input data folder HadoopInputFiles for the Hadoop Directory '/'. The input folder contains 3 text files

5. Executing the MapReduce Program on the Hadoop Cluster

As we've seen already, the MapReduce driver class (WordCount.java) will be configured to execute Mapper, Combiner, Reducer and Partitioner. We'll run the MapReduce program with different configurations using the driver class

i. Only Mapper

ii. Mapper and Reducer

ii. Mapper, Reducer and Partitioner

i. Only Mapper

To run Mapper only, we need to comment out the Combiner, Reducer and Partitioner classes configured in the driver class and package the jar file as shown in the above step. The driver class should look like the below picture. The code for the same is here.

The input files are in "/HadoopInputFiles" and has data as in three files as mentioned below. You can find the input files here.

Now, run the jar file "wordcountmapperonly.jar" on the Hadoop cluster with the following command and above input files. The steps to copy the jar file to HDFS location is shown above section.

hadoop jar `<hdfs_path>`/wordcountmapperonly.jar `<input_file_or_dir_path>` `<output_path>`

The following image show how to run the mapreduce jars on Hadoop cluster. The full output log of the run is here.

The output of the mapper only phase contains all the words with count 1 as shown below

Once we run the MapReduce job, we can see the application is tracked under YARN which is a resource manager for the cluster. Every run gets an entry here. The default YARN URL is <cluster-hostname>:8088. For DataProc cluster though, we need to go to cluster details in the GCP console, select " Web Interfaces" tab under cluster details and select " YARN ResourceManager" to get the YARN web interface.

In case where the output path in the hadoop jar command already exists, the MapReduce framework throws " Output directory already exists" error as shown below. This is to avoid the overwriting of any output data.

Note: -D mapred.reduce.tasks is set to 3 by default and we need only map phase to run. We can force the reducer count to zero using this property.

In the output path, we can see four different files

_SUCCESS - Indicates the job status
part-m-00000 to part-m-00002 - output file corresponding each input files. here 'm' in the output filename indicates 'mapper' phase. Since we don't have a reduce phase configured for this run, we'll get an output file for an input file

As we already know, each mapper produces the key-value pairs <word,1> for all the words in the input sentence as output shown below

ii. Mapper and Reducer

Now, Let's run the 'wordcountmapreduce.jar' with the same input files and a different output path. This has both map and reduce phase configured in the driver class. Logs for the run are here and code for the same is here

The output is generated after the reduce phase into a single output file. Since we have only one reducer by default in the cluster

iii. Mapper, Reducer and Partitioner

Now, Let's run the 'wordcountmapreducepartitioner.jar' with the same input files and a different output path. This has map, partition and reduce phases configured in the driver class. Logs for the run are here and code for the same is here

The output for the MapReduce with partitioner is as follows. As per the partitioner logic here, for each letter at the starting of the word, there will be a different output file created. This means we are creating 26 partitions which will create same number of reducers to process the records Example: all the words starting with letter 'a' will end up in 'part-r-00001' file with the count.

Conclusion

We have seen a practical example of wordcount with MapReduce as promised in my last post. This is an exhaustive guide to capture most known ways to create and execute the MapReduce programs in Java.

MapReduce as a compute has lost its edge to new compute framework like Spark. But do you know that we can use the MapReduce to ingest the data into HDFS from an RDBMS source? or write SQL like queries to execute MapReduce job? We will discuss about those in detail in my next blog posts. Stay tuned!

For now though, I'll delete the cloud resource that I've spun up for the tutorial. If you did the same, please delete the resource you have created else you'll end up with something like this

https://twitter.com/forrestbrazeal/status/1389622850567421952?s=20

Resources

If my conent helped you in anyway and like to contribute to my knowledge quest and sharing, you can contribute to me here

Thanks,

Mani

Hadoop MapReduce - A Programming Paradigm

maninekkalapudi — Sun, 01 Aug 2021 12:43:33 +0000

Hello! Hope you're doing well. In my last post I've explained about the internals of HDFS in detail with hands-on examples. In this post we will discuss about MapReduce, a big data processing framework. It is not a mere compute framework or a tool. It is a completely new programming paradigm that simplifies the big data processing in parallel with key-value pairs. We'll discuss everything in detail with examples in this post. Let's dive in!

Topics covered in this post

What is MapReduce?
Traditional Programming vs MapReduce
Higher Order Functions
MapReduce Framework Components
MapReduce on Hadoop Cluster
MapReduce with Combiner
MapReduce with Partitioner
Wordcount example in MapReduce

1. What is MapReduce?

MapReduce is a distributed parallel compute framework, and it was developed by engineers at Google around 2004. This new framework addressed the challenges Google was facing at time to process large volumes of data for indexing websites for its search engine.

Suppose when a user search for "shopping" on Google we will receive all the shopping websites or businesses most relevant to the term. To produce such relevant search results Google must crawl through every website on the internet and understand what a user might be looking for in each website and group similar websites.

Of course, this is an oversimplification of the how the search works. But our focus is to understand how Google engineers came up with a solution to understand every website (search engine indexing) at planetary scale through big data processing.

This is probably the first time where a group of inexpensive computers connected over a network (in the form of a cluster) to perform data processing in parallel. Distribution among the nodes alone was not a sufficient answer [2]. This distribution of work must be performed in parallel for the following three reasons:

The processing must be able to expand and contract automatically
The processing must be able to proceed regardless of failures in the network or the individual systems
Developers leveraging this approach must be able to create services that are easy to leverage by other developers. Therefore, this approach must be independent of where the data and computations have executed

MapReduce solved all the above problems by abstracting the job orchestration by providing APIs out of the box to the end user to overlook all the steps it performed.

2. Traditional Programming vs MapReduce.

In traditional programming style when we write a program in any programing language of our choice, the program runs on the machine and the data is present on the same machine. This is a very efficient way to process the small-scale data and it can scale up to few GBs easily.

However, In MapReduce style, the data is present on a group of machines and the program is moved to all the machines where the relevant data is present, and the data is processed locally on those machines. This avoids the data transfer over the network (which is a precious resource in datacenters) to the machine which has the program. This is especially true when the data size is massive.

In my last post we understood how a distributed filesystem works. A Hadoop cluster can also perform the big data processing using MapReduce framework on the same node which store the data.

3. Higher Order Functions

Before we understand how MapReduce we need to understand a programming concept called Higher Order Functions. All the modern programming languages support higher order functions. A higher order function is a function that at least does one of the following:

takes one or more functions as arguments
returns a function as its result

map is a higher order function. It takes two parameters:
1. A function which performs a task (Ex: multiply number by 2)
2. List of values (Ex: List of numbers)

The map function takes a list of values and return same number of values in the result after applying a specified function. Here is an example of a map function in python

# Map function - Python example:
# Define a function which takes a parameter(a) and returns 2*a
def double(a):
    return a*2

# Create a list with some numbers
lst = [1,3,5,7,9]

# map the double function to the list of values in list 'lst'
double_lst = map(double, lst)

# Printing map object double_lst will give you the address of the map
print(double_lst)
`<map obect at 0x000001756B5FACF8>`

# Convert the map object to list and print the results
print(list(double_lst))

#[2, 6, 10, 14, 18] -> result of map function. Every number in the list is doubled

reduce is also a higher order function takes a list of values along with the function as parameters and returns a single value. Here is an example of reduce function in python

import functools # This is a python library which has reduce function
# create a list of numbers
lst = [100, 353, 565, 976, 128, 232]

# Define a function which takes two numbers and gives the higher number as the output
def greater(a,b):
    if a > b:
        return a
    else:
        return b

print(functools.reduce(greater, lst))

# 976 -> Output of reduce function. It returns the highest numbers in the list

With the above example we get an understanding of how a map and a reduce function work independently. This idea can be generalized to any type of tasks, and we'll see this for wordcount problem.

4. MapReduce Framework Components

The three important components of MapReduce framework are:

Mapper
Reducer
Combiner

MapReduce library is written in Java and above components are Java classes. Every component in the MapReduce library works only with <key-value> pairs.

1. Mapper

A Mapper/Map are individual tasks which maps the input <key-value> pairs to an intermediate <key-value> pairs. The transformed intermediate records need to be of the same type as the input records. The output of the Mapper is not the final output, and the output will be passed to a Reducer.

We will understand this with a word count problem. For the Mapper, the input will be a <rk (randomkey), line> and the output of the Mapper will be <word, 1>. Every line in the input key-value will be split into words and each word will have the count of 1, even if the word is repeating. the input key (randomkey) will be ignored.

Our input to the Mapper or MapReduce program will be raw text for a wordcount problem. But how did we get the <rk (randomkey), line> as input to the Mapper? We'll see that in the next section.

2. Reducer

A Reducer works on the intermediate output from the Mappers and aggregates the results. The output of the Reducer is the final output.

In the below example, the output from the Mapper contains every word with count 1. The Reducer will take this input and aggregates (sum up) the count for each distinct keyword.

3. Combiner

A combiner will have the same aggregation logic as the Reducer (in most cases), and it runs along with the Mapper on the same machine. It performs the local aggregations at Mapper level before sending the data to a Reducer for final aggregation. Thus, decreasing the amount of data transfer from Mapper to Reducer by a huge degree.

Combiners work fine for the aggregations like count and sum, but one must be careful when implementing aggregation like average. A combiner and a Reducer should not perform average at once.

5. MapReduce on Hadoop Cluster

Typically, the compute nodes and the storage nodes are the same in a HDFS cluster, that is, the MapReduce framework and the HDFS are running on the same set of nodes.

As shown below, A client will send the program (jar file containing Mapper, Reducer and Combiner classes along with all necessary libraries) to each node where the relevant data is present in serialized format, The computations will take place on the same machine. Now, we need to focus on how computations (map and reduce) are done.

When the MapReduce program (Mapper and Reducer classes) is sent to the nodes, the framework will split the data into logical InputSplits and assign them to the Mapper using InputFormatclass. So, each block (128MB max) will be the input size to each Mapper. The number of Mappers that runs is equal to number of blocks. Sometimes the logical InputSplits are not enough, we can use RecordReader class for the splitting as per the special case.
RecordReader class typically, converts the byte-oriented view of the input, provided by the InputSplit, and presents a record-oriented view for the Mapper & Reducer tasks for processing.
What is byte-oriented view and record-oriented view? HDFS splits files as blocks (byte-oriented view). So, it is not following a logical split, meaning a part of last record may reside in one block and rest of it is in another block. This seems correct for storage but while processing, the partial records in a block cannot be processed as it is. So, the record-oriented view comes into place. This will ensure to get the remaining part of the last record in the other block to make it a block of complete records. This is called input-split (record-oriented view).[5]
When the RecordReader reads each line in the file, it converts them into key-value pairs. The keys assigned to each line will be generated randomly. This will be the input (<rk (randomkey), line>) to the Mapper class in our MapReduce program. The above steps are taken care by the MapReduce framework itself. We can also customize the behavior of RecordReader as per our requirement. More on that here.

Once the Mapper receives the output from the RecordReader, the random generated keys in the input will be ignored and only the values will be considered by the Mapper for processing. In our wordcount example, we will consider only the lines and ignore random keys
The output of each Mapper will be intermediate output and it is stored on the disk once it finishes the processing. Since Mappers runs on each node on HDFS cluster, the mapper stage will be executed parallelly and it is monitored by JobTracker in the framework. The output of the Map stage for a wordcount example is as mentioned below:

After all the Mappers are done with the processing, the data is sorted on a disk and sent to another node within the cluster or one of the nodes which performed Map stage. This operation is called Sorting and Shuffle, and the node to which the data is called Reducer. Without a Reducer, there's no Sorting and Shuffle shuffling phase. Every mapper will create an output file.

The Reducer will aggregate the results that it receives from all the mappers in key-value pairs. The final output will be stored in the location provided by the user. For the wordcount example, the output will be <word, count of all the occurrences>
All the work should be done at the Mappers because it runs in parallel. Only the final aggregation should take place at the Reducer. The output of the reduce phase will be stored in a single output file.

6. MapReduce with Combiner

A Combiner is an optional class that operates by accepting the inputs from the Map and thereafter passing the output key-value pairs to the Reducer. The Combiner has the exact same logic (in most cases) as the Reducer, and it performs the local aggregations at each Map level.

Since it runs along with the Mapper, the Combiner also runs in parallel. Having a Combiner with the Mappers will reduce the amount of data shuffled between Mappers and Reducer.

The output of the MapReduce with combiner looks as follows:

Source: https://tutorials.freshersnow.com/map-reduce-tutorial/combiner-in-hadoop-mapreduce/

7. MapReduce with Partitioner

A partitioner will partition the intermediate output from the Mappers into different groups. The partitions will be created based on the user-defined function which work as a hash function.

In normal MapReduce application, we will have only one Reducer that aggregates all the data in the final stage but with addition of partitioners each partition will be aggregated by a separate reducer. So, the number of partitioners is equal to number of reducers. Now, Let's discuss a bit about user-defined function for a partition.

In our wordcount example, let suppose we want to create groups of words based on alphabetical order, i.e., all the words starting with letter 'a' should be present in a group and likewise for all the letters. The partitioner will require 26 conditions (each for a letter) to achieve this, and we'll have 26 output files from same num reducers.

MapReduce with Partitioner will look as mentioned below:

8. Wordcount example in MapReduce

Well, that's a lot of content for a single blog post. I will cover the Wordcount example with great detail in my next post.

Conclusion

As we have seen in the wordcount example, the data from every website on the internet is processed by Google. This is later used to create inverted indices for search engine indexing. More on the same here.

MapReduce is a pioneer big data processing framework. Though it does most of the processing in parallel with Mappers, it is certainly slow due to its frequent disk writes and reads at each stage. Newer and better compute engines like Apache Spark does a better job with in-memory computing but the understanding of the core functionality of a distributed data processing comes from MapReduce.

I've not covered jobtracker for Mapreduce jobs in this post since it would be incomplete to talk about it without mentioning YARN (An operating system for big data applications). We'll discuss about YARN in a separate post.

Sources

Thanks, Mani