Introduction
This article is written for Linux administrators. It teaches you how to create pipelines on a terminal using sed
and awk
commands. Combining these commands allows you to filter and analyze data, troubleshoot log files, and streamline your day-to-day workflow.
sed
and awk
are essential for filtering and transforming text data. awk
works well with columns, and sed
excels at search-and-replace. The power of these tools lies in combining them into a pipeline. That will be the focus of this tutorial.
Prerequisites
To complete this tutorial, you will need:
- Experience operating a Linux terminal. DigitalOcean's A Linux Command Line Primer is a great place to start.
- Knowledge about regular expressions; how to interpret and create them. Read An Introduction to Regular Expressions to learn more.
- Experience using common command line tools like
cut
,head
, and so on. Check out Sed Stream Editor to Manipulate Text in Linux and How To Use the AWK language to Manipulate Text in Linux.
Your First Pipe and Filter using sed
and awk
Let us walk through a basic example of filtering specific data from a file with awk
and then formatting it for display with sed
. You will use a pipeline to extract and then print the product names and prices for products with a price greater than ten dollars.
First, create a products.text
file in Vim using the following command:
vim products.txt
Note: You don't have to use Vim; you can use whichever editor works best for you.
Fill the file with the following contents:
123:T-Shirt:19.99
456:Coffee Mug:8.50
789:Headphones:49.95
Here is the full pipeline you are going to construct:
awk -F: '$3 > 10 {print $2 "\t" $3}' products.txt | sed '1iName\tPrice'
Primer:
awk
Let's brush up on
awk
.awk
uses the syntax of condition { action }. Here is an example of anawk
script:
/^int/ { print "Found an integer." }
- condition:
/^int/
- { action }:
print "Found an integer."
This is how it works: For every line beginning with "
int
",awk
prints the message, "Found an integer."
Now, let's break down each part of this pipeline. Here is the awk
portion:
awk -F: '$3 > 10 {print $2 "\t" $3}'
Here is how it works:
-
awk
matches lines where the condition that the price ($3
) is greater than10
; the action prints the product name ($2
), followed by a tab, followed by the price. - The
-F:
argument sets the delimiter to ‘,
’
Let's look at the sed
portion of our pipeline:
sed '1iName\tPrice'
Here is how it works:
- The
1i
inserts "Name"--before the first line of the output--followed by a tab, "Price", and a newline.
Below is the full pipeline. Run it:
awk -F: '$3 > 10 {print $2 "\t" $3}' products.txt | sed '1iName\tPrice'
Here is the resulting output:
Name Price
T-Shirt 19.99
Headphones 49.95
Straightforward enough, right?
More Complex Filters and Transformations
In this section, we will create some more complex filters and transformations using sed
, awk
, and some other commands. As we walk through each example pipeline, go slow, be patient, run every command, observe the output, and grasp what's happening.
Filtering System Resource Usage by User
Let's create a pipeline that analyzes process information generated by the ps
command. As a system administrator, it behooves you to monitor resource usage per user, allowing you to discover users who consuming excessive memory, CPU usage, and so on.
Here is the full pipeline you will construct, which filters resource usage by user:
# Get process information
ps -eo pid,user,rss |
# Filter by specific user and format output
awk '$2 == "root" {print $1, "RSS:", $3/1024, "MB"}' |
# Sort by memory usage
sed '1iPID RSS(MB)' | sort -nrk 2
Step One -- Outputting Process Info With ps
Begin this pipeline by generating process information using ps
. Here is the first part of the pipeline:
ps -eo pid,user,rss
Here is how it works:
- Displays all processes using the
-e
argument. - Using the
-o
argument, displays thepid
,user
, andrss
columns.
Running this command line produces the following output:
PID USER RSS
1 codespa+ 640
7 codespa+ 1792
42 root 3480
322 codespa+ 1408
355 root 1664
509 codespa+ 1536
518 codespa+ 131588
560 codespa+ 54792
981 codespa+ 62928
Now you have our fields of interest: PID, USER, and RSS.
Step Two -- Filtering User Process Info With awk
Let's move on to the next portion of our pipeline, which uses awk
to filter lines containing the “root
” user and calculate memory usage in megabytes.
awk '$2 == "root" {print $1, "RSS:", $3/1024, "MB"}'
Here is how it works:
- The condition
$2 == "root"
selects lines where the NAME is equal to "root". -
The action {
print $1, "RSS:", $3/1024, "MB"
} displays output using the following format:[value of
pid
] RSS: [file-size] MB
Note: Dividing the RSS value by 1024 demonstrates how
awk
can perform calculations.
Below is our updated pipeline. Run it.
ps -eo pid,user,rss | awk '$2 == "root" {print $1, "RSS:", $3/1024, "MB"}'
You should see the following output:
1 RSS: 0.191406 MB
7 RSS: 0.148438 MB
8 RSS: 94.7695 MB
212 RSS: 7.16406 MB
821 RSS: 5.85938 MB
1883 RSS: 1.55469 MB
1884 RSS: 2.91016 MB
Step Three -- Sorting by Memory Usage
Let's add some commands to our pipeline to sort the output by memory usage. Here is the sed
portion of the pipeline:
sed '1iPID RSS(MB)' | sort -nrk 2
Here is how it works:
-
'1iPID RSS(MB)'
uses the 'i
' to insert the column heading "PID RSS(MB)" followed by a newline.PID RSS(MB)
sort -nrk
sorts the text numerically (-n
), reverses the result (-r
), and sorts by the second column (-k 2
), which effectively sorts the output based on memory usage.
Note: Sorting by memory usage (column 2) in descending order helps identify resource-intensive processes efficiently.
Here is the pipeline thus far. Go ahead and run it:
ps -eo pid,user,rss | awk '$2 == "root" {print $1, $3/1024, "MB"}' | sed '1iPID RSS(MB)' | sort -nrk 2
This should produce the following output:
PID RSS(MB)
821 5.88672 MB
8 95.1914 MB
7 0.148438 MB
212 7.16406 MB
2009 1.09375 MB
2008 1.08594 MB
2007 2.92969 MB
2006 3.14453 MB
1 0.191406 MB
Failed Login Counter
Let's create a pipeline that analyzes an authentication log, searches for, and then counts failed login attempts. Attention to events like this allows you to protect your system and respond to potential threats. Here is the full pipeline we'll build:
grep "Failed password" auth.log | sed 's/invalid user //' | awk '{print $9}' | sort | uniq -c | sort -nr
Here is how it works:
-
grep "Failed password"
: Filters the lines that contain “Failed password” from the authentication log. -
sed 's/invalid user //'
: Removes the “invalid user” part from the lines, if present. -
awk '{print $9}'
: Prints the ninth field, which is typically the username. -
sort
: Sorts the usernames alphabetically. -
uniq -c
: Counts the occurrences of each username. -
sort -nr
: Sorts the counts in descending order.
Let's walk through it.
Step One -- Creating a Log File
Create a file named auth.log
using the following command:
vim auth.log
Fill the file with the following contents:
Feb 10 15:45:09 ubuntu-lts sshd[47341]: Failed password for tedbell from 103.106.189.143 port 60824 ssh2
Feb 10 15:45:11 ubuntu-lts sshd[47341]: Connection closed by authenticating user root 103.106.189.143 port 60824 [preauth]
Feb 10 15:45:11 ubuntu-lts sshd[47339]: Failed password for root from 180.101.88.228 port 11349 ssh2
Feb 10 15:45:12 ubuntu-lts sshd[47343]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh user= rhost=103.106.189.143 user=root
Feb 10 15:45:14 ubuntu-lts sshd[47339]: Failed password for rhomboidgoatcabin from 180.101.88.228 port 11349 ssh2
Feb 10 15:45:14 ubuntu-lts sshd[47343]: Failed password for root from 103.106.189.143 port 33990 ssh2
Feb 10 15:45:16 ubuntu-lts sshd[47343]: Connection closed by authenticating user root 103.106.189.143 port 33990 [preauth]
Feb 10 15:45:16 ubuntu-lts sshd[47339]: Received disconnect from 180.101.88.228 port 11349:11: [preauth]
Feb 10 15:45:16 ubuntu-lts sshd[47339]: Disconnected from authenticating user root 180.101.88.228 port 11349 [preauth]
Feb 10 15:45:16 ubuntu-lts sshd[47339]: PAM 2 more authentication failures; logname= uid=0 euid=0 tty=ssh ruser= rhost=180.101.88.228
Step Two -- Finding Failed Passwords
Below is the first part of the pipeline, which uses grep
to filter lines containing “Failed password". Run it.
grep "Failed password" auth.log
You should see the following output:
Feb 10 15:45:09 ubuntu-lts sshd[47341]: Failed password for tedbell from 103.106.189.143 port 60824 ssh2
Feb 10 15:45:11 ubuntu-lts sshd[47339]: Failed password for root from 180.101.88.228 port 11349 ssh2
Feb 10 15:45:14 ubuntu-lts sshd[47339]: Failed password for rhomboidgoatcabin from 180.101.88.228 port 11349 ssh2
Feb 10 15:45:14 ubuntu-lts sshd[47343]: Failed password for root from 103.106.189.143 port 33990 ssh2
You now have all the failed password entries.
Step Three -- Removing Invalid Users
Update the pipeline by adding a sed
command, which removes any “invalid user” parts:
grep "Failed password" auth.log | sed 's/invalid user //'
Running this pipeline should produce the following output:
Feb 10 15:45:09 ubuntu-lts sshd[47341]: Failed password for tedbell from 103.106.189.143 port 60824 ssh2
Feb 10 15:45:11 ubuntu-lts sshd[47339]: Failed password for root from 180.101.88.228 port 11349 ssh2
Feb 10 15:45:14 ubuntu-lts sshd[47339]: Failed password for rhomboidgoatcabin from 180.101.88.228 port 11349 ssh2
Feb 10 15:45:14 ubuntu-lts sshd[47343]: Failed password for root from 103.106.189.143 port 33990 ssh2
Step Four -- Extracting Username
Update the pipeline by adding an awk
command to print the username
field ($9
).
grep "Failed password" auth.log | sed 's/invalid user //' | awk '{print $9}'
Running this pipeline should produce the following output:
tedbell
root
rhomboidgoatcabin
root
You are making progress! Now you have all the usernames.
Step Five -- Sorting Usernames
Update the pipeline by adding the following sort
command to sort the usernames alphabetically:
grep "Failed password" auth.log | sed 's/invalid user //' | awk '{print $9}' | sort
Running this pipeline should produce the following output:
rhomboidgoatcabin
root
root
tedbell
Now you have an alphabetical list of usernames.
Step Six -- Counting Usernames
Update the pipeline by adding sort
and uniq
commands. Using uniq
with the -c
argument counts the occurrences of each username, and sort
sorts the usernames alphabetically.
grep "Failed password" auth.log | sed 's/invalid user //' | awk '{print $9}' | sort | uniq -c
Running this pipeline should produce the following output:
1 rhomboidgoatcabin
2 root
1 tedbell
Now you have a user count.
Step Seven -- Sorting Output Again
Finally, update the pipeline by adding another sort
command. Using the-nr
argument of sort
sorts the output by username count in descending order.
grep "Failed password" auth.log | sed 's/invalid user //' | awk '{print $9}' | sort | uniq -c | sort -nr
Running the full pipeline should produce the following output:
2 root
1 rhomboidgoatcabin
1 tedbell
All done!
Disk Consumption Report
Let's construct a pipeline that finds the top disk space-consuming directories and sorts them in descending order. It's important to be able to monitor disk usage, ensuring a smoother experience for users. Here is the complete pipeline you will construct:
cat disk_usage.log | awk '{print $2, $1}' | sed 's/G/ GB/' | sort -k2 -nr
Here's how it works:
-
cat disk_usage.log
: Outputs the content of disk_usage.log. -
awk '{print $2, $1}'
: Swaps the columns so that the directory path comes first. -
sed 's/G/ GB/'
: Adds a space before the unit ‘G’ to standardize it to ‘GB’. -
sort -k2 -nr
: Sorts the output based on the second column (disk space) in descending numerical order.
Step One -- Creating Disk Usage File
Begin by creating an input file called disk_usage.txt
, and fill it with the following content:
2.4G /usr/local/bin
5.7G /home/user
1.2G /tmp
9.8G /var/log
Step Two -- Outputting Contents of File
Begin the pipeline by using the cat
command to send the contents of the disk usage file to standard output (screen).
cat disk_usage.log
Step Three -- Swapping Columns
Update the pipeline by adding an awk
command to rearrange the order of the columns, displaying the directory path first:
cat disk_usage.log | awk '{print $2, $}'
Running this pipeline should produce the following output:
/usr/local/bin 2.4G
/home/user 5.7G
/tmp 1.2G
/var/log 9.8G
Step Four -- Changing File Size Column Format
Update the pipeline by adding a sed
command to add a space before the unit ‘G’ to standardize it to ‘GB’:
cat disk_usage.log | awk '{print $2, $1}' | sed 's/G/ GB/'
Running this pipeline should produce the following output:
/usr/local/bin 2.4 GB
/home/user 5.7 GB
/tmp 1.2 GB
/var/log 9.8 GB
Step Five -- Sorting Output on Second Column
Update the pipeline by adding a sort
command to sort the output based on the second column (disk space) in descending numerical order.
cat disk_usage.log | awk '{print $2, $1}' | sed 's/G/ GB/' | sort -k2 -nr
Here is an explanation of the sort
options used above:
-
-k2
: This flag specifies the column (field) for sorting. In this case,2
indicates the second column. -
-nr
:-
-n
: This flag tellssort
to perform a numeric sort on the specified column (second column in this case). -
-r
: This flag reverses the sorting order, so it sorts in descending order instead of the default ascending order.
-
Running this pipeline produces output sorted by disk space:
/var/log 9.8 GB
/home/user 5.7 GB
/usr/local/bin 2.4 GB
/tmp 1.2 GB
Nice work, friend!
Conclusion
In this tutorial, you have learned how to create sophisticated pipelines using sed
, awk
, and other commands. Now you are ready to start experimenting and creating your own pipelines and solving day-to-day system administration problems.
I hope this tutorial helped. Thanks for reading!
Top comments (0)