Boone Cabal

Posted on May 22 • Edited on May 26

How to Build Custom Filters with awk and sed Pipelines on Ubuntu 20.04

#linux #terminal #programming #bash

Introduction

This article is written for Linux administrators. It teaches you how to create pipelines on a terminal using sed and awk commands. Combining these commands allows you to filter and analyze data, troubleshoot log files, and streamline your day-to-day workflow.

sed and awk are essential for filtering and transforming text data. awk works well with columns, and sed excels at search-and-replace. The power of these tools lies in combining them into a pipeline. That will be the focus of this tutorial.

Prerequisites

To complete this tutorial, you will need:

Experience operating a Linux terminal. DigitalOcean's A Linux Command Line Primer is a great place to start.
Knowledge about regular expressions; how to interpret and create them. Read An Introduction to Regular Expressions to learn more.
Experience using common command line tools like cut, head, and so on. Check out Sed Stream Editor to Manipulate Text in Linux and How To Use the AWK language to Manipulate Text in Linux.

Your First Pipe and Filter using `sed` and `awk`

Let us walk through a basic example of filtering specific data from a file with awk and then formatting it for display with sed. You will use a pipeline to extract and then print the product names and prices for products with a price greater than ten dollars.

First, create a products.text file in Vim using the following command:

vim products.txt

Note: You don't have to use Vim; you can use whichever editor works best for you.

Fill the file with the following contents:

123:T-Shirt:19.99
456:Coffee Mug:8.50
789:Headphones:49.95

Here is the full pipeline you are going to construct:

awk -F: '$3 > 10 {print $2 "\t" $3}' products.txt | sed '1iName\tPrice'

Primer: awk

Let's brush up on awk. awk uses the syntax of condition { action }. Here is an example of an awk script:

/^int/ { print "Found an integer." }

condition: /^int/

{ action }: print "Found an integer."

This is how it works: For every line beginning with "int", awk prints the message, "Found an integer."

Now, let's break down each part of this pipeline. Here is the awk portion:

awk -F: '$3 > 10 {print $2 "\t" $3}'

Here is how it works:

awk matches lines where the condition that the price ($3) is greater than 10; the action prints the product name ($2), followed by a tab, followed by the price.
The -F: argument sets the delimiter to ‘,’

Let's look at the sed portion of our pipeline:

sed '1iName\tPrice'

Here is how it works:

The 1i inserts "Name"--before the first line of the output--followed by a tab, "Price", and a newline.

Below is the full pipeline. Run it:

awk -F: '$3 > 10 {print $2 "\t" $3}' products.txt | sed '1iName\tPrice'

Here is the resulting output:

Name    Price
T-Shirt 19.99
Headphones      49.95

Straightforward enough, right?

More Complex Filters and Transformations

In this section, we will create some more complex filters and transformations using sed, awk, and some other commands. As we walk through each example pipeline, go slow, be patient, run every command, observe the output, and grasp what's happening.

Filtering System Resource Usage by User

Let's create a pipeline that analyzes process information generated by the ps command. As a system administrator, it behooves you to monitor resource usage per user, allowing you to discover users who consuming excessive memory, CPU usage, and so on.

Here is the full pipeline you will construct, which filters resource usage by user:

# Get process information
ps -eo pid,user,rss |

# Filter by specific user and format output
awk '$2 == "root" {print $1, "RSS:", $3/1024, "MB"}' |

# Sort by memory usage
sed '1iPID RSS(MB)' | sort -nrk 2

Step One -- Outputting Process Info With `ps`

Begin this pipeline by generating process information using ps. Here is the first part of the pipeline:

ps -eo pid,user,rss

Here is how it works:

Displays all processes using the -e argument.
Using the -o argument, displays the pid, user, and rss columns.

Running this command line produces the following output:

    PID USER       RSS
      1 codespa+   640
      7 codespa+  1792
     42 root      3480
    322 codespa+  1408
    355 root      1664
    509 codespa+  1536
    518 codespa+ 131588
    560 codespa+ 54792
    981 codespa+ 62928

Now you have our fields of interest: PID, USER, and RSS.

Step Two -- Filtering User Process Info With `awk`

Let's move on to the next portion of our pipeline, which uses awk to filter lines containing the “root” user and calculate memory usage in megabytes.

awk '$2 == "root" {print $1, "RSS:", $3/1024, "MB"}'

Here is how it works:

The condition $2 == "root" selects lines where the NAME is equal to "root".
The action { print $1, "RSS:", $3/1024, "MB" } displays output using the following format:

[value of pid] RSS: [file-size] MB

Note: Dividing the RSS value by 1024 demonstrates how awk can perform calculations.

Below is our updated pipeline. Run it.

ps -eo pid,user,rss | awk '$2 == "root" {print $1, "RSS:", $3/1024, "MB"}'

You should see the following output:

1 RSS: 0.191406 MB
7 RSS: 0.148438 MB
8 RSS: 94.7695 MB
212 RSS: 7.16406 MB
821 RSS: 5.85938 MB
1883 RSS: 1.55469 MB
1884 RSS: 2.91016 MB

Step Three -- Sorting by Memory Usage

Let's add some commands to our pipeline to sort the output by memory usage. Here is the sed portion of the pipeline:

sed '1iPID RSS(MB)' | sort -nrk 2

Here is how it works:

'1iPID RSS(MB)' uses the 'i' to insert the column heading "PID RSS(MB)" followed by a newline.

PID RSS(MB)
sort -nrk sorts the text numerically (-n), reverses the result (-r), and sorts by the second column (-k 2), which effectively sorts the output based on memory usage.

Note: Sorting by memory usage (column 2) in descending order helps identify resource-intensive processes efficiently.

Here is the pipeline thus far. Go ahead and run it:

ps -eo pid,user,rss | awk '$2 == "root" {print $1, $3/1024, "MB"}' | sed '1iPID RSS(MB)' | sort -nrk 2

This should produce the following output:

PID RSS(MB)
821 5.88672 MB
8 95.1914 MB
7 0.148438 MB
212 7.16406 MB
2009 1.09375 MB
2008 1.08594 MB
2007 2.92969 MB
2006 3.14453 MB
1 0.191406 MB

Failed Login Counter

Let's create a pipeline that analyzes an authentication log, searches for, and then counts failed login attempts. Attention to events like this allows you to protect your system and respond to potential threats. Here is the full pipeline we'll build:

grep "Failed password" auth.log | sed 's/invalid user //' | awk '{print $9}' | sort | uniq -c | sort -nr

Here is how it works:

grep "Failed password": Filters the lines that contain “Failed password” from the authentication log.
sed 's/invalid user //': Removes the “invalid user” part from the lines, if present.
awk '{print $9}': Prints the ninth field, which is typically the username.
sort: Sorts the usernames alphabetically.
uniq -c: Counts the occurrences of each username.
sort -nr: Sorts the counts in descending order.

Let's walk through it.

Step One -- Creating a Log File

Create a file named auth.log using the following command:

vim auth.log

Fill the file with the following contents:

Feb 10 15:45:09 ubuntu-lts sshd[47341]: Failed password for tedbell from 103.106.189.143 port 60824 ssh2
Feb 10 15:45:11 ubuntu-lts sshd[47341]: Connection closed by authenticating user root 103.106.189.143 port 60824 [preauth]
Feb 10 15:45:11 ubuntu-lts sshd[47339]: Failed password for root from 180.101.88.228 port 11349 ssh2
Feb 10 15:45:12 ubuntu-lts sshd[47343]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh user= rhost=103.106.189.143  user=root
Feb 10 15:45:14 ubuntu-lts sshd[47339]: Failed password for rhomboidgoatcabin from 180.101.88.228 port 11349 ssh2
Feb 10 15:45:14 ubuntu-lts sshd[47343]: Failed password for root from 103.106.189.143 port 33990 ssh2
Feb 10 15:45:16 ubuntu-lts sshd[47343]: Connection closed by authenticating user root 103.106.189.143 port 33990 [preauth]
Feb 10 15:45:16 ubuntu-lts sshd[47339]: Received disconnect from 180.101.88.228 port 11349:11:  [preauth]
Feb 10 15:45:16 ubuntu-lts sshd[47339]: Disconnected from authenticating user root 180.101.88.228 port 11349 [preauth]
Feb 10 15:45:16 ubuntu-lts sshd[47339]: PAM 2 more authentication failures; logname= uid=0 euid=0 tty=ssh ruser= rhost=180.101.88.228

Step Two -- Finding Failed Passwords

Below is the first part of the pipeline, which uses grep to filter lines containing “Failed password". Run it.

grep "Failed password" auth.log

You should see the following output:

Feb 10 15:45:09 ubuntu-lts sshd[47341]: Failed password for tedbell from 103.106.189.143 port 60824 ssh2
Feb 10 15:45:11 ubuntu-lts sshd[47339]: Failed password for root from 180.101.88.228 port 11349 ssh2
Feb 10 15:45:14 ubuntu-lts sshd[47339]: Failed password for rhomboidgoatcabin from 180.101.88.228 port 11349 ssh2
Feb 10 15:45:14 ubuntu-lts sshd[47343]: Failed password for root from 103.106.189.143 port 33990 ssh2

You now have all the failed password entries.

Step Three -- Removing Invalid Users

Update the pipeline by adding a sed command, which removes any “invalid user” parts:

grep "Failed password" auth.log | sed 's/invalid user //'

Running this pipeline should produce the following output:

Feb 10 15:45:09 ubuntu-lts sshd[47341]: Failed password for tedbell from 103.106.189.143 port 60824 ssh2
Feb 10 15:45:11 ubuntu-lts sshd[47339]: Failed password for root from 180.101.88.228 port 11349 ssh2
Feb 10 15:45:14 ubuntu-lts sshd[47339]: Failed password for rhomboidgoatcabin from 180.101.88.228 port 11349 ssh2
Feb 10 15:45:14 ubuntu-lts sshd[47343]: Failed password for root from 103.106.189.143 port 33990 ssh2

Step Four -- Extracting Username

Update the pipeline by adding an awk command to print the username field ($9).

grep "Failed password" auth.log | sed 's/invalid user //' | awk '{print $9}'

Running this pipeline should produce the following output:

tedbell
root
rhomboidgoatcabin
root

You are making progress! Now you have all the usernames.

Step Five -- Sorting Usernames

Update the pipeline by adding the following sort command to sort the usernames alphabetically:

grep "Failed password" auth.log | sed 's/invalid user //' | awk '{print $9}' | sort

Running this pipeline should produce the following output:

rhomboidgoatcabin
root
root
tedbell

Now you have an alphabetical list of usernames.

Step Six -- Counting Usernames

Update the pipeline by adding sort and uniq commands. Using uniq with the -c argument counts the occurrences of each username, and sort sorts the usernames alphabetically.

grep "Failed password" auth.log | sed 's/invalid user //' | awk '{print $9}' | sort | uniq -c

Running this pipeline should produce the following output:

1 rhomboidgoatcabin
2 root
1 tedbell

Now you have a user count.

Step Seven -- Sorting Output Again

Finally, update the pipeline by adding another sort command. Using the-nr argument of sort sorts the output by username count in descending order.

grep "Failed password" auth.log | sed 's/invalid user //' | awk '{print $9}' | sort | uniq -c | sort -nr

Running the full pipeline should produce the following output:

2 root
1 rhomboidgoatcabin
1 tedbell

All done!

Disk Consumption Report

Let's construct a pipeline that finds the top disk space-consuming directories and sorts them in descending order. It's important to be able to monitor disk usage, ensuring a smoother experience for users. Here is the complete pipeline you will construct:

cat disk_usage.log | awk '{print $2, $1}' | sed 's/G/ GB/' | sort -k2 -nr

Here's how it works:

cat disk_usage.log: Outputs the content of disk_usage.log.
awk '{print $2, $1}': Swaps the columns so that the directory path comes first.
sed 's/G/ GB/': Adds a space before the unit ‘G’ to standardize it to ‘GB’.
sort -k2 -nr: Sorts the output based on the second column (disk space) in descending numerical order.

Step One -- Creating Disk Usage File

Begin by creating an input file called disk_usage.txt, and fill it with the following content:

2.4G    /usr/local/bin
5.7G    /home/user
1.2G    /tmp
9.8G    /var/log

Step Two -- Outputting Contents of File

Begin the pipeline by using the cat command to send the contents of the disk usage file to standard output (screen).

cat disk_usage.log

Step Three -- Swapping Columns

Update the pipeline by adding an awk command to rearrange the order of the columns, displaying the directory path first:

cat disk_usage.log | awk '{print $2, $}'

Running this pipeline should produce the following output:

/usr/local/bin 2.4G
/home/user 5.7G
/tmp 1.2G
/var/log 9.8G

Step Four -- Changing File Size Column Format

Update the pipeline by adding a sed command to add a space before the unit ‘G’ to standardize it to ‘GB’:

cat disk_usage.log | awk '{print $2, $1}' | sed 's/G/ GB/'

Running this pipeline should produce the following output:

/usr/local/bin 2.4 GB
/home/user 5.7 GB
/tmp 1.2 GB
/var/log 9.8 GB

Step Five -- Sorting Output on Second Column

Update the pipeline by adding a sort command to sort the output based on the second column (disk space) in descending numerical order.

cat disk_usage.log | awk '{print $2, $1}' | sed 's/G/ GB/' | sort -k2 -nr

Here is an explanation of the sort options used above:

-k2: This flag specifies the column (field) for sorting. In this case, 2 indicates the second column.
-nr:
- -n: This flag tells sort to perform a numeric sort on the specified column (second column in this case).
- -r: This flag reverses the sorting order, so it sorts in descending order instead of the default ascending order.

Running this pipeline produces output sorted by disk space:

/var/log 9.8 GB
/home/user 5.7 GB
/usr/local/bin 2.4 GB
/tmp 1.2 GB

Nice work, friend!

Conclusion

In this tutorial, you have learned how to create sophisticated pipelines using sed, awk, and other commands. Now you are ready to start experimenting and creating your own pipelines and solving day-to-day system administration problems.

I hope this tutorial helped. Thanks for reading!

ACI.dev: Best Open-Source Composio Alternative (AI Agent Tooling)

100% open-source tool-use platform (backend, dev portal, integration library, SDK/MCP) that connects your AI agents to 600+ tools with multi-tenant auth, granular permissions, and access through direct function calling or a unified MCP server.

Star our GitHub!

DEV Community

How to Build Custom Filters with awk and sed Pipelines on Ubuntu 20.04

Introduction

Prerequisites

Your First Pipe and Filter using `sed` and `awk`

More Complex Filters and Transformations

Filtering System Resource Usage by User

Step One -- Outputting Process Info With `ps`

Step Two -- Filtering User Process Info With `awk`

Step Three -- Sorting by Memory Usage

Failed Login Counter

Step One -- Creating a Log File

Step Two -- Finding Failed Passwords

Step Three -- Removing Invalid Users

Step Four -- Extracting Username

Step Five -- Sorting Usernames

Step Six -- Counting Usernames

Step Seven -- Sorting Output Again

Disk Consumption Report

Step One -- Creating Disk Usage File

Step Two -- Outputting Contents of File

Step Three -- Swapping Columns

Step Four -- Changing File Size Column Format

Step Five -- Sorting Output on Second Column

Conclusion

ACI.dev: Best Open-Source Composio Alternative (AI Agent Tooling)

Top comments (0)

ACI.dev: Fully Open-source AI Agent Tool-Use Infra (Composio Alternative)

Introduction

Prerequisites

Your First Pipe and Filter using sed and awk

More Complex Filters and Transformations

Filtering System Resource Usage by User

Step One -- Outputting Process Info With ps

Step Two -- Filtering User Process Info With awk

Step Three -- Sorting by Memory Usage

Failed Login Counter

Step One -- Creating a Log File

Step Two -- Finding Failed Passwords

Step Three -- Removing Invalid Users

Step Four -- Extracting Username

Step Five -- Sorting Usernames

Step Six -- Counting Usernames

Step Seven -- Sorting Output Again

Disk Consumption Report

Step One -- Creating Disk Usage File

Step Two -- Outputting Contents of File

Step Three -- Swapping Columns

Step Four -- Changing File Size Column Format

Step Five -- Sorting Output on Second Column

Conclusion

ACI.dev: Best Open-Source Composio Alternative (AI Agent Tooling)

ACI.dev: Fully Open-source AI Agent Tool-Use Infra (Composio Alternative)

Your First Pipe and Filter using `sed` and `awk`

Step One -- Outputting Process Info With `ps`

Step Two -- Filtering User Process Info With `awk`