Jens Båvenmark for AWS Community Builders

Posted on May 8 • Originally published at Medium

AWS Alert Validation - EC2

#aws #cloudwatch #ec2 #alarm

For monitoring, the golden rule (at least in my opinion) is that an untested alarm is not one you can trust.

An untested alarm is not one you can trust

In this blog series, I will describe different ways you can test your monitoring, both metric and service monitoring, to be sure that it works as you want.

So, what will that entail? Well, you will need to break stuff—at least enough so that your alarms will trigger.

Testing monitoring can be time-consuming especially if you want to test real alarms. Usually, you don’t want to get notified for every spike on an EC2, but more if it persists over a set time. And checking if scheduled tasks work will require you to wait for the schedule to run.

Before we start going through different tests, I suggest you run these tests in a non-production account, and that all monitoring (and its dependencies) is deployed with IaC. That way, you can be sure that the monitoring you have tested in your Dev account will work in your production account as well.

In this first part of this blog series, we will examine testing EC2 alarms and ensuring that your CloudWatch alarm actions are triggered correctly.

Testing CloudWatch Actions

One common thing many want to test is whether they will receive a notification when their CloudWatch alarm is triggered, whether it triggers their Lambda as expected, and whether CloudWatch can trigger the action they have specified.

We usually want to test this without triggering the real alarm, as that can be time-consuming. We can easily do this with the AWS CLI.

With the CLI, we will change the state of the CloudWatch alarm to Alarm.

aws cloudwatch set-alarm-state --alarm-name "AlarmName" --state-reason "Testing alarm" --state-value ALARM

This will trigger the action you specified on your alarm when in state ALARM.

You will also get to test the action you set for the OK state. When your CloudWatch alarm checks the required metric against the threshold within the specified period for the alarm, it returns an OK state (since the metric should be at an OK level compared to the threshold) and triggers the action.

If you don’t want to wait for the period to pass, you can also send the same CLI command again, but with the OK status.

aws cloudwatch set-alarm-state --alarm-name "AlarmName" --state-reason "Testing alarm" --state-value OK

If you want to test for actions for missing data (insufficient data), then set the state to INSUFFICIENT_DATA.

aws cloudwatch set-alarm-state --alarm-name "AlarmName" --state-reason "Testing alarm" --state-value INSUFFICIENT_DATA

Testing EC2 larms

We will look at how you can test the most common alarms for EC2 by triggering them by increasing the metric monitored by utilizing special applications or commands to mimic usage on the EC2 instance (we will test on Linux instances)

The application we will use is called stress-ng.

Installing stress-ng

To install stress-ng, run this command.

Amazon Linux/RHEL/CentOS/Fedora/Rocky

sudo dnf install stress-ng

Ubuntu/Debian

sudo apt install stress-ng

You don't need to do anything more than install the application. We will look into the commands when testing the different alarms.

Before starting testing the alarms, I suggest you modify the thresholds on your alarms to make them easier to trigger. If they trigger on a higher or lower threshold, they will trigger on the correct threshold as well.

All stress-ng commands are “run forever,” so remember to cancel them with CTRL+c when your alarm triggers.

CPU

To test CPU alarms, we will mimic CPU load with the stress-ng application. In these examples, we will trigger a CPU usage alarm by running all cores on the EC2 to a set percentage.

sudo stress-ng --cpu {Number of cpus} --cpu-load {Load in percentage per cpu}

All tests are run on a burstable EC2 instance with two cores.

CPU Usage

To test CPU usage, I have lowered the alarm's threshold to 50%, so I will run the test at 75%.

sudo stress-ng --cpu 2 --cpu-load 75

CPU Load

To test CPU Load, we will run the test with more CPU threads than the instance has.

sudo stress-ng --cpu 4--cpu-load 100

CPU Credits

If you have a burstable instance and want to test the CPU Credits alarm, we run the test on the CPU with a high load. Remember to raise the alarm's threshold to limit the time you will need to wait until it triggers.

sudo stress-ng --cpu 2 --cpu-load 100

Memory

To test memory alarms, we will mimic Memory usage with the stress-ng application. In this example, we will trigger memory usage with the vm flag and set the available memory usage to a set percentage.

sudo stress-ng --vm {Number of workers to use memory} --vm-bytes {Bytes or percent of available memory} --vm-keep

Using more memory than the instance has will result in OOM (Out Of Memory).

Memory Usage

To test Memory Usage, we will run two workers using 80% of the available memory.

sudo stress-ng --vm 2 --vm-bytes 80% --vm-keep

Swap Usage

To test Swap Usage, we will run one worker using 150% of total memory to get swap to be used quickly. The command will retrieve the total memory and multiply it by 1.5. Remember that this can cause OOM issues.

  sudo stress-ng --vm 1 --vm-bytes $(awk '/MemTotal/ {print int($2 * 1.5) "k"}' /proc/meminfo) --vm-keep

Disk

To test disk alarms, we will create disk usage with fallocate or dd to create a dummy file of a specific size.

sudo fallocate -l {Size of file} {Path to file}

If fallocate doesn't work on your Linux distribution, you can use dd instead.

sudo dd if=/dev/zero of={Path to file} bs={Size of block} count={number of count}

Disk Usage

To test disk usage, we will create a dummy file of a specific size on the disk you are monitoring, raising the disk usage above the threshold you have set for your alarm.

sudo fallocate -l 2G /var/filldisk.img

If fallocate doesn't work, use dd instead.

sudo dd if=/dev/zero of={Path to file} bs={Size of block} count={number of count}

Final Thoughts

Testing that your alarms work as you expect can save you a lot of headaches in the future. The tests we have done here are not unique to AWS since all are done with Linux tools.

This was the first post in this series, and in the upcoming posts, we will look at testing alarms for other AWS resources.

62% faster than every other vector database

Tired of slow, inaccurate vector search?
Redis delivers top recall and low latency, outperforming leading vector databases in recent benchmarks. With built-in ANN and easy scaling, it’s a fast, reliable choice for real-time AI apps.

Get started