Forem: Paul Singman

Your AWS Lambda Function Failed, What Now?

Paul Singman — Tue, 10 Nov 2020 23:25:27 +0000

On the analytics team at Equinox Media, we invoke thousands of Lambda functions daily to perform a variety of data processing tasks. Examples range from the mundane shuffling of files around on S3, to the more stimulating generation of real-time fitness content recommendations on the Variis app.

Because of our reliance on Lambda, it’s critical to diagnose issues as quickly as possible.

Here’s a diagram of the process we’ve set up to do so:

If you are also a user of Lambda, what does your error alerting look like? If you find yourself struggling to figure out why a failure occurred, or worse — unaware one happened at all — we hope sharing our solution will help you become a more effective serverless practitioner!

Step #1: Create An Error Metric-Based CloudWatch Alarm

After every single run of a Lambda function, AWS sends a few metrics to the CloudWatch service by default. Per AWS documentation:

Invocation metrics are binary indicators of the outcome of > an invocation. For example, if the function returns an error, Lambda sends the Errors metric with a value of 1. To get a count of the number of function errors that occurred each minute, view the Sum of the Errors metric with a period of one minute.

To make us aware of any failures, we create a CloudWatch Alarm based on the Errors metric for a specific Lambda resource. The exact threshold of the alarm depends on how frequently a job runs and its criticality, but most commonly this value is set to trigger upon three* failures in a five minute period.

*One for the original failure, plus two automatic retries.

For some, generic alerting of this sort is sufficient, and notifications are simply directed to a work email or perhaps a PagerDuty Service tied to an on-call schedule.

However, we know in this scenario valuable information about the failed invocation is being ignored. To be most efficient, we strive to automate more of the debugging process.

Our journey, eager Lambda user, is only beginning.

Step #2: With A Little Help From An SNS Topic + Lambda Friends

Instead of sending straight to an alerting service, we send alarm notifications to a centralized SNS topic that handles failure events for all Lambda functions across our cloud data infrastructure.

What happens to an Alarm record sent to the topic? It triggers another Lambda function of course!

We call this special Lambda function the Alerting Lambda and it performs three main steps:

Sends a message to Slack with details about the failure.
Creates an incident in PagerDuty, also populated with helpful details.
Queries CloudWatch Logs for log messages related to the failure, and if found, sends to Slack.

The first two steps are relatively straightforward so we’ll quickly cover how they work before diving into the third.

If you inspect the payload sent from CloudWatch Alarms to SNS, you’ll see it contains data related to the alarm itself like the name, trigger threshold, old and current alarm state, and relevant CloudWatch Metric.

The Alerting Lambda takes this data and parses it into a super-helpful Slack message (via a webhook) that looks like this:

Similarly, using the pypd package we create a PagerDuty event with helpful Custom metrics and AWS console link populated:

Both of these notifications help us instantly determine if an alert is legitimate or perhaps falls more into the “false alarm” category. When managing 100+ tasks, this provides a quality-of-life improvement for everyone on the team.

The third step of the Alerting Lambda is recently implemented (inspired by this post on effective Lambda logging) and has proven to be a beloved shortcut for Lambda debugging.

The output is a message in Slack containing log messages from recent Lambda failures that looks something like this:

How does this work exactly?

The first step is to parse out the Lambda function name from the SNS event. This allows us to know which CloudWatch Log Group to query against for recent errors, shown in the code snippet below:

And after parsing the query response for a requestId, we run a second Insights query filtered on that requestId, re-format the log messages returned in the response, and send the results to Slack.

Place code like this in your Alerting Lamba and before you know it, you’ll be getting helpful log messages sent to Slack too!

Final Thoughts

Though this solution has proven effective for our needs, there is room for improvement. Notably while we query CloudWatch Logs when a Lambda Errors, we don’t handle other Lambda failures (like timeouts or throttling).

The idea to run an Insights query when a Lambda fails didn’t come to us in a “Eureka!” moment of inspiration… but rather from observing any consistent, predictable actions we perform that could be automated. Maintaining an awareness for these situations will serve any developer well in his or her career.

Another lesson for some getting started with serverless technologies is that you cannot be afraid of managing many, many cloud resources. Critically, the marginal cost of adding an additional Lambda function or SQS queue to your architecture should be near-zero.

The idea of spinning up an additional SNS topic and Lambda for error handling was a turn off to some. We hope we’ve shown the benefits of growing past that limiting mindset. If you want to read more on this topic, check out our post on painlessly deploying Lambda functions.

One final thought, you may be wondering if all other Lambdas are monitored by the Alerting Lambda, what then, monitors the Alerting Lambda function?

Hmmm.

How I Write Meaningful Tests for AWS Lambda Functions

Paul Singman — Tue, 15 Sep 2020 03:07:13 +0000

A Harsh Truth

If you are going to write meaningless unit tests that are more likely to mask errors than expose them, you are better off skipping the exercise altogether.

There, I said it.

Your time is precious and could be spent on better things than achieving a hollow coverage percentage.

Effective testing of code has long been a challenging problem in programming, and newer tools like AWS Lambda seem to bring out the worst in developers when it comes to writing tests.

I think the main reason for this is that it’s more difficult (or at least less intuitive) to mirror the Lambda production environment locally. And as a result, some developers abstain from testing entirely.

I know because I’ve done it myself, even for projects in production. Instead, testing was done integration-style only after code was already deployed to the cloud.

This is extremely manual and wastes time in the long run. Another approach I’ve seen results in tests that look something like this:

This is the unmistakeable sign of an engineering team with a test coverage requirement but a lack of accountability. And no explanation is needed that the above is a no-no.

So, how do we go about transforming the sad test_lambda_function.py file above into something meaningful?

Before we can dive right into testing our Lambda code, there are a couple hurdles in the way. We’ll cover each of these individually and determine how to best handle them. Once dealt with, we are then free to test Lambdas to our heart’s content!

Note: I’ll be including small snippets of code throughout the article for clarity. But at the end there will be a full working code example to reference.

Hurdle #1: The Lambda Trigger Event

Every Lambda function gets invoked in response to a pre-defined trigger that passes specific event data into the default lambda_handler() method. And your first task for effectively testing a Lambda function is to create a realistic input event to test with.

The format of this event depends on the type of trigger. As of the time of writing there are 16 distinct AWS services that can act as the invocation trigger for Lambda.

Below is a code snippet with several examples of inputs that I most commonly use:

The full list of sample input events can be found in the AWS documentation. Alternatively, you can also print the event variable in your lambda_handler code after deploying and view the payload in CloudWatch Logs:

Once you have that example, simply hardcode it in your test file as shown above and we’re off to a fantastic start!

Next up…

Hurdle #2: AWS Service Interactions

Almost inevitably, a Lambda function interacts with other AWS services. Maybe you are writing data to a DynamoDB table. Or posting a message to an SNS topic. Or simply sending a metric to CloudWatch. Or a combination of all three!

When testing it is not a good idea to send data or alter actual AWS resources used in production. To get around this problem, one approach is to set up and later tear down separate test resources.

A cleaner approach though, is to mock interactions with AWS services. And since this is a common problem, a package has been developed to solve this specific problem. And what’s better is it does so in a super elegant way.

It’s name is moto (a portmanteau of mock & boto) and its elegance is derived from two main features:

It patches and mocks boto clients in tests automatically.
It maintains the state of pseudo AWS resources.

What does this look like? All that’s needed is some decorator magic and a little set up!

Say we read data from S3 in our Lambda. Instead of creating and populating a test bucket in S3, we can use moto to create a fake S3 bucket — one that looks and behaves exactly like a real one — without actually touching AWS.

And the best part is we can do this using standard boto3 syntax, as seen in the example below when calling the create_bucket and put_object methods:

Similarly, if we write data to DynamoDB, we could set up our test by creating a fake Dynamo table first:

It requires a bit of trust, but if the test passes, you can be confident your code will work in production, too.

Okay, but not everything is covered by moto…

Yes, it is true that moto doesn’t maintain parity with every AWS API. For example, if your Lambda function interacts with AWS Glue, odds are moto will leave you high and dry since it is only 5% implemented for the Glue service.

This is where we need to roll up our sleeves and do the dirty work of mocking calls ourselves by monkeypatching. This is true whether we’re talking about AWS-related calls or any external service your Lambda may touch, like when posting a message to Slack, for example.

Admittedly the terminology and concepts around this get dense, so it is best explained via an example. Let’s stick with AWS Glue and say we have a burning desire to list our account’s Glue crawlers with the following code:

session = boto3.session.Session()
glue_client = session.client("glue", region_name='us-east-1')
glue_client.list_crawlers()['CrawlerNames']
# {“CrawlerNames”: [“crawler1”, "crawler2",...]}

If we don’t want the success or failure of our illustrious test to depend on the list_crawlers() response, we can hardcode a return value like so:

By leveraging the setattr method of the pytest monkeypatch fixture, we allow glue client code in a lambda_handler to dynamically at runtime access the hardcoded list_clusters response from the MockBotoSession class.

What’s nice about this solution is that is it flexible enough to work for any boto client.

Tidying Up with Fixtures

We’ve already covered how to deal with event inputs and external dependencies in Lambda tests. Another tip I’d like to share involves the use of pytest fixtures to maintain an organized testing file.

The code examples thus far have shown set up code directly in the test_lambda_handler method itself. A better pattern, however, is to create a separate set_up function as a pytest fixture that gets passed into any test method that needs to use it.

For the final code snippet, let’s show an example of this fixture structure using the @pytest.fixture decorator and combine everything covered:

We’ve come a long way from the empty test file at the beginning of the article, no?

As a reminder, this code tests a Lambda function that:

Triggers off an sqs message event
Writes the message to a Dynamo Table
Lists available Glue Crawlers

By employing these strategies though, you should feel confident testing a Lambda triggered from any event type, and one that interacts with any AWS service.

Final Thoughts

If you’ve been struggling to test your Lambda functions, my hope is this article showed you a few tips to help you do so in a useful way.

While we spent a lot of time on common issues and how you shouldn’t test a Lambda, we didn’t get a chance to cover the opposite, yet equally important aspect of this topic--namely what should you test, and how can you structure your Lambda function’s code to make it easier to do so.

I look forward to hearing from you and discussing how you test Lambda functions!

Thank you to Vamshi Rapolu for inspiration and feedback on this article.

Why You Should Never, Ever print() in a Lambda Function

Paul Singman — Wed, 12 Aug 2020 15:23:24 +0000

How to spot an AWS Lambda novice, just like that

Note: This article was originally published in the Towards Data Science Medium Publication on Aug. 4th.

A Tale of Two Lambda Users

Tale #1: The Amateur

One moment everything is fine, then … Bam! Your Lambda function raises an exception, you get alerted and everything changes instantly.

Critical systems could be impacted, so it’s important that you understand the root cause quickly.

Now, this isn’t some low-volume Lambda where you can scroll through CloudWatch Logs and find the error in purely manual fashion. So instead you head to CloudWatch Insights to run a query on the log group, filtering for the error:

Looks like we found our error! While helpful, unfortunately it omits any other log messages associated with the failed invocation.

With just the information shown above maybe — just maybe — you can figure out what the root cause is. But more likely than not, you won’t be confident.

Do you tell people you aren’t sure what happened and that you’ll spend more time investigating if the issue happens again? As if!

So instead you head to the CloudWatch Logs Log Stream, filter records down to the relevant timestamp, and begin manually scrolling through log messages to find the full details on the specific errored invocation.

Resolution Time: 1–2 hours
Lambda Enjoyment Usage Index: Low

Tale #2: The Professional

Same Lambda function, same error. But this time the logging and error handling are improved. As the title implies, this involves replacing print() statements with something a ‘lil better.

What is that something and what does this Lambda function look like anyway? Let’s first go through what error debugging looks like for the professional, then take a look at code. Fair?

Again, we start with an Insights query:

And again we find the error in the logs, but unlike last time, the log event now includes the @requestId from the Lambda invocation. What this allows us to do is run a second Insights query, filtered on that requestId to see the full set of logs for the exact invocation we are interested in!

Now we get 5 results, which together paint the full crime scene picture of what happened for this request. Most helpfully, we immediately see the exact input passed to trigger the Lambda. From this we can either deduce what happened mentally, or run the Lambda code locally with the exact same input event to debug.

Resolution Time: 10–20 minutes
Lambda Enjoyment Usage Index: High!

The Code Reveal

I’d like to imagine my readers are on the edge of their seats, begging to know the difference between the Amateur and the Pro’s code from the tale above.

Whether that’s true or not, here is the Amateur Lambda:

It is, of course, intentionally simple for illustrative purposes. Errors were generated by simply passing an event dictionary without artist as a key, for example: event = {'artisans': 'Leonardo Da Vinci'}.

Now for the Professional Lambda, which performs the same basic function but improves upon the print() statements and error handling.

Interesting! So why are we using the logging module and formatting exception tracebacks?

Lovely Lambda Logging

First, the Lambda runtime environment for python includes a customized logger that is smart to take advantage of.

It features a formatter that, by default, includes the aws_request_id in every log message. This is the critical feature that allows for an Insights query, like the one shown above that filters on an individual @requestId, to show the full details of one Lambda invocation.

Exceptional Exception Handling

Next, you are probably noticing the fancy error handling. Although intimidating looking, using sys.exec_info is the standard way to retrieve information about an exception in python.

After retrieving the exception name, value, and stacktrace , we format it into a json-dumped string so all three appear in one log message, with the keys automatically parsed into fields. This also makes it easy to create custom metrics based off specific errors without requiring complex string parsing.

Lastly, it is worth noting that in contrast, logging an exception with the default Lambda logger without any formatting results in an unfortunate multi-line traceback looks something like this:

Wrapping Up

I hope if your Lambda functions look more like the Amateur Lambda at the moment, this article inspires you to upgrade your dance and go Pro!

Before I let you go, I should warn that the downside to replacing print statements with proper logging is that you lose any terminal output generated from local executions of your Lambda.

There are clever ways around this involving either environment variables or some setup code in a lambda_invoke_local.py type of file. If interested, let me know and I’ll be happy to go over the details in a future article.

Lastly, as a final bit of inspiration, instead of needing to run Cloudwatch Insights queries to debug yourself, it should be possible to set up an Alarm against the Lambda Errors metric that notifies an SNS topic when in state “In Alarm”. Another Lambda could then trigger off that SNS topic to run the same debugging Insights queries as the Pro automatically, and return the relevant logs in Slack or something.

Would be cool, no?