Forem: Dashbird

Why and how to monitor Amazon API Gateway HTTP APIs

Taavi Rehemägi — Thu, 07 Jul 2022 11:52:50 +0000

API gateways are part of every modern microservice architecture. As their name already suggests, they are the gateway into your system; everyone who wants to access your service has to go through a gateway.

In 2019, AWS announced HTTP APIs for its API Gateway (APIG) service. This was a big step to add more flexibility and lower latency to APIG.

Before this release, you could only build REST APIs with APIG, which only helped when you wanted to create an API based on the REST architecture. In every other case, it was a burden because you had to bend all the REST configurations to fit your architecture model.

With HTTP APIs, AWS gave its customers a low-level tool. You can take an architecture pattern, like AsyncAPI, to build your gateway. The more you can do, the more you can do wrong.

So, it's nice to know that Dashbird now supports the monitoring of HTTP APIs built with APIG.

This article will look at why and how to monitor HTTP APIs and how Dashbird will help you do that.

Why Should You Monitor Your HTTP APIs?

As you already know, API gateways are the entry to your system. This means at least two things:

An API gateway is a liability because all bad things entering your system will go through.
Since all data goes through an API gateway, it's the perfect location to measure.

If a malicious user wants to attack your system, they need some way to enter your system. If you didn't configure your APIG correctly, this could create this entry. And this doesn't just mean you allowed anonymous users to access routes they aren't supposed to use. It can also mean you attached confidential data to URLs, which got logged somewhere.

The same goes for allowing all data to enter your system without validation or sanitation. If a user can send anything to your system and processes it without further thought, you can quickly end up with corrupt data or crashed services.

Monitoring your HTTP APIs helps you to get insights into such problems. If things go wrong in production, and your service is on the other side of the planet, you might not have the time to comb through hundreds of log lines; you need a sophisticated monitoring system that quickly tells you what happened.

Modern services also aren't static. They change over time, and your first implementation is never the best. You iterate with user feedback and technical feedback from your monitoring tools.

If a user tells you about a bad user experience because your system is slow, you need to know why it's slow. Pouring hours or days into optimizing your database queries when your HTTP API was the laggard isn't something you want to waste your time with.

Costs are also a huge factor, especially in the AWS cloud, where you pay for all the traffic out of your system. If you always send massive JSON objects around but just use 10% of their data in the UI, you leave money on the table.

Monitoring your HTTP APIs can transform your decision process with actionable information instead of guessing around user complaints and high bills. These gateways' central spot in your architecture makes them the perfect place to gather all the necessary metrics for meaningful iterations.

How to Monitor Your HTTP APIs?

With its latest release, Dashbird added support for APIG's HTTP APIs. All your HTTP APIs are automatically monitored after installing Dashbird into your AWS account. You need to deploy a CloudFormation template to set up Dashbird integration; it doesn't require any code changes!

In figure 1, you can see what Dashbird monitors out of the box. Requests per minute, errors, and latency for all your HTTP APIs. You also get a list of all endpoints, with explanations about their authentication mechanisms and if they are redirects.

Dashbird extracts all errors directly from CloudWatch Logs. This way, you can save a lot of time combing through logs manually while ensuring nothing goes missing.

Figure 1: Dashbird's HTTP API monitoring

Dashbird Insights

Dashbird also comes with a bunch of insights for your HTTP APIs. These are taken directly from the AWS Well-Architected Framework, so you know at one glance if you're following AWS best practices when building your serverless systems. Insights let you be proactive; they will notify you before a problem impacts your users.

Figure 2 shows the insights category filtered by "http api." This filter helps you to focus on one service at a time. The Insights come with some details about the issue and why you might want to fix it.

Figure 2: Dashbird Insights for HTTP APIs

Dashbird Alarms

If you want to get notified when things don't go as planned, you can use Dashbird's trusty alarms feature, which will send you notifications on a channel of your choice. If you look at figure 3, you see an HTTP API alarm that will trigger when too many requests are hitting your endpoints.

Dashbird currently supports email, Amazon SNS, Slack, and webhooks as notification channels.

Figure 3: Dashbird alarms for HTTP APIs

Summary

Dashbird's new API Gateway integration for HTTP APIs brings all the goodies you know from other services.

Monitoring HTTP APIs is a breeze when you don't have to wade through thousands of log lines when encountering problems with your architecture. With Dashbird's alarms, you can get notified right away, and Insights even go a step further by helping you to follow AWS' best practices and telling you things might go wrong in the future.

Since Dashbird is built on top of AWS services like CloudWatch Logs, you get all these supporting features without writing a single line of code or changing your system. Just deploy the CloudFormation template, and Dashbird will monitor all your existing systems for you.

Getting down and dirty with metric-based alerting for AWS Lambda

Taavi Rehemägi — Thu, 02 Jun 2022 13:14:09 +0000

Originally posted to Dashbird blog.

Traditionally in white-box monitoring, error reporting has been achieved with third party libraries, that catch and communicate failures to external services and notify developers whenever a problem occurrs. I'm here to argue that for managed services this can be achieved with less effort, no agents and without performance overhead.

In fact, there's a lot of reasons why you shouldn't use classical error-reporting tools in AWS Lambda. Most critical of them is that error-handling libraries in the code are blind to Lambda specific failures, such as timeouts, wrongly configured packages and out of memory failures. In addition, there is an issue with coverage -- implementing error reporting for each function is a lot of work. Whenever you add a service to your infrastructure, you must go through setting up error tracking and monitoring for it and forgotting to do it can result in blind-spots in your system.

Luckily, those problems can be solved quite easily and in most cases it's just a matter of adopting new tooling and development practices.

About the word "observability"

Before getting into details it's important to understand the idea behind observability. It doesn't mean that you'll have visibility or that you can even monitor your service right of the bat. It means that the system makes itself understandable by outputting data that enables the developer to ask any kind of arbitrary questions about the current or past state of the system. Fortunately the information emitting aspect is well implemented in AWS and serverless users for example have an opportunity to get visibility without specifically implementing extra stuff in their code.

Apart from CloudWatch logs, we could leverage AWS APIs for resource discovery and X-ray and CloudTrail for tracing and connecting execution flows.

We can make failure detection better today

The ability to detect failures across all functions and connect them with specific invocations, view logs and pull X-ray traces for them significally reduces the mean time to resolution in failure scenarios.

Lets break it down

The only prerequisite for log-based error detection and visibility in general is that logs are pushed to CloudWatch (in most cases that is the default). From there on, we can do some smart pattern matching and deduction to detect failure scenarios.

On top of that, logs contain a lot of other data that indicate latency and memory usage and allow us to connect requests with AWS X-ray and search for a trace report for a specific request. All this allows us to gather a lot of context in order to understand what went wrong in a particular case.

Here's what an X-ray trace contains when you search for it for a specific Lambda request. This enables you to catch errors in services your Lambda function touches.

Conclusion

With the emergence of managed and distributed services, the monitoring landscape will have to go through a significant change to keep up with modern cloud applications. Currently, devops overhead is one of the biggest obstacles for companies looking to use serverless in production and rely on it for mission-critical applications. Our team at Dashbird is hoping to solve that one problem at a time.

*Dashbird takes about 5 minutes to set up, after which you will get full visibility into your serverless applications. Give it a try by signing up here.*The phrase "better safe than sorry" gets thrown around whenever people talk about monitoring or getting observability into your AWS resources but the truth is that you can't sit around and wait until a problem arises, you need to proactively look for opportunities to improve your application in order to stay one step ahead of the competition. Setting up alerts that go off whenever a particular event happens is a great way to keep tabs on what's going on behind the scenes of your serverless applications and this is exactly what I'd like to tackle in this article.

AWS Lambda Metrics

AWS Lambda is monitoring functions for you automatically, while it reports metrics through the Amazon CloudWatch. The metrics we speak of consist of total invocations, throttles, duration, error, DLQ errors, etc. You should consider CloudWatch as a metrics repository, being that metrics are the basic concept in CloudWatch and they represent a set of data points which are time-ordered. Metrics are defined by name, one or even more dimensions, as well as a namespace. Every data point has an optional unit of measure and a time stamp.

And while Cloudwatch is a good tool to get the metrics of your functions, Dashbird takes it up a notch by providing that missing link that you'd need in order to properly debug those pesky Lambda issues. It allows you to detect any kinds of failures within all programming languages supported by the platform. This includes crashes, configuration errors, timeouts, early exits, etc. Another quite valuable thing that Dashbird offers is Error Aggregation that allows you to see immediate metrics about errors, memory utilization, duration, invocations as well as code execution.

AWS Lambda metrics explained

Before we jump in I feel like we should discuss the metrics themselves to make sure we all understand and know what every term means or what they refer to.

From there, we'll take a peek at some of the namespace metrics inside the AWS Lambda, and we'll explain how do they operate. For example

Invocations will calculate the number of times a function has been invoked in response to invocation API call or to an event which substitutes the RequestCount metric. All of this includes the successful and failed invocations, but it doesn't include the throttled attempts. You should note that AWS Lambda will send mentioned metrics to CloudWatch only if their value is at the point of nonzero.

Errors will primarily measure the number of failed invocations that happened because of the errors in the function itself which is a substitution for ErrorCount metric. Failed invocations are able to start a retry attempt which can be successful.

There are limitations we must mention:

Doesn't include the invocations that have failed because of the invocation rates exceeded the concurrent limits which were set by default (429 error code).
Doesn't include failures that occurred because of the internal service errors (500 error code).

DeadLetterErrors can start a discrete increase in numbers when Lambda is not able to write the failed payload event to your pre-configured DeadLetter lines. This incursion could happen due to permission errors, misconfigured resources, timeouts or even because of the throttles from downstream services.

Duration will measure the real-time beginning when the function code starts performing as a result of an invocation up until it stops executing. The duration will be rounded up to closest 100 milliseconds for billing. It's notable that AWS Lambda sends these metrics to CloudWatch only if the value is nonzero.

Throttles will calculate the number of times a Lambda function has attempted an invocation and were throttled by the invocation rates that exceed the users' concurrent limit (429 error code). You should also be aware that the failed invocations may trigger retry attempts automatically which can be successful.

Iterator Age is used for stream-based invocations only. These functions are triggered by one of the two streams: Amazon's DynamoDB or Kinesis stream. Measuring the age of the last record for every batch of record processed. Age is the sole difference from the time Lambda receives the batch and the time the last record from the batch was written into the stream.

Concurrent Executions are basically an aggregate metric system for all functions inside the account, as well as for all other functions with a custom concurrent pre-set limit. Concurrent executions are not applicable for different forms and versions of functions. Basically, this means that it measures the sum of concurrent executions in a particular function from a certain point in time. It is crucial for it to be viewed as an average metric considering its aggregated across the time period.

Unreserved Concurrent Executions are almost the same as Concurrent Executions, but they represent the sum of the concurrency of the functions that don't have custom concurrent limits specified. They apply only to the user's account, and they need to be viewed as an average metric if they're aggregated across the period of time.

Where do you start?

Cloudwatch

In order to access the metrics using the CloudWatch console, you should open the console and in the navigational panel choose the metrics option. Furthermore, in the CloudWatch Metrics by Category panel, you should select the Lambda Metrics option.

Dashbird

To access your metrics you need to log in to the app and the first screen will show you a bird's eye view of all the important stats of your functions. From cost, invocations, memory utilization, function duration as well as errors. Everything is conveniently packed onto a single screen.

Setting Up Metric Based Alarms For Lambda Functions

It is essential to set up alarms that will notify you when your Lambda function ends up with an error, so you'll be able to react proficiently.

Cloudwatch

To set up an alarm for failed function (can be caused by the fall of the entire website or even an error in the code) you should go to the CloudWatch console, choose Alarms on your left and click Create Alarm. Choose the "Lambda Metrics," and from there, you should look for your Lambda name in the list. From there, check the box of a row where the metric name is "Error." Then just click Next.

Now, you'll be able to put a name and a description for the alarm. From here, you should set up the alarm to be triggered every time "Errors" are over 0, for one continuous period. As the Statistic, select the "sum" and the minutes required for your particular case in the dropdown "Period" window.

Inside the Notification box, choose the "select notification list" in a dropdown menu and choose your SNS endpoint. The last step in this setup is to click the "Create Alarm" button.

Dashbird

Setting metric-based alerts with Dashbird is not as complicated, in fact, it's quite the opposite. While in the app, go to the Alerts menu and click on the add button on the right side of your screen and give it a name. After that you select the metric you are interested in, which can either be a cold start, retry, invocation and of course error. All you have to do is select the rules (eg: whenever the number of cold starts are over 5 in a 10 minute window alert me) and you are done.

How do you pick the right solution for your metric based alerts?

Tough question. While Cloudwatch is a great tool, the second you have more lambdas in your system you'll find it very hard to debug or even understand your errors due to the large volume of information. Dashbird, on the other hand, offers details about your invocations and errors that are simple and concise and have a lot more flexibility when it comes to customization. My colleague Renato made a simple table that compares the two services.

I'd be remiss not to make an observation: with AWS CloudWatch whenever a function is invoked, they spin up a micro-container to serve the request and open a log stream in CloudWatch for it. They re-use the same log stream as long as this container remains alive. This means the same log stream gets logs from multiple invocations in one place.

This quickly gets very messy and it's hard to debug issues because you need to open the latest log stream and browse all the way down to the latest invocations logs While in Dashbird we show individual invocations ordered by time which makes it a lot easier for developers to understand what's going on at any point in time.The phrase "better safe than sorry" gets thrown around whenever people talk about monitoring or getting observability into your AWS resources but the truth is that you can't sit around and wait until a problem arises, you need to proactively look for opportunities to improve your application in order to stay one step ahead of the competition. Setting up alerts that go off whenever a particular event happens is a great way to keep tabs on what's going on behind the scenes of your serverless applications and this is exactly what I'd like to tackle in this article.

AWS Lambda Metrics

AWS Lambda metrics explained

Before we jump in I feel like we should discuss the metrics themselves to make sure we all understand and know what every term means or what they refer to.

From there, we'll take a peek at some of the namespace metrics inside the AWS Lambda, and we'll explain how do they operate. For example

There are limitations we must mention:

Doesn't include the invocations that have failed because of the invocation rates exceeded the concurrent limits which were set by default (429 error code).
Doesn't include failures that occurred because of the internal service errors (500 error code).

Where do you start?

Cloudwatch

Dashbird

Setting Up Metric Based Alarms For Lambda Functions

It is essential to set up alarms that will notify you when your Lambda function ends up with an error, so you'll be able to react proficiently.

Cloudwatch

Inside the Notification box, choose the "select notification list" in a dropdown menu and choose your SNS endpoint. The last step in this setup is to click the "Create Alarm" button.

Dashbird

How do you pick the right solution for your metric based alerts?

We can do better failure detection in serverless applications

Taavi Rehemägi — Thu, 02 Jun 2022 08:50:06 +0000

Luckily, those problems can be solved quite easily and in most cases it's just a matter of adopting new tooling and development practices.

About the word "observability"

Apart from CloudWatch logs, we could leverage AWS APIs for resource discovery and X-ray and CloudTrail for tracing and connecting execution flows.

We can make failure detection better today

Lets break it down

Here's what an X-ray trace contains when you search for it for a specific Lambda request. This enables you to catch errors in services your Lambda function touches.

Conclusion

Dashbird takes about 5 minutes to set up, after which you will get full visibility into your serverless applications. Give it a try by signing up here.

5 Common Amazon Kinesis Issues

Taavi Rehemägi — Thu, 26 May 2022 10:30:08 +0000

Amazon Kinesis is the real-time stream processing service of AWS. Whether you got video, audio, or IoT streaming data to handle, Kinesis is the way to go.

Kinesis is a serverless managed service that integrates nicely with other services like Lambda or S3. Often, you will use it when SQS or SNS is too low-level.

But as with all the other services on AWS, Kinesis is a professional tool that comes with its share of complications. This article will discuss the most common issues and explain how to fix them. So, let's get going!

This article is written by Kay Plößer and originally posted to Dashbird blog

1. What Limits Apply when AWS Lambda is Subscribed to a Kinesis Stream?

If your Kinesis stream only has one shard, the Lambda function won't be called in parallel even if multiple records are waiting in the stream. To scale up to numerous parallel invocations, you need to add more shards to a Kinesis Stream.

Kinesis will strictly serialize all your invocations per shard. This is a nice feature for controlling your parallel Lambda invocations. But it can slow down overall processing if the function takes too long to execute.

If you aren't relying on previous events, you can use more shards, and Lambda will automatically scale up to more concurrent invocations. But keep in mind that Lambda itself has a soft limit on 1,000 concurrent invocations. You can reach out to AWS to get this limit lifted. There isn't an explicitly defined hard limit above that, but AWS mentions its multiples of 10,000.

2. Data Loss with Kinesis Streams and Lambda

If you call put_record in a loop to publish records from a Lambda function to a Kinesis stream, this can fail mid-loop. To fix this, make sure you catch any errors the put_record method throws; otherwise, your function will crash and only partially publish the list of records.

If one Lambda invocation is responsible for publishing multiple records to a Kinesis stream, you have to make sure a crash of the Lambda function doesn't lose data. Depending on your use case, this could mean you need to use retries or another queue in front of your Lambda function.

You can also try to catch any errors instead of crashing and then put the missing records somewhere else to ensure they don't get lost.

3. InvokeAccessDenied Error When Pushing Records from Firehose to Lambda

You're trying to push a record from Kinesis Firehose to a Lambda function but get an error. This is usually a permission issue with IAM roles. To fix this, make sure to assign your firehose the correct IAM role.

In the Resource section of your policy document, you need to make sure all your Lambda functions' ARNs are listed. You achieve this with either a wildcard in the ARN or an array of ARNs.

But there can be many other permission problems that prevent invocation. Some of them are:

Missing the "Action": ["lambda:InvokeFunction"]
Having an "Effect": "Deny" somewhere
Assigning the wrong role to the firehose

4. Error When Trying to Update the Shard Count

You tried to update the shard count too often in a given period. The UpdateShardCount method has rather tight limits. To get around this issue, you can call other functions like SplitShard and MergeShards, with more generous quotas.

Often, you don't know how many shards are sufficient to handle your load, so you have to update their numbers over time. AWS limits how you meddle with the shard count. To quote the docs here, you can't

Scale more than ten times per rolling 24-hour period per stream
Scale up to more than double your current shard count for a stream
Scale down below half your current shard count for a stream
Scale up to more than 10000 shards in a stream
Scale a stream with more than 10000 shards down unless the result is less than 10000 shards
Scale up to more than the shard limit for your account

If you use other methods, you can get around some of the limitations, which give you more flexibility around sharding.

5. Shard is Not Closed

You interacted too soon after you created a new stream. Creating a new stream can take up to 10 minutes to complete. You can set timeouts after creating a stream or ensure that you retry a few times to fix this.

Creating new streams or shards isn't an instant action. It happens very quickly, but you might have to wait for minutes in the worst case. As with any distributed system, you have to keep latencies in mind. Otherwise, your logs will be littered with errors.

Summary

If you have to process your data or media in real-time, it's best to go for Kinesis on AWS.

Sadly, it's not as straightforward as SQS and SNS, but it's also more flexible than those services.

Your best course of action is to learn about the limitations of the service so you aren't littered with avoidable error messages. Also, make sure to program your Lambda functions robustly so they don't crash with half your data not processed yet.

Monitoring Kinesis with Dashbird

Dashbird will monitor all your Kinesis streams out of the box. Additionally, Dashbird will evaluate all your Kinesis logs according to the Well-Architected Framework. So, it's not just metrics and errors en masse, but actionable information to improve your architecture with AWS best practices.

Try Dashbird now for free, or check out our product tour!

At Dashbird, we understand that serverless's core idea and value is to focus on the customer and the ability to avoid heavy lifting. That's precisely what we provide. Finally, we allow developers to t*hink about the end-user *again and not be distracted by debugging and alarm management or worry about whether something is working.

Further reading:

5 Common Amazon SQS Issues

Taavi Rehemägi — Thu, 12 May 2022 12:40:21 +0000

Originally posted to Dashbird blog.

The simple query service (SQS) was one of the first services AWS offered. It's a managed queuing service that lets you take pressure from your downstream services. You put your items on the queue, and other services can pull them whenever they have the capacity to work on them.

It's a managed service, so you don't have to install or maintain the software yourself; you just configure a queue and start pushing to and pulling from it. So SQS is very simple to get started with.

As with all services on AWS, issues can crop up while using SQS because it's not always obvious what every service can and cannot do. But fear not, for this article aims to help you solve the most common ones as quickly as possible. Ready to fix your queues? Then let's dive in!

1. Java SDK – Cannot Receive Message Attributes

You tried to get messages with attribute without telling the SDK the attribute name. To fix this you need to call receiveMessage with the following parameter:

sqs.receiveMessage(

receiveMessageRequest

.withMessageAttributeNames("myAttribute")

).getMessages()

This way, the SDK knows that it should include the attributes you care about.

For background, AWS services in general, and SQS in particular, are optimized to lower traffic because you pay for all data you want to get out of AWS. If you used an SQS queue from outside of AWS, you would like to make sure you only get the data you use to save money and increase throughput.

The AWS SDK lets you define which attributes of the message you need to get this optimization going. But later, when you access attributes, you have to keep in mind which you have fetched; otherwise, you will find something akin to a null-pointer exception: accessing data that isn't there.

2. SNS Topic Not Publishing to SQS

You tried to publish to SQS from an SNS topic that doesn't have permission. To fix this, you need to allow SQS to the SendMessage action from the SNS service principal.

Here is an example policy:

{

"Statement": [{

"Sid": "Allow-SNS-SendMessage",

"Effect": "Allow",

"Principal": {

"Service": "sns.amazonaws.com"

},

"Action": ["sqs:SendMessage"],

"Resource": "arn:aws:sqs:us-east-2:444455556666:MyQueue",

"Condition": {

"ArnEquals": {

"aws:SourceArn":

"arn:aws:sns:us-east-2:444455556666:MyTopic"

}

}

}]

}

This problem is based on the misconception that SNS topics are deployed in your account. You might expect your account to act as the principal here, but the SNS service itself acts as the principal. So, you end up with your account getting permission it can't use and the actual principal being locked out.

AWS preconfigures all services with minimal access rights, usually none at all. This helps with security because you don't accidentally deploy something the whole world can access.

In day-to-day development work, this is often a bit cumbersome. You add a new feature, and suddenly you need to configure new permissions, so a simple one-liner requires you to write JSON documents in a completely different place. Well, that's the price we pay for safety.

3. Access Denied to SQS via AWS SDK

You tried to access SQS with a role or user that hasn't any permissions. To solve this, initialize the SDK with an account set up with the required SQS permissions. You can do so in the IAM console.

Again, AWS is all about keeping your data safe. So it will lock down all your services by default so that nobody can access anything. If you want to give access to a service, you have to assign a role to it and permit this role to do its work.

You can simply use admin permissions for development purposes inside a development AWS account, so you don't have to change your roles when you need different permissions.

But in a production environment, you should give every service minimal permission. This way, a compromised service won't expose your whole infrastructure.

4. Prevent Adding Duplicate SQS Messages

You're adding messages to a queue and don't know if you already added a message. This issue can't be solved with SQS; you would have to either use another service or filter messages before adding them to the queue.

SQS isn't made for such a use case, so you can't prevent a queue from accepting items it already holds. If you have this constraint, you need to choose a more complex service like DynamoDB, which allows you to mark documents as unique. Step Functions might also be an option here, but this depends mainly on your use case.

5. In-Flight Messages

Your consumers pulled messages without deleting them or increasing the timeout. To fix this, you need to call deleteMessage with your receipt handle.

sqs.deleteMessage({

QueueUrl: 'STRING_VALUE',

ReceiptHandle: 'STRING_VALUE'

});

The idea here is that your consumer can delete a message when it is finished processing it. If a consumer crashes before completing their work, the message remains in the queue, and eventually, another consumer can process it.

So, the consumer pulls a message, processes it, deletes it, and pulls the following message.

6. FIFO Blocks the Queue

You configured a queue as FIFO, and a message blocks the whole queue. To mitigate this issue, group your messages with message group IDs; messages will only block those in the same group.

If you have a queue with many different types of messages and all of them have the same group ID, this can become a maintenance nightmare quickly. Make sure you group messages into smaller batches to lower the likelihood of blocking the whole queue when things go wrong.

Summary

SQS is a straightforward service. You can see it as a temporary database. If you need to put data somewhere to be processed later, SQS is the way to go.

Every service on AWS, an SQS queue, and all the other services in your stack using that queue will be configured with minimal permissions, leading to access issues. So, make sure you get your IAM policies set up correctly before deploying to production.

Admin roles might help when developing, but they're a security nightmare when used in production, so you need to watch IAM policies when deploying!

If you need more control about what gets in or out of your queue, Step Functions or DynamoDB might be a better bet.

Monitor SQS queues with Dashbird

Dashbird automatically monitors SQS queues in your AWS account; no extra config is needed. Everything will be evaluated according to the Well-Architected Framework, so you see right away if you're following best practices.

You can try Dashbird now; no credit card is required. See our product tour here.

At Dashbird, we understand that serverless's core idea and value is to focus on the customer and the ability to avoid heavy lifting. That's exactly what we provide. We give the focus back to developers to only think about the end customer,and not be distracted by debugging and alarm management or to worry about whether something is working or not.

We can do better failure detection in serverless applications

Taavi Rehemägi — Thu, 21 Apr 2022 11:25:28 +0000

This article was originally posted to Dashbird blog.

Luckily, those problems can be solved quite easily and in most cases it's just a matter of adopting new tooling and development practices.

About the word "observability"

Apart from CloudWatch logs, we could leverage AWS APIs for resource discovery and X-ray and CloudTrail for tracing and connecting execution flows.

We can make failure detection better today

Lets break it down

Here's what an X-ray trace contains when you search for it for a specific Lambda request. This enables you to catch errors in services your Lambda function touches.

Conclusion

Dashbird takes about 5 minutes to set up, after which you will get full visibility into your serverless applications. Give it a try by signing up here.

5 Common Step Function Issues

Taavi Rehemägi — Thu, 14 Apr 2022 08:54:46 +0000

Step Functions, the serverless finite state machine service from AWS. With DynamoDB, Lambda, and API Gateway, it forms the core of serverless AWS services. If you have tasks with multiple steps and you want to ensure they will get executed in the proper order, Step Functions is your service of choice.

It offers direct integrations with many AWS services, so you don't need to use Lambda Functions as glue. This can* improve the performance* of your state machine and lower its costs.

But Lambda runs your code, making debugging much more straightforward than running a managed service that's essentially a black box. This is the reason why I'm writing this article. Here you will find the most common issues when working with Step Functions, especially when starting with the service.

1. Task Returned a Result with a Size Exceeding the Maximum Number of Characters Service Limit

It's a rather unwieldy error message, but it means one of the payloads you're passing between states is over 256 KB. Make sure you don't exceed this limit, which can quickly happen when merging multiple parallel states.

Many services at AWS have some stringent limits on the data they can process. This allows AWS engineers to optimize these services and offer on-demand payment for serverless ones. But the downside for you, the user, is that those limits make the services unsuitable for many use-cases.

In the best-case scenario, you get along by planning right around those limits. So, when implementing a workflow as a state machine, make sure you stay inside the 256 KB limit when passing payloads along.

2. State Machine Canceled without Error

There are various reasons why a state machine could get canceled, and sometimes you won't get an error directly. Check the execution history of your state machine, where you can find all outputs.

Errors are a fickle topic, and sometimes things can go so wrong that the whole machine executing your code crashes. The logging works, but there is no time to respond with an error. But rest assured that AWS knows about that and keeps logging everything they can.

The execution history is usually an excellent place to investigate Step Functions problems. Typical events that could lead to a cancelation are more than 25,000 entries in the execution history, invalid data types in your outputs (i.e., numbers instead of strings), or missing variables in choice states.

3. The Choice State's Condition Path References an Invalid Value

You put an unresolvable path into the Variable field of your choice state. In the simplest case, it was just a typo, but it could also mean an object is incomplete or you're trying to call functions.

State machine definitions aren't statically typed; especially wrong path definitions can lead to headaches when you have a typo. Make sure your state outputs always line up with the inputs and

Step Function state machines are simple systems, they can execute basic logic to branch or parallelize states, but they aren't computing engines. The paths you're writing inside state machine definitions aren't JavaScript; they're VTL templates, so none of your usual JavaScript methods for objects or arrays are available here. You must calculate your path targets' value inside a Lambda function before checking it in a choice state. It also has to be boolean, number, string, or timestamp.

Try to define your state machines with tools like the AWS Step Functions Workflow Studio to minimize problems at definition time.

4. State Machine Stops After 25,000 Executions

You exceeded the state machines' execution history with a standard workflow. If you can, switch to express workflow; if that's not possible, you have to split up the state machine and start as a new execution.

The Step Functions service will log all executions of your state machines into an execution history; this is nice for debugging. But this history is limited to 25,000 entries, so the moment your state machine would have the 25,001st state change, the step function service will shut it down.

You can configure state machines as standard and express workflows. The express workflow comes with its limits on execution time, but it allows for more than 25,000 execution history entries. You can use the express workflow to get around this limit if you have many short-lived executions.

If your state machine has long-living execution steps and more than 25,000 execution steps, you will have to split it into multiple state machines. This way, every of these state machines can run as a new execution and, in turn, gets a new execution history.

5. Wrong String Format in Parameters

One of your states sends a result in the wrong format, and there is no alternative to select from. You have to use the States.Format() function for string construction to get around this.

Often you can simply select the right piece of data from your results to pass it into the parameters of the next state. And if not, you might at least be able to modify the target state so it accepts the structure you have available. But this might not always be the case.

The States.Format() function is globally available in your state machine definition. With this function, you can concatenate and reformat the data you have so it fits the parameters of the target state.

Here is a simple example that builds a full name for a parameter:

{

"Parameters": {

"foo.$":

"States.Format('{} {}', $.firstName, $.lastName)"

}

}

If you can't get away with this, you will have to plug a Lambda function that reformats the data more complexly. This function call will cost extra and slow down the execution, but sometimes it's the last resort.

Conclusion

AWS Step Functions is a powerful service that helps coordinate your serverless architecture's more complex tasks. But as with all serverless services, it comes with severe limitations you have to keep in mind when building.

As with all technology, make sure to modularize your stack. Small state machines are more manageable than a single monolithic one that might exceed limits here and there.

Also, as with all serverless systems built in the AWS cloud, Lambda is here to help. If things don't fit 100%, you can always throw in a function here and there to smooth things out. But keep in mind that they aren't free.

How Dashbird can Help

As with the last article about common DynamoDB issues, Dashbird can also give you insights into state machines and their executions. It will* automatically monitor all state machines* in your AWS account; no extra config is needed.

All your state machines will be evaluated according to the Well-Architected Framework, so you see right away if everything is following best practices.

You can try Dashbird now; no credit card is required. The first one million Lambda invocations are on us! See our product tour here.

6 Common DynamoDB Issues

Taavi Rehemägi — Thu, 07 Apr 2022 12:19:14 +0000

DynamoDB, the primary NoSQL database service offered by AWS, is a versatile tool. It's fast, scales without much effort, and best of all, it's billed on-demand! These things make DynamoDB the first choice when a datastore is needed for a new project.

But as with all technology, it's not all roses. You can feel a little lost if you're coming from years of working with relational databases. You're SQL and normalization know-how doesn't bring you much gain.

It's expected that developers face many of the same issues when starting their NoSQL journey with DynamoDB. This article might clear things up a bit.

1. Query Condition Missed Key Schema Element

You're trying to run a query on a field that wasn't indexed for querying. Add a secondary index for the field to your table to fix it.

DynamoDB is a document store, which means it can store documents in each table that have an arbitrary number of fields, in contrast to relational databases that usually include fixed columns in each table. This feature often leads developers to think, DynamoDB is a very flexible database, which is not 100% true.

To efficiently query a table, you have to create the proper indexes right from the beginning or add global secondary indexes later. Otherwise, you will need to use the scan command to get your data, which will check every item in your table. The query command uses the indexes you created and, in turn, is much quicker than using scan.

2. Resource Not Found Exception

The AWS SDK can't find your table. The most common reasons for this are a typo in the table name, using the wrong account credentials, or operating in the wrong region.

If you're using CloudFormation or TerraForm, it can also be that these tools are still deploying your table, and you try to access them before they are finished.

Companies using AWS in many different teams tend to have multiple AWS accounts, and if they operate globally, chances are good they deploy tables close to their customers. If you got many different combinations of accounts and regions, things could get confusing.

That's why you should always double-check your AWS SDK configurations and credentials.

3. Cannot do Operations on a Non-Existent Table

You have errors in your credentials for DynamoDB Local or the AWS SDK. Use the same accounts and regions in your credentials or configure DynamoDB Local with the -sharedDb flag so that multiple accounts can use one table.

DynamoDB Local is a tool that allows you to run DynamoDB on your local computer for development and testing purposes. It has to be configured just like the regular DynamoDB to ensure the system works as your production deployment.

While this makes DynamoDB Local susceptible to the same error, we saw above in point two. This also makes sure you catch these errors at development time.

4. The Provided Key Element Does Not Match the Schema

Your table configuration for hash and sort keys is different from the one you use in GetItem. Make sure you use the same key config at definition and access time to fix this.

Again, this is about the difference between the perceived and actual flexibility of DynamoDB. You need to correctly set up your key schema for a table when creating it and later keep on using the same keys when loading data from the table.

Check out our knowledge base if you want to learn more about setting up DynamoDB key schemas correctly.

5. User is Not Authorized to Perform: DynamoDB PutItem on Resource

An IAM policy issue, the user doesn't have access to the PutItem command. To fix this, you either use the managed policy AmazonDynamoDBFullAccess or, if you don't want to give the user access to all commands, explicitly add the PutItem to their IAM role.

AWS tries to keep you safe from security problems; that's why services deployed on AWS are always as closed as possible for any access. The downside is that accessing services deployed into a production environment can become a boilerplate-heavy task.

To get around this, you can try the AWS CDK. It's an infrastructure as code tool built around CloudFormation that comes with constructs for all AWS services. These constructs help with all that permission management that is easy to get wrong when using AWS.

6. Expected params.Item to be a Structure

You're using the low-level DynamoDB client of the AWS SDK, which expects you to specify runtime types for your fields. Instead of using pid: '', you have to use an object that includes the type: pid: {S: ''}.

The AWS SDK for JavaScript includes two clients. The low-level client requires you to set all your types manually, and a higher-level client, called DynamoDB Document Client, tries to infer your types from the values you use as arguments.

I would recommend using the DynamoDB Document Client if you don't have good reasons to use the low-level one instead.

Conclusion

Using DynamoDB for your projects can save you from many issues with relational databases. Most importantly, the costs associated with them. It scales well in terms of developer experience, from single developers to big teams, and performance, from single deployments to replicas worldwide.

But using DynamoDB and NoSQL databases requires you to relearn what you know from relational databases. If you don't see what you're getting into by just following the hype, you might find yourself at a dead-end later because DynamoDB won't deliver on the promises you read about.

Let Dashbird Help

If you're new to serverless or DynamoDB, Dashbird can help you get up to speed. After you add Dashbird to your AWS account, it automatically monitors all deployed DynamoDB tables, so no problem that might occur is lost on you.

And best of all, Dashbird comes with AWS Well-Architected best practices out of the box. This way, you can make sure you're using DynamoDB in a safe and performant way without learning everything about it up-front.\
You can try Dashbird now; no credit card is required. The first one million Lambda invocations are on us! See our product tour here.

[Infographic] OpenSearch from a serverless perspective

Taavi Rehemägi — Thu, 31 Mar 2022 08:15:10 +0000

Dashbird got an update, and you can now monitor the OpenSearch clusters you set up with Amazon OpenSearch Service. But what does this even mean? Many people don't even know what OpenSearch is! Wasn't there an Elasticsearch service before? So, let's have some explanation before we check out Dashbirds OpenSearch support.

Elastic and AWS Disagreed on Fair Use of Elasticsearch

After a heated argument between Elastic, the main contributor to the Elasticsearch (ES) project, and AWS. Elastic then changed the ES license to prohibit AWS from offering their Amazon Elasticsearch Service. Since the ES versions previous to that change were still under the old Apache 2.0 license, AWS created a fork from these and called it OpenSearch (OS).

To put it simply: Two companies disagreed on some software they implemented, and each one went on with their vision for that project. So, OS is merely a fork of ES, and these two open-source projects will probably diverge in the future in terms of compatibility and features.

In terms of managed search services, AWS deprecated Amazon Elasticsearch Service, and Amazon OpenSearch Service (AOS) is the way to go.

What is OpenSearch?

OS is a search engine, just like Google or Bing, but for your own data. It will index your data and make it searchable via an HTTP interface.

You can download OS and set it up on whatever computer you like. Install it on your local machine, set up an EC2 instance on AWS and install it there, or let AWS help you run it with the AOS.

Get the full size infographic from here.

OpenSearch vs. Elasticsearch

They are pretty similar, but each one is on a different trajectory. Some developers aren't happy with AWS treating Elastic the way they did. Others say switching to a less permissive license isn't the open-source way, and while they don't like how AWS might have treated Elastic, they think the Apache 2.0 license is the way to go if things should stay open.

If you want to use a managed, ES-like system on AWS, you either have to use older versions of ES or OS.

What is Amazon OpenSearch Service?

First, while AOS is a managed service, it's not serverless; you have to pay for idle servers.

But what is AOS actually?

AOS is the service on AWS that manages OS deployments for you. Like RDS is for relational databases, EKS is for Kubernetes clusters, and Elasticsearch Service is for ES, so is AOS for OS.

While AOS is an excellent help to get OS running in the cloud, ES and, in turn, OS is notoriously hard to set up and scale. So hard that no serverless versions of them exist.

While AOS isn't that easy to use, it's still simpler than managing your OS instances manually. AOS helps with scaling clusters, deploying VMs, and managing the underlying network and storage for you. It will also install bug fixes.

If you need a cloud-based search engine for your project, you might not get around messing with OS and AOS.

Amazon OpenSearch Service vs. Cloudwatch Log Insights

When you deploy your infrastructure to the cloud, chances are most of the data you generate is operational logs. If you want to ease the access of these logs for the ops team, you might be inclined to pump them into AOS. But this can get costly.

As mentioned above, AOS isn't a serverless system, so you pay for every hour it's deployed, even if you don't use it. This unused capacity could lead to a massive bill for nothing regarding operational monitoring data.

Cloudwatch Log Insights is a managed service on top of Cloudwatch logs and, in turn, optimized for this exact use-case. It scales automatically, and you only have to pay for the data you analyze, so it can be considered a serverless service.

AOS is a more low-level service you use if you need maximum flexibility. It costs more and isn't easy to scale, but it gives you all the freedom you need for a customized search engine.

Amazon OpenSearch Service in Serverless Architectures

As explained before, AOS isn't a serverless service. It can help with scaling, but it doesn't go down to zero, so you will pay for the underlying EC2 instances idling.

This makes integrating AOS into your architecture a hard choice. But a hard choice between which options?

If money is tight, or you believe that a system should scale to zero for whatever reason, you have to ask yourself what your search requirements are.

AOS is the nuclear option here; it's a Lucene-based distributed search engine that finds everything you indexed with Google-like features like fuzzy matching and such.

But if your search requirements aren't that strong, you might get away with DynamoDB's query features. Your results may vary and won't be on OpenSearch levels, but it might be just enough for a simple search feature. But most importantly, you don't pay for idle EC2 instances.

Setting Up OpenSearch on AWS

As said before, there are different methods of setting up OS on AWS, but we will go with the easiest here. Using the CDK, we will set up an AOS domain in just a few lines of code.

Keep in mind this example taken from the CDK docs will deploy a production-ready OS cluster containing 25 m3.medium instances, which can get expensive quickly.

const aosDomain = new opensearch.Domain(this, 'AosDomain', {

version: opensearch.EngineVersion.OPENSEARCH_1_0,

capacity: {masterNodes: 5, dataNodes: 20},

ebs: {volumeSize: 20},

zoneAwareness: {availabilityZoneCount: 3},

logging: {

slowSearchLogEnabled: true,

appLogEnabled: true,

slowIndexLogEnabled: true,

},

});

Coming from ES, you might have heard about a cluster and how it's made up of one or more nodes. An AOS domain is such a cluster, plus all the things around running such a cluster that AOS takes care of for you. Like networking, updates, resource allocation, etc.

We must use the correct engine version to get an OS cluster because AOS also supports legacy ES clusters.

The nodes are EC2 instances of the default instance type m3.medium.search.

We use five master nodes, which means there will be one master and five master-eligible nodes which can take the place of the master if something goes wrong.

We also use twenty data nodes, which will be used to store all the index shards (i.e., the actual data) of our cluster. Each of these data nodes will get a 20 GiB elastic block storage.

The zone awareness makes sure AOS will deploy our 25 EC2 instances across three availability zones. This way, the cluster won't go down if one zone becomes unavailable.

The logging config makes sure that queries and indices that perform poorly are logged so that when the cluster runs for some time, you get a feeling for the bottlenecks.

OpenSearch Monitoring with Dashbird

With the newest update to Dashbird, all AOS domains will be automatically picked up in your Dashbird inventory.

This update gives you a dashboard out of the box, with all the health metrics related to your domains. Including, but not limited to, the utilization of the different node types and the query latency.

While this is an excellent start to assess the status of your existing clusters in the context of their history, more interesting are the well-architected Dashbird insights.

The insights are periodical checks based on the AWS Well-Architected Framework. This means they test your AOS domains for best practices and notify you if you're not following them.

The insights also include some recurring checks for the availability and usage of your AOS domains. After all, you have to pay for them if they're deployed, and if they are never used, you might as well decommission them to save a few dollars.

Summary

OS is the search engine software that AWS will use from now on. If you want to stay with ES, you either have to use an older version or deploy your cluster manually with EC2, ECS, or EKS.

But rest assured that Dashbird got your back with that new ES fork. So, you will stay on top of things with monitoring metrics and well-architected insights for your AOS domains.

Further reading:

AWS Elastic Load Balancing from a Serverless perspective

Dashbird now integrates with 5 new AWS services

Triggering AWS Lambda with SNS

[Infographic] HTTP API Gateway from a serverless perspective

Taavi Rehemägi — Thu, 10 Mar 2022 08:42:31 +0000

This article was originally posted to Dashbird blog.

Since our latest update, Dashbird also gives you insights into HTTP API Gateway. Let's look at the differences between REST vs. HTTP; HTTP API Gateway pricing, integrations and monitoring.

What is HTTP API Gateway?

Back in the days when AWS released the first version of API Gateway, they tried to make it a higher level than the phalanx of load balancers they already offered. It's serverless and follows a RESTful approach to API modeling, allowing users to use the OpenAPI spec to define their APIs.

While REST is an excellent approach for most APIs, it isn't for everyone, and also, WebSockets were missing from the service. This meant people had to either go with REST and without WebSockets when they wanted a serverless proxy, or they had to go for the application load balancer, which was more low-level but not serverless.

These drawbacks were why they built a new version of API Gateway, called HTTP API Gateway, or API Gateway V2. It comes with the same serverless goodies as automatic scaling and on-demand pricing but offers WebSocket support and isn't bound to REST API design.

AWS HTTP Gateway Infographic by Dashbird

REST vs. HTTP

REST is an API design paradigm created by Roy Fielding. His goal was to give people a tool to design APIs on top of HTTP without all the issues other paradigms like SOAP had in the past. So, REST tries to formalize HTTP to make it more usable for API use-cases.

In turn, HTTP is more flexible; there are other ways to build an API on HTTP that don't follow the REST approach. These include AsyncAPI and GraphQL. In contrast to REST, they have different opinions on endpoint design and using WebSockets to allow real-time interactions.

So, if you're building a simple CRUD API, you can save much time by just going with the REST API Gateway. But if you need to be more flexible, use the HTTP version, but keep in mind that it comes with less help out of the box.

HTTP API Gateway Pricing

API Gateway has a 12-month free tier. That includes these limits for every month:

one million REST API calls
one million HTTP API calls
one million WebSocket API messages
750,000 WebSocket API connection minutes

For the first 300 million requests after the free tier, you pay $1 per million; after that, $0.90.

For the first 1 billion (yes, billion) WebSocket messages, you also pay $1 per million; after that, $0.80 for the next million.

So, the HTTP API Gateway is quite a bit cheaper than the REST version, which costs about $3.50 per one million requests.

Service Integrations

The integration options of the HTTP API Gateway are also a bit limited. While the REST API offers direct integrations with dozens of AWS services, the HTTP version only offers Lambda and HTTP integration. This means if you don't want to integrate with Lambda, your downstream service has to provide an HTTP API.

The Lambda integration is a one-click solution to map a route to a Lambda function; it will automatically forward headers, body, query parameters, etc. Lambda is seen as the default backend for serverless applications, so AWS made its configuration as straightforward as possible.

The HTTP is the catch-all solution. If you want to pipe your requests to another service, HTTP API Gateway will simply proxy it via HTTP. Luckily, you can ensure the downstream service gets the information it needs with parameter mappings.

Authorization

HTTP API Gateway lets you authorize users for your APIs via JSON Web Tokens or an Authorizer Lambda function, which will be called alongside every request. This can get quite expensive; that's why the Lambda response can be cached. All this can be configured; for some APIs, authorization caching isn't an option for security reasons; for others, it might not pose an issue.

The Lambda function can return either a simple boolean value, true for "has access" and false for "doesn't have access," or an IAM document that goes into more detail about what the client is allowed to do. This lets you add authorization features to all your HTTP backends with just a few clicks.

Migrating from REST to HTTP

The whole HTTP approach is more flexible and, what's probably most interesting, it's much cheaper than REST. That alone is perhaps enough to migrate away from the REST version.

But things are different enough between the two that you should consult a migration guide before doing so. The web console looks different; CloudFormation requires you to use other resources, the Lambda integration works differently, etc.

So, sadly, it isn't just incrementing a version number, adding an HTTP switch, and things are done. But people have already written several articles about this migration and its gotchas. Check out this excellent migration guide from Serverless Guru to get things sorted.

Monitoring HTTP API Gateway with Dashbird

Last month, Dashbird released a big update that includes more services to their monitoring. HTTP API Gateway is one of them. Historically, API Gateway was one of the more expensive parts of serverless architectures, and with Dashbirds new update, you can now get those sweet savings while still maintaining insights.

With the new integration, you can ensure that your gateways won't go down without notice.

Dashbird gives you the current errors and alerts you when the error percentage rises to a critical level. This includes 4XX and 5XX HTTP errors.

You also get insights into your latency over time and notifications if it increases or is about to hit a timeout.

Last but not least, you get notified of abandoned gateways. That is, gateways you deployed to the cloud, but that aren't used anymore.

Conclusion

HTTP API Gateway is more flexible for non-REST API design and cheaper than the REST version. If you only have a few routes that need to be handled by a Lambda function, it is your best bet.

But HTTP API Gateway integrations are limited to HTTP services, which takes away from that flexibility again. If you see yourself putting Lambda functions between your gateway and your services by the dozen, you might not save any money and raise the complexity of your stack needlessly. That's when you can be better served by a REST API Gateway that offers cheaper integrations to other AWS services.

But independent of HTTP and REST, Dashbird got you covered.

Further reading:

AWS Kinesis vs SNS vs SQS (with Python examples)

Dashbird now integrates with 5 new AWS services

Triggering AWS Lambda with SNS

[Infographic] AWS SNS from a serverless perspective

Taavi Rehemägi — Tue, 22 Feb 2022 13:04:11 +0000

The Simple Notification Service, or SNS for short, is one of the central services to build serverless architectures in the AWS cloud. SNS itself is a serverless messaging service that can distribute massive numbers of messages to different recipients. These include mobile end-user devices, like smartphones and tablets, but also other services inside the AWS ecosystem.

SNS' ability to target AWS services makes it the perfect companion for AWS Lambda. If you need custom logic, go for Lambda; if you need to fan out messages to multiple other services in parallel, SNS is the place to be.

But you can also use it for SMS or email delivery; it's a versatile service with many possible use-cases in a modern serverless system.

See the original article for copyable code examples: https://dashbird.io/blog/aws-sns/

SNS pricing

SNS is a serverless service, which means it comes with pay-as-you-go billing. You pay about 50 cents per 1 million messages. If your messages are bigger than 64 KB, you have to pay for each additional 64 KB chunk as if it was a whole message.

AWS also offers the first 1 million requests each month for free; this includes 100000 deliveries via HTTP subscription.

SNS vs. SQS

SNS' parallel delivery is different from SQS, the serverless queuing service of AWS. If you need to buffer events to remove pressure from a downstream service, then SQS is a better solution.

Another difference is SQS is pull-based, so you need a service actively grabbing an event from a queue, and AWS SNS is push-based so that it will call a service, like Lambda, that waits passively for an event.

SNS vs. EventBridge

EventBridge has similar use-cases as SNS but operates on a higher level. EventBridge can archive messages and target more services than SNS. SNS' only targets are email addresses, phone numbers, HTTP endpoints, Lambda functions, and SQS queues. This means if you want to give your data to another AWS service, you need to put some glue logic in-between. At least a Lambda function, and it will cost extra money.

But SNS allows configuring a topic as FIFO, which guarantees precisely one message delivery. This lowers the throughput from about 9000 msgs/sec to about 3000 msgs/sec but can reduce the complexity of your Lambda code.

Don't call Lambda from another Lambda

One rule when building serverless systems is "Don't call a Lambda directly from another Lambda." This rule comes from the fact that events from direct calls can get lost when one of the functions crashes, or it could lead to one function waiting until the other function finishes, which means double the costs.

This direct call rule means you always should put another service between your Lambda function calls. Sometimes these services follow from your use-cases, but when they don't, and you're about to make a direct call, you can grab SNS, EventBridge, or SQS to get around this issue.

Using SNS from Lambda

There are two ways SNS interacts with AWS Lambda: First, Lambda can send an event to an SNS topic, and second, a Lambda can subscribe to an SNS topic and receive events from it.

Sending Events from Lambda to an SNS Topic

To send a message to an SNS topic from your Lambda function, you need the SNS client from the AWS SDK and the ARN of your SNS topic.

Let's look at an example Lambda that handles API Gateway events:

The Lambda uses the AWS SDK v3, which is better modularized than the v2, which means more space for your custom code inside a Lambda.

It's a good practice to store the SNS topic ARN inside an environment variable, so you can change it without changing the code. Also, you should initialize the SNS client outside of the function handler, so it only happens on a cold-start.

You need to call the send method with a PublishCommand object to publish messages. The object requires a Message string, which we get from our API Gateway event body, and the TargetArn we got from an environment variable.

Receiving SNS Events with Lambda

To receive an SNS event with a Lambda, you need to subscribe your Lambda to an SNS topic. This way, the event that invokes the Lambda will be an SNS message.

Let's look at how to set things up with the CDK:

The first crucial part here is that you need to wrap the Lambda function into a subscription so that the CDK can link it up with an SNS topic.

The second part is that the event your Lambda function receives has its data inside a Records array, so you need to iterate it to get every record.

Piping API Gateway Events to SNS

Using AWS Lambda to glue things together is pretty straightforward but adds complexity and latency and costs extra money. That's why you should do simple integrations directly between services like API Gateway and SNS.

Let's look at another CDK example:

The example uses a third-party library that takes care of the event transformations. Usually, you would use the AwsServiceIntegration construct, which requires you to write VTL code that transforms the API Gateway event into something SNS understands. The library comes with some transformations out-of-the-box.

If you send JSON via a POST request to the /emails resource of this REST API, API Gateway will directly pipe that data to the SNS topic; no Lambda needed!

Dashbird now supports SNS

Now that you learned that AWS SNS is a crucial part of many serverless systems, you should be happy to hear that with its latest update, Dashbird gives you insights into your SNS topics too!

With its ability to run custom code, Lambda was low-hanging fruit for debugging; you could push all you wanted to know to a monitoring service. But all the other services on AWS are a bit trickier.

Usually, you would learn about the issues inside other services when Lambda was calling them. But Lambda costs money, and some services, like API Gateway or EventBridge, are perfectly able to transform and distribute events directly to the services where they're needed. No Lambda needed, and that's how it should be! Only use Lambda if it simplifies something or if the direct integrations lack features.

With Dashbird's new AWS SNS integration, you can now discover what is happening inside your architecture without the need to sprinkle Lambda functions all around the integration points. This saves you money, latency, and complexity!

Further reading:

AWS Kinesis vs SNS vs SQS (with Python examples)

Dashbird now integrates with 5 new AWS services

Triggering AWS Lambda with SNS

You’ve been thinking of Serverless all wrong!

Taavi Rehemägi — Thu, 17 Feb 2022 15:52:37 +0000

While working for Dashbird.io I've had to pleasure to come in contact with a number of serverless early adopters that included both small companies working on apps or just testing ideas as well as fortune 500 companies with an already established user base. What I've found is that a lot of the people I speak to think of serverless as a shortcut to developing software but that's just not the case.

And yes, you'll hear people tell you things about serverless like You don't have to run your own server. Great! -- You don't need to worry about maintenance and security is easier to handle. -- Even better! -- It's also a cheaper option? -- OMG, where has this been all my life? Every small business could benefit from serverless and will cite the cost as being a key driver towards making the switch but here's the thing. Serverless was not created for small businesses and cost is but a small reason behind its growing notoriety.

The benefits of serverless come from understanding the technology and knowing where to utilize its many amazing features and if done right this can prove to be a great asset to enterprises especially.

Enterprise companies look for more than just a cut in cost or an easier deployment procedure for their applications. In order for them to make the jump to a new technology, they look at it not just from a business or technical perspective but from a combination of the two. A perspective that shows the agility it provides for their development team, data privacy, and security while taking into account all the cost benefits they might get.

Coca-cola North America mentioned that before they made the switch to serverless they talked to their executives about virtualization, scalability, and all the benefits that serverless brings to the table and they couldn't care less about all of that. What they cared for were these 4 things.

Agility

The ability to quickly implement and test ideas or deploy new features was a big factor for Coca-Cola.

Stability

Their system has to stay up and running regardless of load, changes in infrastructure, or any other third-party factors.

Security

With so much information about their customers, data privacy and security were very high on the list of important things to keep in mind.

Cost effectiveness

As you could imagine, it needs to make financial sense to invest time and money into making the switch to a different platform. To read more about how Coca-cola is using serverless, there's check out this case study. The research and development department at Coca-Cola HQ liked serverless so much they actually made it into a rule, that any employee coming forward with a new idea, that new idea has to revolve around serverless or use it over other alternatives.

As a Fortune 100 company, Verizon switched to serverless almost entirely but before making the switch they were looking for a few key features that this new technology would cross of their list. To quote Rajdeep Saha, Cloud Architect @Verizon:

Most large enterprises have a significant number of stakeholders that must be considered when driving the adoption of new tools and processes. In today's environment, the chief concerns of enterprise include:

adhering to security standards that protect data and customers privacy

improving the velocity of application development to respond quickly to customer needs

and promoting the reusability of code to drive operational efficiencies

This is not to say, small businesses can't benefit tremendously from using serverless technology. In fact, small businesses will see the more obvious benefits like all the cost benefits or security benefits faster than perhaps the big enterprise applications.

Taking a dive in the deep end

Moonmail is a pretty awesome emailing service used by companies like Nespresso, Amazon, Circle, and many more. They switched from using a traditional approach to servers, from using RoR on EC2 to serverless and it's best explained in a case study by Carlos Castellanos where he details the journey from EC2 to AWS Lambda.

We realized that just wasn't possible with the architecture we were using at the time since it relied too heavily on EC2 and Ruby on Rails. So, we decided to embark on a never-ending journey in perfecting the art and science of sending emails.

Conclusion

Serverless has been created to fill a void left by the traditional servers, a void that has been made obvious only after we've seen how big of a difference is serverless architecture making in the apps we are using on a regular basis. As more people experiment with the new technology, they are finding it easier to provide scalable solutions while keeping costs more directly tied to usage, so it shouldn't come as a surprise that enterprises have been enthusiastically switching to serverless. At Dashbird.io we help enterprises get observability into their application providing actionable insights into the application, alerting them in case of failures, and keep an eye on costs.