Eyal Estrin for AWS Community Builders

Posted on Jun 10 • Originally published at eyal-estrin.Medium

Introduction to Day 2 Serverless Operations – Part 2

#aws #cloud #serverless #devops

In part 1 of this series, I introduced some of the most common Day 2 serverless operations, focusing on Function as a Service.

In this part, I will focus on serverless application integration services commonly used in event-driven architectures.

For this post, I will look into message queue services, event routing services, and workflow orchestration services for building event-driven architectures.

Message queue services

Message queues enable asynchronous communication between different components in an event-driven architecture (EDA). This means that producers (systems or services generating events) can send messages to the queue and continue their operations without waiting for consumers (systems or services processing events) to respond or be available.

Security and Access Control

Security should always be the priority, as it protects your data, controls access, and ensures compliance from the outset. This includes data protection, limiting permissions, and enforcing least privilege policies.

When using Amazon SQS, manage permissions using AWS IAM policies to restrict access to queues and follow the principle of least privilege, as explained here: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-basic-examples-of-iam-policies.html#security_iam_id-based-policy-examples
When using Amazon SQS, enable server-side encryption (SSE) for sensitive data at rest, as explained here: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-server-side-encryption.html
When using Amazon SNS, manage topic policies and IAM roles to control who can publish or subscribe, as explained here: https://docs.aws.amazon.com/sns/latest/dg/security_iam_id-based-policy-examples.html
When using Amazon SNS, enable server-side encryption (SSE) for sensitive data at rest, as explained here: https://docs.aws.amazon.com/sns/latest/dg/sns-server-side-encryption.html

Monitoring and Observability

Once security is in place, implement comprehensive monitoring and observability to gain visibility into system health, performance, and failures. This enables proactive detection and response to issues.

When using Amazon SQS, monitor queue metrics such as message count, age of oldest message, and queue length using Amazon CloudWatch, as explained here: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/monitoring-using-cloudwatch.html https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-available-cloudwatch-metrics.html
When using Amazon SQS, set up CloudWatch alarms for thresholds (e.g., high message backlog or processing latency), as explained here: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/set-cloudwatch-alarms-for-metrics.html
When using Amazon SNS, use CloudWatch to track message delivery status, failure rates, and subscription metrics, as explained here: https://docs.aws.amazon.com/sns/latest/dg/sns-monitoring-using-cloudwatch.html

Error Handling

With monitoring established, set up robust error handling mechanisms, including alerts, retries, and dead-letter queues, to ensure reliability and rapid remediation of failures.

When using Amazon SQS, configure Dead Letter Queues (DLQs) to capture messages that fail processing repeatedly for later analysis and remediation, as explained here: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-dead-letter-queues.html
When using Amazon SNS, integrate with DLQs (using SQS as a DLQ) for messages that cannot be delivered to endpoints, as explained here: https://docs.aws.amazon.com/sns/latest/dg/sns-dead-letter-queues.html

Scaling and Performance

After ensuring security, visibility, and error resilience, focus on scaling and performance. Monitor throughput, latency, and resource utilization, and configure auto-scaling to match demand efficiently.

When using Amazon SQS, adjust queue settings or consumer concurrency as traffic patterns change, as explained here: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/best-practices-message-processing.html
When using Amazon SQS, monitor usage for unexpected spikes, as explained here: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Best_Practice_Recommended_Alarms_AWS_Services.html#SNS

Maintenance

Finally, establish ongoing maintenance routines such as regular reviews, updates, cost optimization, and compliance audits to sustain operational excellence and adapt to evolving needs.

When using Amazon SQS, purge queues as needed and archive messages if required for compliance, as explained here: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-using-purge-queue.html
When using Amazon SNS, review and clean up unused topics and subscriptions, as explained here: https://docs.aws.amazon.com/sns/latest/dg/sns-delete-subscription-topic.html

Event routing services

Event routing services act as the central hub in event-driven architectures, receiving events from producers and distributing them to the appropriate consumers. This decouples producers from consumers, allowing each to operate, scale, and fail independently without direct awareness of each other.

Monitoring and Observability

Serverless event routing services require robust monitoring and observability to track event flows, detect anomalies, and ensure system health; this is typically achieved through metrics, logs, and dashboards that provide real-time visibility into event processing and failures.

When using Amazon EventBridge, set up CloudWatch metrics and logs to monitor event throughput, failures, latency, and rule matches. Use CloudWatch Alarms to alert on anomalies or failures in event delivery, as explained here: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-monitoring.html

Error Handling and Dead-Letter Management

Effective error handling uses mechanisms like retries and circuit breakers to manage transient failures, while dead-letter queues (DLQs) capture undelivered or failed events for later analysis and remediation, preventing data loss and supporting troubleshooting.

When using Amazon EventBridge, configure dead-letter queues (DLQ) for failed event deliveries. Set retry policies and monitor DLQ for undelivered events to ensure no data loss, as explained here: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-rule-dlq.html

Security and Access Management

Security and access management involve configuring fine-grained permissions to control which users and services can publish, consume, or manage events, ensuring that only authorized entities interact with event routing resources and that sensitive data remains protected.

When using Amazon EventBridge, review and update IAM policies for event buses, rules, and targets. Use resource-based policies to restrict who can publish or subscribe to events, as explained here: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-manage-iam-access.html https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-use-resource-based.html

Scaling and Performance

Serverless platforms automatically scale event routing services in response to workload changes, spinning up additional resources during spikes and scaling down during lulls, while performance optimization involves tuning event patterns, batching, and concurrency settings to minimize latency and maximize throughput.

When using Amazon EventBridge, monitor event throughput and adjust quotas or request service limit increases as needed. Optimize event patterns and rules for efficiency, as explained here: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-quota.html

Workflow orchestration services

Workflow services are designed to coordinate and manage complex sequences of tasks or business processes that involve multiple steps and services. They act as orchestrators, ensuring each step in a process is executed in the correct order, handling transitions, and managing dependencies between steps.

Monitoring and Observability

Set up and review monitoring dashboards, logs, and alerts to ensure workflows are running correctly and to quickly detect anomalies or failures.

When using AWS Step Functions, monitor executions, check logs, and set up CloudWatch metrics and alarms to ensure workflows run as expected, as explained here: https://docs.aws.amazon.com/step-functions/latest/dg/monitoring-logging.html

Error Handling and Retry

Investigate failed workflow executions, enhance error handling logic (such as retries and catch blocks), and resubmit failed runs where appropriate. This is crucial for maintaining workflow reliability and minimizing manual intervention.

When using AWS Step Functions, review failed executions, configure retry/catch logic, and update workflows to handle errors gracefully, as explained here: https://docs.aws.amazon.com/step-functions/latest/dg/concepts-error-handling.html

Security and Access Management

Workflow orchestration services require continuous enforcement of granular access controls and the principle of least privilege, ensuring that each function and workflow has only the permissions necessary for its specific tasks.

When using AWS Step Functions, use AWS Identity and Access Management (IAM) for fine-grained control over who can access and manage workflows, as explained here: https://docs.aws.amazon.com/step-functions/latest/dg/auth-and-access-control-sfn.html

Versioning and Updates

Workflow orchestration services use versioning to track and manage different iterations of workflows or services, allowing multiple versions to coexist and enabling users to select, test, or revert to specific versions as needed.

When using AWS Step Functions, update state machines, manage versions, and test changes before deploying to production, as explained here: https://docs.aws.amazon.com/step-functions/latest/dg/concepts-state-machine-version.html

Cost Optimization

Regularly review usage and billing data, optimize workflow design (e.g., reduce unnecessary steps or external calls), and adjust resource allocation to control operational costs.

When using AWS Step Functions, analyze usage and optimize workflow design to reduce execution and resource costs, as explained here: https://docs.aws.amazon.com/step-functions/latest/dg/sfn-best-practices.html#cost-opt-exp-workflows

Summary

In this blog post, I presented the most common Day 2 serverless operations when using application integration services (message queues, event routing services, and workflow orchestrations) to build modern applications.

I looked at aspects such as observability, error handling, security, performance, etc.

Building event-driven architectures requires time to grasp which services best support this approach. However, gaining a foundational understanding of key areas is essential for effective day 2 serverless operations.

About the author

Eyal Estrin is a cloud and information security architect, an AWS Community Builder, and the author of the books Cloud Security Handbook and Security for Cloud Native Applications, with more than 25 years in the IT industry.

You can connect with him on social media (https://linktr.ee/eyalestrin).

Opinions are his own and not the views of his employer.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.

DEV Community

Introduction to Day 2 Serverless Operations – Part 2

Message queue services

Security and Access Control

Monitoring and Observability

Error Handling

Scaling and Performance

Maintenance

Event routing services

Monitoring and Observability

Error Handling and Dead-Letter Management

Security and Access Management

Scaling and Performance

Workflow orchestration services

Monitoring and Observability

Error Handling and Retry

Security and Access Management

Versioning and Updates

Cost Optimization

Summary

About the author

Top comments (0)

Okay