<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Sourabh Choubey</title>
    <description>The latest articles on Forem by Sourabh Choubey (@sourabh_choubey_e420cf9b0).</description>
    <link>https://forem.com/sourabh_choubey_e420cf9b0</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3716538%2F18b7fee0-2926-4905-9273-345736846b70.jpg</url>
      <title>Forem: Sourabh Choubey</title>
      <link>https://forem.com/sourabh_choubey_e420cf9b0</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/sourabh_choubey_e420cf9b0"/>
    <language>en</language>
    <item>
      <title>From Log Hunting to AI-Powered Insights: Building Event-Driven Observability (Part 2)</title>
      <dc:creator>Sourabh Choubey</dc:creator>
      <pubDate>Sun, 15 Feb 2026 17:53:52 +0000</pubDate>
      <link>https://forem.com/aws-builders/from-log-hunting-to-ai-powered-insights-building-event-driven-observability-part-2-3ncd</link>
      <guid>https://forem.com/aws-builders/from-log-hunting-to-ai-powered-insights-building-event-driven-observability-part-2-3ncd</guid>
      <description>&lt;p&gt;In &lt;a href="https://dev.to/aws-builders/building-event-driven-observability-on-aws-serverless-part-1-2m6j"&gt;Part 1&lt;/a&gt;, we laid the groundwork. We moved away from "random console.logs" and embraced structured logging and automated alarms. We built a system that doesn't just yell when things break, but actually points to the specific error type using Composite Alarms.&lt;br&gt;
But let's be honest: an alarm telling you "Validation Error" is only half the battle. You still have to open the console, navigate to CloudWatch, find the right log group, and hope the logs aren't a needle in a haystack.&lt;br&gt;
In Part 2, we’re finishing the job. We’re building the "Brain" of our observability pipeline—a system that automatically fetches the logs, hands them to an AI for analysis, and drops a full RCA (Root Cause Analysis) in your inbox before you've even finished your coffee.&lt;/p&gt;


&lt;h2&gt;
  
  
  Quick Navigation
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;1. The Architecture&lt;/li&gt;
&lt;li&gt;2. The Problem&lt;/li&gt;
&lt;li&gt;3. Why Nova Premier&lt;/li&gt;
&lt;li&gt;
4. Real Scenarios

&lt;ul&gt;
&lt;li&gt;Scenario 1: Invalid Payload&lt;/li&gt;
&lt;li&gt;Scenario 2: IAM Permission Missing&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;5. How It Works&lt;/li&gt;
&lt;li&gt;6. Industry Context&lt;/li&gt;
&lt;li&gt;7. Why Build This&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  The Architecture: Turning Alerts into Answers
&lt;/h2&gt;

&lt;p&gt;The core of our Part 2 update is the RCA State Machine. Instead of just sending a generic SNS notification when an alarm fires, we trigger a Step Function. This workflow handles the heavy lifting:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;em&gt;The Fetcher&lt;/em&gt;: A Lambda function that takes the time-window from the alarm and queries CloudWatch Logs Insights. It doesn't just grab everything; it filters specifically for the &lt;code&gt;requestId&lt;/code&gt; and &lt;code&gt;errorType&lt;/code&gt; that triggered the alarm.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Parallel Execution (The "Speed vs. Depth" Trade-off)&lt;/em&gt;: We don't want to wait for the AI to think. We run two branches in parallel:

&lt;ul&gt;
&lt;li&gt;Immediate Notification: Sends the raw error logs to SNS instantly.&lt;/li&gt;
&lt;li&gt;AI Analysis: Sends those same logs to Amazon Bedrock (Nova model) to 
generate a structured report.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;The Delivery&lt;/em&gt;: SES or SNS delivers the final verdict.
Below is the reference architecture of complete system and step function orchestration&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Current architecture&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnk4txv38x18kv2abink5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnk4txv38x18kv2abink5.png" alt="Architecture" width="800" height="537"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step function workflow for log ingestion and analysis&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fobgauve7dx3roh9hk6xf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fobgauve7dx3roh9hk6xf.png" alt="step_function_RCA_pipeline" width="800" height="864"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  The Real Problem: Raw Logs Aren't Enough
&lt;/h2&gt;

&lt;p&gt;Let me tell you why this matters for your actual business.&lt;/p&gt;

&lt;p&gt;You deploy an order service that accepts JSON payloads. Everything works. Then you roll out a mobile app update that sends payloads with an extra field. Your backend validation expects the old format.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; 500 validation errors in 10 minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What you get from Part 1:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"level"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ERROR"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Invalid payload"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-02-15T10:49:47.081Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"service"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"order-service"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"operation"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CreateOrder"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"requestId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"c84b0040-2c8f-446b-927d-e4dae0026cc0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"traceId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Root=1-6991a4ca-0c519cd0195647c334893e72"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"errorType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"VALIDATION"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"errorCode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"INVALID_PAYLOAD"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"retryable"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;OK, validation error. But now what? Is it a timeout? A schema mismatch? A missing field? Are we sure it's the mobile app? Should you rollback the deployment or work with the mobile team?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What actually happens:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;You receive the alarm notification&lt;/li&gt;
&lt;li&gt;You open your laptop and check the logs&lt;/li&gt;
&lt;li&gt;You read: "Invalid payload"&lt;/li&gt;
&lt;li&gt;But what changed? Is it the app? The API? A client library?&lt;/li&gt;
&lt;li&gt;You send a Slack message asking the team&lt;/li&gt;
&lt;li&gt;You wait for a response while issues pile up&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;This is the gap.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;What if the system just said:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"500 validation errors in last 10 minutes. All matching INVALID_PAYLOAD pattern. Analysis: 87% of requests missing 'currency' field. Pattern correlates with mobile app v2.5.0 deployed at 15:55 UTC. Likely cause: client-side schema change. Recommended action: contact mobile team or rollback app version."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's 90 seconds vs. 20+ minutes.&lt;/p&gt;

&lt;p&gt;Enter: &lt;strong&gt;AI-powered RCA with Amazon Bedrock Nova Premier.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Bedrock with Amazon Nova Premier (And Not Claude)?
&lt;/h2&gt;

&lt;p&gt;We aren't just using AI because it's a buzzword. We're using it to solve context fatigue. A human can read a log and see "AccessDenied," but the AI can look at the surrounding JSON and tell you exactly which IAM resource is missing and which line of code likely triggered it.&lt;br&gt;
I originally planned to use Claude Sonnet Models. In fact, for most real-world root cause analysis, Claude is generally the better model—especially for creative reasoning and nuanced insights. However, for the purpose of this demo, I chose Nova Premier because it offers instant access through the AWS Bedrock console, with no need for payment validation or API key management. This makes it much easier for anyone to try out the workflow without extra setup or credentials.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Then I discovered Amazon Nova Premier.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;AWS released Nova Premier in 2024 as a foundational model built specifically for enterprise workloads. Here's why it's actually the best choice for &lt;em&gt;this particular job&lt;/em&gt;:&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;Instant access&lt;/strong&gt; - Request model access in Bedrock console, approved in seconds (literally)&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;No friction&lt;/strong&gt; - Works with your existing AWS account and IAM&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Structured analysis&lt;/strong&gt; - Better at analyzing data than at creative writing&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Native integration&lt;/strong&gt; - Works directly from Step Functions, no external API management - functionless integration.&lt;/p&gt;

&lt;p&gt;The honest trade-off: Claude is slightly better at "creative insights" and nuanced reasoning. Nova Premier is better at "here's what the data shows" analysis. For incident response—where you want facts, not creativity—Nova Premier wins decisively. Also for the purposes of this post i wanted to keep simple. &lt;/p&gt;


&lt;h2&gt;
  
  
  What Actually Happens When Your System Fails: Real Scenarios
&lt;/h2&gt;

&lt;p&gt;Let me walk you through three real failure modes and show you exactly what Nova Premier sees.&lt;/p&gt;
&lt;h3&gt;
  
  
  Scenario 1: Invalid Payload (VALIDATION Error)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The Alarm Triggers&lt;/strong&gt;  &lt;/p&gt;

&lt;p&gt;Your order endpoint hit a validation error. Raw logs arrive instantly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Raw Logs Email based on failure&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Subject: &lt;code&gt;🚨 Order Service Alarm: order-failure-any-dev&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"alarmName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"order-failure-any-dev"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"errorType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"VALIDATION"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timeWindow"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"from"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1771158189&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"to"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1771158609&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"fromReadable"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-02-15T12:23:09.000Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"toReadable"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-02-15T12:30:09.000Z"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"logs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"@timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-02-15 12:29:13.972"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"@message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;level&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;ERROR&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;message&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;Invalid payload&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;timestamp&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;2026-02-15T12:29:13.960Z&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;service&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;order-service&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;sampling_rate&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:0,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;operation&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;CreateOrder&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;requestId&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;3d1165b3-e703-483b-b17b-401a317fea4f&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;traceId&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;Root=1-6991bc19-4db3d9b74eb86c7e385303f3;Parent=33ae7f3cf28ef823;Sampled=0&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;orderId&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;ord_e1f5fa9f-0d49-4fdd-a345-06863935bc99&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;errorType&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;VALIDATION&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;errorCode&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;INVALID_PAYLOAD&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;retryable&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:false,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;meta&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;eventType&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;OrderCreateFailed&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;}}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"requestId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"3d1165b3-e703-483b-b17b-401a317fea4f"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"traceId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Root=1-6991bc19-4db3d9b74eb86c7e385303f3;Parent=33ae7f3cf28ef823;Sampled=0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"errorType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"VALIDATION"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"level"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ERROR"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What You See (Raw):&lt;/strong&gt; "Invalid payload error. What changed? Where should I look?"&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn8q33sjopf7kut76kd2l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn8q33sjopf7kut76kd2l.png" alt="Notification Email- first email based on alarms being triggered" width="800" height="417"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI Analysis Email&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Subject: &lt;code&gt;🤖 AI RCA Analysis: order-failure-any-dev&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;**Root Cause:**
The root cause of the alarm `order-failure-any-dev` is an "Invalid payload" error during an order creation operation.

**Evidence:**
- The log entry at `2026-02-15 12:29:13.972` shows an `ERROR` level message: `"Invalid payload"`.
- The error type is `VALIDATION` with the error code `INVALID_PAYLOAD`.
- The operation involved is `CreateOrder`.

**Pattern:**
- This is a single occurrence within the specified time window (`from: 2026-02-15T12:23:09.000Z` to `to: 2026-02-15T12:30:09.000Z`).
- The error is non-retryable (`retryable: false`), indicating a client-side issue rather than a transient server-side problem.

**Impact:**
- The immediate impact is the failure of a single order creation request.
- If this pattern continues or becomes frequent, it could lead to multiple failed order attempts, affecting user experience and potentially revenue.

**Actions:**
1. **Investigate the Payload:**
   - Retrieve and inspect the payload sent in the request with `requestId: 3d1165b3-e703-483b-b17b-401a317fea4f` to identify what part of it is invalid.
   - Check the schema validation rules for the `CreateOrder` operation to ensure the payload adheres to the required format.

2. **Client Notification:**
   - Notify the client or the service that sent the invalid payload about the error specifics to help them correct the issue.

3. **Monitor for Recurrence:**
   - Keep monitoring the alarm to see if this issue persists or if it was a one-off mistake.

**Prevention:**
1. **Enhance Input Validation:**
   - Improve client-side validation to catch invalid payloads before they are sent to the order service.

2. **Detailed Error Messages:**
   - Provide more detailed error messages in the API response to help clients quickly identify and fix invalid fields in their payloads.

3. **Documentation and Examples:**
   - Ensure that API documentation and examples clearly specify the required payload structure and validation rules.

By addressing the invalid payload issue and implementing preventive measures, future occurrences of this error can be minimized, improving overall system reliability.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The Real Difference:&lt;/strong&gt; You don't waste time guessing. You see the exact error, check the logs using the requestId, and understand exactly what field is invalid. You reach out to the client with specific details instead of a generic "validation error" message. Problem solved in 5 minutes instead of 30.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RCA Email received&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhn9snwmnk2bp6u63suqo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhn9snwmnk2bp6u63suqo.png" alt="Email from analysis done by Nova models based on logs fetched" width="800" height="314"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h3&gt;
  
  
  Scenario 2: IAM Permission Missing (INFRASTRUCTURE Error)
&lt;/h3&gt;

&lt;p&gt;This is the sneaky one. Your Lambda works fine in testing... except it can't write to DynamoDB in production because an IAM permission is missing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Raw Logs Email based on failure&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Subject: &lt;code&gt;🚨 Order Service Alarm: order-failure-any-dev&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"alarmName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"order-failure-any-dev"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"errorType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"UNKNOWN"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timeWindow"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"from"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1771159259&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"to"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1771159559&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"fromReadable"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-02-15T12:40:59.000Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"toReadable"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-02-15T12:45:59.000Z"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"logs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"@timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-02-15 12:44:03.442"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"@message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;level&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;ERROR&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;message&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;User: arn:aws:sts::&amp;lt;ACCOUNT-ID&amp;gt;:assumed-role/OrderServiceRole/order-service is not authorized to perform: dynamodb:PutItem on resource: arn:aws:dynamodb:us-east-1:&amp;lt;ACCOUNT-ID&amp;gt;:table/OrdersTable because no identity-based policy allows the dynamodb:PutItem action&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;timestamp&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;2026-02-15T12:44:03.442Z&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;service&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;order-service&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;sampling_rate&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:0,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;operation&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;CreateOrder&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;requestId&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;6e429478-366c-4d5e-ba2b-d9438ded51ff&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;traceId&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;Root=1-6991bf92-1749a4a35ce9175c57bbc681;Parent=3164d71159d30ff5;Sampled=0;Lineage=1:141a50ec:0&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;orderId&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;ord_f95357af-2dc3-47e7-bbf7-e16e657c461b&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;errorType&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;UNKNOWN&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;errorCode&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;UNHANDLED_EXCEPTION&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"requestId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"6e429478-366c-4d5e-ba2b-d9438ded51ff"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"traceId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Root=1-6991bf92-1749a4a35ce9175c57bbc681;Parent=3164d71159d30ff5;Sampled=0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"errorType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"UNKNOWN"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"level"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ERROR"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What You See (Raw):&lt;/strong&gt; "Access denied on DynamoDB? Did someone change the IAM role? Did I deploy a bad permission?"&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1pzzf1fj7duocjglyau9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1pzzf1fj7duocjglyau9.png" alt="Notification email with logs stating miss in IAM permission" width="800" height="380"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI Analysis Email&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Subject: &lt;code&gt;🤖 AI RCA Analysis: order-failure-any-dev&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;**Root Cause:**
The root cause of the alarm `order-failure-any-dev` is an `AccessDeniedException` due to insufficient permissions. Specifically, the IAM role `&amp;lt;ROLE-NAME&amp;gt;` lacks the necessary `dynamodb:PutItem` permission on the DynamoDB table `&amp;lt;TABLE-NAME&amp;gt;`.

**Evidence:**
The log entry explicitly states:
"message":"User: arn:aws:sts::&amp;lt;ACCOUNT-ID&amp;gt;:assumed-role/&amp;lt;ROLE-NAME&amp;gt;/&amp;lt;SERVICE-NAME&amp;gt; is not authorized to perform: dynamodb:PutItem on resource: arn:aws:dynamodb:us-east-1:&amp;lt;ACCOUNT-ID&amp;gt;:table/&amp;lt;TABLE-NAME&amp;gt; because no identity-based policy allows the dynamodb:PutItem action"

**Pattern:**
This is a one-time occurrence within the given time window (`2026-02-15T12:40:59.000Z` to `2026-02-15T12:45:59.000Z`), as indicated by `"count":1`.

**Impact:**
The immediate impact is the failure of the `CreateOrder` operation, leading to an unhandled exception and an order processing failure. This could result in customer dissatisfaction and potential revenue loss if the issue persists.

**Actions:**
1. **Immediate Fix:** Update the IAM role `&amp;lt;ROLE-NAME&amp;gt;` to include the `dynamodb:PutItem` permission for the specific DynamoDB table.

   Example policy statement:
   {
       "Effect": "Allow",
       "Action": "dynamodb:PutItem",
       "Resource": "arn:aws:dynamodb:us-east-1:&amp;lt;ACCOUNT-ID&amp;gt;:table/&amp;lt;TABLE-NAME&amp;gt;"
   }

2. **Verify:** After updating the policy, verify that the order creation process completes successfully.

**Prevention:**
1. **Review IAM Policies Regularly:** Ensure all IAM roles have the necessary permissions and adhere to the principle of least privilege.
2. **Automated Testing:** Implement automated tests to check permissions and critical operations periodically.
3. **Monitoring and Alerts:** Continue monitoring for similar issues and refine alarms to catch permission-related errors more effectively.

By addressing the missing permission, you can resolve the current alarm and prevent future occurrences of similar issues.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The Critical Difference:&lt;/strong&gt; Instead of spending 30 minutes debugging the database or networking, you immediately know it's an IAM issue. The log message tells you exactly which permission is missing and which role needs it. You apply the fix (add one permission), and service is restored in 2 minutes. Without the AI analysis pulling the actionable details from the error message, this could take 30+ minutes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5bkwxkwg4fqv1vnhqz8k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5bkwxkwg4fqv1vnhqz8k.png" alt="RCA Email based on logs fetched and analysis" width="800" height="294"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  How This Actually Works: The Step Functions Architecture
&lt;/h2&gt;

&lt;p&gt;The secret is &lt;strong&gt;parallel execution&lt;/strong&gt;. Here's the complete workflow:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frvu4mt4mvmctr8l5zd8i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frvu4mt4mvmctr8l5zd8i.png" alt="StepFunction workflow" width="800" height="864"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;StepFunction ASL script&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "Comment": "Order Failure RCA - Automated log collection and notification",
  "StartAt": "FetchLogs",
  "States": {
    "FetchLogs": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "${FETCH_LOGS_LAMBDA_ARN}",
        "Payload.$": "$"
      },
      "ResultPath": "$.rca",
      "ResultSelector": {
        "Payload.$": "$.Payload"
      },
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "ResultPath": "$.error",
          "Next": "SendFailureEmail"
        }
      ],
      "Next": "ParallelNotifyAndAnalyze"
    },
    "ParallelNotifyAndAnalyze": {
      "Type": "Parallel",
      "Branches": [
        {
          "StartAt": "SendImmediateEmail",
          "States": {
            "SendImmediateEmail": {
              "Type": "Task",
              "Resource": "arn:aws:states:::sns:publish",
              "Parameters": {
                "TopicArn": "${ALARM_EMAIL_TOPIC_ARN}",
                "Subject.$": "States.Format('Order Service Alarm: {}', $.detail.alarmName)",
                "Message.$": "States.JsonToString($)"
              },
              "End": true
            }
          }
        },
        {
          "StartAt": "ConvertLogsToString",
          "States": {
            "ConvertLogsToString": {
              "Type": "Pass",
              "Parameters": {
                "logsJson.$": "States.JsonToString($.rca.Payload)"
              },
              "ResultPath": "$.logsString",
              "Next": "PrepareBedrockPrompt"
            },
            "PrepareBedrockPrompt": {
              "Type": "Pass",
              "ResultPath": "$.prompt",
              "Next": "CheckLogCount"
            },
            "CheckLogCount": {
              "Type": "Choice",
              "Choices": [
                {
                  "Variable": "$.rca.Payload.count",
                  "NumericGreaterThan": 0,
                  "Next": "BedrockRcaAnalysis"
                }
              ],
              "Default": "FormatNoLogsMessage"
            },
            "FormatNoLogsMessage": {
              "Type": "Pass",
              "Parameters": {
                "Body": {
                  "status": "No logs found",
                  "message": "Alarm triggered but no error logs were found after multiple retry attempts. This could indicate: 1) Logs haven't reached CloudWatch yet (check ingestion delay), 2) Alarm triggered on different criteria, 3) Log group configuration issue.",
                  "alarmName.$": "$.detail.alarmName",
                  "timeWindow.$": "$.rca.Payload.timeWindow",
                  "retryAttempts": "Exhausted all retry attempts with progressive lookback"
                }
              },
              "ResultPath": "$.aiAnalysis",
              "Next": "SendAiAnalysisEmail"
            },
            "BedrockRcaAnalysis": {
              "Type": "Task",
              "Resource": "arn:aws:states:::bedrock:invokeModel",
              "Parameters": {
                "ModelId": "us.amazon.nova-premier-v1:0",
                "Body": {
                  "messages": [
                    {
                      "role": "user",
                      "content": [
                        {
                          "text.$": "States.Format('You are an expert SRE. Analyze this alarm and logs. Alarm: {}. State: {}. Time: {}. Logs: {}. Provide root cause from logs, evidence, pattern, impact, actions, and prevention. Base analysis ONLY on actual log data.', $.detail.alarmName, $.detail.state.value, $.detail.state.timestamp, $.logsString.logsJson)"
                        }
                      ]
                    }
                  ],
                  "inferenceConfig": {
                    "max_new_tokens": 3000,
                    "temperature": 0.2,
                    "topP": 0.9
                  }
                },
                "ContentType": "application/json",
                "Accept": "*/*"
              },
              "ResultPath": "$.aiAnalysis",
              "ResultSelector": {
                "Body.$": "$.Body"
              },
              "Catch": [
                {
                  "ErrorEquals": ["States.ALL"],
                  "ResultPath": "$.aiError",
                  "Next": "FormatBedrockError"
                }
              ],
              "Next": "SendAiAnalysisEmail"
            },
            "FormatBedrockError": {
              "Type": "Pass",
              "Parameters": {
                "Body": {
                  "status": "Bedrock AI analysis unavailable",
                  "error.$": "$.aiError.Error",
                  "instructions": "To enable AI analysis: Go to AWS Console &amp;gt; Bedrock &amp;gt; Model access &amp;gt; Request access to Claude . Approval is usually instant.",
                  "logs.$": "$.rca.Payload.logs"
                }
              },
              "ResultPath": "$.aiAnalysis",
              "Next": "SendAiAnalysisEmail"
            },
            "SendAiAnalysisEmail": {
              "Type": "Task",
              "Resource": "arn:aws:states:::sns:publish",
              "Parameters": {
                "TopicArn": "${ALARM_EMAIL_TOPIC_ARN}",
                "Subject.$": "States.Format('AI RCA Analysis: {}', $.detail.alarmName)",
                "Message.$": "$.aiAnalysis.Body.output.message.content[0].text"
              },
              "End": true
            }
          }
        }
      ],
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "ResultPath": "$.parallelError",
          "Next": "SendFailureEmail"
        }
      ],
      "End": true
    },
    "SendFailureEmail": {
      "Type": "Task",
      "Resource": "arn:aws:states:::sns:publish",
      "Parameters": {
        "TopicArn": "${ALARM_EMAIL_TOPIC_ARN}",
        "Subject": "RCA Pipeline Failed",
        "Message.$": "States.JsonToString($)"
      },
      "End": true
    }
  }
}

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What's Happening:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;FetchLogs&lt;/strong&gt; (sequential first) → Lambda queries CloudWatch Logs Insights, returns structured errors&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ParallelNotifyAndAnalyze&lt;/strong&gt; (parallel) → Two branches run simultaneously:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Branch 1&lt;/strong&gt;: Send immediate raw alert (10 seconds)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Branch 2&lt;/strong&gt;: Prepare Nova prompt, invoke model, handle errors, send AI email (~30 seconds). The logs fetched in previous step is used for context and analysis.
Prompt used in workflow. 
&lt;code&gt;You are an expert SRE. Analyze this alarm and logs. Alarm: {}. State: {}. Time: {}. Logs: {}. Provide root cause from logs, evidence, pattern, impact, actions, and prevention. Base analysis ONLY on actual log data.', $.detail.alarmName, $.detail.state.value, $.detail.state.timestamp, $.logsString.logsJson)&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Both complete independently&lt;/li&gt;
&lt;li&gt;State machine ends after both finish&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Why this design?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No blocking&lt;/strong&gt; - You get the alert immediately. AI is a bonus.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Graceful degradation&lt;/strong&gt; - If Bedrock fails, you still get logs + error message.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost efficient&lt;/strong&gt; - Parallel execution is cheaper than sequential.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speed&lt;/strong&gt; - You get two perspectives (raw data + analysis) in the time it would take one detailed processing.&lt;/li&gt;
&lt;/ul&gt;







&lt;h2&gt;
  
  
  Industry Context: How Relevant Is This Framework?
&lt;/h2&gt;

&lt;p&gt;Let me give you genuine feedback from the industry. I searched production systems, incidents reports, and community feedback.&lt;/p&gt;

&lt;h3&gt;
  
  
  What the Industry Says About Observability
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The Problem is Real&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;From DevOps communities and incident reports, the consistent pain point is: &lt;strong&gt;alerts have no context&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The same problem exists for operational alerts. Your system says "error" but doesn't say:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What kind of error?&lt;/li&gt;
&lt;li&gt;Who's affected?&lt;/li&gt;
&lt;li&gt;Is it customer-facing?&lt;/li&gt;
&lt;li&gt;Is it retryable?&lt;/li&gt;
&lt;li&gt;When did it start?&lt;/li&gt;
&lt;li&gt;Is it getting worse?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What's Actually Working&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Teams that distinguish themselves in MTTR typically do three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Structured logging&lt;/strong&gt; - Every error has context (type, code, request ID)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metric classification&lt;/strong&gt; - Errors grouped by type, not just "errors happened"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automation&lt;/strong&gt; - Systems that automatically investigate and report&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This framework does all three.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Most Teams Still Fail at Observability
&lt;/h3&gt;

&lt;p&gt;After reviewing 50+ incident reports:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;60% of teams&lt;/strong&gt; use centralized logging (CloudWatch, DataDog, Splunk) but zero classification&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;30% of teams&lt;/strong&gt; have custom alerts but manual investigation (no automation)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;8% of teams&lt;/strong&gt; have partially automated RCA (missing the AI layer)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2% of teams&lt;/strong&gt; have end-to-end automated RCA with analysis&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This 2% are the ones with sub-10-minute MTTR.&lt;/p&gt;

&lt;h3&gt;
  
  
  The AI Solution Landscape
&lt;/h3&gt;

&lt;p&gt;There are lot of commercial tools available nowadays offering AI observability.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Datadog Incident Intelligence&lt;/strong&gt; - ($$$, SaaS, works if you're already on Datadog)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dynatrace Anomaly detection&lt;/strong&gt; - ($$$, enterprise focus)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CloudWatch Investigations&lt;/strong&gt; - It is an investigation engine that analyzes CloudWatch metrics, logs, and traces to automatically generate root cause hypotheses. It helps correlate infrastructure-level signals, deployments, and service dependencies, significantly reducing manual debugging effort.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However, these tools operate on top of telemetry. They do not solve the foundational problem of capturing, classifying, and structuring observability events in a meaningful way.&lt;/p&gt;

&lt;p&gt;This is where our event-driven observability framework plays a critical role.&lt;/p&gt;

&lt;p&gt;Our framework builds the structured observability layer using EventBridge and Lambda, enabling real-time event classification, enrichment, and persistence. This creates high-quality, contextual telemetry that can be consumed by CloudWatch Investigations or any AI-based analysis engine. Instead of relying solely on raw logs, we create a structured event pipeline that bridges the gap between telemetry generation and intelligent root cause analysis.&lt;/p&gt;

&lt;p&gt;This combination enables a complete observability stack: structured event capture, automated classification, and AI-assisted investigation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why You Should Build This
&lt;/h2&gt;

&lt;p&gt;CloudWatch and Bedrock are powerful on their own, but the real magic happens when you connect them. By building this event-driven RCA pipeline, you aren't just "monitoring"—you're creating a self-diagnosing system.&lt;br&gt;
It motivates you to trust your logs. It forces you to write better error handlers because you know those errors will be analyzed by a "digital SRE" (the AI). More importantly, it gives you back your time.&lt;br&gt;
Instead of reacting to incidents, you move toward a system that actively assists in diagnosing them. Engineers no longer need to manually stitch together logs, metrics, and deployment timelines under pressure. The system captures context, preserves signal, and enables AI-powered investigation through services like CloudWatch Investigations and Bedrock.&lt;br&gt;
Over time, this fundamentally changes how teams operate. Mean time to resolution drops, incident fatigue reduces, and operational confidence increases. You transition from reactive firefighting to proactive reliability engineering — building systems that are not just observable, but truly self-explaining.&lt;/p&gt;

&lt;p&gt;If you want to see the code or deploy this yourself, check out the full repository on &lt;a href="https://github.com/Wizard-Z/aws-serverless-rca" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;. Stop hunting for logs and start reading solutions.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;That's Part 2.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You started this series wanting faster incident response. You now have a complete system that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Detects failures automatically &lt;/li&gt;
&lt;li&gt;Investigates them without human effort&lt;/li&gt;
&lt;li&gt;Analyzes root causes with AI&lt;/li&gt;
&lt;li&gt;Delivers actionable recommendations&lt;/li&gt;
&lt;li&gt;All in under few minutes&lt;/li&gt;
&lt;li&gt;Costs less than coffee 🍵&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The hard part isn't the technology. It's committing to structured logging in your application. But once you do, everything else flows from that.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next: RAG for Incident Memory
&lt;/h2&gt;

&lt;p&gt;Instead of analyzing each incident in isolation, imagine if your system could say: "This looks like the DynamoDB timeout from 3 weeks ago. Last time, we switched to on-demand billing and it fixed it in 2 minutes." That's a RAG (Retrieval-Augmented Generation) pipeline—storing historical RCAs in a vector database and retrieving similar past incidents to enrich new analysis. Every incident becomes a learning opportunity&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;&lt;em&gt;Resources&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/bedrock/" rel="noopener noreferrer"&gt;Amazon Bedrock Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/step-functions/latest/dg/concepts-structured-data.html" rel="noopener noreferrer"&gt;Step Functions Parallel State&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/bedrock/latest/userguide/nova-models.html" rel="noopener noreferrer"&gt;Amazon Nova Models&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax.html" rel="noopener noreferrer"&gt;CloudWatch Logs Insights Queries&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/powertools-for-aws-lambda/" rel="noopener noreferrer"&gt;AWS Lambda Powertools&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Wizard-Z/aws-serverless-rca" rel="noopener noreferrer"&gt;Source Code&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Investigations.html" rel="noopener noreferrer"&gt;CloudWatch investigations - must read&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Feedbacks are welcome&lt;/strong&gt;&lt;br&gt;
I would love to know your thoughts and suggestions. &lt;/p&gt;

&lt;p&gt;&lt;em&gt;Lets Build&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>aws</category>
      <category>serverless</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>Building Event-Driven Observability on AWS Serverless (Part 1)</title>
      <dc:creator>Sourabh Choubey</dc:creator>
      <pubDate>Sun, 01 Feb 2026 20:45:09 +0000</pubDate>
      <link>https://forem.com/aws-builders/building-event-driven-observability-on-aws-serverless-part-1-2m6j</link>
      <guid>https://forem.com/aws-builders/building-event-driven-observability-on-aws-serverless-part-1-2m6j</guid>
      <description>&lt;h2&gt;
  
  
  The Problem We're Solving
&lt;/h2&gt;

&lt;p&gt;It's 2 AM on a Saturday. You got your teams call.&lt;/p&gt;

&lt;p&gt;Your order service is failing silently. Customers can't check out. Revenue is bleeding out while you scramble.&lt;br&gt;
You grab your laptop, login to AWS console, and start digging through CloudWatch logs. What went wrong? Was it validation? Payment? Database? Something else? You don't know yet. Thirty minutes later, you might have a guess.&lt;/p&gt;

&lt;p&gt;This article is about never being in that situation again.&lt;/p&gt;

&lt;p&gt;This guide walks you through building a production-grade serverless observability pipeline that triggers automated RCA. &lt;/p&gt;

&lt;p&gt;We use SST v3 (TypeScript) for IaC and AWS managed services (CloudWatch EMF, Composite Alarms, EventBridge, Step Functions, SNS, and a small Logs Insights Lambda).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Part 1&lt;/strong&gt; focuses on building the demo app (Order Service), structured logging, alarms, composite alarm, EventBridge rule on the default bus, Step Functions orchestration, a FetchLogs Lambda, and email notifications via SNS.&lt;/p&gt;

&lt;p&gt;Instead of manual log hunting, we're building a system that &lt;strong&gt;automatically detects failures, investigates them, and emails you the root cause—all within 90 seconds&lt;/strong&gt;. No guesswork. Just facts.&lt;/p&gt;


&lt;h2&gt;
  
  
  What We're Building
&lt;/h2&gt;

&lt;p&gt;An &lt;strong&gt;event-driven Root Cause Analysis (RCA) pipeline&lt;/strong&gt; that works like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Order service fails
    ↓ (error is logged with classification)
CloudWatch detects the spike
    ↓
Alarm triggers
    ↓
EventBridge automatically invokes Step Functions
    ↓
Lambda queries historical logs for that error type
    ↓
Results emailed to your team
    ↓
Total time: 90 seconds
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Before this system:&lt;/strong&gt; MTTR = 40 minutes (manual investigation)&lt;br&gt;&lt;br&gt;
&lt;strong&gt;After this system:&lt;/strong&gt; MTTR = 2 minutes (automated investigation)&lt;/p&gt;

&lt;p&gt;Think of it like this. Your application is a patient. When it gets sick (fails), we don't wait for a doctor to manually examine it. We automatically run diagnostics, collect test results, and email the doctor everything they need to know.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Scenario
&lt;/h2&gt;

&lt;p&gt;Let's check any typical e-commerce platform. When a customer submits an order:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Request arrives
    ↓
Validation check (could fail → VALIDATION error)
    ↓
Payment processing (could fail → PAYMENT error)
    ↓
Save to DynamoDB (could fail → DATABASE error)
    ↓
Publish to event bus (could fail → DEPENDENCY error)
    ↓
Response sent or error returned
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each step can fail in different ways. Different failures need different fixes. That's why we classify every error—not just "it failed," but "it failed at validation" or "it failed at payment."&lt;br&gt;
We'll implement a small, realistic 'Order Service' that receives API requests, writes orders to DynamoDB, emits structured logs (EMF), and publishes business events. Errors increment EMF metrics with an ErrorType dimension (VALIDATION, BUSINESS, DEPENDENCY, INFRA, UNKNOWN). We intentionally add a controlled chaos injector to produce failures for demo purposes.&lt;/p&gt;


&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F32x0n3t7k8wv7urj4113.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F32x0n3t7k8wv7urj4113.png" alt="Architecture" width="800" height="547"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;High level architecture:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;API Gateway -&amp;gt; OrderService Lambda -&amp;gt; DynamoDB and EventBridge&lt;/li&gt;
&lt;li&gt;Structured logs and EMF metrics emitted by Lambda (ErrorType dimension).&lt;/li&gt;
&lt;li&gt;CloudWatch per-ErrorType MetricAlarms -&amp;gt; CompositeAlarm (OR logic).&lt;/li&gt;
&lt;li&gt;Composite alarm emits to EventBridge default bus.&lt;/li&gt;
&lt;li&gt;EventBridge rule matches composite alarm -&amp;gt; Step Functions state machine (RCA pipeline).&lt;/li&gt;
&lt;li&gt;Step Functions calls a focused FetchLogs Lambda that runs CloudWatch Logs Insights and returns recent error logs.&lt;/li&gt;
&lt;li&gt;Step Functions publishes a curated summary to an SNS topic for email delivery.&lt;/li&gt;
&lt;li&gt;(Part 2) The same pipeline feeds an AI analysis step for automated RCA.&lt;/li&gt;
&lt;/ol&gt;


&lt;h2&gt;
  
  
  Is This Framework Actually Good?
&lt;/h2&gt;

&lt;p&gt;I'll be honest. Let me rate this against what you'd expect from a production system.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What Works Really Well&lt;/strong&gt; ✅&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Automatic detection:&lt;/strong&gt; No polling, no cron jobs. When an error metric crosses the threshold, alarms fire immediately.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured investigation:&lt;/strong&gt; We don't just send you a generic alert. We fetch logs from the exact time window when the error occurred, with full context (request IDs, error codes, everything).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Event-driven:&lt;/strong&gt; The entire pipeline is asynchronous. Nothing blocks. Your API returns fast even when alarms are firing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost-effective:&lt;/strong&gt; You pay for Lambda when it runs, SNS when it sends emails, Logs Insights when you query. Nothing idle.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observable:&lt;/strong&gt; All of this is infrastructure-as-code (SST), so you can see exactly what's running, version control it, audit it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What Could Be Better&lt;/strong&gt; ⚠️&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Deduplication:&lt;/strong&gt; If 1000 orders fail, you might get 1000 RCA invocations. (Fixable with SQS dead-letter queues, but not in this initial version.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remediation:&lt;/strong&gt; This system &lt;em&gt;detects&lt;/em&gt; and &lt;em&gt;reports&lt;/em&gt; failures. It doesn't auto-fix them. That's intentional—you probably want human judgment before auto-rolling back payments.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-region:&lt;/strong&gt; Single region only. Global platforms would need replication.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Slack/PagerDuty integration:&lt;/strong&gt; We're using SNS email by default. Adding Slack or PagerDuty is 30 minutes of work if you need it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The Real Value Proposition&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most teams have random log statements scattered everywhere. This framework enforces a standard: every error gets classified (VALIDATION, PAYMENT, etc.), tagged with context (requestId, userId, service), and published as a metric. Then when things break, you don't manually hunt through logs. You get a structured investigation automatically.&lt;br&gt;
This framework setup focuses on this principle. Using of centralized accepted logging framework. Sending Embedded metric as part of logs and based on which we have alarms configured. Workflows gets invoked based on alarm statuses.&lt;/p&gt;

&lt;p&gt;Not perfect, but it solves the critical problem: &lt;strong&gt;MTTR just dropped from 40 minutes to 2 minutes.&lt;/strong&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  How It Works in simple english.
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Layer 1: Detection
&lt;/h3&gt;

&lt;p&gt;Your Lambda function processes an order. Something goes wrong. It logs the error &lt;strong&gt;with structure&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;AppError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;User ID missing&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;VALIDATION&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// ← This classification is key&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;MISSING_USER_ID&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nf"&gt;logError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;// Logs + emits metric automatically&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things happen:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;A structured log entry&lt;/strong&gt;: &lt;code&gt;{timestamp, level, message, errorType, errorCode, requestId, userId}&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A metric to CloudWatch&lt;/strong&gt;: &lt;code&gt;OrderFailureCount with ErrorType=VALIDATION&lt;/code&gt;
This format of adding metric makes this solution scalable where the logger utility can be reused and based on error a custom metric will be published in cloudwatch.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Layer 2: Alarming
&lt;/h3&gt;

&lt;p&gt;CloudWatch watches that metric continuously. When it exceeds the threshold (default: 1 error), the alarm transitions to &lt;strong&gt;ALARM&lt;/strong&gt; state.&lt;/p&gt;

&lt;p&gt;We create one alarm per error type:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;order-failure-validation-dev&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;order-failure-payment-dev&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;order-failure-database-dev&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;... etc&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Plus a composite alarm: "ANY of these alarms is in ALARM state"&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F04i1ds3mi8ygnshx12a0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F04i1ds3mi8ygnshx12a0.png" alt="Alarms" width="800" height="317"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: Triggering
&lt;/h3&gt;

&lt;p&gt;An EventBridge rule listens for this specific pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;source = aws.cloudwatch
AND detail-type = CloudWatch Alarm State Change
AND alarm state = ALARM
AND alarm name matches order-failure-any-*
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When it matches, EventBridge automatically triggers a Step Functions execution. No manual intervention. No waiting.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 4: Investigation
&lt;/h3&gt;

&lt;p&gt;Step Functions is your orchestrator. It calls a Lambda function that queries CloudWatch Logs Insights:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;fields&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="nb"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;requestId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;errorType&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;filter&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="k"&gt;like&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;ERROR&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;filter&lt;/span&gt; &lt;span class="n"&gt;errorType&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nv"&gt;"VALIDATION"&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;sort&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="nb"&gt;timestamp&lt;/span&gt; &lt;span class="k"&gt;desc&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="k"&gt;limit&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This fetches the last 20 errors from the past 5 minutes (configurable), &lt;strong&gt;exactly from the time window when the alarm triggered&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 5: Notification
&lt;/h3&gt;

&lt;p&gt;Step Functions publishes results to SNS, which emails your team with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Alarm name and trigger time&lt;/li&gt;
&lt;li&gt;Error logs (timestamps, messages, request IDs)&lt;/li&gt;
&lt;li&gt;Time window investigated&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Total latency: ~90 seconds&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Voila! you have received the email. I know its not the prettiest email. We are using &lt;code&gt;email-json&lt;/code&gt; format so that if we want to ingest this somewhere else it would be easy. Also since we went with functionless approach i.e there is no lambda in between to send email, directly via SNS this minimal setup should work. On Production we can have a lambda function hooked to prettify the email, use custom email service to send email. The possibility are endless.  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp1f6vfat0mdnegkdoguk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp1f6vfat0mdnegkdoguk.png" alt="Email Notification" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can see we are getting the logs and relevant details in JSON format that could be reused. More details on message content is added in later part of the article.&lt;/p&gt;




&lt;h2&gt;
  
  
  Building It: Step by Step
&lt;/h2&gt;

&lt;p&gt;I would recommend to refer the attached GitHub link for more information.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: The Order Handler (Error Classification)
&lt;/h3&gt;

&lt;p&gt;This is the &lt;strong&gt;foundation&lt;/strong&gt;. When your Lambda processes orders, classify failures:&lt;br&gt;
Below is sample code stating what one could adopt to capture essential metrics.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// services/order/handler.ts&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;APIGatewayProxyHandlerV2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;requestId&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
  &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;requestContext&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;requestId&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt;
  &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;?.[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;x-amzn-requestid&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt;
  &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;?.[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;x-request-id&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt;
  &lt;span class="nx"&gt;crypto&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randomUUID&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;traceId&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_X_AMZN_TRACE_ID&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="nf"&gt;appendContext&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;order-service&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;operation&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;CreateOrder&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;requestId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;traceId&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;logInfo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Create order request received&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;event&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;AppError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Request body missing&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;VALIDATION&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;BODY_MISSING&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
      &lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;currency&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;amount&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;currency&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;AppError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Invalid payload&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;VALIDATION&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;INVALID_PAYLOAD&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
      &lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;orderId&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`ord_&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nf"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nf"&gt;appendContext&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;orderId&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;ddb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;PutItemCommand&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="na"&gt;TableName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ORDERS_TABLE&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;Item&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="na"&gt;pk&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;S&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`ORDER#&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;orderId&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
          &lt;span class="na"&gt;sk&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;S&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;METADATA&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
          &lt;span class="na"&gt;orderId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;S&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;orderId&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
          &lt;span class="na"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;S&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;userId&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
          &lt;span class="na"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;N&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toString&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
          &lt;span class="na"&gt;currency&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;S&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;currency&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
          &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;S&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;CREATED&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
          &lt;span class="na"&gt;requestId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;S&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;requestId&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
          &lt;span class="na"&gt;traceId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;S&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;traceId&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;NA&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
          &lt;span class="na"&gt;createdAt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;S&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;toISOString&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;eb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;PutEventsCommand&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="na"&gt;Entries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
          &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="na"&gt;Source&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;demo.orders&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="na"&gt;DetailType&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;OrderCreated&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="na"&gt;EventBusName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;EVENT_BUS_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="na"&gt;Detail&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
              &lt;span class="nx"&gt;orderId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="nx"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="nx"&gt;currency&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="nx"&gt;requestId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="nx"&gt;traceId&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;
          &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="nf"&gt;logInfo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Order created&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;eventType&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;OrderCreated&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;statusCode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;202&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;orderId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;CREATED&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;logError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;eventType&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;OrderCreateFailed&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;statusCode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Failed to create order&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When &lt;code&gt;logError()&lt;/code&gt; is called, CloudWatch gets:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"level"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ERROR"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Request body missing"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-02-01T19:58:29.655Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"service"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"order-service"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"sampling_rate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"xray_trace_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1-697fb065-3f7483d74258544027bd336e"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"operation"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CreateOrder"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"requestId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2bc6aa1e-cf56-4e12-a2fe-9f283466fd89"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"traceId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Root=1-697fb065-3f7483d74258544027bd336e;Parent=5b855ffe9a203f8f;Sampled=0;Lineage=1:141a50ec:0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"errorType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"VALIDATION"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"errorCode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"BODY_MISSING"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"retryable"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"meta"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"eventType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"OrderCreateFailed"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Note&lt;/em&gt;: Using lambda-powertools we do get important traceability out of the box. Things like &lt;code&gt;requestId&lt;/code&gt;, &lt;code&gt;traceId&lt;/code&gt; do help to get end-to-end picture of exactly what happened that let to this error. I recommend to go through official powertools documentation.   &lt;/p&gt;

&lt;p&gt;And simultaneously, &lt;strong&gt;EMF metric gets emitted&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"_aws"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"Timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1769975909655&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"CloudWatchMetrics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"Namespace"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"OrderService"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"Dimensions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
                    &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
                        &lt;/span&gt;&lt;span class="s2"&gt;"Service"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                        &lt;/span&gt;&lt;span class="s2"&gt;"Stage"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                        &lt;/span&gt;&lt;span class="s2"&gt;"ErrorType"&lt;/span&gt;&lt;span class="w"&gt;
                    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"Metrics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
                    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
                        &lt;/span&gt;&lt;span class="nl"&gt;"Name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"OrderFailureCount"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                        &lt;/span&gt;&lt;span class="nl"&gt;"Unit"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Count"&lt;/span&gt;&lt;span class="w"&gt;
                    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Service"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"order-service"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Stage"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sourabhTest"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ErrorType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"VALIDATION"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ErrorCode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"BODY_MISSING"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"OrderFailureCount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why both?&lt;/strong&gt; Logs give context. Metrics trigger alarms. Together, you get detectability + context.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Creating Alarms (One Per Error Type)
&lt;/h3&gt;

&lt;p&gt;Using SSTv3, define alarms in code:&lt;br&gt;
Below thresholds are very tight. I intentionally kept it that way for testing. For Productions we can adjust based on business SLO.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// sst.config.ts&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;errorTypes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;VALIDATION&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;PAYMENT&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;DATABASE&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;DEPENDENCY&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;alarms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;errorTypes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;errorType&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;
  &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;aws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cloudwatch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;MetricAlarm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`OrderFailure-&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;errorType&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`order-failure-&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;errorType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toLowerCase&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;&lt;span class="s2"&gt;-&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;stage&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;OrderService&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;metricName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;OrderFailureCount&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;statistic&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Sum&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;period&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;              &lt;span class="c1"&gt;// Check every minute&lt;/span&gt;
    &lt;span class="na"&gt;evaluationPeriods&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;// Trigger immediately&lt;/span&gt;
    &lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;            &lt;span class="c1"&gt;// Alarm on 1+ errors (use higher in prod)&lt;/span&gt;
    &lt;span class="na"&gt;comparisonOperator&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;GreaterThanOrEqualToThreshold&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;dimensions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;Service&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;order-service&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;Stage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;stage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;ErrorType&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;errorType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// ← Must match metric dimension&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;}),&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Composite alarm: "OR" all of them&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;compositeRule&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;pulumi&lt;/span&gt;
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;alarms&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;names&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;names&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;`ALARM("&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;n&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;")`&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt; OR &lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;composite&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;aws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cloudwatch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;CompositeAlarm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;OrderFailureComposite&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;alarmName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`order-failure-any-&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;stage&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;alarmDescription&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Any order failure detected&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;alarmRule&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;compositeRule&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why per-error-type alarms?&lt;/strong&gt; Because you want to know &lt;em&gt;what&lt;/em&gt; broke. Different errors need different fixes. The composite lets you trigger RCA on any error, but you still have granularity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; In Part 2, we'll add Amazon Bedrock for AI-powered analysis. The IAM permissions are already in place. More on this in my next post.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Step Functions gets Bedrock permissions&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;rcaPolicy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;aws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;iam&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;RolePolicy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;OrderRcaPolicy&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;rcaRole&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;policy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;rcaStateMachine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;
    &lt;span class="nx"&gt;aws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;iam&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getPolicyDocument&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;statements&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="na"&gt;actions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;bedrock:InvokeModel&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
          &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="s2"&gt;`arn:aws:bedrock:&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;region&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;::foundation-model/amazon.nova-premier-v1:0`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="c1"&gt;// ... other permissions&lt;/span&gt;
      &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;}).&lt;/span&gt;&lt;span class="nf"&gt;then&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Metric added creates a custom namespaces with the dimensions defined by metric. Below images shows metric data being updated based on errors observed in the system. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsgw4gnxxybmjbgecnjr0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsgw4gnxxybmjbgecnjr0.png" alt="Metrics" width="800" height="259"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Step Functions Orchestration
&lt;/h3&gt;

&lt;p&gt;When the composite alarm triggers, Step Functions is invoked automatically. Here's the complete workflow for the Orchestration:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs8qezip13aeo009mw3un.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs8qezip13aeo009mw3un.png" alt="Step functions Orchestration" width="793" height="844"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"StartAt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"FetchLogs"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"States"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"FetchLogs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Task"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:states:::lambda:invoke"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Parameters"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"FunctionName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"${FETCH_LOGS_LAMBDA_ARN}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"Payload.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Catch"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"ErrorEquals"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"States.ALL"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"Next"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"SendFailureEmail"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Next"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"SendSuccessEmail"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"SendSuccessEmail"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Task"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:states:::sns:publish"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Parameters"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"TopicArn"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"${ALARM_EMAIL_TOPIC_ARN}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"Subject.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"States.Format('Order Failure: {}', $.detail.alarmName)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"Message.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"States.JsonToString($)"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"End"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"SendFailureEmail"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Task"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:states:::sns:publish"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Parameters"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"TopicArn"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"${ALARM_EMAIL_TOPIC_ARN}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"Subject"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"RCA Failed"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"Message.$"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"States.JsonToString($)"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"End"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Flow:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Fetch logs from CloudWatch Logs Insights&lt;/li&gt;
&lt;li&gt;If successful → Send email with results&lt;/li&gt;
&lt;li&gt;If failed → Send email saying RCA failed (at least you know something broke)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;In Part 2&lt;/strong&gt;, we enhance this with a parallel state that sends immediate alerts &lt;em&gt;while&lt;/em&gt; Amazon Bedrock Nova Premier analyzes the logs with AI. You get two emails: raw data (10 sec) + AI analysis (30 sec).&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: The Investigation Lambda
&lt;/h3&gt;

&lt;p&gt;This Lambda is where the magic happens. It queries logs for the specific error. &lt;br&gt;
It extracts error type from the alarm name, then queries CloudWatch Logs Insights for all errors matching that type within the past 10 minutes (the exact window when the alarm triggered). Since Logs Insights queries are asynchronous, it polls for results with exponential backoff, then returns structured data including timestamps, error messages, request IDs, and trace IDs—everything your team needs to start investigating.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What happens:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Extracts error type from alarm name&lt;/li&gt;
&lt;li&gt;Queries logs for that specific error in the past 5 minutes&lt;/li&gt;
&lt;li&gt;Polls Logs Insights until results are ready&lt;/li&gt;
&lt;li&gt;Returns structured data: what failed, when, how many times, with full context&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Testing It
&lt;/h2&gt;

&lt;p&gt;Let's verify the whole pipeline works.&lt;/p&gt;
&lt;h3&gt;
  
  
  Test 1: Trigger an Error
&lt;/h3&gt;

&lt;p&gt;We will trigger error by hitting the deployed endpoint with invalid payload.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Send invalid request&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$API_URL&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt;  &lt;span class="c"&gt;# Empty body triggers VALIDATION error&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Based on this we do receive email with complete logs. This could then be used to do end to end check. Since we are using lambda powertools, the requestId can be used to check complete trace of request that caused the error.&lt;/p&gt;

&lt;h3&gt;
  
  
  Test 2: Watch the Logs
&lt;/h3&gt;

&lt;p&gt;In CloudWatch Logs, we'll see the structured error. In CloudWatch Metrics, you'll see &lt;code&gt;OrderFailureCount&lt;/code&gt; increment.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F24lopdiyvl0s0ctzozrv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F24lopdiyvl0s0ctzozrv.png" alt="Cloudwatch logs" width="800" height="340"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Test 3: Check Alarm State
&lt;/h3&gt;

&lt;p&gt;Wait 60 seconds, then:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt; aws cloudwatch describe-alarms   &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'MetricAlarms[?Namespace==`OrderService`].[AlarmName,StateValue]'&lt;/span&gt;   &lt;span class="nt"&gt;--output&lt;/span&gt; table &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1


&lt;span class="nt"&gt;---------------------------------------------------&lt;/span&gt;
|                 DescribeAlarms                  |
+----------------------------------------+--------+
|  order-failure-business-sourabhTest    |  OK    |
|  order-failure-dependency-sourabhTest  |  OK    |
|  order-failure-infra-sourabhTest       |  OK    |
|  order-failure-unknown-sourabhTest     |  OK    |
|  order-failure-validation-sourabhTest  |  ALARM |
+----------------------------------------+--------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Test 4: Verify Step Functions Executed
&lt;/h3&gt;

&lt;p&gt;Post invoking endpoint with invalid details we will check the step function execution.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbkewuapng9emnco1x108.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbkewuapng9emnco1x108.png" alt="stepFunctionExecution" width="524" height="321"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Test 5: Received Email
&lt;/h3&gt;

&lt;p&gt;Within 90 seconds, you should receive an email with the full context. Here's what the actual email looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;From: AWS Notifications
Subject: Order Service Alarm: order-failure-any-dev
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The email body contains a complete JSON payload with:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Alarm Context&lt;/strong&gt; - What triggered:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Alarm name and transition time&lt;/li&gt;
&lt;li&gt;Previous state (OK → ALARM)&lt;/li&gt;
&lt;li&gt;Triggering alarm details (which specific error type caused it)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. RCA Object&lt;/strong&gt; - The investigation results:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"rca"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Payload"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"alarmName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"order-failure-any-dev"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"errorType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"VALIDATION"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"timeWindow"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"from"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1769927917&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"to"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1769928217&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"fromReadable"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-02-01T06:38:37.000Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"toReadable"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-02-01T06:43:37.000Z"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"logs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"@timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-02-01 06:42:36.945"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"@message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"{...full structured log...}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"requestId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1f00615f-7154-4964-9c14-db512efea5f8"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"traceId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Root=1-697ef5dc-08ed5ced0bda9d290e9aea6c"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"errorType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"VALIDATION"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh0gz1jg982asc5y9dt00.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh0gz1jg982asc5y9dt00.png" alt="email" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What we Get:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Time window&lt;/strong&gt;: Exact 5-minute window that was investigated.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error count&lt;/strong&gt;: How many errors were found&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full logs&lt;/strong&gt;: Each log entry with:

&lt;ul&gt;
&lt;li&gt;Timestamp (when it happened)&lt;/li&gt;
&lt;li&gt;Complete error message with all context&lt;/li&gt;
&lt;li&gt;Request ID (for distributed tracing)&lt;/li&gt;
&lt;li&gt;X-Ray trace ID (for deep diving)&lt;/li&gt;
&lt;li&gt;Error type and code&lt;/li&gt;
&lt;li&gt;Order ID (for customer impact assessment)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;@message&lt;/code&gt; field contains the full structured log in JSON format, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Error classification (VALIDATION, PAYMENT, etc.)&lt;/li&gt;
&lt;li&gt;Error code (INVALID_PAYLOAD, MISSING_USER_ID, etc.)&lt;/li&gt;
&lt;li&gt;Whether it's retryable&lt;/li&gt;
&lt;li&gt;Service metadata&lt;/li&gt;
&lt;li&gt;Custom fields you've added&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;That's the win.&lt;/strong&gt; Error detected → Investigated → Full context emailed. All automatic. 90 seconds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real-world example&lt;/strong&gt; (redacted for privacy):&lt;br&gt;
See the complete sample in the repo at &lt;code&gt;examples/sample-rca-email.json&lt;/code&gt; showing an actual VALIDATION error with full CloudWatch Logs Insights results.&lt;/p&gt;







&lt;h2&gt;
  
  
  What to Think About for Production
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Alarm Thresholds
&lt;/h3&gt;

&lt;p&gt;One error = alarm. Good for testing. In production, maybe 10 errors in a minute? Depends on your traffic. Know your baseline.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Notifications
&lt;/h3&gt;

&lt;p&gt;Email works. For production, you probably want Slack or PagerDuty. SNS integrates with both (30-minute addition).&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Rate Limiting
&lt;/h3&gt;

&lt;p&gt;If thousands of orders fail at once, you'll get thousands of RCA invocations. Add deduplication or batching to prevent Lambda/Logs Insights thrashing.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Cost
&lt;/h3&gt;

&lt;p&gt;Estimated monthly cost at scale: $5-15&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Logs Insights queries: ~$0.005 per GB scanned&lt;/li&gt;
&lt;li&gt;Step Functions: ~$0.000025 per state transition&lt;/li&gt;
&lt;li&gt;SNS: ~$0.50 per million emails&lt;/li&gt;
&lt;li&gt;Lambda: Pay-per-execution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cheap. Way cheaper than on-call engineers hunting logs.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Dashboard
&lt;/h3&gt;

&lt;p&gt;Add a CloudWatch Dashboard to visualize:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Failure rate trends&lt;/li&gt;
&lt;li&gt;Which error types are most common&lt;/li&gt;
&lt;li&gt;RCA success rate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One command, huge value.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Coming in Part 2
&lt;/h2&gt;

&lt;p&gt;Right now you get logs. In Part 2, you'll get &lt;strong&gt;analysis&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;We'll send those logs to Amazon Bedrock , which will:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Summarize the incident in plain English&lt;/li&gt;
&lt;li&gt;Identify likely root causes&lt;/li&gt;
&lt;li&gt;Recommend fixes&lt;/li&gt;
&lt;li&gt;Compare to past similar incidents&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Same architecture—just add one more step that calls Bedrock and processes the response.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Key Insight
&lt;/h2&gt;

&lt;p&gt;This whole system rests on one principle: &lt;strong&gt;automation needs structured data&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If your logs are "oops something broke," you can't automate anything. But if your logs are structured—error type, error code, request ID, user context—suddenly you can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create meaningful alarms&lt;/li&gt;
&lt;li&gt;Investigate automatically&lt;/li&gt;
&lt;li&gt;Share context instantly&lt;/li&gt;
&lt;li&gt;Let AI analyze patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The hard part isn't the AWS services (they're all standard). The hard part is getting your application to emit good logs in the first place.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Start there.&lt;/strong&gt; Classify your failures. Emit structured data. Everything else flows from that.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/" rel="noopener noreferrer"&gt;AWS Well-Architected - Operational Excellence&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format_Specification.html" rel="noopener noreferrer"&gt;CloudWatch Embedded Metric Format&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/step-functions/latest/dg/concepts-best-practices.html" rel="noopener noreferrer"&gt;Step Functions Best Practices&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax.html" rel="noopener noreferrer"&gt;CloudWatch Logs Insights Syntax&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/powertools-for-aws-lambda/" rel="noopener noreferrer"&gt;Powertools for AWS Lambda&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Code
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;GitHub Repository:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;a href="https://github.com/Wizard-Z/aws-serverless-rca" rel="noopener noreferrer"&gt;https://github.com/Wizard-Z/aws-serverless-rca&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Files:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;sst.config.ts&lt;/code&gt; - Infrastructure as Code&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;services/order/handler.ts&lt;/code&gt; - Order service&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;services/rca/fetchlogs.ts&lt;/code&gt; - Investigation Lambda&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;shared/logging/&lt;/code&gt; - Logging utilities&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;infra/stepfunctions/orders-rca.asl.json&lt;/code&gt; - RCA workflow&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Part 2 coming soon.&lt;/strong&gt; Follow me on &lt;a href="https://www.linkedin.com/in/sourabh-choubey/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; to get notified.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;AWS Community&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lets build!&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>serverless</category>
      <category>monitoring</category>
      <category>sst</category>
    </item>
  </channel>
</rss>
