Forem: Håkon Eriksen Drange

AWS Immersion Day talk: Web Application Firewall and AWS Shield Advanced

Håkon Eriksen Drange — Tue, 25 Nov 2025 20:17:00 +0000

I was invited by AWS Norway to contribute to their AWS Immersion Day on Zero Trust which took place on Tuesday November 25th 2025.

Elshan Hasanov from AWS started with an introduction to the concept of Zero Trust and how Amazon Verified Access and Amazon VPC Lattice can be used for the purpose. Next up I presented approaches for application and infrastructure security with AWS Web Application Firewall and AWS Shield Advanced.

You can find the slides below.

2025-11-25-aws-immersion-day-zero-trust-waf-shield-haakon-eriksen-drange Download

About the event

In today’s dynamic and distributed cloud environments, traditional perimeter security alone is no longer suﬃcient to protect against advanced threats. This session explores the paradigm shift towards Zero Trust architecture and its implementation within AWS ecosystems, covering the use of AWS services and features to adopt Zero Trust principles alongside traditional security approaches for fine-grained access control, network segmentation, secure data access, and logging. Join us to learn how to transition to a more secure, resilient, and comprehensive security approach tailored for your AWS environment.

What to Expect from the Event

This session and hands-on workshop provides an in-depth exploration of Zero Trust security principles and their practical application within AWS environments using AWS Verified Access and VPC Lattice, complemented by modern perimeter security strategies including WAF and firewall inspection.

Core Focus Areas

Identity-aware, least-privilege access for both human users and microservices
Integration of Zero Trust with strategic perimeter controls including WAF
Practical implementation using AWS Verified Access and Amazon VPC Lattice

Hands-on Components

Configure AWS Verified Access for secure remote user access
Implement Amazon VPC Lattice for service-to-service communication

Original AWS Experience North event page:

2025-11-25-zero-trust-event-page Download

The post AWS Immersion Day talk: Web Application Firewall and AWS Shield Advanced first appeared on Håkon Eriksen Drange.

AWS User Group Oslo talk: From Vibe Coding to Spec-Driven Development with Kiro

Håkon Eriksen Drange — Wed, 19 Nov 2025 13:39:13 +0000

On Tuesday November 18th at the AWS User Group Oslo Meetup I shared my views on why builders should pivot from Vibe Coding to Spec-Driven Development with Kiro.

Highlights

How AI assistants are transforming the Software Development Lifecycle (all code, not only frontend and backend, Infrastructure-as-Code as well)
Introduction to Vibe Coding, the good and the not so good parts
How traditional and established software craftmanship is becoming more relevant than ever
How Kiro and the Spec-Driven Development workflow helps bring more structure, quality and speed
A practical demonstration; adding a new feature to a serverless weather forecast application, deployed on AWS with Terraform and GitHub Actions
Best practices for specs, agent steering context, MCP and token optimization based on my experience
Predictions for the future

You can find the slides below. Thanks to AWS Community Builders and the Amazon Kiro team for valuable insights and supporting material.

For a detailed walk-through check out https://hedrange.com/2025/08/11/how-to-use-kiro-for-ai-assisted-spec-driven-development/ .

2025-11-18-aws-user-group-oslo-from-vibe-coding-to-spec-driven-development-with-kiro Download

Additional resources

The post AWS User Group Oslo talk: From Vibe Coding to Spec-Driven Development with Kiro first appeared on Håkon Eriksen Drange.

Demonstrate practical AWS skills with new microcredentials

Håkon Eriksen Drange — Fri, 14 Nov 2025 07:58:42 +0000

AWS has announced a new skills validation program called Microcredentials. These are more practical and lightweight approaches for validating knowledge than a full, comprehensive certification exam. It’s rewarding to be able to go beyond theoretical knowledge and prove what you’ve actually learned. AWS Certifications and Microcredentials complement each other; validating both your deep technical knowledge and hands-on skills.

Image credit: AWS – https://www.aboutamazon.com/news/aws/aws-ai-certification-learning-tools-skills-development

This is how it works

The exam labs are in a live AWS-provisioned environment, similar to regular lab/SimuLearn tasks on AWS Skill Builder. During the course of 90 minutes candidates are presented with a set of challenges to be achieved by practical implementation in the AWS Console (ClickOps, no IaC). You will have to diagnose issues and implement solutions on your own, there are no hints or guidance provided. The exam lab cannot be paused or restarted, so if you have to quit you need to start over again. Not much different than an actual certification exam. Candidates failing to meet the passing score objective can go for a re-take after 25 days.

As of November 2025 the available training options are:

Microcredential Preview Experience

This microcredential validates your ability to configure and connect basic AWS services in hands-on scenarios. Key focus areas include S3 static hosting, API Gateway connections, Lambda integrations, and DynamoDB storage. This is not a real microcredential exam lab. This is a trial version you can use to familiarize yourself with the interface.

I recommend you start here to become familiar with the concept to set yourself up for success for the actual lab exam.

AWS Skill Builder link: https://skillbuilder.aws/learn/JBTFY8M6S8/microcredential-preview-experience/YJFR7KHKR3

AWS Agentic AI Demonstrated

_ AWS Agentic AI Demonstrated is a hands-on exam lab designed to help you validated your AWS skills in the Agentic AI domain_.

Objectives

Troubleshoot and repair supervisor and specialist Bedrock Agents
Troubleshoot and repair an Amazon Bedrock knowledge base
Integrate and fix a Bedrock Agent
Enhance Bedrock Agent capabilities
Integrate Bedrock Guardrails with a Bedrock Agent
Connect a web application chat client with a Bedrock Agent

AWS Skill Builder link: https://skillbuilder.aws/learn/32Y249P272/aws-agentic-ai-demonstrated/TTAJ5WKYTS

AWS Serverless Demonstrated

AWS Serverless Demonstrated is a hands-on exam lab designed to help you validate your AWS skills in the Serverless domain.

Objectives

Configure an AWS Lambda function
Deploy a REST API
Configure a Step Functions state machine
Design and implement an event-driven system
Optimize AWS Lambda functions for various scenarios
Configure a CI/CD pipeline for serverless applications
Configure and analyze monitoring and telemetry

AWS Skill Builder Link: https://skillbuilder.aws/learn/XV3B4RGA8Q/aws-serverless-demonstrated/BYD5SH8R5C

Tip: Completing the Serverless Knowledge Badge Readiness Path course first can be helpful (you will also get a badge upon completing a multiple-choice assessment).

Result

Passing the microcredential lab exams will unlock some nice new Credly badges you can share with your manager, colleagues and on social media to prove the skills you’ve acquired.

My reference: https://www.credly.com/badges/bc214808-641d-4fb2-b196-e78b530af563

My reference: https://www.credly.com/badges/80b6282b-026e-482e-bc14-9aa801536435

Good luck!

The post Demonstrate practical AWS skills with new microcredentials first appeared on Håkon Eriksen Drange.

How to use Kiro for AI assisted spec-driven development

Håkon Eriksen Drange — Mon, 11 Aug 2025 13:35:31 +0000

Read on to learn how to use Kiro for AI assisted spec-driven development of a serverless weather forecasting app, using Terraform for deployment to AWS.

Table of contents

Introduction
Kiro introduces the AI assisted spec-driven development workflow
Kiro core concepts
- Core capabilities
- Specs: Plan and build features using structured specifications
- Hooks: Automate repetitive tasks with intelligent triggers
- Agentic chat: Build features through natural conversation with AI
- Steering: Guide AI with custom rules and context
- MCP Servers: Connect external tools and data sources
Getting Kiro
Starting your first Kiro project
- Vibe
- Spec
Setting up Kiro
- Model Context Protocol (MCP) servers
Setting up agent steering context
Writing your product spec
- Step 1: Define requirements
- Step 2: Define design
- Step 3: Implement
Deploying the final solution
- Workflow for adding a new feature
Learnings and key takeaways
- Reflections on context
Kiro pricing
Conclusion
Resources

Introduction

Since the advent of Generative AI, coding assistants and their evolution has been a topic of much discussion. As the Large Language Models became increasingly more precise, the AI-based coding assistants or companions have been able to produce increasingly more relevant suggestions, which has been incorporated into extensions and new IDEs such as Cursor in addition to the CLI. We’ve evolved from simple suggestions of a few lines to transformation of complete codebases. The focus is pivoting from assistance to resolution of a particular problem and outcomes, what we instruct the software to achieve. Vibe Coding will probably go down in history as one of the main terms of 2025. It can be efficient for prototypes and proof-of-concepts, but how can we know what assumptions and decisions the agent made to get to that result?

Traditional software development processes are based on initial specification. We need to know the purpose and functionality of what to build before we starting building.

Project Rules and Customizations in Amazon Q Developer were a step in the right direction, but some key challenges I experienced with AI coding assistants earlier in 2025:

Don’t remember context or state. If you shut down your laptop and continue tomorrow the context may be lost (was improved with Q Developer)
How to share context across multiple developers in a team
How to get more valuable output according to personal/company preference

Research referenced by AWS shows that addressing issues during the development phase is 5 to 7 times more costly than resolving them during the planning phase of the software development lifecycle. Similarly, it’s less complex and costly to change a system before going to production.

This principle holds true even with AI coding assistants. When you take the time to discuss requirements and design with Kiro during the planning phase, a single specification request will often accomplish what would otherwise require multiple vibe iterations during implementation.

Garbage in, garbage out, you know.

Luckily, Kiro can now help incorporate that well-known structure into AI assistant coding, in a consistent manner.

Kiro introduces the AI assisted spec-driven development workflow

Kiro is a new software development IDE based on Visual Studio Code that turns prompts into clear requirements, structured designs and implementation tasks.

The whole process is validated by tests. Code is generated by revolutionary AI agents utilizing the latest and most up-to-date Large Language Models (LLMs).

Kiro leverages Anthropic’s Claude Sonnet 4 under the hood, with the option to fall back to 3.7 (prefer the newest one). These models are specialized in agentic coding and tasks across the entire software development lifecycle from initial planning, implementation, bug fixing, maintenance and refactoring.

I recommend reading Introducing Kiro and Kiro and the future of AI spec-driven software development to get up to speed.

Kiro core concepts

Kiro introduces two main modes: Vibe and Spec.

Image courtesy of Kiro

Core capabilities

You can read more about my experiences with these capabilities in the next chapter. Let me introduce the concepts first.

Specs: Plan and build features using structured specifications

Specs or specifications are structured artifacts that formalize the development process for complex features in your application. They provide a systematic approach to transform high-level ideas into detailed implementation plans with clear tracking and accountability.

With Kiro’s specs, you can:

Break down requirements into user stories with acceptance criteria

Build design docs with sequence diagrams and architecture plans

Track implementation progress across discrete tasks

Collaborate effectively between product and engineering teams

Reference: https://kiro.dev/docs/specs/

Hooks: Automate repetitive tasks with intelligent triggers

Agent Hooks are automated triggers that execute predefined agent actions when specific events occur in the Kiro IDE. When files are created, saved or deleted you can configure hooks to be run for common tasks to:

Maintain consistent code quality

Prevent security vulnerabilities

Reduce manual overhead

Standardize team processes

Create faster development cycles

Reference: https://kiro.dev/docs/hooks/

This is an area I have not explored in detail yet.

Agentic chat: Build features through natural conversation with AI

Kiro offers a chat panel where you can interact with your code through natural language conversations. Just tell Kiro what you need. Ask questions about your codebase, request explanations for complex logic, generate new features, debug tricky issues, and automate repetitive tasks—all while Kiro maintains complete context of your project.

Steering: Guide AI with custom rules and context

I believe this is one of the most powerful capabilities Kiro introduces. To quote the documentation:

Steering gives Kiro persistent knowledge about your project through markdown files in .kiro/steering/. Instead of explaining your conventions in every chat, steering files ensure Kiro consistently follows your established patterns, libraries, and standards.

Consistent Code Generation – Every component, API endpoint, or test follows your team’s established patterns and conventions.

Reduced Repetition – No need to explain project standards in each conversation. Kiro remembers your preferences.

Team Alignment – All developers work with the same standards, whether they’re new to the project or seasoned contributors.

Scalable Project Knowledge – Documentation that grows with your codebase, capturing decisions and patterns as your project evolves.

Reference: https://kiro.dev/docs/steering/

MCP Servers: Connect external tools and data sources

In my opinion the main inputs for valuable and tailored results is your combination of Steering and Model Context Protocol servers. MCP extends Kiro’s capabilities by connecting to specialized servers that provide additional tools and context, tailored to your environment.

MCP is a protocol that allows Kiro to communicate with external servers to access specialized tools and information. For example, the AWS Documentation MCP server provides tools to search, read, and get recommendations from AWS documentation directly within Kiro.

With MCP, you can:

Access specialized knowledge bases and documentation

Integrate with external services and APIs

Extend Kiro’s capabilities with domain-specific tools

Create custom tools for your specific workflows

Reference: https://kiro.dev/docs/mcp/

Getting Kiro

As of early August 2025 Kiro is still in public preview with limited availability and a waiting list.

Assuming you have been able to get Kiro from https://kiro.dev/ go through their Get started guide to learn more about the basic concepts.

Starting your first Kiro project

Open a new folder to start a new project and you are presented with the option build Vibe or Spec style.

Vibe

Chat first, then build. Explore ideas and iterate as you discover needs.

Great for:

Rapid exploration and testing
Building when requirements are unclear
Implementing a task

Spec

Plan first, then build. Create requirements and design before coding starts.

Great for:

Thinking through feature in-dept
Projects needing upfront planning
Building features in a structured way

In my case I’ve used Amazon Q Developer in VS Code and CLI quite a lot for coding acceleration, so I went directly for Spec. Let’s get back to working with specifications after we have configured the remaining parts of our Kiro workspace.

Setting up Kiro

Model Context Protocol (MCP) servers

Enriching your environment with relevant MCP servers can be a massive boost. Take a look at the official MCP Servers from AWS at https://github.com/awslabs/mcp , there is already a ton available.

The ones I currently enjoy in Kiro and Amazon Q Developer focusing on Terraform development and AWS are:

AWS Core provides tools for prompt understanding and translation to AWS services
AWS Docs and AWS Knowledge can read, search and recommend from the official, up-to-date AWS Documentation.
AWS API can suggest for you and call AWS CLI commands on your behalf.
AWS CDK can provide guidance and generate CDK stacks.
AWS Terraform can search AWS and AWSCC provider docs, Execute Terraform and Terragrunt Commands, run Checkov scans and search user provided Terraform registry modules. Super valuable!
AWS Serverless can provide guidance, search schemas, deploy serverless applications, get metrics and so on.
AWS Diagram get generate diagrams, get diagram examples and list icons. You can generate architecture diagrams with official AWS icons, flow and sequence charts and so on. Content can be provided as a static image or in Draw.IO XML format, so that you can finishing up the final touches and corrections yourself.

- AWS Pricing can analyze CDK and Terraform projects, query the official pricing API and generate cost reports, much more efficient than manually working with the AWS Cost Calculator.

The MCP configuration feature in Kiro supports two modes: User Config (global) and Workspace Config.

Personally I prefer Workspace Config and keeping all context in the project. This makes it easier and predictable for other colleagues having the same MCP server configuration. It also yields more consistent outputs. Think about it, if not all team members have the same context settings, the results are not guaranteed to be consistent and could lead to implementation differences and bugs.

Here is my current Workspace MCP Config, which I have added to a common agent-steering-bootstrap Git repo:

<textarea tabindex="-1" aria-hidden="true" readonly>{
  "mcpServers": {
    "fetch": {
      "command": "uvx",
      "args": [
        "mcp-server-fetch"
      ],
      "env": {},
      "disabled": false,
      "autoApprove": []
    },
    "aws-docs": {
      "command": "uvx",
      "args": [
        "awslabs.aws-documentation-mcp-server@latest"
      ],
      "env": {
        "FASTMCP_LOG_LEVEL": "ERROR"
      },
      "disabled": false,
      "autoApprove": [
        "search_documentation",
        "read_documentation"
      ]
    },
    "aws-core": {
      "command": "uvx",
      "args": [
        "awslabs.core-mcp-server@latest"
      ],
      "env": {
        "FASTMCP_LOG_LEVEL": "ERROR"
      },
      "disabled": false,
      "autoApprove": []
    },
    "aws-api": {
      "command": "uvx",
      "args": [
        "awslabs.aws-api-mcp-server@latest"
      ],
      "env": {
        "FASTMCP_LOG_LEVEL": "ERROR"
      },
      "disabled": false,
      "autoApprove": []
    },
    "aws-knowledge-mcp-server": {
      "command": "uvx",
      "args": [
        "mcp-proxy",
        "--transport",
        "streamablehttp",
        "https://knowledge-mcp.global.api.aws"
      ],
      "disabled": false,
      "autoApprove": []
    },
    "aws-cdk": {
      "command": "uvx",
      "args": [
        "awslabs.cdk-mcp-server@latest"
      ],
      "env": {
        "FASTMCP_LOG_LEVEL": "ERROR"
      },
      "disabled": false,
      "autoApprove": []
    },
    "aws-terraform": {
      "command": "uvx",
      "args": [
        "awslabs.terraform-mcp-server@latest"
      ],
      "env": {
        "FASTMCP_LOG_LEVEL": "ERROR"
      },
      "disabled": false,
      "autoApprove": []
    },
    "aws-serverless": {
      "command": "uvx",
      "args": [
        "awslabs.aws-serverless-mcp-server@latest"
      ],
      "env": {
        "FASTMCP_LOG_LEVEL": "ERROR"
      },
      "disabled": false,
      "autoApprove": []
    },
    "awslabs-diagram": {
      "command": "uvx",
      "args": [
        "awslabs.aws-diagram-mcp-server"
      ],
      "env": {
        "FASTMCP_LOG_LEVEL": "ERROR"
      },
      "disabled": false,
      "autoApprove": [
        "get_diagram_examples",
        "generate_diagram"
      ]
    },
    "awslabs-pricing": {
      "command": "uvx",
      "args": [
        "awslabs.aws-pricing-mcp-server@latest"
      ],
      "env": {
        "FASTMCP_LOG_LEVEL": "ERROR",
        "AWS_PROFILE": "default",
        "AWS_REGION": "eu-west-1"
      },
      "disabled": false,
      "autoApprove": [
        "get_pricing_service_codes",
        "get_pricing_service_attributes",
        "get_pricing_attribute_values",
        "get_pricing"
      ]
    }
  }
}</textarea>

{
  "mcpServers": {
    "fetch": {
      "command": "uvx",
      "args": [
        "mcp-server-fetch"
      ],
      "env": {},
      "disabled": false,
      "autoApprove": []
    },
    "aws-docs": {
      "command": "uvx",
      "args": [
        "awslabs.aws-documentation-mcp-server@latest"
      ],
      "env": {
        "FASTMCP_LOG_LEVEL": "ERROR"
      },
      "disabled": false,
      "autoApprove": [
        "search_documentation",
        "read_documentation"
      ]
    },
    "aws-core": {
      "command": "uvx",
      "args": [
        "awslabs.core-mcp-server@latest"
      ],
      "env": {
        "FASTMCP_LOG_LEVEL": "ERROR"
      },
      "disabled": false,
      "autoApprove": []
    },
    "aws-api": {
      "command": "uvx",
      "args": [
        "awslabs.aws-api-mcp-server@latest"
      ],
      "env": {
        "FASTMCP_LOG_LEVEL": "ERROR"
      },
      "disabled": false,
      "autoApprove": []
    },
    "aws-knowledge-mcp-server": {
      "command": "uvx",
      "args": [
        "mcp-proxy",
        "--transport",
        "streamablehttp",
        "https://knowledge-mcp.global.api.aws"
      ],
      "disabled": false,
      "autoApprove": []
    },
    "aws-cdk": {
      "command": "uvx",
      "args": [
        "awslabs.cdk-mcp-server@latest"
      ],
      "env": {
        "FASTMCP_LOG_LEVEL": "ERROR"
      },
      "disabled": false,
      "autoApprove": []
    },
    "aws-terraform": {
      "command": "uvx",
      "args": [
        "awslabs.terraform-mcp-server@latest"
      ],
      "env": {
        "FASTMCP_LOG_LEVEL": "ERROR"
      },
      "disabled": false,
      "autoApprove": []
    },
    "aws-serverless": {
      "command": "uvx",
      "args": [
        "awslabs.aws-serverless-mcp-server@latest"
      ],
      "env": {
        "FASTMCP_LOG_LEVEL": "ERROR"
      },
      "disabled": false,
      "autoApprove": []
    },
    "awslabs-diagram": {
      "command": "uvx",
      "args": [
        "awslabs.aws-diagram-mcp-server"
      ],
      "env": {
        "FASTMCP_LOG_LEVEL": "ERROR"
      },
      "disabled": false,
      "autoApprove": [
        "get_diagram_examples",
        "generate_diagram"
      ]
    },
    "awslabs-pricing": {
      "command": "uvx",
      "args": [
        "awslabs.aws-pricing-mcp-server@latest"
      ],
      "env": {
        "FASTMCP_LOG_LEVEL": "ERROR",
        "AWS_PROFILE": "default",
        "AWS_REGION": "eu-west-1"
      },
      "disabled": false,
      "autoApprove": [
        "get_pricing_service_codes",
        "get_pricing_service_attributes",
        "get_pricing_attribute_values",
        "get_pricing"
      ]
    }
  }
}

Setting up agent steering context

Steering gives Kiro persistent knowledge about your project through markdown files in .kiro/steering/. Instead of explaining your conventions in every chat, steering files ensure Kiro consistently follows your established patterns, libraries, and standards.

Key benefits:

Consistent Code Generation – Every component, API endpoint, or test follows your team’s established patterns and conventions.

Reduced Repetition – No need to explain project standards in each conversation. Kiro remembers your preferences.

Team Alignment – All developers work with the same standards, whether they’re new to the project or seasoned contributors.

Scalable Project Knowledge – Documentation that grows with your codebase, capturing decisions and patterns as your project evolves.

Reference: https://kiro.dev/docs/steering/

MCP agent steering contains more background information and guidance about how Kiro can leverage the active MCP servers.

Product focuses on common product development context and principles.

Structure focuses on how the files in your codebase are structured, to align with company standards and preferences.

In Tech I define general patterns and principles for approaching solutions deployed to AWS with Terraform.

Most companies have product development principles, architecture and software guidelines documented in their internal wikis. This context is crucial to get into Kiro Agent Steering context, to get outputs matching with company and team preferences.

Example from tech.md:

<textarea tabindex="-1" aria-hidden="true" readonly># Technology Stack

This document outlines the technical foundation and tooling for the project.

## Build System &amp; Tools
- CI/CD is out of scope of this Terraform module.

## Application Tech Stack
- Python3, Boto3, Jinja templating etc.
- Unit testing of core functionality
- Basic testing of Terraform code
- Provide /health endpoint for REST APIs
- Python operations should take place in a virtual environment where optimal Python version is installed with pyenv

## Infrastructure Tech Stack
- Terraform for Infrastructure-as-Code
- Terraform providers aws and awscc, if necessary
- Leverage community modules from https://github.com/terraform-aws-modules as relevant
- AWS Serverless architecture options are preferred for minimal operational overhead
- The AWS infrastructure is Well-Architected
- The AWS infrastructure is secure as per the latest CIS AWS Security Hub control standard
- Terraform code is unit tested Terraform's native testing framework, HCL-based tests.

## Observability
- For serverless components, AWS X-Ray is leveraged for tracing
- Logs are directed to AWS CloudWatch Logs. CloudWatch Logs groups have a retention period of 180 days. 
- A solution specific AWS Cloudwatch Dashboard which includes relevant CloudWatch metrics for reliability, performance and cost, in addition to a list over the last failing requests

### Pre-commit for Terraform
- Pre-commit is installed and leveraged for validation and formatting. 
  - terraform_fmt
  - terraform_docs in main README.md
  - check-merge-conflict
  - trailing-whitespace
  - mixed-line-ending

Example .pre-commit-config.yaml located in the root directory:

repos:

repo: https://github.com/antonbabenko/pre-commit-terraform rev: v1.77.3 hooks:
- id: terraform_fmt
- id: terraform_docs args: ["--args=--sort-by required"]
- id: terraform_checkov args:
  - --args=--quiet
  - --args=--download-external-modules false
repo: https://github.com/pre-commit/pre-commit-hooks rev: v4.4.0 hooks:
- id: check-merge-conflict
- id: trailing-whitespace args: [--markdown-linebreak-ext=md]
- id: mixed-line-ending args: ["--fix=lf"]


## Documentation
- AWS Labs Diagram MCP server is used to produce relevant architecture, flow and sequence diagrams, included in main README.md
- AWS Labs Pricing MCP server is used to perform a basic cost calculation of the solution, included in the main README.md
- Every task should also ensure relevant and clear documentation is created or up to date. Prefer simple and user friendly documentation, don't overcomplicate.
- All documentation follows markdown format and is stored in the `docs/` directory
- Architecture diagrams are generated programmatically using the AWS diagram MCP server
- Cost analysis documentation includes detailed breakdowns, usage projections, and optimization recommendations
- Documentation includes deployment guides, troubleshooting guides, and operational runbooks
- There should be an examples folder with README.md explaining how to include the Terraform module call in an existing CI/CD codebase.
- In documentation, provide TL;DR to make it easy and quick for developers to get up to speed. 
- In high level project documentation, include an executive summary for target group project owners, to articulate functionality and the value the solution provides.

## Cost Management
- AWS Labs Pricing MCP server provides accurate cost calculations for the infrastructure components of the solution.
- Cost analysis should include environment-specific projections (staging and production).
- Cost analysis should include AWS region comparison of eu-west-1, eu-central-1 and eu-north-1 for the production environment.
- CloudWatch cost metrics and dashboards provide real-time cost monitoring.
- A solution specific AWS Budget is deployed, based on infrastructure tag Key Service. Budget alerts prevents unexpected charges.
- Guidance is provided for the top three cost items that may increase with heavy production load. 
- Cost documentation is included in the main `README.md`

## Principles
- Favor KISS over complexity, simplicity over comprehensibility
- Respect and adopt well-known cloud based architecture and integration patterns

</textarea>

# Technology Stack

This document outlines the technical foundation and tooling for the project.

## Build System & Tools
- CI/CD is out of scope of this Terraform module.

## Application Tech Stack
- Python3, Boto3, Jinja templating etc.
- Unit testing of core functionality
- Basic testing of Terraform code
- Provide /health endpoint for REST APIs
- Python operations should take place in a virtual environment where optimal Python version is installed with pyenv

## Infrastructure Tech Stack
- Terraform for Infrastructure-as-Code
- Terraform providers aws and awscc, if necessary
- Leverage community modules from https://github.com/terraform-aws-modules as relevant
- AWS Serverless architecture options are preferred for minimal operational overhead
- The AWS infrastructure is Well-Architected
- The AWS infrastructure is secure as per the latest CIS AWS Security Hub control standard
- Terraform code is unit tested Terraform's native testing framework, HCL-based tests.

## Observability
- For serverless components, AWS X-Ray is leveraged for tracing
- Logs are directed to AWS CloudWatch Logs. CloudWatch Logs groups have a retention period of 180 days. 
- A solution specific AWS Cloudwatch Dashboard which includes relevant CloudWatch metrics for reliability, performance and cost, in addition to a list over the last failing requests

### Pre-commit for Terraform
- Pre-commit is installed and leveraged for validation and formatting. 
  - terraform_fmt
  - terraform_docs in main README.md
  - check-merge-conflict
  - trailing-whitespace
  - mixed-line-ending

Example .pre-commit-config.yaml located in the root directory:

repos:

repo: https://github.com/antonbabenko/pre-commit-terraform rev: v1.77.3 hooks:
- id: terraform_fmt
- id: terraform_docs args: ["--args=--sort-by required"]
- id: terraform_checkov args:
  - --args=--quiet
  - --args=--download-external-modules false
repo: https://github.com/pre-commit/pre-commit-hooks rev: v4.4.0 hooks:
- id: check-merge-conflict
- id: trailing-whitespace args: [--markdown-linebreak-ext=md]
- id: mixed-line-ending args: ["--fix=lf"]


## Documentation
- AWS Labs Diagram MCP server is used to produce relevant architecture, flow and sequence diagrams, included in main README.md
- AWS Labs Pricing MCP server is used to perform a basic cost calculation of the solution, included in the main README.md
- Every task should also ensure relevant and clear documentation is created or up to date. Prefer simple and user friendly documentation, don't overcomplicate.
- All documentation follows markdown format and is stored in the `docs/` directory
- Architecture diagrams are generated programmatically using the AWS diagram MCP server
- Cost analysis documentation includes detailed breakdowns, usage projections, and optimization recommendations
- Documentation includes deployment guides, troubleshooting guides, and operational runbooks
- There should be an examples folder with README.md explaining how to include the Terraform module call in an existing CI/CD codebase.
- In documentation, provide TL;DR to make it easy and quick for developers to get up to speed. 
- In high level project documentation, include an executive summary for target group project owners, to articulate functionality and the value the solution provides.

## Cost Management
- AWS Labs Pricing MCP server provides accurate cost calculations for the infrastructure components of the solution.
- Cost analysis should include environment-specific projections (staging and production).
- Cost analysis should include AWS region comparison of eu-west-1, eu-central-1 and eu-north-1 for the production environment.
- CloudWatch cost metrics and dashboards provide real-time cost monitoring.
- A solution specific AWS Budget is deployed, based on infrastructure tag Key Service. Budget alerts prevents unexpected charges.
- Guidance is provided for the top three cost items that may increase with heavy production load. 
- Cost documentation is included in the main `README.md`

## Principles
- Favor KISS over complexity, simplicity over comprehensibility
- Respect and adopt well-known cloud based architecture and integration patterns

Writing your product spec

Product specs or specifications are structured artifacts that formalize the development process. They provide a systematic approach to transform high-level ideas into detailed implementation plans with clear tracking and accountability.

The workflow is illustrated below:

From the Kiro pane, click the + button under Specs. Alternatively, choose Spec from the chat pane.

Describe your project idea.

A requirements Markdown file is created in a folder with the spec name weather-forecast-app.

Step 1: Define requirements

The requirements.md file should define user stories with acceptance criterias in EARS notation, similar to common agile development practice. Define what we would like to achieve and which problems we propose to solve. HOW we plan to solve it will come afterwards in the Design phase.

EARS, which stands for Easy Approach to Requirements Syntax, is a method for writing clear and unambiguous requirements using a structured set of rules and keywords. Alistair Mavin and colleagues at Rolls-Royce PLC developed EARS whilst analysing the airworthiness regulations for a jet engine’s control system. The structured format makes it easy to understand what is expected, reducing misinterpretations. Clearer requirements lead to better test cases and easier verification of application functionality.

<textarea tabindex="-1" aria-hidden="true" readonly>WHEN [condition/event]
THE SYSTEM SHALL [expected behavior]</textarea>

WHEN [condition/event]
THE SYSTEM SHALL [expected behavior]

Starting writing your User Stories as acceptance criterias in EARS format. Remember, as described in the Introduction, the more complete context you provide, including what your team usually have in their minds and have learned by experience how you do you things in your company, be as specific as you can. Investing more time in producing well-crafted specifications can reduce time spent on modifications and troubleshooting.

Remember, common principles and guidelines are defined as Agent Steering Context. Product Spec focuses on the functionality of the application. As the system grows you can create additional specifications and manage requirements in logical separation.

This is how the .kiro/specs/weather-forecast-app/requirements.md for the example weather forecast application looks like:

<textarea tabindex="-1" aria-hidden="true" readonly># Requirements Document

## Introduction

This specification will create a weather forecast application, to be deployed with Terraform on AWS serverless infrastructure.

### Requirement 1

**User Story:** As an end user, I will access a web site which compares the weather forecast for tomorrow for the European cities Oslo (Norway), Paris (France), London (United Kingdom) and Barcelona (Spain).

#### Acceptance Criteria

1. WHEN an end user is accessing the service THEN the system SHALL display a simple web site with a fancy design for the weather forecast for the cities as described in the User Story
2. WHEN an end user is accessing the service THEN the system SHALL be snappy and respond fast
3. WHEN an end user is accessing the service on a mobile device THEN the design SHALL be optimized for display on a small screen
4. WHEN static content is served to end users THEN the system SHALL set Cache-Control headers with Max-Age of 900 seconds (15 minutes) to optimize performance and reduce server load

### Requirement 2

**User Story:** As a developer, my application requirements are as follows:

#### Acceptance Criteria

1. WHEN the weather-forecast-app is generated THEN the system SHALL provide a modern front-end application
2. WHEN the weather-forecast-app is generated THEN the system SHALL look up weather forecasts from https://api.met.no/weatherapi/locationforecast/2.0/documentation and cache the results for 1 hour
3. WHEN the weather-forecast-app is generated THEN the system SHALL respect the Terms of Service defined at https://developer.yr.no/doc/TermsOfService/
4. WHEN the weather-forecast-app is generated THEN the system SHALL be tested
5. WHEN the Lambda function successfully retrieves weather data from the backend API THEN the system SHALL set cache-control: max-age=60 on the HTTP response
6. WHEN the Lambda function fails to retrieve weather data from the backend API THEN the system SHALL set cache-control: max-age=0 on the HTTP response
7. WHEN the frontend displays weather data THEN the system SHALL show the Last updated timestamp from the lastUpdated field in the API response
8. WHEN weather data is cached in DynamoDB and the weather API does not provide timestamp information THEN the system SHALL use the DynamoDB cache timestamp as the lastUpdated value in the API response

### Requirement 3

**User Story:** As a developer, my cloud infrastructure requirements are as follows:

#### Acceptance Criteria

1. WHEN the infrastructure is generated THEN the codebase SHALL be organized as one, self-contained Terraform module
2. WHEN the infrastructure is generated THEN the system SHALL require basic unit tests and infrastructure-as-code validation to be successful
3. WHEN the infrastructure is deployed THEN the system SHALL create AWS resources with appropriate tags like Service:weather-forecast-app.
4. WHEN the infrastructure is deployed THEN the system SHALL package and deploy the weather-forecast-app code
5. WHEN the infrastructure is deployed THEN the system SHALL provide accessible endpoints for testing
6. WHEN the infrastructure is deployed THEN the system SHALL include required IAM roles and permissions
7. WHEN the infrastructure is deployed THEN the system SHALL output relevant URLs or connection information
8. WHEN the infrastructure is deployed THEN the system SHALL be configured for high availability
9. WHEN the CloudFront distribution is deployed THEN the system SHALL use price class 100 to optimize costs while covering Europe and the United States edge locations
10. WHEN the CloudFront distribution is deployed THEN the system SHALL allow only GET, HEAD, and OPTIONS HTTP methods for security and performance optimization
11. WHEN the CloudFront distribution is deployed THEN the system SHALL configure caching policies based on query parameters to optimize cache efficiency
12. WHEN the CloudFront distribution is deployed THEN the system SHALL set the default TTL to 900 seconds (15 minutes) to align with static content caching requirements</textarea>

# Requirements Document

## Introduction

This specification will create a weather forecast application, to be deployed with Terraform on AWS serverless infrastructure.

### Requirement 1

**User Story:** As an end user, I will access a web site which compares the weather forecast for tomorrow for the European cities Oslo (Norway), Paris (France), London (United Kingdom) and Barcelona (Spain).

#### Acceptance Criteria

1. WHEN an end user is accessing the service THEN the system SHALL display a simple web site with a fancy design for the weather forecast for the cities as described in the User Story
2. WHEN an end user is accessing the service THEN the system SHALL be snappy and respond fast
3. WHEN an end user is accessing the service on a mobile device THEN the design SHALL be optimized for display on a small screen
4. WHEN static content is served to end users THEN the system SHALL set Cache-Control headers with Max-Age of 900 seconds (15 minutes) to optimize performance and reduce server load

### Requirement 2

**User Story:** As a developer, my application requirements are as follows:

#### Acceptance Criteria

1. WHEN the weather-forecast-app is generated THEN the system SHALL provide a modern front-end application
2. WHEN the weather-forecast-app is generated THEN the system SHALL look up weather forecasts from https://api.met.no/weatherapi/locationforecast/2.0/documentation and cache the results for 1 hour
3. WHEN the weather-forecast-app is generated THEN the system SHALL respect the Terms of Service defined at https://developer.yr.no/doc/TermsOfService/
4. WHEN the weather-forecast-app is generated THEN the system SHALL be tested
5. WHEN the Lambda function successfully retrieves weather data from the backend API THEN the system SHALL set cache-control: max-age=60 on the HTTP response
6. WHEN the Lambda function fails to retrieve weather data from the backend API THEN the system SHALL set cache-control: max-age=0 on the HTTP response
7. WHEN the frontend displays weather data THEN the system SHALL show the Last updated timestamp from the lastUpdated field in the API response
8. WHEN weather data is cached in DynamoDB and the weather API does not provide timestamp information THEN the system SHALL use the DynamoDB cache timestamp as the lastUpdated value in the API response

### Requirement 3

**User Story:** As a developer, my cloud infrastructure requirements are as follows:

#### Acceptance Criteria

1. WHEN the infrastructure is generated THEN the codebase SHALL be organized as one, self-contained Terraform module
2. WHEN the infrastructure is generated THEN the system SHALL require basic unit tests and infrastructure-as-code validation to be successful
3. WHEN the infrastructure is deployed THEN the system SHALL create AWS resources with appropriate tags like Service:weather-forecast-app.
4. WHEN the infrastructure is deployed THEN the system SHALL package and deploy the weather-forecast-app code
5. WHEN the infrastructure is deployed THEN the system SHALL provide accessible endpoints for testing
6. WHEN the infrastructure is deployed THEN the system SHALL include required IAM roles and permissions
7. WHEN the infrastructure is deployed THEN the system SHALL output relevant URLs or connection information
8. WHEN the infrastructure is deployed THEN the system SHALL be configured for high availability
9. WHEN the CloudFront distribution is deployed THEN the system SHALL use price class 100 to optimize costs while covering Europe and the United States edge locations
10. WHEN the CloudFront distribution is deployed THEN the system SHALL allow only GET, HEAD, and OPTIONS HTTP methods for security and performance optimization
11. WHEN the CloudFront distribution is deployed THEN the system SHALL configure caching policies based on query parameters to optimize cache efficiency
12. WHEN the CloudFront distribution is deployed THEN the system SHALL set the default TTL to 900 seconds (15 minutes) to align with static content caching requirements

If you prefer, you can write your requirements the way you are used to and then click Refine to have Kiro help you format them in EARS format, but I would say it’s good team practice to align to the EARS format.

Step 2: Define design

Switch to the design tab and click Refine to generate design specification based on the defined requirements, merged with the Agent Steering context configuration.

Then Kiro starts working.

Kiro has now populated design.md. When I take a closer look I don’t see anything addressing the Weather API Service Terms of Service rule: You must identify yourself (set custom HTTP User-Agent).

You can have Kiro help refine and add more requirements as you go along.

Now the design looks good to me, so let’s move on to generate the implementation plan.

<textarea tabindex="-1" aria-hidden="true" readonly># Weather Forecast App Design Document

## Overview

The weather forecast application is a serverless web application that displays tomorrow's weather forecast for four European cities: Oslo (Norway), Paris (France), London (United Kingdom), and Barcelona (Spain). The application will be deployed on AWS using Terraform infrastructure-as-code and will integrate with the Norwegian Meteorological Institute's weather API.

### Key Design Principles
- **Serverless-first architecture** for minimal operational overhead
- **Mobile-responsive design** for optimal user experience across devices
- **Fast response times** through efficient caching and CDN distribution
- **Well-architected AWS infrastructure** following security and reliability best practices

## Architecture

### High-Level Architecture
The application follows a serverless architecture pattern with the following components:

1. **Frontend** : Static web application hosted on S3 with CloudFront distribution
2. **Backend API** : AWS Lambda functions for weather data processing
3. **Data Layer** : DynamoDB for caching weather data and API rate limiting
4. **External Integration** : Norwegian Meteorological Institute API (api.met.no)

### Architecture Rationale
- **Static hosting with S3/CloudFront** : Provides fast global content delivery and handles traffic spikes efficiently
- **Lambda functions** : Serverless compute eliminates server management and scales automatically
- **DynamoDB** : NoSQL database perfect for caching weather data with TTL capabilities
- **API Gateway** : Provides managed API endpoints with built-in throttling and monitoring

## SNIP END</textarea>

# Weather Forecast App Design Document

## Overview

The weather forecast application is a serverless web application that displays tomorrow's weather forecast for four European cities: Oslo (Norway), Paris (France), London (United Kingdom), and Barcelona (Spain). The application will be deployed on AWS using Terraform infrastructure-as-code and will integrate with the Norwegian Meteorological Institute's weather API.

### Key Design Principles
- **Serverless-first architecture** for minimal operational overhead
- **Mobile-responsive design** for optimal user experience across devices
- **Fast response times** through efficient caching and CDN distribution
- **Well-architected AWS infrastructure** following security and reliability best practices

## Architecture

### High-Level Architecture
The application follows a serverless architecture pattern with the following components:

1. **Frontend** : Static web application hosted on S3 with CloudFront distribution
2. **Backend API** : AWS Lambda functions for weather data processing
3. **Data Layer** : DynamoDB for caching weather data and API rate limiting
4. **External Integration** : Norwegian Meteorological Institute API (api.met.no)

### Architecture Rationale
- **Static hosting with S3/CloudFront** : Provides fast global content delivery and handles traffic spikes efficiently
- **Lambda functions** : Serverless compute eliminates server management and scales automatically
- **DynamoDB** : NoSQL database perfect for caching weather data with TTL capabilities
- **API Gateway** : Provides managed API endpoints with built-in throttling and monitoring

## SNIP END

The design turned out to be grow into more than 300 lines, so I’m just including a snippet here. You can review the complete file here: .kiro/specs/weather-forecast-app/design.md

This looks good to me, let’s proceed to generate the implementation plan.

Step 3: Implement

Kiro has now generated a task list for implementation based on the previous steps.

.kiro/specs/weather-forecast-app/tasks.md now looks like this:

<textarea tabindex="-1" aria-hidden="true" readonly># Implementation Plan

- [] 1. Set up project structure and configuration
  - Create Terraform module directory structure following best practices
  - Set up Python virtual environment with pyenv for application development
  - Configure pre-commit hooks for Terraform validation and formatting
  - Create basic project documentation structure with docs/ directory
  - _Requirements: 3.1, 3.2_

- [] 2. Implement simplified Lambda weather service
  - [] 2.1 Create embedded weather service in Lambda handler
    - Implement weather data fetching directly in lambda_handler.py using urllib
    - Define city configuration with coordinates for Oslo, Paris, London, Barcelona
    - Create weather data processing and transformation logic embedded in handler
    - Implement proper User-Agent header with configurable company website
    - Add rate limiting with simple delays between API calls
    - _Requirements: 2.2, 2.3, 1.1_

  - [] 2.2 Implement weather API integration and processing
    - Create fetch_weather_data function for met.no API calls
    - Implement extract_tomorrow_forecast for parsing API responses
    - Add weather condition mapping and error handling
    - Create process_city_weather for individual city processing
    - Implement get_weather_summary for all cities with delay between calls
    - _Requirements: 2.2, 1.2_

- [] 3. Build Lambda function infrastructure
  - [] 3.1 Create simplified Lambda function handler
    - Implement main Lambda handler with embedded weather service
    - Add environment variable configuration for company website
    - Implement proper error handling and logging with standardized responses
    - Create /health endpoint for monitoring with environment information
    - Add CORS support and OPTIONS request handling
    - _Requirements: 2.1, 2.4, 3.6_

  - [] 3.2 Add DynamoDB caching to simplified Lambda handler
    - Implement DynamoDB caching directly in the lambda_handler.py file
    - Add cache check before API calls and cache storage after successful API responses
    - Implement 1-hour TTL (3600 seconds) for cached weather data
    - Add error handling for DynamoDB operations with fallback to API calls
    - Use boto3 client for DynamoDB operations embedded in the handler
    - _Requirements: 2.2, 1.2, 3.6_

- [] 4. Update Terraform infrastructure for simplified approach
  - [] 4.1 Maintain DynamoDB table configuration for caching
    - Keep existing DynamoDB table from Terraform backend module
    - Maintain DynamoDB-related IAM permissions for Lambda role
    - Ensure DynamoDB table name is passed to Lambda via environment variable
    - Keep TTL configuration for 1-hour cache expiration
    - Maintain existing tests for DynamoDB validation
    - _Requirements: 3.1, 3.6, 3.8_

  - [] 4.2 Update Lambda function Terraform module for simplified deployment
    - Update Terraform configuration for simplified Lambda function
    - Maintain DynamoDB environment variables (table name) and permissions
    - Keep COMPANY_WEBSITE environment variable configuration
    - Maintain X-Ray tracing and CloudWatch logging
    - Keep IAM role with DynamoDB permissions for caching
    - _Requirements: 3.1, 3.4, 3.6, 3.8_

  - [] 4.3 Maintain API Gateway configuration
    - Keep existing API Gateway REST API configuration
    - Maintain CORS settings and rate limiting
    - Keep Lambda integration with proper error handling
    - Maintain CloudWatch logging configuration
    - Keep existing tests for the API Gateway setup
    - _Requirements: 3.1, 3.5, 3.6_

- [] 5. Create frontend application
  - [] 5.1 Build responsive weather display components
    - Create React components for weather card display
    - Implement responsive grid layout for four cities
    - Add loading states and error handling UI
    - Ensure mobile-optimized design with proper breakpoints
    - _Requirements: 1.1, 1.2, 1.3, 2.1_

  - [] 5.2 Implement API integration and state management
    - Create API client for backend weather service
    - Implement data fetching with error handling and retries
    - Add browser-side caching strategy respecting 1-hour backend cache
    - Create loading and error state management
    - _Requirements: 1.2, 2.1, 2.2_

  - [] 5.3 Add weather icons and styling
    - Implement weather condition icon mapping
    - Create CSS styling for responsive design
    - Add animations and transitions for better UX
    - Ensure accessibility compliance (WCAG)
    - _Requirements: 1.1, 1.3, 2.1_

  - [] 5.4 Optimize frontend build for caching
    - Configure build process to generate static assets optimized for 15-minute caching
    - Ensure proper file naming and versioning for cache busting when needed
    - Validate that all static assets (HTML, CSS, JS, images) are properly configured
    - _Requirements: 1.2, 1.4_

- [] 6. Configure static hosting infrastructure
  - [] 6.1 Create S3 bucket for static hosting
    - Implement Terraform module for S3 bucket configuration
    - Configure bucket policies for static website hosting
    - Set up versioning and lifecycle policies
    - Add proper IAM permissions for deployment
    - Write basic tests for the S3 configuration
    - _Requirements: 3.1, 3.4, 3.6_

  - [] 6.2 Set up CloudFront distribution
    - Create Terraform module for CloudFront CDN
    - Configure cache behaviors and TTL settings
    - Set up origin failover for high availability
    - Add security headers and HTTPS redirection
    - Write basic tests for the Cloudfront configuration
    - _Requirements: 1.2, 3.1, 3.8_

  - [] 6.4 Configure CloudFront price class and optimization settings
    - Update CloudFront distribution to use price class 100 (PriceClass_100)
    - Configure allowed HTTP methods to GET, HEAD, and OPTIONS only
    - Set up caching policy configuration based on query parameters
    - Configure default TTL to 900 seconds (15 minutes)
    - Ensure coverage includes Europe and United States edge locations
    - Validate cost optimization while maintaining performance for target regions
    - Update Terraform configuration with appropriate price_class, allowed_methods, and caching parameters
    - Test CloudFront distribution functionality with new configuration
    - _Requirements: 3.9, 3.10, 3.11, 3.12_

  - [] 6.3 Configure Cache-Control headers for static content
    - Configure S3 bucket metadata to set Cache-Control: max-age=900 for all static assets
    - Update CloudFront cache behaviors to respect and forward Cache-Control headers
    - Ensure consistent 15-minute caching for HTML, CSS, JavaScript, and image files
    - _Requirements: 1.2, 1.4_

- [] 7. Implement monitoring and observability
  - [] 7.1 Create simple and intuitive CloudWatch dashboard and alarms
    - Implement Terraform module for CloudWatch dashboard
    - Configure the most important alarms for Lambda errors, API Gateway 5xx, and DynamoDB throttling
    - Set up custom metrics for weather API success rates
    - Add log retention policies (180 days)
    - _Requirements: 3.6, 3.7_

  - [] 7.2 Set up AWS Budget and cost monitoring
    - Create Terraform module for AWS Budget with Service tag filter
    - Configure budget alerts for cost thresholds
    - Implement simple and intuitive cost monitoring Cloudwatch dashboard
    - _Requirements: 3.3, 3.7_

- [] 8. Create deployment and testing automation
  - [] 8.1 Implement Terraform module packaging
    - Create main Terraform module with all sub-modules
    - Configure variable definitions and outputs
    - Add module documentation with terraform-docs
    - Create examples/ directory with usage examples
    - _Requirements: 3.1, 3.2, 3.7_

  - [] 8.2 Add basic integration and end-to-end tests
    - Create integration tests for complete weather data flow
    - Implement end-to-end tests for user journey with CloudWatch synthetics
    - Create basic infrastructure deployment tests
    - Write basic test automation scripts with cleanup
    - _Requirements: 2.4, 3.2_

  - [] 8.3 Add cache header validation tests
    - Create automated tests to verify Cache-Control headers are properly set
    - Test that static assets return max-age=900 in response headers
    - Validate cache behavior across different asset types (HTML, CSS, JS, images)
     - _Requirements: 1.4, 2.4_

  - [] 8.4 Fix CI/CD deployment path issues
    - Resolve frontend build path problems in CI/CD environments where working directory structure differs
    - Update Terraform frontend module to handle different working directory structures and missing directories
    - Add proper error handling and path validation for frontend build process
    - Ensure frontend directory and package.json are found correctly in CI/CD pipelines
    - Test build process works in both local development and CI/CD environments
    - _Requirements: 3.1, 3.4_

  - [] 8.5 Update unit tests for simplified Lambda implementation
    - Update existing unit tests to work with the simplified embedded Lambda handler
    - Remove tests for separate weather service modules (api_client, cache, processor, etc.)
    - Create focused tests for the main Lambda handler functions
    - Test weather data fetching, processing, response formatting, and DynamoDB caching
    - Ensure tests cover error handling, cache hits/misses, and edge cases
    - _Requirements: 2.4, 3.2_

  - [] 8.6 Add frontend error loop prevention safeguards
    - Implement circuit breaker pattern in useWeatherData hook to prevent infinite retry loops
    - Add exponential backoff with maximum delay caps for failed requests
    - Implement request rate limiting to prevent rapid successive API calls on errors
    - Add error threshold detection to disable auto-retry after consecutive failures
    - Create user-friendly error states that prevent automatic retry loops
    - _Requirements: 1.2, 2.1_

  - [] 8.7 Configure reasonable Lambda concurrency limits
    - Set Lambda reserved concurrency to 5 concurrent executions (reasonable for weather API)
    - Update backend module variables to reflect appropriate concurrency limits
    - Add documentation explaining concurrency limits and cost implications
    - Ensure concurrency limits prevent runaway costs while maintaining service availability
    - Test concurrency limits under load to ensure proper throttling behavior
    - _Requirements: 3.6, 3.8_

  - [] 8.8 Implement dynamic cache-control headers in Lambda function
    - Update Lambda handler to set cache-control: max-age=60 for successful weather API responses
    - Set cache-control: max-age=0 for failed weather API responses or error conditions
    - Ensure cache-control headers are properly included in HTTP response headers
    - Test cache-control behavior for both success and failure scenarios
    - _Requirements: 2.5, 2.6_

  - [] 8.9 Implement lastUpdated timestamp handling in Lambda function
    - Update Lambda handler to include lastUpdated timestamp in all API responses
    - Use weather API timestamp when available in the met.no API response
    - Fall back to DynamoDB cache timestamp when weather API timestamp is not provided
    - Ensure timestamp is in ISO 8601 format for consistent frontend display
    - Test timestamp handling for both fresh API calls and cached responses
    - _Requirements: 2.7, 2.8_

  - [] 8.10 Update frontend to display lastUpdated timestamp
    - Modify weather display components to show the lastUpdated timestamp from API responses
    - Format timestamp for user-friendly display (e.g., "Last updated: 2 minutes ago")
    - Handle cases where lastUpdated is null or missing
    - Ensure timestamp display is responsive and accessible
    - _Requirements: 2.7_

- [] 9. Generate documentation and cost analysis
  - [] 9.1 Create architecture diagrams
    - Generate AWS architecture diagram using MCP diagram server
    - Create sequence diagrams for weather data flow
    - Add deployment flow diagrams
    - Include diagrams in main README.md
    - _Requirements: 3.7_

  - [] 9.2 Perform cost analysis and optimization
    - Use AWS Labs Pricing MCP server for cost calculations
    - Compare costs across eu-west-1, eu-central-1, eu-north-1 regions
    - Create cost projections for staging and production environments
    - Document top three cost optimization opportunities
    - Include cost analysis in main README.md
    - _Requirements: 3.7_

- [] 10. Finalize project documentation
  - Create crisp and clear README.md with TL;DR section
  - Add executive summary for project stakeholders
  - Create basic deployment guide and troubleshooting documentation
  - Write operational runbooks for maintenance
  - Add basic examples for CI/CD integration and how to configure relevant variables
  - _Requirem</textarea>

# Implementation Plan

- [] 1. Set up project structure and configuration
  - Create Terraform module directory structure following best practices
  - Set up Python virtual environment with pyenv for application development
  - Configure pre-commit hooks for Terraform validation and formatting
  - Create basic project documentation structure with docs/ directory
  - _Requirements: 3.1, 3.2_

- [] 2. Implement simplified Lambda weather service
  - [] 2.1 Create embedded weather service in Lambda handler
    - Implement weather data fetching directly in lambda_handler.py using urllib
    - Define city configuration with coordinates for Oslo, Paris, London, Barcelona
    - Create weather data processing and transformation logic embedded in handler
    - Implement proper User-Agent header with configurable company website
    - Add rate limiting with simple delays between API calls
    - _Requirements: 2.2, 2.3, 1.1_

  - [] 2.2 Implement weather API integration and processing
    - Create fetch_weather_data function for met.no API calls
    - Implement extract_tomorrow_forecast for parsing API responses
    - Add weather condition mapping and error handling
    - Create process_city_weather for individual city processing
    - Implement get_weather_summary for all cities with delay between calls
    - _Requirements: 2.2, 1.2_

- [] 3. Build Lambda function infrastructure
  - [] 3.1 Create simplified Lambda function handler
    - Implement main Lambda handler with embedded weather service
    - Add environment variable configuration for company website
    - Implement proper error handling and logging with standardized responses
    - Create /health endpoint for monitoring with environment information
    - Add CORS support and OPTIONS request handling
    - _Requirements: 2.1, 2.4, 3.6_

  - [] 3.2 Add DynamoDB caching to simplified Lambda handler
    - Implement DynamoDB caching directly in the lambda_handler.py file
    - Add cache check before API calls and cache storage after successful API responses
    - Implement 1-hour TTL (3600 seconds) for cached weather data
    - Add error handling for DynamoDB operations with fallback to API calls
    - Use boto3 client for DynamoDB operations embedded in the handler
    - _Requirements: 2.2, 1.2, 3.6_

- [] 4. Update Terraform infrastructure for simplified approach
  - [] 4.1 Maintain DynamoDB table configuration for caching
    - Keep existing DynamoDB table from Terraform backend module
    - Maintain DynamoDB-related IAM permissions for Lambda role
    - Ensure DynamoDB table name is passed to Lambda via environment variable
    - Keep TTL configuration for 1-hour cache expiration
    - Maintain existing tests for DynamoDB validation
    - _Requirements: 3.1, 3.6, 3.8_

  - [] 4.2 Update Lambda function Terraform module for simplified deployment
    - Update Terraform configuration for simplified Lambda function
    - Maintain DynamoDB environment variables (table name) and permissions
    - Keep COMPANY_WEBSITE environment variable configuration
    - Maintain X-Ray tracing and CloudWatch logging
    - Keep IAM role with DynamoDB permissions for caching
    - _Requirements: 3.1, 3.4, 3.6, 3.8_

  - [] 4.3 Maintain API Gateway configuration
    - Keep existing API Gateway REST API configuration
    - Maintain CORS settings and rate limiting
    - Keep Lambda integration with proper error handling
    - Maintain CloudWatch logging configuration
    - Keep existing tests for the API Gateway setup
    - _Requirements: 3.1, 3.5, 3.6_

- [] 5. Create frontend application
  - [] 5.1 Build responsive weather display components
    - Create React components for weather card display
    - Implement responsive grid layout for four cities
    - Add loading states and error handling UI
    - Ensure mobile-optimized design with proper breakpoints
    - _Requirements: 1.1, 1.2, 1.3, 2.1_

  - [] 5.2 Implement API integration and state management
    - Create API client for backend weather service
    - Implement data fetching with error handling and retries
    - Add browser-side caching strategy respecting 1-hour backend cache
    - Create loading and error state management
    - _Requirements: 1.2, 2.1, 2.2_

  - [] 5.3 Add weather icons and styling
    - Implement weather condition icon mapping
    - Create CSS styling for responsive design
    - Add animations and transitions for better UX
    - Ensure accessibility compliance (WCAG)
    - _Requirements: 1.1, 1.3, 2.1_

  - [] 5.4 Optimize frontend build for caching
    - Configure build process to generate static assets optimized for 15-minute caching
    - Ensure proper file naming and versioning for cache busting when needed
    - Validate that all static assets (HTML, CSS, JS, images) are properly configured
    - _Requirements: 1.2, 1.4_

- [] 6. Configure static hosting infrastructure
  - [] 6.1 Create S3 bucket for static hosting
    - Implement Terraform module for S3 bucket configuration
    - Configure bucket policies for static website hosting
    - Set up versioning and lifecycle policies
    - Add proper IAM permissions for deployment
    - Write basic tests for the S3 configuration
    - _Requirements: 3.1, 3.4, 3.6_

  - [] 6.2 Set up CloudFront distribution
    - Create Terraform module for CloudFront CDN
    - Configure cache behaviors and TTL settings
    - Set up origin failover for high availability
    - Add security headers and HTTPS redirection
    - Write basic tests for the Cloudfront configuration
    - _Requirements: 1.2, 3.1, 3.8_

  - [] 6.4 Configure CloudFront price class and optimization settings
    - Update CloudFront distribution to use price class 100 (PriceClass_100)
    - Configure allowed HTTP methods to GET, HEAD, and OPTIONS only
    - Set up caching policy configuration based on query parameters
    - Configure default TTL to 900 seconds (15 minutes)
    - Ensure coverage includes Europe and United States edge locations
    - Validate cost optimization while maintaining performance for target regions
    - Update Terraform configuration with appropriate price_class, allowed_methods, and caching parameters
    - Test CloudFront distribution functionality with new configuration
    - _Requirements: 3.9, 3.10, 3.11, 3.12_

  - [] 6.3 Configure Cache-Control headers for static content
    - Configure S3 bucket metadata to set Cache-Control: max-age=900 for all static assets
    - Update CloudFront cache behaviors to respect and forward Cache-Control headers
    - Ensure consistent 15-minute caching for HTML, CSS, JavaScript, and image files
    - _Requirements: 1.2, 1.4_

- [] 7. Implement monitoring and observability
  - [] 7.1 Create simple and intuitive CloudWatch dashboard and alarms
    - Implement Terraform module for CloudWatch dashboard
    - Configure the most important alarms for Lambda errors, API Gateway 5xx, and DynamoDB throttling
    - Set up custom metrics for weather API success rates
    - Add log retention policies (180 days)
    - _Requirements: 3.6, 3.7_

  - [] 7.2 Set up AWS Budget and cost monitoring
    - Create Terraform module for AWS Budget with Service tag filter
    - Configure budget alerts for cost thresholds
    - Implement simple and intuitive cost monitoring Cloudwatch dashboard
    - _Requirements: 3.3, 3.7_

- [] 8. Create deployment and testing automation
  - [] 8.1 Implement Terraform module packaging
    - Create main Terraform module with all sub-modules
    - Configure variable definitions and outputs
    - Add module documentation with terraform-docs
    - Create examples/ directory with usage examples
    - _Requirements: 3.1, 3.2, 3.7_

  - [] 8.2 Add basic integration and end-to-end tests
    - Create integration tests for complete weather data flow
    - Implement end-to-end tests for user journey with CloudWatch synthetics
    - Create basic infrastructure deployment tests
    - Write basic test automation scripts with cleanup
    - _Requirements: 2.4, 3.2_

  - [] 8.3 Add cache header validation tests
    - Create automated tests to verify Cache-Control headers are properly set
    - Test that static assets return max-age=900 in response headers
    - Validate cache behavior across different asset types (HTML, CSS, JS, images)
     - _Requirements: 1.4, 2.4_

  - [] 8.4 Fix CI/CD deployment path issues
    - Resolve frontend build path problems in CI/CD environments where working directory structure differs
    - Update Terraform frontend module to handle different working directory structures and missing directories
    - Add proper error handling and path validation for frontend build process
    - Ensure frontend directory and package.json are found correctly in CI/CD pipelines
    - Test build process works in both local development and CI/CD environments
    - _Requirements: 3.1, 3.4_

  - [] 8.5 Update unit tests for simplified Lambda implementation
    - Update existing unit tests to work with the simplified embedded Lambda handler
    - Remove tests for separate weather service modules (api_client, cache, processor, etc.)
    - Create focused tests for the main Lambda handler functions
    - Test weather data fetching, processing, response formatting, and DynamoDB caching
    - Ensure tests cover error handling, cache hits/misses, and edge cases
    - _Requirements: 2.4, 3.2_

  - [] 8.6 Add frontend error loop prevention safeguards
    - Implement circuit breaker pattern in useWeatherData hook to prevent infinite retry loops
    - Add exponential backoff with maximum delay caps for failed requests
    - Implement request rate limiting to prevent rapid successive API calls on errors
    - Add error threshold detection to disable auto-retry after consecutive failures
    - Create user-friendly error states that prevent automatic retry loops
    - _Requirements: 1.2, 2.1_

  - [] 8.7 Configure reasonable Lambda concurrency limits
    - Set Lambda reserved concurrency to 5 concurrent executions (reasonable for weather API)
    - Update backend module variables to reflect appropriate concurrency limits
    - Add documentation explaining concurrency limits and cost implications
    - Ensure concurrency limits prevent runaway costs while maintaining service availability
    - Test concurrency limits under load to ensure proper throttling behavior
    - _Requirements: 3.6, 3.8_

  - [] 8.8 Implement dynamic cache-control headers in Lambda function
    - Update Lambda handler to set cache-control: max-age=60 for successful weather API responses
    - Set cache-control: max-age=0 for failed weather API responses or error conditions
    - Ensure cache-control headers are properly included in HTTP response headers
    - Test cache-control behavior for both success and failure scenarios
    - _Requirements: 2.5, 2.6_

  - [] 8.9 Implement lastUpdated timestamp handling in Lambda function
    - Update Lambda handler to include lastUpdated timestamp in all API responses
    - Use weather API timestamp when available in the met.no API response
    - Fall back to DynamoDB cache timestamp when weather API timestamp is not provided
    - Ensure timestamp is in ISO 8601 format for consistent frontend display
    - Test timestamp handling for both fresh API calls and cached responses
    - _Requirements: 2.7, 2.8_

  - [] 8.10 Update frontend to display lastUpdated timestamp
    - Modify weather display components to show the lastUpdated timestamp from API responses
    - Format timestamp for user-friendly display (e.g., "Last updated: 2 minutes ago")
    - Handle cases where lastUpdated is null or missing
    - Ensure timestamp display is responsive and accessible
    - _Requirements: 2.7_

- [] 9. Generate documentation and cost analysis
  - [] 9.1 Create architecture diagrams
    - Generate AWS architecture diagram using MCP diagram server
    - Create sequence diagrams for weather data flow
    - Add deployment flow diagrams
    - Include diagrams in main README.md
    - _Requirements: 3.7_

  - [] 9.2 Perform cost analysis and optimization
    - Use AWS Labs Pricing MCP server for cost calculations
    - Compare costs across eu-west-1, eu-central-1, eu-north-1 regions
    - Create cost projections for staging and production environments
    - Document top three cost optimization opportunities
    - Include cost analysis in main README.md
    - _Requirements: 3.7_

- [] 10. Finalize project documentation
  - Create crisp and clear README.md with TL;DR section
  - Add executive summary for project stakeholders
  - Create basic deployment guide and troubleshooting documentation
  - Write operational runbooks for maintenance
  - Add basic examples for CI/CD integration and how to configure relevant variables
  - _Requirem

In the IDE you will see an option to trigger to start tasks:

You can either trigger them one by one or ask Kiro in the chat to get started.

Let’s start the first task.

We can see that Kiro is setting up the structure according to what’s stated on the external website terraform-best-practices.com, as requested. Nice.

Like with Amazon Q Developer, you can approve or let Kiro trust specific tools and commands:

After a short while the first task is complete!

I did note that Terraform AWS provider version 5 was installed. The latest one is major version 6.7.0, so I updated the Tech Steering specification and asked Kiro to refresh.

Kiro refreshed the context guidelines, found what to update, performed the changes and ran checks and pre-commit to verify it’s working as expected.

Then we move on to Task 2: Implement core Python service, and so on.

This is how the Summary of implementation looks like for task 7.

As you can see the main difference between the traditional vibe/CLI approach is that the spec-driven workflow keeps the steering and requirements up to date, to persist the context. This makes the process a lot more predictable, and possible to collaborate on in a team.

Keeping the specifications version controlled along with the application codebase makes it easy to track changes as changes are committed.

Depending on your team development workflow, each feature could be organized as a Spec, containing relevant user stories.

Kiro generated a comprehensive local testing suite. It turned out to be more complex than I think is necessary, with some tests being flaky. I asked Kiro to focus on testing the core functionality and remove brittle and complex tests. A reflection here is that I did not specify the desired test approach in detail in my steering context.

Here are the diagrams the AWS Diagrams MCP server helped create:

Deploying the final solution

I included the module definition in my existing Github Actions CI/CD Terraform codebase:

<textarea tabindex="-1" aria-hidden="true" readonly>module "weather_forecast_app" {
  source = "git::https://github.com/haakond/terraform-aws-weather-forecast.git?ref=COMMIT-SHA"
  project_name = "weather-forecast-app"
  environment = "prod"
  aws_region = "eu-west-1"
  weather_service_identification_domain = "youramazingwebsite.com"
}</textarea>

module "weather_forecast_app" {
  source = "git::https://github.com/haakond/terraform-aws-weather-forecast.git?ref=COMMIT-SHA"
  project_name = "weather-forecast-app"
  environment = "prod"
  aws_region = "eu-west-1"
  weather_service_identification_domain = "youramazingwebsite.com"
}

I experienced a few Terraform errors that Kiro wasn’t able to catch before terraform plan and apply. Issues:

CloudWatch Logs groups defined with the same name in two sub-modules
Missing API Gateway Account configuration for CloudWatch Logs
Missing deploy process for the React frontend to Amazon S3
- Since this is fairly static app I decided to keep it along with the infrastructure code, for simplicity.
- For production frontend applications I would set up a dedicated CI/CD pipeline to deploy only the frontend codebase.
An overly complex Reach frontend application, which I asked Kiro to simplify.

With some additional prompting assistance from Kiro I was able to resolve the issues and end up with a fully working deployment!

End result:

Workflow for adding a new feature

Here’s one possible suggested approach:

git clone into a new feature branch
Specify
- For a major new feature: create a new Specification (requirements, design, tasks)
- For a minor improvement: Incorporate into an existing Specification
Implement requirements
Create pull request
Review and merge to main

Learnings and key takeaways

If you do manual changes outside of Kiro, the specs (Requirements, Design, Tasks) will deviate and confuse Kiro. Stick to the Kiro workflow, and ask Kiro to refresh what you did. Kiro will backport into the specs.
- If during the preview period you keep hitting Kiro’s usage limit, a workaround can be to get Amazon Q Developer help you when troubleshooting, to save interaction tokens. Just make sure that you tell Kiro which areas has changed to get the specs up-to-date.
Separate product feature specifications from common steering context.
Kiro has a tendency to add more comprehensive and complex testing procedures and documentation than a human would appreciate. Be explicit about keeping things simple and focus on the core functionality.
To ensure the suggested tests are relevant and follows your company practice, add a detailed Agent Steering document for testing.
Organize specs by feature, to be able to work independently without conflicts or affecting other areas. This can also reduce the blast radius in case of unexpected changes, plus it reduces the context size for the agent.
Keep specs and user stories in version control along with your application. If you perform traditional manual changes, give Kiro a hint to have the design and requirements updated accordingly.

Reflections on context

Traditionally, the environment specific context are a company’s tech stack, policies and guidelines combined with the experience of experienced software engineers. This is information known and acquired in your setting and is normally not included in (Jira) User Stories. However, coding agents by default do not posess this context information. The closest thing may be the possibility to configure Amazon Q Developer with custom repositories, so that it can learn about company specific coding standards, libraries etc.

Spec driven development with Kiro now forces this context information to be defined as Agent Steering resources. Teams can organize workshops to document their guiding tenets, principles, organized in Git repo and iterated as the information evolves. Perhaps your team have something similar documented in a company wiki already?

I think we need to help AI coding companions build the same mental framework we’d give to a human colleague during onboarding and code review. Kiro solves this by checking in your specifications to Git. Consider creating a Kiro app bootstrap repository or include common steering as a Git submodule, package reference or similar.

Kiro pricing

When Kiro becomes Generally Available, there will be different tiers available to match your level of usage.

Vibe Requests cover any agentic operation in Kiro that does not involve execution of a Spec Task.

You start with Vibe requests to create requirements, design documents and tasks.

One Vibe Request typically equals one message or prompt, while one Spec Requests equals executing a single Spec Task.

For more information see https://kiro.dev/blog/understanding-kiro-pricing-specs-vibes-usage-tracking/.

Conclusion

My personal experience is that I really appreciate the spec-driven development process. Kiro is a game-changer which can significantly boost what builders are able to produce. There are no more valid excuses for avoiding sufficient test coverage or struggling with even a nice looking frontend app!

Kiro is not just a new tool, it is a new workflow; AI-assisted, spec-driven development which can incorporate mature engineering practices, and my initial evaluation leaves me thinking that AI is starting to grow up and become more professional. As a bonus, Kiro forces you to write better specifications, which can make it easier to establish a common alignment in a team, and while onboarding new team members.

This approach is a giant leap forward compared to coding assistants pre H2 2025. Project Rules and Customizations in Amazon Q Developer were a step in the right direction, but Kiro brings a sought-after structure and consistency that in my opinion not only is nice, but necessary, in a professional context. Yes, there are still some bugs and quirks (Kiro is at time of writing still in limited preview) but I am optimistic that this technology and the underlying Large Language Models will mature and produce results of increasingly higher quality and predictability over the coming months. Earlier, my experience was that code assistant suggestions could get me around ~70% up to speed, with ~30% traditional authoring for preciseness. Kiro boosts this to maybe ~85%+. I would claim that return on investment on the Kiro license can be achieved pretty fast.

I would still prefer a knowledgeable human in the loop reviewing Pull Requests and making the final decision for changes, that is until Agentic AI has matured a bit more. It will be exciting to see how this field evolves during the next couple of years.

I encourage you to try out spec-driven development with your colleagues and warmly welcome Kiro as your new team member.

Resources

The post How to use Kiro for AI assisted spec-driven development first appeared on Håkon Eriksen Drange.

Extensive reporting of Well-Architected Maturity

Håkon Eriksen Drange — Sun, 20 Jul 2025 07:59:45 +0000

Introduction
How to generate a Well-Architected Compliance report
What a Well-Architected Framework Compliance report looks like
How the reporting feature works
Conclusion
Feedback and contributions
Resources

Introduction

In the post How to measure Well-Architected maturity we explored how extensive insight into cloud infrastructure posture could help accelerate the Measure phase, leaving more time to Learn and discuss opportunities for Improvement.

Figure – Well-Architected Framework review cycle courtesy of AWS

A key factor was to make the data available while performing a review in the Well-Architected Tool. After discussing the solution with colleagues and customers I realized that the valuable data points weren’t exposed to their full potential within only the Notes field in the Well-Architected Tool. Key constraints are that the Notes field is limited to plain text and a maximum of 2000 characters. Valuable resource identifiers and tags had to be capped, which made it harder to easily identify the applicable resources. Could there be a better way?

I decided to develop an additional AWS Lambda function called well_architected_report_generator. The main purpose of this Lambda function is to collect all the available data points from various sources and generate a report in HTML format stored in Amazon Simple Storage Service (S3). When performing Well-Architected Framework Reviews, you are most likely already logged in to the AWS Console to access the Well-Architected Tool, so opening an additional tab with S3 could be useful. Thanks to Amazon Q Developer the HTML and CSS came out pretty good, too!

Current data sources:

AWS Config Conformance Packs
AWS Trusted Advisor checks (available checks depends on active AWS Support Plan)
Resource Tag: Name

How to generate a Well-Architected Compliance report

Deploy the Terraform module as also described in How to measure Well-Architected Maturity.

Please note that three new (optional) Terraform module variables have been introduced:

<textarea tabindex="-1" aria-hidden="true" readonly>variable "deploy_aws_config_recorder" {
  description = "Set to true to deploy an AWS Config Recorder. If you already have a customer managed AWS Config recorder in the desired region, set to false. AWS supports only one customer managed configuration recorder for each account for each AWS Region."
  type = bool
  default = true
}

variable "reports_bucket_name_prefix" {
  description = "Prefix for the S3 bucket name that stores Well-Architected compliance reports"
  type = string
  default = "well-architected-compliance-reports"
}

variable "reports_retention_days" {
  description = "Number of days to retain non-current versions of reports in the S3 bucket"
  type = number
  default = 90
}</textarea>

variable "deploy_aws_config_recorder" {
  description = "Set to true to deploy an AWS Config Recorder. If you already have a customer managed AWS Config recorder in the desired region, set to false. AWS supports only one customer managed configuration recorder for each account for each AWS Region."
  type = bool
  default = true
}

variable "reports_bucket_name_prefix" {
  description = "Prefix for the S3 bucket name that stores Well-Architected compliance reports"
  type = string
  default = "well-architected-compliance-reports"
}

variable "reports_retention_days" {
  description = "Number of days to retain non-current versions of reports in the S3 bucket"
  type = number
  default = 90
}

At the time of writing AWS only supports one customer managed AWS Config recorder in each region. If you already have one, set this value to false, for re-use.

If you would like to retain the reports for longer than the default value of 90 days you may set the desired value accordingly.

A minimal example module call if you already have a customer managed AWS Config recorder in your region and you prefer to retain the report files for 400 days may look like this:

<textarea tabindex="-1" aria-hidden="true" readonly>module "well_architected_config_conformance_pack" {
  source = "git::https://github.com/soprasteria/terraform-aws-wellarchitected-conformance.git?ref=&lt;DESIRED-COMMIT-SHA&gt;"
  deploy_aws_config_recorder = false
  reports_retention_days = 400
}</textarea>

module "well_architected_config_conformance_pack" {
  source = "git::https://github.com/soprasteria/terraform-aws-wellarchitected-conformance.git?ref=<DESIRED-COMMIT-SHA>"
  deploy_aws_config_recorder = false
  reports_retention_days = 400
}

Wait 24 hours for data to be collected and aggregated.

Then, in the AWS Console, go to AWS Lambda and locate the function well_architected_report_generator.

Create a new Test function with a JSON payload of workload_id for the Well-Architected Tool (not the complete ARN, just the last part) and dry_run. Example payload:

<textarea tabindex="-1" aria-hidden="true" readonly>{
  "workload_id": "141970ea95fd5b4329cyh05502659f39",
  "dry_run": 0
}</textarea>

{
  "workload_id": "141970ea95fd5b4329cyh05502659f39",
  "dry_run": 0
}

Hit Test. Execution may take a minute or two, depending on the amount of AWS infrastructure resources deployed in the AWS account.

The Cloud Watch Logs output will let you know which AWS Support plan is detected and where you can find the produced report.

Starting Well-Architected Report Generator
Collecting compliance data from AWS Config
Trusted Advisor compliance status mapping
AWS Business or Enterprise Support is/is not enabled
Successfully uploaded report to s3://well-architected-compliance-reports-123456789012/Reports/well_architected_compliance_report_%timestamp%.html

Navigate to Amazon S3, find the most recent report by sorting on the Last modified column, check the box to the left of Name and click Open to access the report in a new browser tab.

What a Well-Architected Framework Compliance report looks like

The produced report contains information about the time of generation, the AWS Account ID, region and detected AWS Support plan.

Example report 1, an AWS account with Basic Support:

Example report 2, an AWS account with Enterprise Support:

The report may contain a lot of detailed information, so an Executive Summary is provided as a high-level overview.

Example report 1:

Example report 2:

Next a Table of Contents is provided which includes an overview of the available Well-Architected Framework pillars and questions.

Each question includes relevant context information from the Well-Architected Framework and link to further guidance.

Here is the second question in the Security pillar: “How do you manage identifies for people and machines?”

For AWS Accounts with available AWS Premium Support, relevant Trusted Advisor check information is included.

Relevant AWS Config checks and their status are displayed, along with the detected Resource Type, Resource ID and Tag Name, if available, for easy identification and discussion.

For Reliability Pillar question number 9, Trusted Advisor checks indicates that RDS backups are enabled for all clusters, but there is at least one S3 bucket where replication is not configured.

The next section lists all detected AWS resources, including Resource ID, Status and Tag Name, as available (certain information has been obfuscated on purpose).

Moving to Cost Optimization pillar question number three: “How do you monitor cost and usage?” actually has no official Trusted Advisor checks associated, but the Terraform module fills the gap with custom AWS Config checks.

Based on this we can easily see that:

Cost Anomaly Detection is configured with a Cost monitor.
There is at least one AWS Budget configured with alert subscriptions.
There are no EC2 instances not in any Auto Scaling Groups.
The AWS account is a member of AWS Organizations and a Tag Policy is in effect.

How the reporting feature works

Let’s take a closer look into the underlying logic of the well_architected_report_generator Lambda function.

Diagram created by the help of AWS Diagram and Documentation MCP Servers

After the Terraform module is deployed AWS Config needs 24 hours to collect and aggregate all data points. Then when the Lambda function is triggered, the following happens:

The starting point is a Workload in the Well-Architected Tool. It’s not necessary to answer all questions first, you can do that with your team after the initial report has been generated. Every question in each pillar will then be mapped to relevant AWS Config and Trusted Advisor checks. Note: There is currently not a 100% 1-1 mapping with AWS Config and Trusted Advisor checks, but the availability may increase going forward as more custom checks are added to the Terraform module (contributions are welcome) and AWS Trusted Advisor.
Compliance status and information for all mapped checks are collected and aggregated. Resources in scope are fetched along with Tag Name, as available. Contextual information and further guidance is retrieved.
Scores and percentages are calculated.
Data and information are grouped by pillar.
The Executive Summary is generated.
The report is produced based on Python Jinja2 templating functionality and uploaded to a dedicated bucket on Amazon S3.

This functionality is part of the Terraform module in the file wa_report_generator.tf which includes the AWS Lambda function, a dedicated AWS KMS Key for encrypting the compliance reports (as resource and account information may be considered as sensitive) and a dedicated Amazon S3 bucket.

Conclusion

The solution now provides even more valuable insight to accelerate Well-Architected Framework Review conversations, providing more time to discuss opportunities for improvement. It fills some gaps if you don’t have AWS Premium Support available, and adds additional value otherwise. It can easily be deployed in existing Terraform pipelines. Please note that all resources will be cleaned up and deleted upon module call removal, including the reports, so make sure you remember REL09 and back up relevant reports accordingly.

Feedback and contributions

If you have any feedback please let me know through your preferred medium of contact.

If you would like to contribute with bugfixes, additional functionality or check coverage, pull requests are welcome!

Resources

AWS Config
AWS Config Conformance Packs
AWS Well-Architected Framework
AWS Well-Architected Tool
GitHub: Terraform module terraform-aws-wellarchitected-conformance
Development was accelerated by Amazon Q Developer
AWS Diagram MCP Server
AWS Documentation MCP Server

The post Extensive reporting of Well-Architected Maturity first appeared on Håkon Eriksen Drange.

How to measure Well-Architected maturity?

Håkon Eriksen Drange — Tue, 15 Apr 2025 08:14:40 +0000

TABLE OF CONTENTS

Introduction
Challenge
Measuring Well-Architected maturity with Terraform and AWS Config
- Functional flow of the solution
- Conceptual AWS architecture diagram
How to deploy and utilize
- Viewing measurement insights in AWS Console
- Well-Architected Tool integration
- Event JSON examples for dry_run/live mode
- Event JSON for cleaning notes fields for all questions
- Notice about compliance checks and automation
- Cost of AWS Config evaluations
How to remove and decommission after use
Behind the scenes
- AWS Config resources
- AWS Config Conformance Packs
Feedback and contributions
Resources

Introduction

“I always think that you should be asking yourself:

Are you Well-Architected?”

Dr. Werner Vogels, AWS re:Invent keynote 2018

Being Well-Architected means that you have taken care of the basics and your foundation is solid, so that you can move fast and focus on business requirements, with decreased risk of surprises. But how can we answer Werner’s question, with confidence? How can we measure Well-Architected maturity in a tangible manner? And what is Well-Architected enough, in your project phaze?

As the first part of the journey I would suggest to plan and conduct a Well-Architected Framework Review; to have a conversation about your solution architecture and how the different best practices from AWS could apply in your context (Measure).

In most review conversations I observe teams are spending a substantial amount of time trying to understand if the designed or provisioned resources are meeting the AWS recommended best practice configurations. “Did we set up alerting for this scenario?”, “Did we configure encryption at rest for the database cluster?”, “Did anyone get any alerts about spiking costs?”, “Is our documentation still up to date?” and so on.

During a conversation, spending less efforts on Measure could provide us with more time to Learn about best practices and align on opportunities for Improvement. Personally, I don’t believe that automation and AI capabilities will fully replace the WAFR lifecycle, but they may help accelerate them, reducing Mean-Time To Deployed Improvement for your users (MTTDI) [yes, I just invented that term].

Figure – Well-Architected Framework review cycle courtesy of AWS

Challenge

You could use a variety of options for measuring if cloud resources are meeting AWS best practices. Most commonly:

AWS Security Hub with the AWS Foundational Security Best Practices control reference (highly recommended).
AWS Trusted Advisor (full set of checks requires Business or Enterprise Support from AWS).
3rd party open-source tools such as Prowler and Steampipe.
3rd party SaaS vendors offering similar functionality in APM/Observability services.

Some of these options may not be available during a Well-Architected Framework Review due to company policies on changes potentially affecting the entire AWS Organization, cost development, security validations or procurement.

But, what if AWS native technology, provisioned for a limited time period, would be acceptable? In this article I will share an possible approach to measure Well-Architected maturity in the form of AWS Config Conformance packs.

Measuring Well-Architected maturity with Terraform and AWS Config

Recently I have been working on a Terraform module which can be utilized in scenarios with constraints as described above. A particular reflection I’ve made is that most tools focus primarily on security and reliability. Some dedicated offerings focus solely on Cloud Financial Management and Cost Optimization (also called FinOps), but finding one complete COTS Solution To Rule Them All (that doesn’t charge a premium) is unlikely.

If we wrap up our sleeves and develop our own solution supporting our own custom logic we could also cover other aspects such as Cost Optimization.

This Terraform module deploys AWS Config Conformance Packs mapped to pillars in the Well-Architected Framework.

For relevant pillars in the AWS Well-Architected Framework, each best practice that is specific enough to be detected will report to be COMPLIANT or NON_COMPLIANT. Some best practices are harder to measure, or up to subjective consideration if a team is happy with how things are, or if the team considers there is room for improvement:

How a team evaluates culture and priorities.
How satisfied a team is with insight into their workload(s) or business continuity and disaster recovery planning.
How to practice cloud financial management.

Best practices in Operational Excellence are not straight forward to detect, as implementation of observability may have subjective opinion on room for improvement or may be performed with 3rd party tools. The main outcome of this module is to accelerate the Well-Architected Framework Review conversation, not to replace it with automation. Our hope is to shift the focus from “how did we configure this?” to “this is where we are today, what could we do to improve?”, thus freeing up valuable time for busy teams.

In addition, the Notes field in the Well-Architected Tool can be populated directly with AWS Config resource compliance check results, leaving you with more insight to discuss improvement actions.

Functional flow of the solution

Figure – Flow sequence accelerating WAFR conversations

Conceptual AWS architecture diagram

AWS Config Configuration Recorder • Records configuration changes for resources in your local AWS account (no impact or dependencies on AWS Organizations Config recorder(s)) • Set to record either daily or continuously (configurable) • Stores configuration snapshots in a dedicated Amazon S3 bucket
Amazon S3 Bucket • Stores AWS Config configuration snapshots • Stores CloudFormation templates for conformance packs • Encrypted with a dedicated KMS key
AWS Config Conformance Packs • Well-Architected-Security • Well-Architected-Reliability • Well-Architected-Cost-Optimization • Well-Architected-IAM (optional, subset of Security checks)
Custom Lambda Functions • Cost Optimization checks: • Account structure implementation • AWS Budgets configuration • AWS Cost Anomaly Detection • Organization information in cost and usage • EC2 instances without Auto Scaling Groups
Well-Architected Tool Updater Lambda function • Retrieves compliance data from AWS Config • Maps compliance results to specific Well-Architected Framework best practices • Updates Notes fields in Well-Architected Tool

How to deploy and utilize

At least two days before your planned review, deploy the module as suggested in examples/main.tf and described below. Compliance checks will update on a daily basis, to optimize costs for AWS Config Evaluations.
Right before the review, trigger the Lambda function well_architected_tool_updater to update the Well-Architected Tool workload notes sections based on AWS Config Conformance packs compliance status.
Run the review, look to the data in the notes field for discussion. No checked/answered questions will be modified, that would be up to subjective evaluation.

provider "aws" {
  region = "eu-west-1" # Change to your preferred region
}

module "well_architected_conformance" {
  source = "git::https://github.com/soprasteria/terraform-aws-wellarchitected-conformance.git?ref=c006f439fc07d2e898cc7f67c5e7bcad1dcbd2e8"

  # AWS Config recording configuration
  recording_frequency = "DAILY" # Use DAILY to reduce costs

  # Deploy conformance packs
  deploy_security_conformance_pack = true
  deploy_reliability_conformance_pack = true
  deploy_cost_optimization_conformance_pack = true
  deploy_iam_conformance_pack = true
}

Viewing measurement insights in AWS Console

Navigating to AWS Config – Conformance packs will present a dashboard with packs for the Security, Reliability and Cost Optimization Pillars by default, plus IAM for Identity and Access Management, if enabled.

You can view the compliance score trend for each pillar/pack:

You can also view the compliance status for each check, prefixed with the related best practice question, mapped to the AWS Well-Architected Framework whitepaper.

Well-Architected Tool integration

This module can also automatically update Well-Architected Tool workloads with compliance data from the AWS Config Conformance Packs.

The Lambda function well_architected_tool_updater will:

Process each conformance pack (Security, Reliability, Cost Optimization).
Loop through all rules in sequence (SEC01, SEC02, REL01, REL02, COST01, etc.).
For each rule, list the resource type, resource ID, and compliance status in the Notes field of the corresponding best practice question of your Well-Architected Tool workload.
- The notes field has a limitation of maximum 2084 characters. When more resources are discovered than there is room for, resources will be summarized.
Overwrite old data if triggered more than once.
If you would like to erase all contents in all notes field, set the clean_notes input parameter to 1.

The source code for the Lambda function is located in the src/wa_tool_updater directory.

To trigger the Well-Architected Tool updater, go to Well-Architected Tool and extract the Workload ID (not the full resource ARN).

Then go to AWS Lambda and find the function well_architected_tool_updater. Create test event JSON definition as follows (Console or CLI):

Event JSON examples for dry_run/live mode

Extract the Well-Architected Tool Workload ID from Properties – ARN. This example with dry_run = 1 will find relevant compliance data and log to CloudWatch Logs. No changes or updates will be performed.

{
  "workload_id": "141970ea95fd5b4329cea05202659f39",
  "dry_run": 1,
  "clean_notes": 0
}

Flipping dry_run = 0 will perform updates of the notes field. No checked/answered questions will be modified.

{
  "workload_id": "141970ea95fd5b4329cea05202659f39",
  "dry_run": 0,
  "clean_notes": 0
}

Event JSON for cleaning notes fields for all questions

If you end up with a lot of mess and would like a fresh start, setting clean_notes to 1 will clean the notes field for all questions and return. No further changes to checked/answered questions or compliance data updates will be performed.

{
  "workload_id": "141970ea95fd5b4329cea05202659f39",
  "dry_run": 1,
  "clean_notes": 1
}

Expected output is as follows. Full log output is available in Cloudwatch Logs.

Back in Well-Architected Tool, the notes field will now be updated with detected compliance for SEC 4. How do you detect and investigate security events?

Notice about compliance checks and automation

Check data is based on all resources in the current AWS account. Tagging based filtering is currently not supported. Be aware if you have multiple workloads in the same AWS account.

Cost of AWS Config evaluations

According to the AWS Config pricing page; “With AWS Config, you are charged based on the number of configuration items recorded, the number of active AWS Config rule evaluations, and the number of conformance pack evaluations in your account. A configuration item is a record of the configuration state of a resource in your AWS account. An AWS Config rule evaluation is a compliance state evaluation of a resource by an AWS Config rule in your AWS account. A conformance pack evaluation is the evaluation of a resource by an AWS Config rule within the conformance pack”.

AWS Config supports Continuous recording and Daily recording. You can choose between Daily or Continuous by setting the desired value for the variable recording_frequency, which defaults to DAILY.

How to remove and decommission after use

Some might see this solution as valuable long-term, others might have other tools coming in which overlaps.

As this Terraform module deploys an S3 bucket for storing Config evaluations, the bucket must be emptied before it can be deleted.

Empty the bucket in the AWS Console.
Remove the Terraform module call declaration from your code base.
Trigger your CI/CD pipeline.

Behind the scenes

Some Terraform snippets on how to deploy an AWS Config Conformance Pack

How to write a custom AWS Config check.

AWS Config resources

To avoid dependencies or conflicts with existing AWS Organization based AWS Config, this module deploys a dedicated AWS Config Recorder, which has to be started after provisioning.

# Excerpts for illustration, not complete example, see main.tf

# AWS Config Delivery Channel to S3
resource "aws_config_delivery_channel" "well_architected" {
  name = "well_architected_config_delivery_channel"
  s3_bucket_name = module.aws_config_well_architected_recorder_s3_bucket.s3_bucket_id
  depends_on = [aws_config_configuration_recorder.well_architected]
}

# AWS Config Configuration Recorder with recording_frequency set by input variable
resource "aws_config_configuration_recorder" "well_architected" {
  name = "well-architected"
  role_arn = aws_iam_role.config_role.arn

  recording_group {
    all_supported = true
    include_global_resource_types = true
  }

  recording_mode {
    recording_frequency = var.recording_frequency
  }
}

# AWS Config retention configuration: Number of days AWS Config stores your historical information.
resource "aws_config_retention_configuration" "example" {
  retention_period_in_days = 400
}

# Manages status (recording / stopped) of an AWS Config Configuration Recorder.
resource "aws_config_configuration_recorder_status" "well_architected" {
  name = aws_config_configuration_recorder.well_architected.name
  is_enabled = true
  depends_on = [aws_config_delivery_channel.well_architected]
}

AWS Config Conformance Packs

Security, Reliability and IAM conformance packs are based on AWS’ library of Conformance Pack Sample Templates for AWS Config (in Cloudformation format):

The underlying checks are AWS Config Managed Rules and cannot be edited. The Cloudformation templates are imported in Terraform as data objects. ConfigRuleNames are replaced to suit the particular Well-Architected Framework Pillar and best practice.

# Excerpts for illustration, not complete example
locals {
  url_template_body_wa_security_pillar = "https://raw.githubusercontent.com/awslabs/aws-config-rules/refs/heads/master/aws-config-conformance-packs/Operational-Best-Practices-for-AWS-Well-Architected-Security-Pillar.yaml"
}

data "http" "template_body_wa_security_pillar" {
  url = local.url_template_body_wa_security_pillar
}

data "util_replace" "transformed_wa_security_pillar" {
  content = data.http.template_body_wa_security_pillar.response_body
  replacements = {
    "account-part-of-organizations" : "SEC01-securely-operate_bp_account-part-of-organizations",
    "ec2-instance-managed-by-systems-manager" : "SEC01-securely-operate_bp_ec2-instance-managed-by-systems-manager",
    "codebuild-project-envvar-awscred-check" : "SEC01-securely-operate_bp_codebuild-project-envvar-awscred-check",
    "mfa-enabled-for-iam-console-access" : "SEC02-identities_bp_mfa-enabled-for-iam-console-access"
     # .. and so on
    }
}

# Render templates to file on S3 to avoid template_body file limitation of 51,200 bytes
resource "aws_s3_object" "cloudformation_wa_config_security_template" {
  bucket = module.aws_config_well_architected_recorder_s3_bucket.s3_bucket_id
  key = "Cloudformation/wa-config-security.yaml"
  content = data.util_replace.transformed_wa_security_pillar.replaced
  content_type = "application/yaml"
}

# Takes the source Cloudformation file from S3, generates an AWS Config Conformance pack which behind the scenes creates an AWS managed Cloudformation stack. 
resource "aws_config_conformance_pack" "well_architected_conformance_pack_security" {
  count = var.deploy_security_conformance_pack ? 1 : 0
  name = "Well-Architected-Security"
  template_s3_uri = "s3://${module.aws_config_well_architected_recorder_s3_bucket.s3_bucket_id}/${aws_s3_object.cloudformation_wa_config_security_template.key}"
  depends_on = [aws_config_configuration_recorder.well_architected]
}

The Cost Optimization Conformance Pack is built from scratch. Custom Lambda Rules may be implemented like this:

# AWS Lambda function based on module from terraform-aws-modules
module "lambda_function_wa_conformance_cost_03_aws_budgets" {
  source = "git::https://github.com/terraform-aws-modules/terraform-aws-lambda.git?ref=f7866811bc1429ce224bf6a35448cb44aa5155e7"
  trigger_on_package_timestamp = false
  function_name = "WA-COST03-BP05-AWS-Budgets"
  description = "AWS Config Custom Rule which checks for AWS Budgets setup according to WAF COST03-BP05."
  handler = "index.lambda_handler"
  runtime = var.lambda_python_runtime
  source_path = "../../local-modules/wa-config-conformance/src/cost03_aws_budgets/index.py"
  attach_policy_statements = true
  timeout = var.lambda_timeout
  cloudwatch_logs_retention_in_days = var.lambda_cloudwatch_logs_retention_in_days
  policy_statements = {
    statement = {
      effect = "Allow"
      actions = [
        "budgets:DescribeBudgets",
        "budgets:ViewBudget",
        "config:PutEvaluations"
      ]
      resources = ["*"]
    }
  }

  tags = {
    Name = "Well-Architected-Conformance-COST03-BP05-AWS-Budgets"
  }
}

resource "aws_config_config_rule" "cost_01_aws_budgets" {
  name = "cost01-cloud-financial-management_bp_aws-budgets"
  description = "Checks for AWS Budgets setup according to WAF COST01-BP05 Report and notify on cost optimization."

  source {
    owner = "CUSTOM_LAMBDA"
    source_identifier = module.lambda_function_wa_conformance_cost_03_aws_budgets.lambda_function_arn

    source_detail {
      message_type = "ScheduledNotification"
      maximum_execution_frequency = var.scheduled_config_custom_lambda_periodic_trigger_interval
    }
  }

  depends_on = [module.lambda_function_wa_conformance_cost_03_aws_budgets]
}

# Lambda permissions for all AWS Config Custom Lambda Rules
resource "aws_lambda_permission" "config_permissions" {
  for_each = toset([
    module.lambda_function_wa_conformance_cost_02_account_structure_implemented.lambda_function_name,
    module.lambda_function_wa_conformance_cost_03_aws_budgets.lambda_function_name,
    module.lambda_function_wa_conformance_cost_03_aws_cost_anomaly_detection.lambda_function_name,
    module.lambda_function_wa_conformance_cost_03_add_organization_information_to_cost_and_usage.lambda_function_name,
    module.lambda_function_wa_conformance_cost_04_ec2_instances_without_auto_scaling.lambda_function_name
  ])

  statement_id = "AllowConfigInvoke"
  action = "lambda:InvokeFunction"
  function_name = each.value
  principal = "config.amazonaws.com"
  source_account = local.aws_account_id
}

Feedback and contributions

If you have any feedback, please let me know through your preferred medium of contact.

If you would like to contribute with additional functionality and check coverage, pull requests are welcome!

Resources

The post How to measure Well-Architected maturity? first appeared on Håkon Eriksen Drange.

AWS re:Invent re:Cap talk – Simplifying developer experience with new features in AWS Step Functions

Håkon Eriksen Drange — Fri, 17 Jan 2025 07:43:58 +0000

Monday January 13th I was invited by AWS User Group Oslo to participate in the traditional AWS re:Invent re:Cap meetup. My re:Cap contribution was a reflection on recent and valuable features in AWS Step Functions which can make the developer experience simpler and more efficient, reducing the time from idea to business value.

Main topics of my re:Cap talk

How to manage AWS Step Functions configuration with Infrastructure-as-Code
Replacing application (Lambda) code with Step Functions workflow configuration
Breaking apart a “Lambda-lith”
Step Functions intrinsic functions
New support for JSONata, in addition to JSONPath
Walkthrough of JSONata native functionality
New support for Variables
Step Functions Distributed Map state

“Customers choose Step Functions to build complex workflows that involve multiple services such as AWS Lambda, AWS Fargate, Amazon Bedrock, and HTTP API integrations. Within these workflows, you build states to interface with these various services, passing input data and receiving responses as output. While you can use Lambda functions for date, time, and number manipulations beyond Step Functions’ intrinsic capabilities, these methods struggle with increasing complexity, leading to payload restrictions, data conversion burdens, and more state changes. This affects the overall cost of the solution. You use variables and JSONata to address this.“

AWS Compute Blog: Simplifying developer experience with variables and JSONata in AWS Step Functions

Source material references this re:Cap talk was based upon are listed below.

Event highlights

Gunnar Grosch from AWS kicked off the event with highlights of new announcements from the pre:Invent and actual re:Invent period with a deeper dive on Amazon Aurora DSQL.

Then it was my turn before Colin from Capra Consulting shared his perception about the state of Platform Engineering.

The meetup wrapped up with an interesting panel discussion with Martin from Sopra Steria, Gunnar from AWS, Anders from Webstep and Erlend from Capra.

Many thanks to the organizers for a great event.

Slides from my re:Cap talk

Simplifying developer experience with new features in AWS Step Functions Download

Reference material

The post AWS re:Invent re:Cap talk – Simplifying developer experience with new features in AWS Step Functions first appeared on Håkon Eriksen Drange.

Increase system reliability with Immutable Infrastructure – Move fast and avoid surprises

Håkon Eriksen Drange — Fri, 09 Aug 2024 17:52:53 +0000

Table of contents

The largest outage in history of IT (so far)
Configuration drift
The concept of Immutable Infrastructure
Immutable Infrastructure in the AWS Cloud
- Virtual machine based workloads on EC2
- Container based workloads
- Immutability for Docker and local testing
- Immutability for workloads provisioned with AWS Elastic Container Service (ECS based on EC2 or Fargate)
- Immutability for workloads provisioned with AWS Elastic Kubernetes Service
- About Docker labels/image tags and deployment
- Immutability for serverless workloads provisioned with AWS Lambda
- Data handling in the immutable infrastructure model
Conclusion
References

The largest outage in history of IT (so far)

On 19 July 2024, the cybersecurity company CrowdStrike distributed a faulty update to its Falcon Sensor security software that caused widespread problems with Microsoft Windows computers running the software. As a result, roughly 8.5 million systems crashed and were unable to properly restart in what has been called the largest outage in the history of information technology and “historic in scale”.

The outage disrupted daily life, businesses, and governments around the world. Many industries were affected; airlines, airports, banks, hotels, hospitals, manufacturing, stock markets, broadcasting, gas stations, retail stores, emergency services and governmental websites. The worldwide financial damage has been estimated to be at least $10 billion.

*What happened? *

The CrowdStrike Falcon software suite consists, highly simplified, of an software application (agent), which is versioned, and Rapid Response Content configuration updates, the latter not being versioned, to facilitate rapid deployment. However, the outcome of this change was not quality assured properly before widespread deployment. In their Preliminary Post Incident Review, CrowdStrike indicated they promised to improve their quality assurance process and provide customers with “greater control over the delivery of Rapid Response Content updates by allowing granular selection of when and where these updates are deployed.”

I would like to highlight that there are certain nuances to this case (rapid distribution of security detection/mitigation mechanisms, client machines vs. servers etc.), but my personal stance is that any system that is providing mission critical services should have a level of quality assurance that matches it’s Service Level Agreement/Objective and risk tolerance.

Outages like this are not something new. It can happen to any system where changes are applied live or in-place, without being tested properly in advance. Service disruptions can also happen during regular operating system patching procedures, if/when a bug or a security issue is introduced, or with any application or server process, depending on how privileged the process is.

Configuration drift

Long running systems will over the course of time experience configuration drift. A clean server started from a golden image immediately has a high degree of certainty of it’s current state, but as time goes by, certainty decreases. In a traditional (legacy) deployment model, software packages are updated, cache and log files are stored on disk. Perhaps not all log file locations are configured for rotation, perhaps an application stores temp files at a non-standard location and not all temp files are cleaned up. Over time disk space utilization increases, and at some point in the critical partitions, such as /boot may become full without it being noticed. The server’s state has now deviated significantly from its initial state, and it may be hard to reproduce that same state.

A software update that passed QA in staging is not guaranteed be successfully applied in production because of lacking disk space or some trash that has been lying around, accumulating over time, or custom, manual configurations which was supposed to “improve” something in the live environment.

Let’s say you start comparing a green Granny Smith against a green Granny Smith, but over the course of time the green Granny Smith may have evolved to a Red Delicious apple, and perhaps to an orange Asian pear!

For production systems, especially mission-critical ones, risk management becomes increasingly important. An operating model based on long running servers is prone to introduce risk and uncertainty as time goes by. Staging and production environments can over the course of time drift and not be identical anymore. You end up with a Snowflake Server.

Can the new state be predicted? Can the previously known state be reproduced? If not, you may have a challenge with rollbacks and disaster recovery.

It is a good idea to virtually burn down your servers at regular intervals. A server should be like a phoenix, regularly rising from the ashes.

https://martinfowler.com/bliki/PhoenixServer.html

The concept of Immutable Infrastructure

The Cambridge dictionary definition of Immutability is:

The state of not changing, or being unable to be changed

https://dictionary.cambridge.org/dictionary/english/immutability

With the advent of cloud and more capabilities for automation at our disposal, an alternative pattern called Immutable Infrastructure has gained popularity. The concept is simply about starting from a clean, well-known slate, every time a change is performed.

The concept of the Immutable Server, or disposable servers, was introduced around 2012 by actors such as Thoughtworks, Netflix and Google. Instead of using configuration management to try to keep systems in compliance, they advocated for using configuration management to create base images for servers that could be torn down and rebuilt at will.

An Immutable Server is the logical conclusion of this approach, a server that once deployed, is never modified or changed. No software or operating system updates, security patches, application releases or configuration changes are being performed in-place on live production servers. “Treat your servers like cattle, not like pets” gained traction. If there was a problem with a live server, it would be terminated and replaced by a new one, from a known state.

Not even new application releases/software artifacts, are deployed to existing servers. The running servers are replaced with new instances that has software artifacts built-in. With load balancing, automatic health checks and blue/green or canary capabilities, deploying new servers or rolling back to previous versions can be done without end user impact (if done correctly).

This concept is also described in the AWS Well-Architected Framework – Reliability Pillar – REL08-BP04 Deploy using immutable infrastructure.

Implementing in-place changes to running infrastructure resources, the common approach, is actually stated as an anti-pattern.

Benefits of establishing this best practice:

_ Increased consistency across environments: Since there are no differences in infrastructure resources across environments, consistency is increased and testing is simplified._
_ Reduction in configuration drifts: By replacing infrastructure resources with a known and version-controlled configuration, the infrastructure is set to a known, tested, and trusted state, avoiding configuration drifts._
_ Reliable atomic deployments: Deployments either complete successfully or nothing changes, increasing consistency and reliability in the deployment process._
_ Simplified deployments: Deployments are simplified because they don’t need to support upgrades. Upgrades are just new deployments._
_ Safer deployments with fast rollback and recovery processes: Deployments are safer because the previous working version is not changed. You can roll back to it if errors are detected._
_ Enhanced security posture: By not allowing changes to infrastructure, remote access mechanisms (such as SSH) can be disabled. This reduces the attack vector, improving your organization’s security posture._

Immutable Infrastructure in the AWS Cloud

This practice can be achieved regardless of compute option, but to reduce time to market and operational overhead it mandates a high degree of automation with CI/CD tools such as AWS CodePipeline and GitHub Actions.

Virtual machine based workloads on EC2

In this scenario an automated routine is established which produces virtual machine images called EC2 Amazon Machine Images (AMI). This “golden image” includes the respective version of application source code and it’s dependencies such as operating system services, at the expected versions. AMIs can be produced without environment specific configuration built-in, to be fetched from AWS Systems Manager Parameter Store and/or AWS Secrets Manager, so that the same image can be deployed and verified in dev/test, staging and production environments.

Hashicorp Packer has been available for quite some time. AWS also provides the native service EC2 Image Builder for this purpose.

Amazon provides managed AMIs for both Linux (Amazon Linux based on Fedora) and Windows workloads that are tailored for security and performance in AWS.

With Immutable Infrastructure, Windows Updates and Linux unattended-upgrades are disabled. Every time AWS releases a new officially supported base AMI version, or ad-hoc updates are required, the EC2 Image Builder Pipeline is triggered, which produces a new artifact and validates each change. Every new application version fetches the latest quality assured base AMI and produces a self-contained application AMI.

You can find a Terraform example with inline comments and explanations of EC2 Image Builder resources below .

# EC2 Image Builder component that installs Git and Nginx, clones a sample app repo from GitHub and starts the Nginx web server
resource "aws_imagebuilder_component" "hello_world" {
  name = "hello-world-component"
  platform = "Linux"
  version = "1.0.0"
  description = "Hello World application component"

  data {
    name = "hello-world-app-script"
    type = "AWS_LAMBDA"
    content = <<-EOF
      # Install Git and Nginx
      yum install -y git nginx

      # Clone sample Hello World app from GitHub
      mkdir app-repo
      git clone https://github.com/aws-samples/aws-codepipeline-s3-codedeploy-linux app-repo

      # Copy cloned files to Nginx public HTML directory
      cp -r app-repo/* /usr/share/nginx/html/

      # Start Nginx
      systemctl start nginx
    EOF
  }
}

# Recipe named nginx-recipe which includes the nginx-component.
# parent_image is set to Amazon Linux 2 AMI version 2024.7.20. This can be parameterized. 
resource "aws_imagebuilder_image_recipe" "nginx_recipe" {
  name = "nginx-recipe"
  parent_image = "arn:aws:imagebuilder:${var.region}:aws:image/amazon-linux-2-x86/2024.7.20"
  version = "1.0.0"

  component {
    component_arn = aws_imagebuilder_component.nginx.arn
  }
}

# Defines an infrastructure configuration, including the instance profile, security group, and subnet (intentionally not included)
resource "aws_imagebuilder_infrastructure_configuration" "nginx_infra" {
  name = "nginx-infra"
  instance_profile_name = aws_iam_instance_profile.image_builder_instance_profile.name
  security_group_ids = [aws_security_group.image_builder_sg.id]
  subnet_id = aws_subnet.image_builder_subnet.id
  terminate_instance_on_failure = true
}

# Pipeline output should be an Amazon Machine Image (AMI) with a name based on the build date.
resource "aws_imagebuilder_distribution_configuration" "nginx_distribution" {
  name = "nginx-distribution"

  distribution {
    ami_distribution_configuration {
      name = "nginx-ami-{{ imagebuilder:buildDate }}"
    }
    region = var.region
  }
}

# Defines the Image Builder pipeline, ties together the recipe, infrastructure configuration, and distribution configuration.
resource "aws_imagebuilder_image_pipeline" "nginx_pipeline" {
  name = "nginx-pipeline"
  image_recipe_arn = aws_imagebuilder_image_recipe.nginx_recipe.arn
  infrastructure_configuration_arn = aws_imagebuilder_infrastructure_configuration.nginx_infra.arn
  distribution_configuration_arn = aws_imagebuilder_distribution_configuration.nginx_distribution.arn
}

Container based workloads

In your CI/CD tool of choice, container images are produced and pushed to a container repository such as AWS Elastic Container Registry (ECR). Even though immutability is one of the foundations behind containers, it is not restricted, but based on how the container configuration is specified and launched. Container root filesystems are usually writable by default.

In many cases software packages are being updated on container launch, but this is an anti-pattern in immutability. Update installed packages and install Apache could yield different results if executed at different points in time, as new upstream software package updates are made available. We need to ensure that we test the exact same artifact in staging that is being deployed to production, so the recommended approach is to update all packages at container build time and then launch it with the root storage partition in read only mode.

The container’s root filesystem should be treated as a ‘golden image’ by using Docker run’s --read-only option. This prevents any writes to the container’s root filesystem at container runtime and enforces the principle of immutable infrastructure.

CIS Docker Benchmark control 5.12 – Datadog

Immutability for Docker and local testing

Adding a read-only flag at the container’s runtime enforces the container’s root filesystem being mounted as read only. With the –tmpfs option it is possible to mount a temporary file system for non-persistent data/cache.

docker run <Run arguments> --read-only <Container Image Name or ID> <Command>

# Example with the --tmpfs option to mount a temporary file system for non-persistent data/cache
docker run --interactive --tty --read-only --tmpfs "/run" --tmpfs "/tmp" ubuntu /bin/bash

Immutability for workloads provisioned with AWS Elastic Container Service (ECS based on EC2 or Fargate)

Configure the Amazon ECS Task Definition file to set parameter readonlyRootFilesystemin section Storage and logging to true, as the default value is false.

Example Terraform resource definition, see line 27:

resource "aws_ecs_task_definition" "hardened_task_definition" {
  family = "hardened-task-definition"
  network_mode = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu = 256
  memory = 512

  volume {
    name = "efs-volume"
    efs_volume_configuration {
      file_system_id = "fs-0123456789abcdef" # Replace with your EFS file system ID
      root_directory = "/tmp"
    }
  }

  container_definitions = jsonencode([
    {
      name = "hardened-container"
      image = "nginx:latest"
      essential = true
      portMappings = [
        {
          containerPort = 8080
          hostPort = 8080
        }
      ]
      readonlyRootFilesystem = true
      volumesFrom = [
        {
          sourceContainer = "efs-volume-container"
        }
      ]
    },
    {
      name = "efs-volume-container"
      image = "amazon/amazon-efs-utils:latest"
      essential = true
      volumeMounts = [
        {
          name = "efs-volume"
          mountPath = "/tmp"
        }
      ]
    }
  ])
}

With this configuration, the /tmp directory inside the hardened-container will be mounted to the specified EFS file system, allowing temporary files to be stored on the persistent EFS file system instead of the read-only root filesystem.

Immutability for workloads provisioned with AWS Elastic Kubernetes Service

This also applies to Kubernetes in general. In the manifest file, specify securityContext: readOnlyRootFilesystem: true.

Example Kubernetes manifest below which demonstrates read only configuration on lines 11-12:

apiVersion: v1
kind: Pod
metadata:
  labels:
    run: nginx
  name: nginx
spec:
  containers:
    - image: nginx
      name: hardened-container
      securityContext:
        readOnlyRootFilesystem: true
      volumeMounts:
        - name: cache-volume
          mountPath: /var/cache/nginx
        - name: runtime-volume
          mountPath: /var/run
        - name: efs-volume
          mountPath: /tmp
  volumes:
    - name: cache-volume
      emptyDir: {}
    - name: runtime-volume
      emptyDir: {}
    - name: efs-volume
      nfs:
        server: efs-server.default.svc.cluster.local
        path: "/efs-share"

After applying this updated manifest, the Nginx container will have an Amazon EFS volume mounted at /tmp, allowing it to use the persistent storage provided by Amazon EFS for temporary files while still maintaining a read-only root filesystem.

Leverage Kubernetes’ Deployment strategies for blue/green, canaryetc. as best suits your use-case.

About Docker labels/image tags and deployment

When new Docker images are produced and pushed to a container repository, a common approach is to add the label/tag “latest” to it (similar to HEAD in Git), and this is also referenced in CI/CD pipelines. However, with this approach it is challenging to answer exactly which Docker image was in production at any given time, since the reference to the actual image is lost.

To follow through on immutability it is recommended to set a unique tag on every new image. You can reference the actual image tag checksum or some other application specific tag like hello-world-v1.2.3 for GitOps and CI/CD deployment. If you also add a tag with the Git commit ID, debugging becomes much easier.

It also becomes more transparent when dealing with rollbacks when you know hello-world-v1.2.3 was live in production, hello-world-v1.2.4 failed the canary health checks, the automatic procedure rolled back to hello-world-v1.2.3, and you can quickly look up the git commit of hello-world-v1.2.4 for the change.

In Amazon Elastic Container Registry you can prevent image tags from being overwritten by enabling the property Tag immutability.

Immutability for serverless workloads provisioned with AWS Lambda

AWS Lambda is immutable by design. Lambda creates a new version of your function each time that you publish the function. It’s code, runtime, architecture, memory, layers, and most other configuration settings will remain unchanged.

Versions can be used to control function deployment. You can publish a new version of a function for beta testing, or canary testing for a small amount of users instead of deploying the change to all production users at once by using a specific version reference (Qualified ARN instead of $LATEST), in AWS API Gateway or other services that are routing traffic to your Lambda functions.

In the example below we define three aliases: staging, canary-prod and prod, which refers to relevant Lambda function versions. A Lambda routing configuration is defined which directs 90% of the traffic to alias prod (v41), and 10% to alias canary-prod (v42).

Procedure

Test version 42 in staging.
Roll out version 42 to 10% of all production users.
CloudWatch Metrics for Lambda functions are observed and grouped by a combination of alias and executed version.
If application monitoring and health checks are sustained over a course of eg. 30 minutes with no increases in error rates, update the alias for prod to increase function_version from 41 to 42 and reset routing_config. If not, roll back.
If health checks are still OK, end. If not, roll back.

Terraform example of AWS Lambda aliases and routing configuration for canary deployment:

resource "aws_lambda_alias" "staging" {
    name = "staging"
    function_name = "arn:aws:lambda:aws-region:123456789012:function:helloworld"
    function_version = "42"
    description = "Canary alias for production environment"
}

resource "aws_lambda_alias" "canary_prod" {
    name = "canary-prod"
    function_name = "arn:aws:lambda:aws-region:123456789012:function:helloworld"
    function_version = "42"
    description = "Canary alias for production environment"
  }

resource "aws_lambda_alias" "prod" {
  name = "prod"
  function_name = "arn:aws:lambda:aws-region:123456789012:function:helloworld"
  function_version = "41"
  description = "Alias for main production audience"
}

resource "aws_lambda_update_alias" "prod_routing_with_canary" {
  name = aws_lambda_alias.prod.name
  function_name = aws_lambda_alias.prod.function_name
  function_version = aws_lambda_alias.prod.function_version

  routing_config {
    additional_version_weights = {
      aws_lambda_alias.canary_prod.function_version = 0.1
    }
  }
}

Data handling in the immutable infrastructure model

As the compute resources themselves can come and go, data that needs to be persisted, including session state, has to be moved off of the compute tier. Depending on the type of data there are different AWS services at your disposal:

Conclusion

By adhering to the principle of never changing a running system, a higher degree of predictability and reliability can be achieved.

Many view rollbacks as a pain. Most people avoid spending time on rollback and disaster recovery testing. “We don’t do rollbacks, we prefer to try to fix the problem and roll forward”. This is a warning signal of manual procedures and lacking automation.

With the immutable infrastructure approach rollbacks becomes a natural part of change management. Database changes can be managed with the Expand-Contract pattern. Any failing automated checks with increased error rates should automatically roll back to the previously known working version, also supported by the database schema. By adopting canary releases and/or blue-green deployments rollbacks can be performed fast and with reduced end user impact.

A bonus with the Immutable Infrastructure approach is that rollbacks will become trivial. By releasing small changes frequently a failed new release shouldn’t be a big issue. Instead of being up at night or feeling the pressure of having to fight fires to resolve issues which are impacting end users, teams can investigate in peace and quiet, produce a new artifact version with the bug-fix and ship it quickly through fully automated and quality assured CI/CD pipelines, to master the art of Continuous Delivery and realize business value faster.

References

The post Increase system reliability with Immutable Infrastructure – Move fast and avoid surprises first appeared on Håkon Eriksen Drange.

Bye bye Bastion!

Håkon Eriksen Drange — Wed, 03 Jul 2024 17:27:38 +0000

Introduction
Exploring alternative workflows
- Deploy sample infrastructure
- Well-Architected Virtual Private Cloud (VPC)
- RDS Aurora MySQL Multi-AZ cluster
- EC2 Amazon Linux instance and security group
- AWS Cloud9 SSM Managed Instance
- Alternative 1: AWS Systems Manager – Session Manager
- Prerequisites
- Connecting to an EC2 instance in a private subnet from the AWS Console
- Connecting to an EC2 instance in a private subnet from your local workstation with the AWS CLI and AWS IAM Identity Center
- Connecting to an RDS cluster in a private database subnet with local port forwarding
- Alternative 2: AWS CloudShell VPC Environment
- Alternative 3: AWS Cloud9 IDE
Conclusion and feature comparison
References

Introduction

The Bastion Host, or Jump Host concept, has historically been a traditional pattern for providing system administrators with external access to internal compute resources on distinct networks. An actor connects to a dedicated host in a DMZ by most commonly Secure Shell (SSH) or Remote Desktop Protocol (RDP), and from there gain access to compute resources on internal networks to perform system maintenance, apply patches, check content in a database , update schemas etc.

These Bastion Hosts must behighly secured to withstand attacks. Measures include hardened operating systems and server configurations, removing non-required services and libraries, firewalling and audit logging, but the reality is that many instances do not meet the recommended standards, either by lack of knowledge or configuration mistakes. One of the most common attack vectors is simply not locking down the firewall rules/security group rules to only permit access on relevant ports from trusted IP ranges, leaving SSH/RDP open to anyone (0.0.0.0/0) instead of your corporate gateway or VPN.

As mentioned in my post Protect your webapps from malicious traffic with AWS Web Application Firewall, _ 7 percent of EC2 instances , 3 percent of Azure VMs , and 13 percent of Google Cloud VMs are publicly exposed to the internet. Among instances that are publicly exposed, HTTP and HTTPS are the most commonly exposed ports, and are not considered risky in general. After these, SSH and RDP remote access protocols are common._

On July 2nd 2024 the critical vulnerability CVE-2024-6387, labeled regreSSHion, was announced, where an unauthenticated remote code execution in OpenSSH’s server (sshd) could grant full root access. With a vulnerability score of 9.8/10 this is one of the most serious bugs in OpenSSH in years.

The first thing to do is to stop exposing SSH/RDP, and then find alternative, more modern workflows for accessing cloud resources. Even if access is whitelisted from trusted IP address ranges, it’s a bad practice in 2024 to have direct access into production environments.

As stated in the AWS Well-Architected Framework – Security Pillar:

Use automation to perform deployment, configuration, maintenance, and investigative tasks wherever possible. Consider manual access to compute resources in cases of emergency procedures or in safe (sandbox) environments, when automation is not available.

Common anti-patterns

Interactive access to Amazon EC2 instances with protocols such as SSH or RDP.

Maintaining individual user logins such as /etc/passwd or Windows local users.

Sharing a password or private key to access an instance among multiple users.

Manually installing software and creating or updating configuration files.

Manually updating or patching software.

Logging into an instance to troubleshoot problems.

Removing the use of Secure Shell (SSH) and Remote Desktop Protocol (RDP) for interactive access reduces the scope of access to your compute resources. This takes away a common path for unauthorized actions.

SEC06-BP03 Reduce manual management and interactive access

For reference, the CIS AWS Foundations Benchmark has multiple controls when it comes to detecting public exposure of SSH/RDP. AWS Security Hub AWS Config and AWS Trusted Advisor can help you here.

[EC2.53] EC2 security groups should not allow ingress from 0.0.0.0/0 to remote server administration ports
- This control checks whether an Amazon EC2 security group allows ingress from 0.0.0.0/0 to remote server administration ports (ports 22 and 3389). The control fails if the security group allows ingress from 0.0.0.0/0 to port 22 or 3389.
[EC2.13] Security groups should not allow ingress from 0.0.0.0/0 or ::/0 to port 22
- This control checks whether an Amazon EC2 security group allows ingress from 0.0.0.0/0 or ::/0 to port 22. The control fails if the security group allows ingress from 0.0.0.0/0 or ::/0 to port 22.
[EC2.21] Network ACLs should not allow ingress from 0.0.0.0/0 to port 22 or port 3389
- This control checks whether a network access control list (network ACL) allows unrestricted access to the default TCP ports for SSH/RDP ingress traffic. The control fails if the network ACL inbound entry allows a source CIDR block of ‘0.0.0.0/0’ or ‘::/0’ for TCP ports 22 or 3389.

Exploring alternative workflows

It’s encouraged to pivot from a classical systems administration approach and substitute interactive access with AWS Systems Manager capabilities.

Look into how you can automate runbooks and trigger maintenance tasks with AWS Systems Manager Automation documents.

Don’t perform changes in live systems, deploy EC2 compute resources using the immutable infrastructure pattern, as for container based workloads. Or, even better, containerize to AWS ECS Fargate/EKS.

If your use-case dictates interactive access; disable security group ingress rules for port 22/tcp (SSH) or port 3389/tcp (RDP) and leverage AWS SSM – Session Manager agent based access to EC2. You can also configure activity logging to Amazon CloudWatch Logs for a full audit trail. We will explore this workflow in the coming chapters.

Deploy sample infrastructure

For testing the described alternatives I have developed a sample Terraform module which deploys the following resources.

Well-Architected Virtual Private Cloud (VPC)

Public subnets
Private subnets with VPC endpoints
Database subnets
NAT Gateways
VPC endpoints

locals {
  azs = slice(data.aws_availability_zones.available.names, 0, 2)
  aws_account_id = data.aws_caller_identity.current.account_id
}

module "vpc" {
  source = "git::https://github.com/terraform-aws-modules/terraform-aws-vpc.git?ref=25322b6b6be69db6cca7f167d7b0e5327156a595"

  name = var.name_prefix
  cidr = var.vpc_cidr
  azs = local.azs
  private_subnets = [for k, v in local.azs : cidrsubnet(var.vpc_cidr, 8, k)]
  public_subnets = [for k, v in local.azs : cidrsubnet(var.vpc_cidr, 8, k + 4)]
  database_subnets = [for k, v in local.azs : cidrsubnet(var.vpc_cidr, 8, k + 8)]

  create_database_subnet_group = true
  create_database_subnet_route_table = true
  create_database_internet_gateway_route = false

  manage_default_network_acl = true
  manage_default_route_table = true
  manage_default_security_group = true

  enable_dns_hostnames = true
  enable_dns_support = true
  enable_nat_gateway = true
  single_nat_gateway = false
  one_nat_gateway_per_az = true

  enable_flow_log = true
  create_flow_log_cloudwatch_log_group = true
  create_flow_log_cloudwatch_iam_role = true
  flow_log_max_aggregation_interval = 60

  vpc_tags = {
    Name = var.name_prefix
  }
}

module "vpc_endpoints" {
  source = "git::https://github.com/terraform-aws-modules/terraform-aws-vpc.git//modules/vpc-endpoints?ref=4a2809c673afa13097af98c2e3c553da8db766a9"

  vpc_id = module.vpc.vpc_id

  create_security_group = true
  security_group_name_prefix = "${var.name_prefix}-vpc-endpoints-"
  security_group_description = "VPC endpoint security group"
  security_group_rules = {
    ingress_https = {
      description = "HTTPS from VPC"
      cidr_blocks = [module.vpc.vpc_cidr_block]
    }
  }

  endpoints = {
    dynamodb = {
      service = "dynamodb"
      service_type = "Gateway"
      route_table_ids = flatten([module.vpc.intra_route_table_ids, module.vpc.private_route_table_ids, module.vpc.public_route_table_ids])
      policy = data.aws_iam_policy_document.dynamodb_endpoint_policy.json
      tags = { Name = "dynamodb-vpc-endpoint" }
    },
    ecs = {
      service = "ecs"
      private_dns_enabled = true
      subnet_ids = module.vpc.private_subnets
    },
    ecs_telemetry = {
      create = false
      service = "ecs-telemetry"
      private_dns_enabled = true
      subnet_ids = module.vpc.private_subnets
    },
    ecr_api = {
      service = "ecr.api"
      private_dns_enabled = true
      subnet_ids = module.vpc.private_subnets
      policy = data.aws_iam_policy_document.generic_endpoint_policy.json
    },
    ecr_dkr = {
      service = "ecr.dkr"
      private_dns_enabled = true
      subnet_ids = module.vpc.private_subnets
      policy = data.aws_iam_policy_document.generic_endpoint_policy.json
    },
    rds = {
      service = "rds"
      private_dns_enabled = true
      subnet_ids = module.vpc.private_subnets
      security_group_ids = [module.db.security_group_id]
    },
    kms = {
      service = "kms"
      private_dns_enabled = true
      subnet_ids = module.vpc.database_subnets
    },
    ssm = {
      service = "ssm"
      private_dns_enabled = true
      subnet_ids = module.vpc.private_subnets
    },
    ssmmessages = {
      service = "ssmmessages"
      private_dns_enabled = true
      subnet_ids = module.vpc.private_subnets
    },
    ec2messages = {
      service = "ec2messages"
      private_dns_enabled = true
      subnet_ids = module.vpc.private_subnets
    }
  }

  tags = {
    Name = var.name_prefix
  }
}

RDS Aurora MySQL Multi-AZ cluster

module "db" {
  source = "git::https://github.com/terraform-aws-modules/terraform-aws-rds-aurora.git?ref=7d46e900b31322fd7a0ab0d7f67006ba4836c995"

  name = "${var.name_prefix}-rds"
  engine = "aurora-mysql"
  engine_version = "8.0"
  master_username = "root"
  instances = {
    1 = {
      instance_class = "db.t3.medium"
    }
    2 = {
      instance_class = "db.t3.medium"
    }
  }
  vpc_id = module.vpc.vpc_id
  db_subnet_group_name = module.vpc.database_subnet_group_name
  security_group_rules = {
    ingress = {
      source_security_group_id = aws_security_group.private_access.id
    }
    ingress = {
      source_security_group_id = data.aws_security_group.cloud9_security_group.id
    }
    kms_vpc_endpoint = {
      type = "egress"
      from_port = 443
      to_port = 443
      source_security_group_id = module.vpc_endpoints.security_group_id
    }
  }

  tags = {
    Name = var.name_prefix
    Environment = "dev"
    Classification = "internal"
  }

  manage_master_user_password_rotation = true
  master_user_password_rotation_schedule_expression = "rate(7 days)"
}

EC2 Amazon Linux instance and security group

The Security Group for RDS is provisioned within the module. One placeholder security group labeled “private_access” is defined for EC2 and CloudShell purposes which only permits egress traffic. It is referenced in the RDS cluster security group to permit incoming connections on port 3306 for MySQL. This is called security group referencing and allows for dynamic configurations, instead of specifying static CIDR ranges, which often are too permissive.

data "aws_ami" "amazon_linux_23" {
  most_recent = true
  owners = ["amazon"]

  filter {
    name = "name"
    values = ["al2023-ami-2023*-x86_64"]
  }
}

module "ec2_instance" {
  source = "git::https://github.com/terraform-aws-modules/terraform-aws-ec2-instance.git?ref=4f8387d0925510a83ee3cb88c541beb77ce4bad6"

  name = "${var.name_prefix}-ec2"
  ami = data.aws_ami.amazon_linux_23.id
  create_iam_instance_profile = true
  iam_role_description = "IAM role for EC2 instance and SSM access"
  iam_role_policies = {
    AmazonSSMManagedInstanceCore = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
  }
  instance_type = "t2.micro"
  vpc_security_group_ids = [aws_security_group.private_access.id]
  subnet_id = element(module.vpc.private_subnets, 0)

  # Enforces IMDSv2
  metadata_options = {
    "http_endpoint" : "enabled",
    "http_put_response_hop_limit" : 1,
    "http_tokens" : "required"
  }

  tags = {
    Name = "${var.name_prefix}-ec2"
    Environment = "dev"
  }
}

resource "aws_security_group" "private_access" {
  #checkov:skip=CKV2_AWS_5: Placeholder security group, to be assigned to applicable resources, but beyond scope of this module.
  name_prefix = "${var.name_prefix}-private-access"
  description = "Security group for private access from local resources. Permits egress traffic."
  vpc_id = module.vpc.vpc_id

  egress {
    description = "Permit egress TCP"
    from_port = 0
    to_port = 65535
    protocol = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }
  egress {
    description = "Permit egress UDP"
    from_port = 0
    to_port = 65535
    protocol = "udp"
    cidr_blocks = ["0.0.0.0/0"]
  }
  egress {
    description = "Permit egress ICMP"
    from_port = -1
    to_port = -1
    protocol = "icmp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    Name = "${var.name_prefix}-sg-private-access"
  }
}

AWS Cloud9 SSM Managed Instance

The data property for the security group makes it possible to identity the security group provisioned by AWS Cloud9. This is added in an ingress rule for the RDS cluster, as previously defined.

resource "aws_cloud9_environment_ec2" "cloud9_ssm_instance" {
  name = "${var.name_prefix}-cloud9"
  instance_type = "t2.micro"
  automatic_stop_time_minutes = 30
  image_id = "amazonlinux-2023-x86_64"
  connection_type = "CONNECT_SSM"
  subnet_id = element(module.vpc.private_subnets, 0)
  owner_arn = length(var.cloud9_instance_owner_arn) > 0 ? var.cloud9_instance_owner_arn : null
}

data "aws_security_group" "cloud9_security_group" {
  filter {
    name = "tag:aws:cloud9:environment"
    values = [
      aws_cloud9_environment_ec2.cloud9_ssm_instance.id
    ]
  }
}

For a fully working code repository along with setup instructions, see https://github.com/haakond/terraform-aws-bastion-host-alternatives/blob/main/examples/main.tf and https://github.com/haakond/terraform-aws-bastion-host-alternatives/blob/main/README.md.

Alternative 1: AWS Systems Manager – Session Manager

With AWS Systems Manager – Session Manager, you can manage your Amazon Elastic Compute Cloud (Amazon EC2) instances, edge devices, on-premises servers, and virtual machines (VMs). Port forwarding is also supported to connect to remote hosts in private subnets.

You can use either an interactive one-click browser-based shell or the AWS Command Line Interface (AWS CLI). Session Manager provides secure and auditable node management without the need to open inbound ports, maintain bastion hosts, or manage SSH keys. Session Manager supports Linux, Windows and macOS and session activity can be logged with AWS CloudTrail and Amazon CloudWatch Logs.

Session Manager can be configured to log entered commands and their output during a session, which can be used for generating reports or audit situations.

Note: AWS Systems Manager also provides an option with EC2 Instance Connect Endpoints, but this is based on SSH.

The coming two examples are based on this access pattern.

Prerequisites

Supported operating system
AWS Systems Manager SSM agent installed
- List of AMIs with the SSM Agent preinstalled
Connectivity to endpoints ec2messages, ssm and ssmmessages in the current region
IAM service role permissions AmazonSSMManagedInstanceCore or equivalent
Optional: Install the Session Manager plugin for the AWS CLI

Connecting to an EC2 instance in a private subnet from the AWS Console

There are two optional starting points in the AWS Console, either from the AWS Systems Manager – Session Manager or directly from the EC2 instances list. The EC2 approach is usually the fastest and most convenient. With the prerequisites in order, find the instance you want to connect to, hit Connect.

Ensure the tab with the option Session Manager is chosen and click Connect again.

I am now logged in and authenticated with my Federated AWS IAM Identity Center method and we have full traceability.

Connecting to an EC2 instance in a private subnet from your local workstation with the AWS CLI and AWS IAM Identity Center

In this example Microsoft Entra ID is the identity provider and federationis configured with AWS IAM Identity Center for modern user management. Do yourself a favor and get rid of those IAM users with static access keys.

To authorize your workstation based on Linux, macOS or Windows Subsystem for Linux, ensure you have the latest version of the AWS CLI installed.

Run aws configure sso and follow the instructions to obtain a valid session based on a browser where you’re currently logged in. For a full step by step guide see https://docs.aws.amazon.com/cli/latest/userguide/sso-configure-profile-token.html#sso-configure-profile-token-auto-sso.

You should now be logged in and have chosen the relevant AWS account and role to assume.

To confirm I run the following AWS cli command to get a list of EC2 instances including the Name tag value. One instance is returned.

aws ec2 describe-instances --query 'Reservations[].Instances[].[InstanceId, Tags[?Key==Name].Value[]]' --output=json

[
    [
        [
            "i-0b448a908727eec7d",
            [
                "bastion-alternative-demo-ec2"
            ]
        ]
    ]
]

Connecting to the instance is as simple as:

aws ssm start-session --target i-0b448a908727eec7d

Type exitto terminate the session. There you go, CLI or Console based access with a security group with no inbound rules.

Connecting to an RDS cluster in a private database subnet with local port forwarding

As with the previous example, establish a session with aws configure sso and identify the EC2 instance you would like to use as proxy. In this example we use the same one. To find the RDS cluster writer endpoint name you can issue the following command:

aws rds describe-db-clusters \
  --query 'DBClusters[?starts_with(DBClusterIdentifier, `bastion-alternative-demo`)].DBClusterIdentifier' \
  --output text | xargs -I {} aws rds describe-db-cluster-endpoints \
  --db-cluster-identifier {} \
  --query 'DBClusterEndpoints[?EndpointType==`WRITER`].Endpoint' \
  --output text

Alternatively, in the AWS Console navigate to RDS and copy the Writer Endpoint name, FQDN.

Execute the following command to start a port forwarding session which should provide MySQL connectivity to port 3306 on your local machine:

aws ssm start-session --target i-0b448a908727eec7d --document-name AWS-StartPortForwardingSessionToRemoteHost --parameters '{"portNumber":["3306"],"localPortNumber":["3306"],"host":["bastion-alternative-demo-rds.cluster-cmikipz5ncly.eu-west-1.rds.amazonaws.com"]}'

The main difference is that this time the aws ssm start-session command will trigger an AWS Systems Managed Document called “AWS-StartPortForwardingSessionToRemoteHost” and we supply desired parameters. For demonstration the Terraform module has provisioned an RDS Aurora MySQL cluster in private database subnets.

The command outputs “Starting session with SessionId (..) Port 3306 opened, Waiting for connections.

In another terminal I run mysql -y root -p -h 127.0.0.1 --port 3306 and take the opportunity to create a new privileged mysql user called chuck_norris.

Verified working as expected.

If you prefer to use a GUI client like MySQL Workbench or HeidiSQLconfigure it to connect on localhost port 3306. If you run a development database server locally you probably would like to configure port forwarding to a different port.

Alternative 2: AWS CloudShell VPC Environment

AWS CloudShell is a browser-based shell that is pre-authenticated with your console credentials which makes it easy to securely manage, explore and interact with AWS resources. Common development tools related to AWS are also pre-installed.

AWS announcedVPC Environment support for AWS CloudShell on June 26th 2024. This enables the possibility to use CloudShell securely within the same subnet as other resources in your VPCs without the need for additional network configuration. Before this there was no means to control the network flow.

One caveat is that AWS CloudShell VPC environment does not support persistent storage, as the regular CloudShell feature has. Storage is ephemeral and data and home directories are deleted when an active environment ends, so you have to ensure data you care about is saved in Amazon S3 or another relevant persistent store. In my opinion, from a security perspective, auto-cleanup is a positive thing.

Open Cloudshell from the main AWS Services search box or the logo icon:

We choose the pre-provisioned VPC, subnet and security group:

Let’s verify outbound connectivity:

The /home partition has about 12GB free space. If you need more scratch space, look into mounting Amazon Elastic File System or dump stuff on Amazon S3. Keep in mind that the CloudShell volume is ephemeral and data will be gone after your session ends.

Let’s also check that Cloudshell’s Elastic Network Interface is in the expected private subnet. Thank you Amazon Q for the creative query suggestion.

aws ec2 describe-network-interfaces \
  --filters "Name=private-ip-address,Values=$(ip addr show ens6 | grep -oE '((1?[0-9][0-9]?|2[0-4][0-9]|25[0-5])\.){3}(1?[0-9][0-9]?|2[0-4][0-9]|25[0-5])' | head -n 1)" \
  --query 'NetworkInterfaces[*].{SubnetId:SubnetId}' \
  --output text | xargs -I {} aws ec2 describe-subnets --subnet-ids {} \
  --query 'Subnets[*].{SubnetId:SubnetId,Ipv4CidrBlock:CidrBlock,TagName:Tags[?Key==`Name`].Value|[0]}' --output table

Cloudshell has many development tools pre-installed, including a mysql client. To verify that the endpoint hostname resolves to a private ip address in the private subnet, install dig from bind-utils:

sudo yum install bind-utils -y
dig +short <hostname>

We successfully managed to connect to the RDS Aurora MySQL cluster and created a new database.

Alternative 3: AWS Cloud9 IDE

AWS Cloud9 is a cloud based integrated development environment (IDE) that can be accessed through the AWS Console. It could be an alternative if you prefer a lightweight client/workstation environment, where your development environment will be the configured the same across client devices.

Cloud9 is a fully managed service based on EC2 and EBS for persistent data storage. Instance hibernation takes place after a period if inactivity (configurable, from 30 minutes to x hours) to save costs when not in use. A Cloud9 environmentcan be deployed into both public and private subnets, in modes EC2(recommended) or SSH(discouraged). The EC2 mode supports the “no-ingress instance” pattern, without the need to open any inbound ports, by leveraging AWS Systems Manager. This is the procedure we will explore further.

A sample Cloud9 instance has been provisioned as part of the Terraform sample module. Navigate to AWS Cloud9 in the AWS Console, locate “bastion-alternative-demo-cloud9” and click Open.

The EC2 environment type comes with AWS managed temporary credentials activated by default, which manages the AWS access credentials on the users behalf, with certain restrictions. To ensure you get all privileges available to the IAM policies for your active role session, disable this in the AWS Cloud9 main window, Preferences, AWS Settings, Credentials.

Open a Terminal tab and issue aws configure sso as in the previous example, and set the SSO session name “default”.

aws s3 ls verifies the AWS session. Technically, this could have been executed from “anywhere”, but the main benefit are:

The Cloud9 environment maintains the configuration regardless of client device.
Other resources on private subnets may be configured to be available.

Verify database connectivity:

Conclusion and feature comparison

In this blog post we explored alternative workflows which can replace the concept of a traditional Bastion Host accessed by SSH or RDP. AWS provides customers with alternatives so that you can choose the one that best fits your use-case.

Supports port forwarding and local GUI clients.

Full integration with Cloudtrail and CloudWatch Logs for auditing and activity tracking | All prerequisites for SSM Managed Instances may be seen as a hurdle, but can be solved with IaC. | No additional charges for accessing Amazon EC2 instances.

The advanced on-premises instance tier is required for using Session Manager to interactively access on-premises instances. |
| AWS CloudShell VPC Environments | Available anywhere in the AWS Console.

Does not require extensive configuration.

Easy to use for quick commands or lookups.

Ephemeral storage, auto-cleanup after session inactivity.

| Ephemeral storage, auto-cleanup after session inactivity.

Due to possible session timeout issues consider other options for long running commands, database imports/exports etc.

Audit and activity logging capabilities not matching SSM Session Manager. | No additional charges, minimum fees nor commitments. You only pay for other AWS resources you use with CloudShell to create and run your applications. |
| AWS Cloud9 | Appealing if you’re also working on code development (application/IaC).

Same IDE experience regardless of client device.

Terraform/Cloudformation deployments can be triggered triggered from the same terminal.

Data is persisted on EBS until environment termination. | Session/permission management can be cumbersome.

Audit and activity logging capabilities not matching SSM Session Manager. | No additional charges, minimum fees nor commitments for the service itself. You pay only for the compute (EC2) and storage resources (EBS) for the environment.

Example:

t2.micro Linux instance at $0.0116/hour x 90 total hours used per month = $1.05

$0.10 per GB-month of provisioned storage x 10-GB storage volume = $1.00

Total monthly fees: $2.05 |

My personal preference is to pivot to immutability with containers and automation, but AWS Systems Manager – Session Manager would be the most viable alternative workflow for any remaining EC2 based workloads.

Instead of querying the database directly for troubleshooting or support requests, develop a support microservice or dashboard for the purpose. It could be as simple as SELECT * from relevant tables which returns relevant data in JSON format, with standard employee authentication and authorization mechanisms built in.

Full Terraform sample code is available at https://github.com/haakond/terraform-aws-bastion-host-alternatives, feel free to grab anything you need.

References

The post Bye bye Bastion! first appeared on Håkon Eriksen Drange.

Never miss an alert with AWS Chatbot and AWS SSM – Incident Manager

Håkon Eriksen Drange — Fri, 31 May 2024 07:40:23 +0000

Table of contents

Introduction
AWS Chatbot
- AWS Chatbot Pricing
- Setting up AWS Chatbot for Slack
- Step 1: Add the AWS Chatbot app in Slack Automations to your desired Slack workspace
- Step 2: Configure a Slack channel by inviting the AWS Chatbot app to your desired Slack channel
  - Manual procedure
  - Automated deployment with Terraform
- Step 3: Test notifications
- Sending CloudWatch alarms to Slack
- Sending AWS Health notifications and Security Hub findings to Slack
AWS Systems Manager – Incident Manager
- Response plans and incident severity classification
- Notification deduplication
- Setting up AWS Systems Manager – Incident Manager
- Configure AWS Systems Manager for your applicable regions with region failover replication set
- Configure contacts for your on-call team
- Configure response plan – Example for type Critical Incident Process
- Configure on-call schedule
- Configure Systems Manager Runbook to be executed for the Critical Incident Process
- Deploying AWS Systems Manager – Incident Manager with Terraform
- Using AWS SSM Incident Manager
- Activate contact channels
- On-call schedule – calendar overview
- Fire drill
- Incident Manager Pricing
Conclusion
References

Introduction

Having a clear understanding of operational service level indicators like service latency and availability is paramount to ensure you can deliver the expected quality of service to your end users, customers and your company. By expected I mean exactly that. Not less, not high above, but mainly at point.

Lower quality of service can manifest into unhappy customers and end users. Higher quality of service can be seen as a good thing, but a too high level can lead to over-engineering, increased complexity, higher cost of over-provisioned capacity and not daring to innovate. Identifying the right level based on conversations with customer stakeholders and aligning expectations through Service Level Agreements and Service Level Objectives can ensure all involved parties share the same understanding of where the bar is. If a stakeholder says “the website must be up and running at all times” people can have different understanding of what this means in practice. Is the stakeholder meaning 99.99%? Or even 100%? Or is 43 minutes over the course of 30 days acceptable? This can lead to very interesting conversations based on operational metrics data instead of subjective opinion.

The first pillar in the AWS Well-Architected Framework is called Operational Excellence, and these are a few of the best practice recommendations that can get you very far, in addition to the Reliability Pillar.

OPS04-BP01 Identify key performance indicators
- Service Level Indicators such as availability and/or latency
OPS04-BP02 Implement application telemetry
OPS04-BP03 Implement user experience telemetry
OPS10-BP01 Use a process for event, incident, and problem management
OPS10-BP03 Prioritize operational events based on business impact
OPS08-BP04 Create actionable alerts
OPS10-BP04 Define escalation paths
OPS10-BP06 Communicate status through dashboards
REL 13. How do you plan for disaster recovery (DR)?

Consider that you have deployed a set of workloads, defined KPIsfor five most important end user workflows in the systems, have CloudWatch alarms configured and playbooksdefined in an operational wiki about how to handle the events.

What is your alerting strategy? Many start with CloudWatch Alarms => Simple Notification Service (SNS) => email/Slack, but what happens if multiple alarms have triggered? How do you ensure the right personnel gets notified so that critical alerts aren’t missed, and, know which one(s) to prioritize first?

In this blog post we will explore one possible solution which leverages AWS Systems Manager – Incident Manager to efficiently manage situations where a workload has become unavailable or is severely impacted. We will also look into how AWS Chatbot can increase insight and visibility into operational metrics and events by getting data out of the AWS Console and into Slack with examples for events from AWS Health, AWS Security Hub and CloudWatch alarms for a container based workload. I will also demonstrate how you can provision the solution with Terraform.

AWS Chatbot

This is a managed service from AWS which enables ChatOps for AWS. Operational tasks and visibility can be shifted from the AWS Console to Amazon Chime, Microsoft Teams and Slack. You can receive notifications for operational alarms, security alarms, budget deviations and so on. The service eliminates the need for self-managed AWS Lambda functions for these types of integrations, and if your organization is using Slack or Microsoft Teams I can highly recommend to check if you can replace any custom integration logic with AWS Chatbot.

Another useful aspect of AWS Chatbot is the possibility to search and discover AWS information and ask service questions to Amazon Q, without needing to investigate official documentation sources or search on the internet. The answers will be visible to your team so that everyone is kept in the loop. You can also ask Q in VScode if you have a topic you prefer to keep to yourself.

AWS Chatbot Pricing

AWS Chatbot is free to use. You only pay for underlying services (SNS, CloudWatch, GuardDuty, EventBridge, Security Hub) like how you would call them using CLI/Console etc. Slack/Microsoft Teams licensing options will apply as relevant.

Setting up AWS Chatbot for Slack

AWS Chatbot uses Amazon Simple Notification Service (SNS) topic to send event and alarm notifications from AWS services to chat channels you configure. The initial configuration of the service, by authorizing Slack, is only possible through the AWS Console, but the remainder of the solution is possible to provision by using a combination of Terraform providers for AWS. I will take you through the relevant code snippets. Full sample code is available on my GitHub page and further explained in the Conclusion section.

Different organizations have different operational models based on if they are doing centralized platform services or more distributed DevOps style.

If you prefer to centralize the configuration you can do so in one AWS account and have other workload accounts publish events to the central SNS topic. After some experience with AWS Chatbot, and recent resource support in the AWS Cloud Control Provider for Terraform, I’ve landed on that deploying AWS Chatbot in each production workload account can be automated to a high degree and it also reduces complexity for resource access across accounts. One example is that AWS Chatbot automatically can include CloudWatch metrics graphs and useful information.

Each DevOps team can then take full responsibility for their workloads by sending CloudWatch Alarms and EventBridge events from services like AWS Health, AWS Security Hub and Amazon GuardDuty to a team specific Slack/Microsoft Teams channel. A centralized platform team could do the same for aggregated insights for Landing Zone governance as a safety net.

After these Terraform resources have been provisioned, In this case Slackwill be demonstrated. Take a note of the output sns_topic_for_aws_chatbot_arn. This will be configured in future sources.

Step 1: Add the AWS Chatbot app in Slack Automations to your desired Slack workspace

As a Slack workspace administrator, add AWS Chatbot to the Slack workspace.
Log in to the AWS Console in your respective workload account
Go to AWS Chatbot console
Configure a chat client
Choose Slack, Configure
Choose the Slack workspace you prefer to use
Allow

Official reference: https://docs.aws.amazon.com/chatbot/latest/adminguide/slack-setup.html#slack-client-setup.

Step 2: Configure a Slack channel by inviting the AWS Chatbot app to your desired Slack channel

Manual procedure

In my case I call the Configuration name “hed-aws-monitoring” and configure event logging to Amazon CloudWatch Logs to be able to ensure that the setup is working as expected and for possible troubleshooting. AWS Chatbot creates this Amazon CloudWatch Logs Group as part of the provisioning phase in us-east-1, it’s not possible (as of June 2024) to configure an existing CloudWatch Logs Group you may have already have provisioned.

For Role settings I choose to let AWS Chatbot generate the desired IAM configuration. This is possible to define yourself. An approach could be to start with AWS Chatbot generated resources and then replace it your own Terraform resource definitions afterwards.

I choose Notification, Incident Manager, Resource Explorer and Amazon Q permissions. I do not expect to perform read-only commands, invoke Lambda functions or call AWS Support commands directly from Slack, so I leave these unchecked.

The Slack channel is now configured.

If you want to add support for notifications from global services producing CloudWatch Metrics in us-east-1, such as Route 53 Health Checks, you can configure an additional SNS topic.

Automated deployment with Terraform

A fully working Terraform module is available at https://github.com/haakond/terraform-aws-chatbot/blob/main/README.md.

AWS Chatbot channel configuration for Slack is only supported in the AWS Cloud Control Provider for Terraform, so first we start by declaring the necessary providers:

terraform {
  required_providers {
    aws = {
      source = "hashicorp/aws"
      version = "~> 5.53.0"
      configuration_aliases = [aws, aws.us-east-1]
    }
    awscc = {
      source = "hashicorp/awscc"
      version = "~> 1.2.0"
    }
  }
}

The first resource is the Slack channel configuration with relevant input variables for Slack Channel ID, Slack Workspace ID, and configured SNS topics for primary region and us-east-1, for global service endpoints.

We then declare the IAM role for this purpose with relevant Managed Policies to also be able to managed Incident Manager and Security Hub findings directly from Slack. Feel free to adjust to your use-case. Lastly relevant SNS topic are configured for relevant regions.

resource "awscc_chatbot_slack_channel_configuration" "chatbot_slack" {
  configuration_name = var.slack_channel_configuration_name
  iam_role_arn = awscc_iam_role.chatbot_channel_role.arn
  slack_channel_id = var.slack_channel_id
  slack_workspace_id = var.slack_workspace_id
  logging_level = var.logging_level
  sns_topic_arns = [aws_sns_topic.sns_topic_for_aws_chatbot_primary_region.arn, aws_sns_topic.sns_topic_for_aws_chatbot_us_east_1.arn]
  guardrail_policies = [
    "arn:aws:iam::aws:policy/PowerUserAccess"
  ]
}

resource "awscc_iam_role" "chatbot_channel_role" {
  role_name = "aws-chatbot-channel-role"
  assume_role_policy_document = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Sid = "Chatbot"
        Principal = {
          Service = "chatbot.amazonaws.com"
        }
      },
    ]
  })
  managed_policy_arns = [
    "arn:aws:iam::aws:policy/AWSResourceExplorerReadOnlyAccess",
    "arn:aws:iam::aws:policy/AWSIncidentManagerResolverAccess",
    "arn:aws:iam::aws:policy/AmazonQFullAccess",
    "arn:aws:iam::aws:policy/CloudWatchReadOnlyAccess",
    "arn:aws:iam::aws:policy/AWSSecurityHubFullAccess",
    "arn:aws:iam::aws:policy/AWSSupportAccess"
  ]
}

resource "aws_sns_topic" "sns_topic_for_aws_chatbot_primary_region" {
  #checkov:skip=CKV_AWS_26:
  name = "aws-chatbot-notifications"
  http_success_feedback_role_arn = aws_iam_role.delivery_status_logging_for_sns_topic.arn
  http_failure_feedback_role_arn = aws_iam_role.delivery_status_logging_for_sns_topic.arn
  tags = {
    Name = "aws_chatbot_notifications"
    Service = "monitoring"
  }
}

resource "aws_sns_topic" "sns_topic_for_aws_chatbot_us_east_1" {
  #checkov:skip=CKV_AWS_26:
  provider = aws.us-east-1
  name = "aws-chatbot-notifications"
  http_success_feedback_role_arn = aws_iam_role.delivery_status_logging_for_sns_topic.arn
  http_failure_feedback_role_arn = aws_iam_role.delivery_status_logging_for_sns_topic.arn
  tags = {
    Name = "aws_chatbot_notifications"
    Service = "monitoring"
  }
}

# Define SNS topic policy primary region
resource "aws_sns_topic_policy" "sns_topic_policy_for_aws_chatbot_primary_region" {
  arn = aws_sns_topic.sns_topic_for_aws_chatbot_primary_region.arn
  policy = data.aws_iam_policy_document.sns_topic_policy_for_aws_chatbot_primary_region.json
}

# Define SNS topic policy primary region us-east-1
resource "aws_sns_topic_policy" "sns_topic_policy_for_aws_chatbot_us_east_1" {
  provider = aws.us-east-1
  arn = aws_sns_topic.sns_topic_for_aws_chatbot_us_east_1.arn
  policy = data.aws_iam_policy_document.sns_topic_policy_for_aws_chatbot_us_east_1.json
}

# IAM role for delivery_status_logging_for_sns_topic
resource "aws_iam_role" "delivery_status_logging_for_sns_topic" {
  name = "aws-chatbot-delivery-status-logging"
  assume_role_policy = data.aws_iam_policy_document.sns_to_cw_logs_assume_role_policy.json
}

resource "aws_iam_policy" "delivery_status_logging_for_sns_topic_policy" {
  policy = data.aws_iam_policy_document.sns_to_cw_logs_policy.json
}

resource "aws_iam_role_policy_attachment" "delivery_status_logging_for_sns_topic_attachment" {
  role = aws_iam_role.delivery_status_logging_for_sns_topic.name
  policy_arn = aws_iam_policy.delivery_status_logging_for_sns_topic_policy.arn
}

Data resources in data.tf for IAM policies etc. are intentionally left out of this blog post, but can be viewed at https://github.com/haakond/terraform-aws-chatbot/blob/main/data.tf.

To deploy this module in your workload account include the following snippets, as explained in examples/main.tf and examples/provider.tf.

module "aws_chatbot_slack" {
  source = "git::https://github.com/haakond/terraform-aws-chatbot.git"
  providers = {
    aws = aws
    aws.us-east-1 = aws.us-east-1
    awscc = awscc
  }
  slack_channel_configuration_name = "slack-hed-aws-monitoring"
  slack_channel_id = "AABBCC001122DD88"
  slack_workspace_id = "ABCD1234EFGH5678"
  logging_level = "INFO"
}

provider "aws" {
  region = var.aws_region
  profile = var.profile_cicd
  assume_role {
    role_arn = "arn:aws:iam::${var.aws_account_id}:role/${var.profile_cicd}"
    session_name = "SESSION_NAME"
    external_id = "EXTERNAL_ID"
  }
}

provider "awscc" {
  region = var.aws_region
  profile = var.profile_cicd
  assume_role = {
    role_arn = "arn:aws:iam::${var.aws_account_id}:role/${var.profile_cicd}"
    session_name = "SESSION_NAME"
    external_id = "EXTERNAL_ID"
  }
}

provider "aws" {
  alias = "us-east-1"
  region = "us-east-1"
  profile = var.profile_cicd
  assume_role {
    role_arn = "arn:aws:iam::${var.aws_account_id}:role/${var.profile_cicd}"
    session_name = "SESSION_NAME"
    external_id = "EXTERNAL_ID"
  }
}

Step 3: Test notifications

In the Configure channels overview, select the applicable channel and click Send test message.

Expected result: Two messages, one for each SNS topic in us-east-1 and eu-central-1.

I ask Amazon Q via AWS Chatbot about relevant CloudWatch metrics for ECS Fargate container services and get some helpful links in return.

Sending CloudWatch alarms to Slack

You can set up notifications to Slack for any CloudWatch Metric+Alarm that you care about. It can be CPU and memory utilization, disk free space, database query latency and so on. I set up a Route 53 Health Check to monitor my blog availability. This ensures I am being notified regardless of actual reason.

# Define an additional provider for the us-east-1 region.
provider "aws" {
  alias = "us-east-1"
  region = "us-east-1"
  profile = var.profile_cicd
  assume_role {
    role_arn = "arn:aws:iam::${var.aws_account_id}:role/${var.profile_cicd}"
    session_name = "SESSION_NAME"
    external_id = "EXTERNAL_ID"
  }
}

# AWS Route 53 Health Checks and corresponding metrics reside in us-east-1.
resource "aws_route53_health_check" "hedrange_com_about" {
  provider = aws.us-east-1
  fqdn = "hedrange.com"
  port = 443
  type = "HTTPS"
  resource_path = "/about/"
  failure_threshold = "3"
  request_interval = "30"
  measure_latency = true
  invert_healthcheck = false
  regions = ["us-east-1", "us-west-1", "eu-west-1"]
  tags = {
    Name = "health-check-blog"
  }
}

# Since AWS Route 53 Health Checks and corresponding metrics reside in us-east-1, so the alarm has to be provisioned there as well.
resource "aws_cloudwatch_metric_alarm" "healthcheck_hedrange_com_about" {
  provider = aws.us-east-1
  alarm_name = "health-check-blog-about"
  comparison_operator = "LessThanThreshold"
  evaluation_periods = "3"
  metric_name = "HealthCheckStatus"
  namespace = "AWS/Route53"
  period = "60"
  statistic = "Minimum"
  threshold = "1"
  alarm_description = "CRITICAL - https://hedrange.com/about is unavailable!"
  dimensions = {
    HealthCheckId = aws_route53_health_check.hedrange_com_about.id
  }
  alarm_actions = [local.chatbot_sns_topic_arn_us_east_1]
  ok_actions = [local.chatbot_sns_topic_arn_us_east_1]
  tags = {
    Name = "alarm-health-check-blog",
    Severity = "CRITICAL"
  }
}

To test the notification let’s amend invert_healthcheck = true and re-provision.

The CloudWatch alarm changed state from OK to In alarm, sent a notification to the SNS topic in us-east-1 configured for AWS Chatbot which then dispatched the following message to Slack:

Recovery notification:

Sending AWS Health notifications and Security Hub findings to Slack

To ensure we also get notified about important AWS Health events and findings in Security Hub we can set up EventBridge rules accordingly.

# Eventbridge rule with event pattern to catch high severity Security Hub findings, regardless of product.
resource "aws_cloudwatch_event_rule" "main_securityhub_event_rule" {
  name = "aws-securityhub-rule"
  description = "Capture AWS Security Hub events"

  event_pattern = <<EOF
{
    "source": [
        "aws.securityhub"
     ],
    "detail-type": [
        "Security Hub Findings - Imported"
    ],
    "detail": {
        "findings": {
            "Severity": {
                "Label": ["CRITICAL", "HIGH"]
            }
        }
    }
}
EOF
}

resource "aws_cloudwatch_event_target" "main_securityhub_rule_target_sns_topic_for_aws_chatbot" {
  rule = aws_cloudwatch_event_rule.main_securityhub_event_rule.name
  target_id = "SendToSNS"
  arn = aws_sns_topic.sns_topic_for_aws_chatbot.arn
}

# Eventbridge rule with event pattern to AWS Health notifications
resource "aws_cloudwatch_event_rule" "health_event_rule" {
  name = "aws-health-rule"
  description = "Capture AWS Health events"

  event_pattern = <<EOF
{
    "source": ["aws.health"],
    "detail-type": ["AWS Health Event"]
}
EOF
}

# EventBridge rule target to the SNS topic for AWS Chatbot
resource "aws_cloudwatch_event_target" "health_event_rule_target_sns_topic_for_aws_chatbot" {
  rule = aws_cloudwatch_event_rule.health_event_rule.name
  target_id = "SendToSNS"
  arn = aws_sns_topic.sns_topic_for_aws_chatbot.arn
}

This is how a Slack message looks like for a High finding in Security Hub about missing IAM Access Analyzer enablement in region eu-north-1.

This is how a notification from AWS Health looks like:

AWS Systems Manager – Incident Manager

Incident Manager is a service that has gone a bit below the radar. Teams adopting this service directly in their AWS environments should be able to minimize Recovery Time Objective and the consequences outages may have for customer applications.

AWS explains that Incident Manager helps reduce the time to resolve incidents by:

Providing automated plans for efficiently engaging the people responsible for responding to the incidents.
Providing relevant troubleshooting data.
Enabling automated response actions by using predefined Automation runbooks.
Providing methods to collaborate and communicate with all stakeholders.

Response plans and incident severity classification

Events on the AWS platform can trigger Incidents using pre-defined Response Plans to get the attention of first responders, to quickly start troubleshooting while communicating efficiently with the AWS Chatbot integration for Slack and Microsoft Teams.

Based on impact and scope one can differentiate alarms and notifications differentiate between urgency, escalation and resolution procedures.

Impact code	Impact name	Sample defined scope
`1`	`Critical`	Full application failure that impacts most customers.
`2`	`High`	Full application failure that impacts a subset of customers.
`3`	`Medium`	Partial application failure that is customer-impacting.
`4`	`Low`	Intermittent failures that have limited impact on customers.
`5`	`No Impact`	Customers aren’t currently impacted but urgent action is needed to avoid impact.

Example triage table ref. https://docs.aws.amazon.com/incident-manager/latest/userguide/incident-lifecycle.html#triage

Notification deduplication

A key feature of AWS SSM Incident Manager is the incident deduplication feature, which ensures grouping of similar notifications, as opposed to direct notifications from CloudWatch => SNS => Slack. Incident Manager automatically deduplicates multiple incidents created by the same Amazon CloudWatch alarm or Amazon EventBridge event. This can reduce alert fatigue and ensure critical notifications aren’t missed.

The purpose of this blog post is not to deep dive into the service itself, but to demonstrate how deployment can be automated with Terraform. For more information read The incident lifecycle in Incident Manager.

Diagram courtesy of AWS. Source: https://docs.aws.amazon.com/incident-manager/latest/userguide/incident-lifecycle.html

Setting up AWS Systems Manager – Incident Manager

At first glance most of the configuration options were not available in the official awsTerraform provider, so many steps had to be configured manually. However, after further searching, I realized that the AWS Cloud Control Terraform provider, awscc, supported the particular resources, so I succeeded in defining a fully working Terraform module for this purpose. I’ve only parameterized the most relevant values. There are many configuration options to tweak so I didn’t make a fully customizable module at this point in time. One approach could be to start here to become familiar and then optimize later on.

Configure AWS Systems Manager for your applicable regions with region failover replication set

resource "aws_ssmincidents_replication_set" "default" {
  region {
    name = local.current_region
  }
  region {
    name = var.replication_set_fallback_region
  }

  tags = {
    Name = "default"
  }
}

Configure contacts for your on-call team

I have one primary contact for myself with contact methods email, SMS and voice/phone.

Do not that the resource awscc_ssmcontacts_contact is based on the AWS Cloud Control provider. It it’s not that intuitive that this configuration results in a on-call schedule, so I spent quite some time figuring this out.

resource "aws_ssmcontacts_contact" "primary_contact" {
  alias = var.primary_contact_alias
  display_name = var.primary_contact_display_name
  type = "PERSONAL"

  tags = {
    key = "primary-contact"
  }
  depends_on = [aws_ssmincidents_replication_set.default]
}

resource "aws_ssmcontacts_contact_channel" "primary_contact_email" {
  contact_id = aws_ssmcontacts_contact.primary_contact.arn

  delivery_address {
    simple_address = var.primary_contact_email_address
  }

  name = "primary-contact-email"
  type = "EMAIL"
}

resource "aws_ssmcontacts_contact_channel" "primary_contact_sms" {
  contact_id = aws_ssmcontacts_contact.primary_contact.arn

  delivery_address {
    simple_address = var.primary_contact_phone_number
  }

  name = "primary-contact-sms"
  type = "SMS"
}

resource "aws_ssmcontacts_contact_channel" "primary_contact_voice" {
  contact_id = aws_ssmcontacts_contact.primary_contact.arn

  delivery_address {
    simple_address = var.primary_contact_phone_number
  }

  name = "primary-contact-voice"
  type = "VOICE"
}

resource "aws_ssmcontacts_plan" "primary_contact" {
  contact_id = aws_ssmcontacts_contact.primary_contact.arn

  stage {
    duration_in_minutes = 1

    target {
      channel_target_info {
        retry_interval_in_minutes = 5
        contact_channel_id = aws_ssmcontacts_contact_channel.primary_contact_email.arn
      }
    }
  }
  stage {
    duration_in_minutes = 5

    target {
      channel_target_info {
        retry_interval_in_minutes = 5
        contact_channel_id = aws_ssmcontacts_contact_channel.primary_contact_sms.arn
      }
    }
  }
  stage {
    duration_in_minutes = 10

    target {
      channel_target_info {
        retry_interval_in_minutes = 5
        contact_channel_id = aws_ssmcontacts_contact_channel.primary_contact_voice.arn
      }
    }
  }
}

resource "awscc_ssmcontacts_contact" "oncall_schedule" {

  alias = "default-schedule"
  display_name = "default-schedule"
  type = "ONCALL_SCHEDULE"
  plan = [{
    rotation_ids = [aws_ssmcontacts_rotation.business_hours.id]
  }]
  depends_on = [aws_ssmincidents_replication_set.default]
}

Configure response plan – Example for type Critical Incident Process

You can have as many response plans as you like, but each one of them are as of June 2024 priced at $7 on a monthly basis.

One consideration could be to have Critical and High response plans as a starting point and dispatch alarms accordingly.

resource "aws_ssmincidents_response_plan" "critical_incident" {
  name = "CRITICAL-INCIDENT"

  incident_template {
    title = "CRITICAL-INCIDENT"
    impact = "1"
    incident_tags = {
      Name = "CRITICAL-INCIDENT"
    }

    summary = "Follow CRITICAL INCIDENT process."
  }

  display_name = "CRITICAL-INCIDENT"
  chat_channel = [var.chatbot_sns_topic_notification_arn]
  engagements = [awscc_ssmcontacts_contact.oncall_schedule.arn]

  action {
    ssm_automation {
      document_name = aws_ssm_document.critical_incident_runbook.arn
      role_arn = aws_iam_role.service_role_for_ssm_incident_manager.arn
      document_version = "$LATEST"
      target_account = "RESPONSE_PLAN_OWNER_ACCOUNT"
      parameter {
        name = "Environment"
        values = ["Production"]
      }
      dynamic_parameters = {
        resources = "INVOLVED_RESOURCES"
        incidentARN = "INCIDENT_RECORD_ARN"
      }
    }
  }

  tags = {
    Name = "critical-incident-response-plan"
  }

  depends_on = [aws_ssmincidents_replication_set.default]
}

Configure on-call schedule

In this example the on-call rotation schedule consists of only the primary contact. In a full deployment an on-call team normally consists of several people on a weekly schedule, with escalation/fallback. Feel free to adjust to your use-case.

Start of each shift is every Monday at 09:00 and people will be notified from 08:30 – 16:00, during business hours. You can choose any time period of the day, for instance have one on-call schedule during business hours and another one outside of business hours.

resource "aws_ssmcontacts_rotation" "business_hours" {
  contact_ids = [
    aws_ssmcontacts_contact.primary_contact.arn
  ]

  name = "business-hours"

  recurrence {
    number_of_on_calls = 1
    recurrence_multiplier = 1
    weekly_settings {
      day_of_week = "MON"
      hand_off_time {
        hour_of_day = 09
        minute_of_hour = 00
      }
    }

    weekly_settings {
      day_of_week = "FRI"
      hand_off_time {
        hour_of_day = 15
        minute_of_hour = 55
      }
    }

    shift_coverages {
      map_block_key = "MON"
      coverage_times {
        start {
          hour_of_day = 08
          minute_of_hour = 30
        }
        end {
          hour_of_day = 16
          minute_of_hour = 00
        }
      }
    }
    shift_coverages {
      map_block_key = "TUE"
      coverage_times {
        start {
          hour_of_day = 08
          minute_of_hour = 30
        }
        end {
          hour_of_day = 16
          minute_of_hour = 00
        }
      }
    }
    shift_coverages {
      map_block_key = "WED"
      coverage_times {
        start {
          hour_of_day = 08
          minute_of_hour = 30
        }
        end {
          hour_of_day = 16
          minute_of_hour = 00
        }
      }
    }
    shift_coverages {
      map_block_key = "THU"
      coverage_times {
        start {
          hour_of_day = 08
          minute_of_hour = 30
        }
        end {
          hour_of_day = 16
          minute_of_hour = 00
        }
      }
    }
    shift_coverages {
      map_block_key = "FRI"
      coverage_times {
        start {
          hour_of_day = 08
          minute_of_hour = 30
        }
        end {
          hour_of_day = 16
          minute_of_hour = 00
        }
      }
    }
  }

  start_time = var.rotation_start_time
  time_zone_id = "Europe/Oslo"
  depends_on = [aws_ssmincidents_replication_set.default]
}

Configure Systems Manager Runbook to be executed for the Critical Incident Process

The AWS Systems Manager Runbook AWSIncidents-CriticalIncidentRunbookTemplate can be used as a starting point for this use-case. For more information see Working with Systems Manager Automation runbooks in Incident Manager.

resource "aws_ssm_document" "critical_incident_runbook" {
  name = "critical_incident_runbook"
  document_type = "Automation"
  document_format = "YAML"
  content = <<DOC
#
# Original source: AWSIncidents-CriticalIncidentRunbookTemplate
#
# Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy of this
# software and associated documentation files (the "Software"), to deal in the Software
# without restriction, including without limitation the rights to use, copy, modify,
# merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
# permit persons to whom the Software is furnished to do so.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
# INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
# PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
# HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
# OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
# SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
#
---
description: "This document is intended as a template for an incident response runbook in [Incident Manager](https://docs.aws.amazon.com/incident-manager/latest/userguide/index.html).\n\nFor optimal use, create your own automation document by copying the contents of this runbook template and customizing it for your scenario. Then, navigate to your [Response Plan](https://console.aws.amazon.com/systems-manager/incidents/response-plans/home) and associate it with your new automation document; your runbook is automatically started when an incident is created with the associated response plan. For more information, see [Incident Manager - Runbooks](https://docs.aws.amazon.com/incident-manager/latest/userguide/runbooks.html). \v\n\nSuggested customizations include:\n* Updating the text in each step to provide specific guidance and instructions, such as commands to run or links to relevant dashboards\n* Automating actions before triage or diagnosis to gather additional telemetry or diagnostics using aws:executeAwsApi\n* Automating actions in mitigation using aws:executeAutomation, aws:executeScript, or aws:invokeLambdaFunction\n"
schemaVersion: '0.3'
parameters:
  Environment:
    type: String
  incidentARN:
    type: String
  resources:
    type: String
mainSteps:
  - name: Triage
    action: 'aws:pause'
    inputs: {}
    description: |-
      **Determine customer impact**

      * View the **Metrics** tab of the incident or navigate to your [CloudWatch Dashboards](https://console.aws.amazon.com/cloudwatch/home#dashboards:) to find key performance indicators (KPIs) that show the extent of customer impact.
      * Use [CloudWatch Synthetics](https://console.aws.amazon.com/cloudwatch/home#synthetics:) and [Contributor Insights](https://console.aws.amazon.com/cloudwatch/home#contributorinsights:) to identify real-time failures in customer workflows.

      **Communicate customer impact**

      Update the following fields to accurately describe the incident:
      * **Title** - The title should be quickly recognizable by the team and specific to the particular incident.
      * **Summary** - The summary should contain the most important and up-to-date information to quickly onboard new responders to the incident.
      * **Impact** - Select one of the following impact ratings to describe the incident:
        * 1 – Critical impact, full application failure that impacts many to all customers.
        * 2 – High impact, partial application failure with impact to many customers.
        * 3 – Medium impact, the application is providing reduced service to many customers.
        * 4 – Low impact, the application is providing reduced service to few customers.
        * 5 – No impact, customers are not currently impacted but urgent action is needed to avoid impact.
  - name: Diagnosis
    action: 'aws:pause'
    inputs: {}
    description: |
      **Rollback**

      * Look for recent changes to the production environment that might have caused the incident. Engage the responsible team using the **Contacts** tab of the incident.
      * Rollback these changes if possible.

      **Locate failures**
      * Review metrics and alarms related to your [Application](https://console.aws.amazon.com/systems-manager/appmanager/applications). Add any related metrics and alarms to the **Metrics** tab of the incident.
      * Use [CloudWatch ServiceLens](https://console.aws.amazon.com/cloudwatch/home#servicelens:) to troubleshoot issues across multiple services.
      * Investigate the possibility of ongoing incidents across your organization. Check for known incidents and issues in AWS using [Personal Health Dashboard](https://console.aws.amazon.com/systems-manager/insights). Add related links to the **Related Items** tab of the incident.
      * Avoid going too deep in diagnosing the failure and focus on how to mitigate the customer impact. Update the **Timeline** tab of the incident when a possible diagnosis is identified.
  - name: Mitigation
    action: 'aws:pause'
    description: |-
      **Collaborate**
      * Communicate any changes or important information from the previous step to the members of the associated chat channel for this incident. Ask for input on possible ways to mitigate customer impact.
      * Engage additional contacts or teams using their escalation plan from the **Contacts** tab.
      * If necessary, prepare an emergency change request in [Change Manager](https://console.aws.amazon.com/systems-manager/change-manager).

      **Implement mitigation**
      * Consider re-routing customer traffic or throttling incoming requests to reduce customer impact.
      * Look for common runbooks in [Automation](https://console.aws.amazon.com/systems-manager/automation) or run commands using [Run Command](https://.console.aws.amazon.com/systems-manager/run-command).
      * Update the **Timeline** tab of the incident when a possible mitigation is identified. If needed, review the mitigation with others in the associated chat channel before proceeding.
    inputs: {}
  - name: Recovery
    action: 'aws:pause'
    inputs: {}
    description: |-
      **Monitor customer impact**
      * View the **Metrics** tab of the incident to monitor for recovery of your key performance indicators (KPIs).
      * Update the **Impact** field in the incident when customer impact has been reduced or resolved.

      **Identify action items**
      * Add entries in the **Timeline** tab of the incident to record key decisions and actions taken, including temporary mitigations that might have been implemented.
      * Create a **Post-Incident Analysis** when the incident is closed in order to identify and track action items in [OpsCenter](https://console.aws.amazon.com/systems-manager/opsitems).

DOC
}

Deploying AWS Systems Manager – Incident Manager with Terraform

In addition to the snippets mentioned above there are resource configurations for IAM and Terraform data objects. To keep in mind the length of this blog post I therefore only refer to them in the full module code repository.

This is how I configure the module call in my workload provisioning pipeline. In a production environment in a DevOps model I would provision this module in every production workload account a DevOps team is responsible for, along with the applications and CloudWatch monitoring data. Feel free to adjust to how your organization is set up.

# Use aws and awscc providers to provision SSM Incident Manager
# Ref. https://registry.terraform.io/providers/hashicorp/aws/latest/docs/guides/using-aws-with-awscc-provider
# Refers to output from the previous module provision for AWS Chatbot. 
module "ssm_incident_manager" {
  source = "git::https://github.com/haakond/terraform-aws-ssm-incident-manager.git?ref=dev"
  providers = {
    aws = aws
    awscc = awscc
  }
  primary_contact_alias = "primary-contact"
  primary_contact_display_name = "Håkon Eriksen Drange"
  primary_contact_email_address = "alpha.bravo@charlie-company.com"
  primary_contact_phone_number = "+4799887766"
  chatbot_sns_topic_notification_arn = module.aws_chatbot_slack.chatbot_sns_topic_arn_primary_region
  rotation_start_time = "2024-06-24T07:00:00+00:00"
}

Since we had to use a combination of the aws and awscc Terraform providers, remember to include something similar in your provider.tf:

provider "aws" {
  region = var.aws_region
  profile = var.profile_cicd
  assume_role {
    role_arn = "arn:aws:iam::${var.aws_account_id}:role/${var.profile_cicd}"
    session_name = "SESSION_NAME"
    external_id = "EXTERNAL_ID"
  }
}

provider "awscc" {
  region = var.aws_region
  profile = var.profile_cicd
  assume_role = {
    role_arn = "arn:aws:iam::${var.aws_account_id}:role/${var.profile_cicd}"
    session_name = "SESSION_NAME"
    external_id = "EXTERNAL_ID"
  }
}

Module reference with example: https://github.com/haakond/terraform-aws-ssm-incident-manager/blob/main/examples/main.tf.

Using AWS SSM Incident Manager

Provisioning this module should yield the following results.

Activate contact channels

The first step is to go to Contacts and activate the configured channels.

You will receive a one time passcode to each of the configured channels. Each one of them needs to be activated before the contact is enabled.

On-call schedule – calendar overview

The configured on-call schedule looks like this. The team will only be notified through the configured contact channels during business hours. I would also set up an out-of-business-hours schedule with the appropriate configuration.

It’s also possible to create shift overrides, in case a team member is asked to cover for a sick colleague etc.

Fire drill

I provisioned a temporary Amazon Linux 2 EC2 instance and set up a CloudWatch alarm to simulate unusual high CPU utilization over an extended period of time where Auto Scaling was not able to provision enough capacity for the load spike. This can be any CloudWatch alarm. Route 53 Health Checks and CloudWatch Synthetics monitors are also be good candidates.

Alarm and OK actions are configured with the relevant SNS topics for AWS Chatbot provisioned in the previous module.

resource "aws_cloudwatch_metric_alarm" "demo_ec2_instance_cpu_utilization" {
  alarm_name = "CRITICAL-EC2-CPU_i-04f7af9ea5374c4d8"
  comparison_operator = "GreaterThanOrEqualToThreshold"
  evaluation_periods = "3"
  metric_name = "CPUUtilization"
  namespace = "AWS/EC2"
  period = "300"
  statistic = "Average"
  threshold = "80"
  alarm_description = "CRITICAL CPU utilization for instance ID i-04f7af9ea5374c4d8!"
  dimensions = {
    InstanceId = "i-04f7af9ea5374c4d8"
  }
  alarm_actions = [module.aws_chatbot_slack.chatbot_sns_topic_arn_primary_region, module.ssm_incident_manager.critical_incident_response_plan_arn]
  ok_actions = [module.aws_chatbot_slack.chatbot_sns_topic_arn_primary_region]
}

I log in to the demo EC2 instance through AWS Systems Managed – Session Manager and run the following commands to generate high CPU load for 30 minutes.

sudo amazon-linux-extras install epel -ysudo yum install stress -ystress --cpu 4 --timeout 30m

The CloudWatch Alarm dispatches to SNS for AWS Chatbot and the result is the following message on Slack:

The CloudWatch Alarm’s secondary action is to trigger an AWS Systems Manager – Incident Manager Response Plan, which also is set up to post to Slack:

After 1 minute, as per the contact plan configuration, I receive this email:

Shortly after I receive this SMS:

I then follow the instructions to acknowledge the incident:

If this was during the night and I was asleep, the next phase would be an automated phone call.

My team-mates can see on Slack that I have assumed ownership of the incident.

This is how the incident is tracked in AWS Systems Manager – Incident Manager. The SSM runbook as deployed with Terraform is triggered which provides guidance on how to handle the process of type Critical Incident.

The runbook lays out the process for us and the first step is to examine the customer impact, called Triage.

The metric for the CloudWatch Alarm which triggered the Incident is automatically included in the Incident Overview.

Clear instructions are provided for the next phases as well. You can customize all this in your own company runbook.

The situation recovered so we complete the Recovery section of the runbook.

Incident Manager provides a full timeline overview of all events, for documentation and further forensics.

We can then create a post-incident analysis / post-mortem analysis using a recommended template (which also can be customized).

We can pull in relevant CloudWatch metrics to put the event into perspective

Sample questions for a blameless, constructive review about areas that could be improved are also provided:

Follow-up action items can be defined as well.

As we can see AWS Systems Manager – Incident Manager provides functionality for handling the entire lifecycle of critical events.

Incident Manager Pricing

AWS SSM Incident Manager pricing is $7 per Response Plan per month. 100 SMS & Voice messages are included free of charge. Destination country rates can be found here: https://aws.amazon.com/systems-manager/pricing/country-rates/.

Conclusion

In this blog post we explored a solution to ensure operational events are handled efficiently with the primary objective of restoring quality of service and end user experience as quickly as possible. We saw how AWS Chatbot and AWS SSM Incident Manager integrates nicely with AWS services such as AWS Health, AWS Security Hub and any AWS CloudWatch Alarms. Making operational information and CloudWatch metrics available in Slack/Microsoft Teams, where most of the daily interaction takes place, is something that I personally appreciate. Most people have Slack/Teams on their mobile devices so this can really increase the quality of internal communication by not having to log in to the AWS Console or 3rd party systems.

By following the described steps rooted in the AWS Well-Architected Framework and deploying the provided Terraform sample code organizations can improve their operational procedures to increase resiliency and work smart.

Terraform module for AWS Chatbot: https://github.com/haakond/terraform-aws-chatbot/blob/main/README.md.

Terraform module for AWS Systems Manager Incident Manager: https://github.com/haakond/terraform-aws-ssm-incident-manager/blob/main/README.md.

References

The post Never miss an alert with AWS Chatbot and AWS SSM – Incident Manager first appeared on Håkon Eriksen Drange.

Protect your webapps from malicious traffic with AWS Web Application Firewall

Håkon Eriksen Drange — Thu, 23 May 2024 12:56:54 +0000

Table of contents

Introduction
- The Web Application Firewall concept
- Deployment options
- Option #1 – Application layer
- Option #2 – Webserver module
- Option #3 – Virtual appliance
- Option #4 – AWS native service – Web Application Firewall
  - AWS native service – WAF – Regional deployment
  - AWS native service – WAF – Global edge network
- Introduction to how AWS WAF works
- AWS WAF Traffic dashboard insight
AWS WAF provisioning with Terraform
AWS WAF pricing
Conclusion and recommendations
References and additional resources

Introduction

As a rapidly increasing amount of companies are moving workloads to the cloud and extending their footprint through refactoring and modernization, the possible attack vectors are increasingly expanding. According to DataDogs State of Cloud Security Report, a substantial portion of cloud workloads are excessively privileged [FACT 5] and many virtual machines are publicly exposed to the internet [FACT 6].

In AWS, only a small number (1.5 percent) of Amazon EC2 instances have full administrator privileges. Overall, nearly one in four EC2 instances (23 percent) have administrator or highly sensitive permissions to the AWS account they run in. An attacker does not need full administrator privileges to have a substantial impact—there are other, more common and challenging-to-uncover types of permissions they can leverage. We found that:

_ 5.4 percent of EC2 instances have risky permissions that allow lateral movement in the account, such as connecting to other instances using SSM Sessions Manager._

_ 7.2 percent allow an attacker to gain full administrative access to the account by privilege escalation, such as permissions to create a new IAM user with administrator privileges._

_ 20 percent have excessive data access , such as listing and accessing data from all S3 buckets in the account._

(Note that these conditions are not mutually exclusive—a specific instance can fall into several of these categories.)
FACT 5: A substantial portion of cloud workloads are excessively privileged – DataDog State of Cloud Security

_ 7 percent of EC2 instances , 3 percent of Azure VMs , and 13 percent of Google Cloud VMs are publicly exposed to the internet. Among instances that are publicly exposed, HTTP and HTTPS are the most commonly exposed ports, and are not considered risky in general. After these, SSH and RDP remote access protocols are common._

FACT 6: Many virtual machines remain publicly exposed to the internet – DataDog State of Cloud Security

Scanning of public internet resources is happening all the time while malicious actors are becoming more and more sophisticated. Companies can have a tight perimeter for employee IAM-credentials with access to the AWS Console/CLI, but still have the back door wide open with unsecured web applications.

In the Security Pillar of the AWS Well-Architected Framework a key design principle is to apply security at all layers with a defense in depth approach for edge of network, VPC, load balancing, instance/compute, operating system, application and code.

The Web Application Firewall concept

You may be familiar with traditional firewalls that function on Layer 3/4 by defining rules for protocols and port ranges. As the HTTP protocol is a Layer 7 construct we need something a bit more advanced to inspect, monitor, filter/block unwanted requests to and from a web service.

A Web Application Firewall can help by preventing attacks exploiting a web application’s known vulnerabilities. The OWASP Top Ten list defines the most common attack vectors and is updated on a regular basis:

Broken Access Control
- Violation of principle of least privilege/elevation of privilege
- Bypassing access control checks/viewing someone else’s account
- Lack of multi-factor authentication
Cryptographic Failures
- Lack of proper encryption at rest/in transit
- Old or weak cryptographic algorithms (not modern TLS)
Injection
- Lack of proper input validation/sanitation
- SQL injection
- Cross-site scripting (XSS)
- OS command, ORM, LDAP injection
Insecure Design
Security Misconfiguration
Vulnerable and Outdated Components
- Operating system, web/application server, DBMS etc.
Identification and Authentication Failures
Brute force or automated attacks
Software and Data Integrity Failures
Security Logging and Monitoring Failures
- Insufficient audit logging
Server-Side Request Forgery

One caveat with WAF is, that depending on the deployment model, it may be resource intensive. All sorts of inspection and filtering generates CPU load, which in a traditional datacenter environment means the web server has less resource capacity for serving legitimate traffic, leading to performance degradation, especially while under heavy load or attack.

Deployment options

A number of Web Application Firewall solutions are available, each with it’s pros and cons.

Option #1 – Application layer

The first option describes how to implement WAF like capabilities either in your own code base or including a framework component such as ShieldON for PHP. Although it may seem handy this has directly integrated with your application code and since it’s deployed on the same compute option (VM or container) this option comes with the most severe performance impact.

Option #2 – Webserver module

By moving up one level in the stack and configuring WAF as a module in your web server of choice you can decouple from your application code base.

ModSecurityis a traditional option for Apache and Nginxweb servers. If deployed on the same instance both your application and the WAF would compete for CPU resources. In some situations WAF can be extracted and deployed as a separate proxy tier, as further described below.

Option #3 – Virtual appliance

In this model the WAF component is separated from the application layer so that it can be managed and scaled independently. The main advantage is that malicious traffic can be blocked before reaching the application compute resources, ensuring maximum performance for legitimate requests. This also opens up the possibility to share the WAF component across application workloads in the same region.

AWS Marketplace has multiple options for Cloud WAF-as-a-Service virtual appliances and most premium solutions can be integrated with existing enterprise management, logging and reporting tools. Some also deploy HAproxy or Nginx with WAF like capabilities like the aforementioned ModSecurity.

Option #4 – AWS native service – Web Application Firewall

Our fourth option is AWS Web Application Firewall, a fully managed service. With this option you pay no license fees, only for what you use, and the WAF component can also be scaled and managed as with option #3, decoupled from the application layer. Some of the main benefits are that AWS WAF is tightly integrated with AWS services such as Amazon Cloudfront, AWS Application Load Balancer, AWS API Gateway, AWS Appsync and AWS Shield for DDoS protection. It’s relatively straightforward to set up and deploy and builders already familiar with AWS won’t have to learn something new (or relate to a 3rd party provider with possibly sub-optimal licensing agreements).

AWS WAF also supports Bot Control that provides visibility and control over common and pervasive bot traffic that can consume excess resources or lead to downtime and Fraud Control which can protect login and sign-up pages against attacks such as credential stuffing, credential cracking and fake account creation.

Relevant rules are configured in ordered priority, like traditional firewall rules, and you can choose from ALLOW, BLOCK or COUNT actions to achieve the desired behaviour.

ALLOW all requests except the ones that you prefer
BLOCK all requests except the ones that you prefer
COUNT requests that match certain criterias
CAPTCHA or challenge checks against requests that match certain criterias

AWS native service – WAF – Regional deployment

In this scenario AWS WAF is configured for Application Load Balancers, Amazon API Gateways or AWS AppSync at the regional level.

AWS Shield L3+L4 standard protection is included without additional charges, but AWS Shield Advanced (L7 protection) is not supported.

AWS native service – WAF – Global edge network

In this scenario AWS WAF is configured for Amazon Cloudfront at the global edge network level.

Mitigation of large scale attacks is most efficient the further “out” you get, because global network capacity combined is larger than regional capacity, so moving from a regional perimeter to the AWS global edge network is highly recommended. By adopting WAF with Cloudfront you can get full AWS Shield DDoS protection (Standard for L3/4 or Advanced for L3/4+7 for mission critical workloads) and provide AWS the optimal preconditions for mitigation. Malicious requests can be blocked even before it reaches the region.

Introduction to how AWS WAF works

Define a Web Access Control List (Web ACL) configured to protect a set of AWS resources (such as Amazon Cloudfront or AWS Application Load Balancer).
Specify your desired actions as rule statements. These can be custom and specified by you, managed by the AWS Threat Research Team or a 3rd party vendor from the AWS Marketplace.
- Each rule consists of a condition and an action. Example: if request origin country is this value, then BLOCK the request.
- A rule can be rate based, IP allow/deny, geoblocking at country level and so on.
Organize re-usable rules in Rule Groups that can be attached to multiple WebACLs.

Rules are evaluated from the lowest numeric priority (1) and up until rule match that terminates the evaluation, or all rules are evaluated without match.

To calculate the complexity and evaluation of the total combination of rules and rule groups, each Web ACL has a concept called Web Capacity Units (WCU) which is limited to maximum 5000 WCUs per Web ACL or Rule Group.

If your company has subscribed to AWS Shield Advanced, the service will add an additional Rule Group managed by the AWS Shield Response Team for tailored mitigation.

Here is an overview of some of the Managed Rule Groups available for configuration in the AWS Console:

AWS Managed rule groups

AWS WAF Traffic dashboard insight

Having visibility into incoming traffic is paramount for successful mitigation and to ensure valid traffic is not impacted by mistake. AWS WAF WebACL logs can be configured to be shipped to an Amazon CloudWatch Logs log group or an Amazon S3 bucket which enables easy querying with CloudWatch Logs Insights and/or Amazon Athena. In addition an Amazon Data Firehouse delivery stream destination can also be set up for further processing or shipping to a 3rd party log analysis solution such as Splunk.

AWS Web Application Firewall is of course set up for my blog so I have the opportunity to share some recent insights from the last seven days since publication of this post.

The graph below illustrates the distribution between Allowed and Blocked requests. At this point in time I have no Count actions configured.

The graph below illustrates the types of attacks identified in the requests. The majority is of type NoUserAgent and there are just a few BadBots.

The graph below illustrates the ten most common rule labes added to incoming requests.

In addition to useful graphs on the Traffic overview tab you can also easily query the WAF logs directly from the CloudWatch Log Insights tab.

Here is a sample of some of the information available in the request logs which you can base your rule logic on, with some details redacted or modified. For full description of all available log fields see AWS WAF Developer Guide – Log fields.

We can see that action: ALLOW and terminatingRuleId: Default_Action, so the request was permitted and passed on to the backend.

action: ALLOW
httpRequest.clientIp: 123.45.67.809
httpRequest.country: NO
httpRequest.headers.10.value: https://hedrange.com/
httpRequest.headers.4.value: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36 Edg/125.0.0.0
httpRequest.uri: /wp-content/uploads/2023/10/2023-10-oidc-feat.png
ruleGroupList.0.ruleGroupId: AWS#AWSManagedRulesAmazonIpReputationList
ruleGroupList.1.ruleGroupId: AWS#AWSManagedRulesWordPressRuleSet
ruleGroupList.2.ruleGroupId: AWS#AWSManagedRulesCommonRuleSet
terminatingRuleId: Default_Action
terminatingRuleType: REGULAR

The CloudWatch Logs query below searches the selected CloudWatch Logs Groups containing the WAF logs for country, action, URI and terminating rule ID for BLOCKed requests during the last week, limited to 50 results:

  fields httpRequest.country, action, httpRequest.uri, terminatingRuleId
| filter action = "BLOCK"
| sort @timestamp desc
| limit 50

httpRequest.country	action	httpRequest.uri	terminatingRuleId
UA	BLOCK	/archivarix.cms.php	AWS-AWSManagedRulesAmazonIpReputationList
FR	BLOCK	/	AWS-AWSManagedRulesCommonRuleSet
US	BLOCK	/	AWS-AWSManagedRulesAmazonIpReputationList
UA	BLOCK	/wp-content/themes/sketch/404.php	AWS-AWSManagedRulesCommonRuleSet
US	BLOCK	/	AWS-AWSManagedRulesCommonRuleSet
US	BLOCK	/	AWS-AWSManagedRulesCommonRuleSet
US	BLOCK	/	AWS-AWSManagedRulesCommonRuleSet
PH	BLOCK	/xmlrpc.php	AWS-AWSManagedRulesWordPressRuleSet
PH	BLOCK	/xmlrpc.php	AWS-AWSManagedRulesWordPressRuleSet
US	BLOCK	/xmlrpc.php	AWS-AWSManagedRulesWordPressRuleSet
US	BLOCK	/xmlrpc.php	AWS-AWSManagedRulesWordPressRuleSet
US	BLOCK	/xmlrpc.php	AWS-AWSManagedRulesWordPressRuleSet
US	BLOCK	/xmlrpc.php	AWS-AWSManagedRulesWordPressRuleSet
US	BLOCK	/	AWS-AWSManagedRulesCommonRuleSet
US	BLOCK	/	AWS-AWSManagedRulesCommonRuleSet
US	BLOCK	/	AWS-AWSManagedRulesCommonRuleSet
CA	BLOCK	/robots.txt	AWS-AWSManagedRulesCommonRuleSet
US	BLOCK	/	AWS-AWSManagedRulesCommonRuleSet
CN	BLOCK	/xmlrpc.php	AWS-AWSManagedRulesWordPressRuleSet
US	BLOCK	/	AWS-AWSManagedRulesCommonRuleSet
IE	BLOCK	/wp-content/plugins/index.php	AWS-AWSManagedRulesCommonRuleSet
IE	BLOCK	/css/sgd.php	AWS-AWSManagedRulesCommonRuleSet
IE	BLOCK	/revision.php	AWS-AWSManagedRulesCommonRuleSet
IE	BLOCK	/.well-known/admin.php	AWS-AWSManagedRulesCommonRuleSet
IE	BLOCK	/wp-content/install.php	AWS-AWSManagedRulesCommonRuleSet
IE	BLOCK	/wp-includes/Requests/dropdown.php	AWS-AWSManagedRulesCommonRuleSet
IE	BLOCK	/wp-includes/pomo/fgertreyersd.php.suspected	AWS-AWSManagedRulesCommonRuleSet
IE	BLOCK	/wp-includes/sts.php	AWS-AWSManagedRulesCommonRuleSet
IE	BLOCK	/google.php	AWS-AWSManagedRulesCommonRuleSet
IE	BLOCK	/wp-content/uploads/error_log.php	AWS-AWSManagedRulesCommonRuleSet
IE	BLOCK	/db.php	AWS-AWSManagedRulesCommonRuleSet
IE	BLOCK	/wp-includes/pomo/wp-login.php	AWS-AWSManagedRulesCommonRuleSet
IE	BLOCK	/wp-admin/js/privacy-tools.min.php	AWS-AWSManagedRulesCommonRuleSet
IE	BLOCK	/autoload_classmap.php	AWS-AWSManagedRulesCommonRuleSet
IE	BLOCK	/link.php	AWS-AWSManagedRulesCommonRuleSet
IE	BLOCK	/ws.php	AWS-AWSManagedRulesCommonRuleSet
IE	BLOCK	/doc.php	AWS-AWSManagedRulesCommonRuleSet
IE	BLOCK	/wp-admin/js/widgets/cong.php	AWS-AWSManagedRulesCommonRuleSet
IE	BLOCK	/wp-includes/rest-api/endpoints/html.php	AWS-AWSManagedRulesCommonRuleSet
IE	BLOCK	/wp-content/uploads/wp-login.php.suspected	AWS-AWSManagedRulesCommonRuleSet
IE	BLOCK	/01.php	AWS-AWSManagedRulesCommonRuleSet
IE	BLOCK	/wp-content/uploads/cong.php	AWS-AWSManagedRulesCommonRuleSet
IE	BLOCK	/.well-known//.well-known/owlmailer.php	AWS-AWSManagedRulesCommonRuleSet
IE	BLOCK	/wp-includes/js/tinymce/skins/wordpress/images/index.php	AWS-AWSManagedRulesCommonRuleSet
IE	BLOCK	/worm0.PhP7	AWS-AWSManagedRulesCommonRuleSet
IE	BLOCK	/user.php	AWS-AWSManagedRulesCommonRuleSet
IE	BLOCK	/edit.php	AWS-AWSManagedRulesCommonRuleSet
IE	BLOCK	/wp-includes/js/tinymce/skins/lightgray/img/index.php	AWS-AWSManagedRulesCommonRuleSet
IE	BLOCK	/options.php	AWS-AWSManagedRulesCommonRuleSet

As we can see there are many good examples here of potential malicious URIs which are blocked before reaching the application – even though the resources does not exist on the web server in question, this is just random HTTP request guessing from malicious actors.

AWS WAF provisioning with Terraform

For demonstration reference is made to my blog post Develop lightweight and secure REST APIs with AWS Lambda Function URL and Terraform which includes Terraform resource configurations for AWS WAF configured for Amazon Cloudfront.

As a first step we define a WAFv2 WebACL with a default action of ALLOW.
The first rule is the AWS Managed Rule AWSManagedRulesAmazonIpReputationList(WCU: 25) with priority 1.
- The Amazon IP reputation list rule group contains rules that are based on Amazon internal threat intelligence. This is useful if you would like to block IP addresses typically associated with bots or other threats. Blocking these IP addresses can help mitigate bots and reduce the risk of a malicious actor discovering a vulnerable application (reference).
The second rule is the AWS Managed Rule AWSManagedRulesWordPressRuleSet (WCU: 100) with priority 2.
- The WordPress application rule group contains rules that block request patterns associated with the exploitation of vulnerabilities specific to WordPress sites. You should evaluate this rule group if you are running WordPress. This rule group should be used in conjunction with the SQL database and PHP application rule groups (reference).
The third rule is the AWS Managed Rule AWSManagedRulesKnownBadInputsRuleSet (WCU: 200) with priority 3.
- The Known bad inputs rule group contains rules to block request patterns that are known to be invalid and are associated with exploitation or discovery of vulnerabilities. This can help reduce the risk of a malicious actor discovering a vulnerable application (reference).
The fourth rule is the AWS Managed Rule “AWSManagedRulesCommonRuleSet” (WCU: 700) with priority 4.
- The core rule set (CRS) rule group contains rules that are generally applicable to web applications. This provides protection against exploitation of a wide range of vulnerabilities, including some of the high risk and commonly occurring vulnerabilities described in OWASP publications such as OWASP Top 10. Consider using this rule group for any AWS WAF use case (reference).
Then we define a Cloudfront distribution.
Configure AWS WAF WebACL for Cloudfront.

# Step #1 - Create a Web ACL
resource "aws_wafv2_web_acl" "lambda_function_url_demo" {
  #checkov:skip=CKV2_AWS_31: WAF2 logging configuration not necessary for this use-case.
  count = var.provision_cloudfront == true ? 1 : 0
  provider = aws.us-east-1
  name = "lambda_function_url_demo"
  description = "Web ACL with managed rule groups for lambda_function_url_demo"
  scope = "CLOUDFRONT"

  default_action {
    allow {}
  }
# Step 2 - First rule
  rule {
    name = "AWSManagedRulesAmazonIpReputationList"
    priority = 1

    override_action {
      none {}
    }

    statement {
      managed_rule_group_statement {
        name = "AWSManagedRulesAmazonIpReputationList"
        vendor_name = "AWS"
      }
    }

    visibility_config {
      cloudwatch_metrics_enabled = false
      metric_name = "AWSManagedRulesAmazonIpReputationList"
      sampled_requests_enabled = false
    }
  }
# Step 3 - Second rule
  rule {
    name = "AWSManagedRulesKnownBadInputsRuleSet"
    priority = 3

    override_action {
      none {}
    }

    statement {
      managed_rule_group_statement {
        name = "AWSManagedRulesKnownBadInputsRuleSet"
        vendor_name = "AWS"
      }
    }

    visibility_config {
      cloudwatch_metrics_enabled = false
      metric_name = "AWSManagedRulesKnownBadInputsRuleSet"
      sampled_requests_enabled = false
    }
  }
# Step 4 - Third rule
  rule {
    name = "AWSManagedRulesKnownBadInputsRuleSet"
    priority = 3

    override_action {
      none {}
    }

    statement {
      managed_rule_group_statement {
        name = "AWSManagedRulesKnownBadInputsRuleSet"
        vendor_name = "AWS"
      }
    }

    visibility_config {
      cloudwatch_metrics_enabled = false
      metric_name = "AWSManagedRulesKnownBadInputsRuleSet"
      sampled_requests_enabled = false
    }
  }
# Step 5 - Fourth rule  
  rule {
    name = "AWSManagedRulesCommonRuleSet"
    priority = 4

    override_action {
      none {}
    }

    statement {
      managed_rule_group_statement {
        name = "AWSManagedRulesCommonRuleSet"
        vendor_name = "AWS"
      }
    }

    visibility_config {
      cloudwatch_metrics_enabled = false
      metric_name = "AWSManagedRulesCommonRuleSet"
      sampled_requests_enabled = false
    }
  }

  visibility_config {
    cloudwatch_metrics_enabled = true
    metric_name = "web-acl-lambda-function-url-demo"
    sampled_requests_enabled = true
  }
}
# Step 6 - Define Cloudfront distribution resource
resource "aws_cloudfront_distribution" "lambda_function_url_demo" {
  #checkov:skip=CKV_AWS_310: Origin failover is not required for this use-case.
  #checkov:skip=CKV2_AWS_42: Custom SSL certificate is not required for this use-case.
  #checkov:skip=CKV2_AWS_32: Response headers policy not required.
  #checxkov:skip=CKV_AWS_68: WAF to come
  #checxkov:skip=CKV_AWS_111: WAF to come
  #checkov:skip=CKV2_AWS_47: WAF to come
  count = var.provision_cloudfront == true ? 1 : 0
  provider = aws.us-east-1
  origin {
    domain_name = local.lambda_function_url_demo_domain_name
    origin_access_control_id = aws_cloudfront_origin_access_control.cloudfront_oac_lambda_url[0].id
    origin_id = local.lambda_function_origin_id

    custom_origin_config {
      http_port = 80
      https_port = 443
      origin_protocol_policy = "https-only"
      origin_ssl_protocols = ["TLSv1.2"]
      origin_keepalive_timeout = 5
      origin_read_timeout = 30
    }
  }

  enabled = true
  is_ipv6_enabled = true
  default_root_object = "index.html"
  price_class = "PriceClass_200"

  logging_config {
    include_cookies = false
    bucket = module.cloudfront_logs[0].s3_bucket_bucket_domain_name
    prefix = "lambda_function_url_demo"
  }

  default_cache_behavior {
    allowed_methods = ["HEAD", "DELETE", "POST", "GET", "OPTIONS", "PUT", "PATCH"]
    cached_methods = ["GET", "HEAD", "OPTIONS"]
    target_origin_id = local.lambda_function_origin_id

    forwarded_values {
      query_string = true

      cookies {
        forward = "none"
      }
    }

    viewer_protocol_policy = "redirect-to-https"
    min_ttl = 0
    default_ttl = 0
    max_ttl = 86400
    compress = true
  }

  restrictions {
    geo_restriction {
      restriction_type = "none"
      locations = []
    }
  }

  tags = {
    Name = "LambdaFunctionUrlDemo"
  }

  viewer_certificate {
    cloudfront_default_certificate = true
    minimum_protocol_version = "TLSv1.2_2018"
  }
# Step 7 - Configure AWS WAF WebACL for Cloudfront
  web_acl_id = aws_wafv2_web_acl.lambda_function_url_demo[0].arn
}

Full sample code including logging configuration is available at https://github.com/haakond/terraform-aws-lambda-function-url/blob/main/waf.tf and https://github.com/haakond/terraform-aws-lambda-function-url/blob/main/cloudfront.tf. Check out the README.md for how to deploy.

AWS WAF pricing

AWS charges 1) per Web ACL, 2) the amount of rules configured and 3) the amount requests processed. The more WCUs your configuration consumes the higher the cost.

More advanced capabilities such as Bot Control and Fraud Control have additional subscription and processing costs.

You can also subscribe to Managed Rules from 3rd party provides from the AWS Marketplace, which will be billed separately.

For full insight and scenario examples study https://aws.amazon.com/waf/pricing/.

Conclusion and recommendations

In this blog post we explored Web Application Firewall as a concept and considered different implementation options. We reviewed AWS WAF as a managed service and explored relevant rules, traffic analysis and logging.

To get up and running with WAF I recommend to start simple and only choose relevant rules applicable for your type of workload; application, operating system and compute option.

Align with your security department about applicable policies and protection mechanisms to adhere to. The more complexity you add to WAF the more intensive traffic analysis will be, which in turn increases the costs.

To reduce rule evaluation (and cost), add the widest and most probable rules (lowest WCU) to be executed first and the most narrow or heavy (highest WCU) ones at last. Basic price for a Web ACL includes up to 1500 WCUs, so try to stay below to avoid extra charges. .

When adding new rules, choose action type COUNT first and observe the WAF logs for a reasonable period of time to ensure valid traffic is not impacted, before switching to BLOCK.

References and additional resources

The post Protect your webapps from malicious traffic with AWS Web Application Firewall first appeared on Håkon Eriksen Drange.

Develop lightweight and secure REST APIs with AWS Lambda Function URL and Terraform

Håkon Eriksen Drange — Fri, 03 May 2024 12:55:52 +0000

Table of contents

Introduction
What an AWS Lambda Function URL is and how it differs from a regular AWS Lambda Function
- Caveats
- Verification of Origin Access Control
- Additional protection with AWS Web Application Firewall
- Full code example
Conclusion

Introduction

When it comes to developing REST APIs on AWS there’s a lot of options. A traditional approach is to take care of the application layer with in-house business logic based on company tech stack preferences for programming languages and frameworks, deployed on compute resources like EC2 or ECS/EKS behind Application Load Balancers.

Another approach in a more distributed world is to offload the routing logic to a managed service such as Amazon API Gateway. This can remove a lot of heavy lifting and logic in your application layer so that developers can focus more on core business logic and modularization.

But sometimes developers only need to expose very simple functionality through an HTTPS endpoint. API Gateway might seem to complex and perhaps more advanced functionality like authentication, routing and throttling is not necessary to get something quickly up and running. For these situations AWS Lambda Function URL could be a feature worth exploring.

What an AWS Lambda Function URL is and how it differs from a regular AWS Lambda Function

Lambda functions can be invoked from a number of AWS services such as DynamoDB Streams, SQS, Kinesis and so on.

Current direct Lambda invocation methods are:

The Lambda console
The AWS SDK
The Invoke API
AWS CLI
Function URL HTTPS endpoint

The method Function URL enables a Lambda function to be invoked by an HTTPS endpoint in the format of https://<url-id>.lambda-url.<region>.on.aws, in addition to the traditional invocation methods.

Access can be controlled with the AuthType parameter combined with resource-based policies.

To only provide access to authenticated users and roles developer can configure AuthType AWS_IAM. Each HTTP request is signedusing AWS Signature Version 4 (SigV4).

For unauthenticated access to anyone specify AuthType NONE. Do take into consideration that Lambda URL itself does not provide throttling or protection capabilities (described in more details below).

Now follows some Terraform code for demonstration purposes. To keep it as simple as possible we deploy an AWS Lambda function based on terraform-aws-modules/terraform-aws-lambda. create_lambda_function_url is set to trueand authorization_type to NONE.

Python source code is intentionally left out at this stage, a fully working code example is referenced in the conclusion.

# AWS Lambda Function with endpoint URL
module "lambda_function_url_demo" {
  source = "git::https://github.com/terraform-aws-modules/terraform-aws-lambda.git?ref=f7866811bc1429ce224bf6a35448cb44aa5155e7"

  function_name = "lambda-function-url-demo"
  description = "Lambda Function URL Demo"
  handler = "index.lambda_handler"
  runtime = "python3.12"
  source_path = "./src/lambda-function-url-demo/index.py"
  create_lambda_function_url = true
  authorization_type = "NONE"
  timeout = 30
  cors = {
    allow_credentials = true
    allow_origins = ["*"]
    allow_methods = ["*"]
    allow_headers = ["date", "keep-alive"]
    expose_headers = ["keep-alive", "date"]
    max_age = 60
  }

  tags = {
    Name = "LambdaFunctionUrlDemo"
  }
}

output "lambda_function_url_demo_arn" {
  value = module.lambda_function_url_demo.lambda_function_arn
  description = "Lambda Function URL Demo ARN"
  sensitive = false
}

output "lambda_function_url_demo_url" {
  value = module.lambda_function_url_demo.lambda_function_url
  description = "Lambda Function URL Demo URL"
  sensitive = false
}

Result

lambda_function_url_demo_arn = “arn:aws:lambda:eu-west-1:1234567890:function:lambda-function-url-demo”

lambda_function_url_demo_url = “https://ytnvqv4vyv5jdhaj4xumgtgd4e0ggowg.lambda-url.eu-west-1.on.aws/“

The AWS Lambda Function can now be accessed by the endpoint URL output from terraform apply.

Caveats

Lambda Function URLs are designed to be a simple building block and by itself does not support throttling, API Token authentication and management, Web Application Firewall (WAF) or DDoS protection.

However, this is where Amazon Cloudfront and friends come to assist.

With this approach the perimeter is moved from regional to global endpoints. This means we can benefit from the global scale and network acceleration of Cloudfront which includes AWS Shield Standard for L3/L4 DDoS Protection (you can subscribe to Shield Advanced for L7 protection, automated mitigation and Shield Response Team support). In combination with Web Application Firewall malicious traffic can be mitigated and dropped at the edge to protect our origin from unwanted invocations. This is not only beneficial from a security point of view but it also keeps costs under control.

April 11th 2024 AWS announced support for Origin Access Control for Lambda function URL origins. The Terraform AWS Provider added support for this in v5.46.0 which was released released April 19th 2024.

Changelog: resource/aws_cloudfront_origin_access_control: Add lambda and mediapackagev2 as valid values for origin_access_control_origin_type (#34362)

Our Lambda function will now look like this. On line 11 authorization_type is changed from “NONE” to “AWS_IAM”. From line 27 a Lambda permission resource is added which grants the Cloudfront distribution permissions to invoke the function. Lines 35 and throughout defines basic properties of an AWS Cloudfront distribution.

# AWS Lambda Function
module "lambda_function_url_demo" {
  source = "git::https://github.com/terraform-aws-modules/terraform-aws-lambda.git?ref=f7866811bc1429ce224bf6a35448cb44aa5155e7"

  function_name = "lambda-function-url-demo"
  description = "Lambda Function URL Demo"
  handler = "index.lambda_handler"
  runtime = "python3.12"
  source_path = "./src/lambda-function-url-demo/index.py"
  create_lambda_function_url = true
  authorization_type = "AWS_IAM"
  timeout = 30
  cors = {
    allow_credentials = true
    allow_origins = ["*"]
    allow_methods = ["*"]
    allow_headers = ["date", "keep-alive"]
    expose_headers = ["keep-alive", "date"]
    max_age = 60
  }

  tags = {
    Name = "LambdaFunctionUrlDemo"
  }
}

resource "aws_lambda_permission" "allow_cloudfront" {
  statement_id = "AllowCloudFrontServicePrincipal"
  action = "lambda:InvokeFunctionUrl"
  function_name = module.lambda_function_url_demo.lambda_function_name
  principal = "cloudfront.amazonaws.com"
  source_arn = aws_cloudfront_distribution.lambda_function_url_demo[0].arn
}

resource "aws_cloudfront_distribution" "lambda_function_url_demo" {
  provider = aws.us-east-1
  origin {
    domain_name = local.lambda_function_url_demo_domain_name
    origin_access_control_id = aws_cloudfront_origin_access_control.cloudfront_oac_lambda_url[0].id
    origin_id = local.lambda_function_origin_id

    custom_origin_config {
      http_port = 80
      https_port = 443
      origin_protocol_policy = "https-only"
      origin_ssl_protocols = ["TLSv1.2"]
      origin_keepalive_timeout = 5
      origin_read_timeout = 30
    }
  }

  enabled = true
  is_ipv6_enabled = true
  default_root_object = "index.html"
  price_class = "PriceClass_200"

  logging_config {
    include_cookies = false
    bucket = module.cloudfront_logs[0].s3_bucket_bucket_domain_name
    prefix = "lambda_function_url_demo"
  }

  default_cache_behavior {
    allowed_methods = ["HEAD", "DELETE", "POST", "GET", "OPTIONS", "PUT", "PATCH"]
    cached_methods = ["GET", "HEAD", "OPTIONS"]
    target_origin_id = local.lambda_function_origin_id

    forwarded_values {
      query_string = true

      cookies {
        forward = "none"
      }
    }

    viewer_protocol_policy = "redirect-to-https"
    min_ttl = 0
    default_ttl = 0
    max_ttl = 86400
    compress = true
  }

  restrictions {
    geo_restriction {
      restriction_type = "none"
      locations = []
    }
  }

  tags = {
    Name = "LambdaFunctionUrlDemo"
  }

  viewer_certificate {
    cloudfront_default_certificate = true
    minimum_protocol_version = "TLSv1.2_2018"
  }
  web_acl_id = aws_wafv2_web_acl.lambda_function_url_demo.arn
}

# Amazon Cloudfront distribution OAC
resource "aws_cloudfront_origin_access_control" "cloudfront_oac_lambda_url" {
  name = "cloudfront_oac_lambda_url"
  description = "Policy for Lambda Function URL origins"
  origin_access_control_origin_type = "lambda"
  signing_behavior = "always"
  signing_protocol = "sigv4"
}

Verification of Origin Access Control

For verification we can observe incoming requests in CloudWatch Logs. For unauthenticated requests directly to the Lambda URL headers x-amz-content-sha256 and x-amz-security-token are absent.

[INFO]  2024-04-26T13:35:48.597Z    4f4c2779-05cb-439b-ba55-b50415679622    {
    "version": "2.0",
    "routeKey": "$default",
    "rawPath": "/index.html",
    "rawQueryString": "input1=YES",
    "headers": {
        "x-amz-content-sha256": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
        "x-amzn-tls-version": "TLSv1.2",
        "sec-fetch-site": "same-origin",
        "x-amz-source-account": "602472554111",
        "x-forwarded-port": "443",
        "sec-fetch-user": "?1",
        "x-amz-security-token": "IQoJb3JpZ2luX2VjEHYaCXVzLWVhc3QtMSJHMEUCIA2gGi8zMAO+5h2fGL65nksRPq51Ks3y3RL5o/SfleOiAiEA6om0TmJJe3uFRO54DCL9DV6/eGxcr+KGEzNiWblc/r0qlwIIv///////////ARAAGgw4NTYzNjkwNTMxODEiDM9OfrD8gawO+6OZ0irrAegKU5LkXCbnheQAeURiaTIUv9BTjdYeGa7p1gjpPw0N51w/2BB/c0eUae11ONdgdcyK6hCLthfyOay96rx7YTbXhvtVSTWkk7Nz6eAAyttffUv+n5c5+CY2M0bLntNkuisImgKh0RRl1rsTYILXOTqqVlT5+Ipd/yZNPtgXf0NsPJNEOsAtWvSZf4PScxYd9Xlk0CvNDUCk1BZ5afUkLXlhO/T1F2Tu0oaYbIwFxLZngmDc+KEMo82HkocD4VG/fmUtp7x2ln9BwCINkLtg7P4REgsJ9WdNUG647hrkJMcBRHaYePATqPhdYtIwttqusQY6jwFudLv6XtcxTs+Yi8NuweNYVXOvyR9N28zX6OasvJh4p3JseSxXr1Ejsgnhcb9rc40uhHlvwqvuNFdgeXiB+xEiVkDV2KtOEULzVd+bO1Nf4va6WTuob1wPG4W73TAUO+xLaDedVcpp+kQQYmr3I3Dh2m31XUiL7unsacWGVi+6DZiVWLoBb6ZJ3zCabR5jOg==",
        "via": "2.0 fc5e625db631bc657fc73f189d53fa14.cloudfront.net (CloudFront)",
        "x-amzn-tls-cipher-suite": "ECDHE-RSA-AES128-GCM-SHA256",
        "sec-ch-ua-mobile": "?0",
        "upgrade-insecure-requests": "1",
        "host": "ytnvqv4vyv5jdhaj4xumgtgd4e0ggowg.lambda-url.eu-west-1.on.aws",
        "sec-fetch-mode": "navigate",
        "x-amz-date": "20240426T133548Z",
        "x-forwarded-proto": "https",
        "x-forwarded-for": "81.166.192.92",
        "priority": "u=0, i",
        "x-amz-source-arn": "arn:aws:cloudfront::602472554111:distribution/E3RIXKOQDC23IE",
        "sec-ch-ua": "\"Chromium\";v=\"124\", \"Google Chrome\";v=\"124\", \"Not-A.Brand\";v=\"99\"",
        "x-amzn-trace-id": "Root=1-662badb4-7b849839607a46814fd5c1db",
        "sec-ch-ua-platform": "\"Windows\"",
        "accept-encoding": "gzip",
        "x-amz-cf-id": "TEFeqAECknBq5wwXW6rdwZR6os03LoZJyFZjEbjD-VW7ImAl1BdpYg==",
        "user-agent": "Amazon CloudFront",
        "sec-fetch-dest": "document"
    },
    "queryStringParameters": {
        "input1": "YES"
    },
    "requestContext": {
        "accountId": "anonymous",
        "apiId": "ytnvqv4vyv5jdhaj4xumgtgd4e0ggowg",
        "domainName": "ytnvqv4vyv5jdhaj4xumgtgd4e0ggowg.lambda-url.eu-west-1.on.aws",
        "domainPrefix": "ytnvqv4vyv5jdhaj4xumgtgd4e0ggowg",
        "http": {
            "method": "GET",
            "path": "/index.html",
            "protocol": "HTTP/1.1",
            "sourceIp": "64.252.86.126",
            "userAgent": "Amazon CloudFront"
        },
        "requestId": "4f4c2779-05cb-439b-ba55-b50415679622",
        "routeKey": "$default",
        "stage": "$default",
        "time": "26/Apr/2024:13:35:48 +0000",
        "timeEpoch": 1714138548591
    },
    "isBase64Encoded": false
}

Direct non-authorized access to the AWS Lambda Function URL is now denied and our lightweight API can only be accessed through Cloudfront with L3/L4 DDoS protection.

Additional protection with AWS Web Application Firewall

To stop malicious requests from botnets, SQL injection, cross-site scripting (XSS) and so on we associate a Web Application Firewall Access Control List with the Cloudfront distribution. Read more details about AWS WAF, it’s core functionality and aspects to take into consideration at Protect your webapps from malicious traffic with AWS Web Application Firewall.


# Web Application Firewall resources

# Common S3 bucket for WAF logs
resource "aws_cloudwatch_log_group" "waf_cloudwatch_logs" {
  #checkov:skip=CKV_AWS_158: KMS encryption unnecessary for this use-case.
  count = var.provision_cloudfront == true ? 1 : 0
  provider = aws.us-east-1
  name = "aws-waf-logs-lambda-function-url-demo"
  retention_in_days = 365
}

resource "aws_wafv2_web_acl_logging_configuration" "waf_cloudwatch_logs_config" {
  count = var.provision_cloudfront == true ? 1 : 0
  provider = aws.us-east-1
  log_destination_configs = [aws_cloudwatch_log_group.waf_cloudwatch_logs[0].arn]
  resource_arn = aws_wafv2_web_acl.lambda_function_url_demo[0].arn
}

resource "aws_cloudwatch_log_resource_policy" "waf_cloudwatch_logs_resource_policy" {
  count = var.provision_cloudfront == true ? 1 : 0
  provider = aws.us-east-1
  policy_document = data.aws_iam_policy_document.waf_logging[0].json
  policy_name = "webacl-policy-waf-lambda-function-url-demo"
}

# Create a Web ACL
resource "aws_wafv2_web_acl" "lambda_function_url_demo" {
  #checkov:skip=CKV2_AWS_31: WAF2 logging configuration not necessary for this use-case.
  count = var.provision_cloudfront == true ? 1 : 0
  provider = aws.us-east-1
  name = "lambda_function_url_demo"
  description = "Web ACL with managed rule groups for lambda_function_url_demo"
  scope = "CLOUDFRONT"

  default_action {
    allow {}
  }

  rule {
    name = "AWSManagedRulesAmazonIpReputationList"
    priority = 1

    override_action {
      none {}
    }

    statement {
      managed_rule_group_statement {
        name = "AWSManagedRulesAmazonIpReputationList"
        vendor_name = "AWS"
      }
    }

    visibility_config {
      cloudwatch_metrics_enabled = false
      metric_name = "AWSManagedRulesAmazonIpReputationList"
      sampled_requests_enabled = false
    }
  }

  rule {
    name = "AWSManagedRulesWordPressRuleSet"
    priority = 2

    override_action {
      none {}
    }

    statement {
      managed_rule_group_statement {
        name = "AWSManagedRulesWordPressRuleSet"
        vendor_name = "AWS"
      }
    }

    visibility_config {
      cloudwatch_metrics_enabled = false
      metric_name = "AWSManagedRulesWordPressRuleSet"
      sampled_requests_enabled = false
    }
  }

  rule {
    name = "AWSManagedRulesKnownBadInputsRuleSet"
    priority = 3

    override_action {
      none {}
    }

    statement {
      managed_rule_group_statement {
        name = "AWSManagedRulesKnownBadInputsRuleSet"
        vendor_name = "AWS"
      }
    }

    visibility_config {
      cloudwatch_metrics_enabled = false
      metric_name = "AWSManagedRulesKnownBadInputsRuleSet"
      sampled_requests_enabled = false
    }
  }

  rule {
    name = "AWSManagedRulesCommonRuleSet"
    priority = 4

    override_action {
      none {}
    }

    statement {
      managed_rule_group_statement {
        name = "AWSManagedRulesCommonRuleSet"
        vendor_name = "AWS"
      }
    }

    visibility_config {
      cloudwatch_metrics_enabled = false
      metric_name = "AWSManagedRulesCommonRuleSet"
      sampled_requests_enabled = false
    }
  }

  visibility_config {
    cloudwatch_metrics_enabled = true
    metric_name = "web-acl-lambda-function-url-demo"
    sampled_requests_enabled = true
  }
}

Full code example

To tie together all the bits and pieces I have developed a sample Terraform module which you can inspect further: https://github.com/haakond/terraform-aws-lambda-function-url . Study the README.md and examples/main.tf for complete documentation on how to get up and running, including Lambda Function code in Python. If you find it useful, feel free to fork and adjust to your needs.

module "lambda_function_url_demo" {
  source = "git::https://github.com/haakond/terraform-aws-lambda-function-url.git?ref=e3c72cb76d4a1d5b5b56e4a56a117f0949002a9d"

  # As global resources related to Cloudfront and WAF needs to be provisioned in us-east-1, we pass in two different providers.
  # Reference: https://developer.hashicorp.com/terraform/language/modules/develop/providers#passing-providers-explicitly
  provision_cloudfront = false # Set to false on the first run, set to true on the second run because of circular resources dependencies with Lambda and Cloudfront.
  providers = {
    aws = aws
    aws.us-east-1 = aws.us-east-1
  }
}

Conclusion

In this article we explored how AWS Lambda Function URL can be a compelling alternative to self-hosted APIs with Application Load Balancers and Amazon API Gateway for more lightweight and simple REST API use-cases. We looked at how we can secure the solution by adopting Amazon Cloudfront and AWS Web Application Firewall. As a bonus custom domain names are also possible with Amazon Certificate Manager support in Cloudfront. The solution is fully serverless and has no servers or containers to patch or manage.

Further reading

Amazon CloudFront now supports Origin Access Control (OAC) for Lambda function URL origins
Amazon CloudFront Developer Guide – Restricting access to an AWS Lambda Function URL origin
AWS Lambda Developer Guide – Lambda Function URLs
AWS Lambda Terraform module
Amazon Cloudfront – Developer Guide – Using AWS WAF protections
Terraform registry – Resource: aws_cloudfront_origin_access_control
https://github.com/haakond/terraform-aws-lambda-function-url/

The post Develop lightweight and secure REST APIs with AWS Lambda Function URL and Terraform first appeared on Håkon Eriksen Drange.