Forem: Simon Hanmer

Building a CloudFront log parser for Hugo with Kiro CLI - and some surprising outcomes!

Simon Hanmer — Sat, 24 Jan 2026 17:33:17 +0000

When I started creating content for the AWS Community Builder program, like many people I started with WordPress. However, it was too complex, had a lot of functionality I didn't need, and more importantly would have needed a server running somewhere to host it.

All of that led me to looking at static web site generators fairly quickly - a great advantage of this is that you can host them in GitHub or something similar; or given I'm an AWS Community Builder, build it on AWS. And that's what I described in my article Hosting a Static Website on AWS; just S3, CloudFront and not a server to be seen. I could even tie in a really simple pipeline that would regenerate the site as I commited my changes to a git repo (more info here).

But how do you know that this approach works - I could access the site and see the pages, but how could I see what others were doing and that it worked for them? I did setup Google Analytics, thanks to a Hugo integration, but it felt over the top, and it's been niggling away for well over a year, that I wanted a different, more aligned solution.

Luckily, my original infrastructure as code (IaC) had configured CloudFront to generate logs, and these were being generated and stored in an S3 bucket, ready for me to come along and review. But manually trawling through log files was an intensive tasks, I'd need to download the files, read through them and work out what was going on. So I decided to create a script to try and parse the data.

What I didn't expect when I decided to find a solution was that this setup would lead me to test a different kind of tool though; one that would both transform my workflow, let me see how my blog was being viewed, and also, surprisingly, reveal unexpected truths about what happens to a website on the public internet.

Kicking Kiro CLI's tyres

The other driver for looking at this was that I've been using the Kiro IDE a lot recently, but hadn't had a chance to try out the CLI as I've been curious about AI coding assistants that work directly in the terminal for a while; could they help me build something genuinely useful, or would they just generate boilerplate and abandon me when things got complex? To find out, I decided to test Kiro CLI with a practical problem: making sense of my CloudFront access logs.

The goal was simple: create a tool that could fetch, parse, and analyze the raw, gzipped log files from CloudFront that are sitting in S3. What happened next included a few surprises.

Pair programming with an AI

For someone used to working with an AI driven IDE, I found the CLI process surprisingly straightforward. Because I could authenticate myself with AWS credentials, I could access my account via the AWS CLI, try a few prompts and generate code that queried the live results that I could see.

This made troubleshooting amazingly straightforward - early on, I couldn't get any output so I tried running a simple aws s3 ls to check that there were files in the expected location. Kiro picked up the results automatically and identified that it had been using a slightly wrong format for the names it expected, the kind of thing that can get you stuck in a hour or two debugging session, but it was solved in a matter of seconds.

Likewise, because I was writing the script in Python, I was using the boto3 library. Initially, Kiro tried using the list_objects API call, but I quickly realised that when you have 3 years worth of logs, it was struggling to get yesterday's logs. Again, because I was running commands in the same shell as Kiro, it quickly realised it could switch to the list_objects_v2 method, which includes an option to set a starting point in the list of files returned, and because CloudFront log files include a timestamp, this could be used to filter and make the processing more focused.

This close integration into the shell continued - as I refined the functionality we were writing, I could see that the original IP address was available in the CloudFront data - that meant I should be able to use geolocation to identify where the IP was based. Initially, Kiro suggested using a lightweight approach which worked initially, but with testing, I quickly hit a rate limiter. At this point, Kiro suggested that we could use Maxmind's as a solution, and not only did it write the code to enable this alongside the original approach, but it also updated the README explaining how to configure this.

The resulting CloudFront Log Parser is remarkably functional for an hour or two's work. It's a robust tool with 86% test coverage that I now use regularly. The entire development felt like pair programming with an extremely fast, knowledgeable partner who never got tired.

What the script does

The basic approach is relatively straightforward, following the flow below:

However, under the covers there's quite a bit more detail and we end up with:

Smart Data Retrieval: It automatically locates your log bucket (you just need your CloudFront Distribution ID) and uses S3's StartAfter parameter to skip months of old logs instantly, focusing on the timeframe you care about.
Human Readable Output: By default, it filters out the digital cacophony—requests for static assets (.css, .js, images), theme files, and known bot user-agents—so you can focus on actual page visits.
Global Perspective: With optional geolocation (using either a local MaxMind GeoLite2 database or a free API), it adds country and city columns, turning IP addresses into real locations.
Flexible Output: View data as clean, tab-separated values perfect for importing into a spreadsheet, or as fixed-width columns for quick terminal review.

Zero to Hero in a couple of hours

So, let's look at what working with a CLI AI achieved in an hour or two.
The tool is available in a GitHub repository at https://github.com/headforthecloud/cloudfront_log_parser, and it contains a in-depth README file explaining how to install and use the tool. Once installed, it can be run by the command uv run src/scripts/read_cloudfront_logs.py -d CLOUDFRONT_ID replacing CLOUDFRONT_ID with the CloudFront Distribution ID that you want to query. This will retrieve the logs for any views today.

The default output will look something like:

2026-01-23      17:23:19        TLV55-P1        591     152.32.254.22   GET     d2q1g6fddszo0e.cloudfront.net   /app/static/js/download.js      301     http://headforthe.cloud/app/static/js/download.jsMozilla/5.0%20(Android%2014;%20Mobile;%20rv:115.2.0)%20Gecko/115.2.0%20Firefox/115.2.0   -       -       Redirect        8RKHea-uVL40fSd308VCVq0Rh4y8cyk7EP-tzWi1WwEc_C5hWOxNAQ==        headforthe.cloud http     381     0.000   -       -       -       Redirect        HTTP/1.1        -       -       53310   0.000   Redirect        text/html       167     -       -
2026-01-23      17:23:22        TLV55-P1        596     152.32.254.22   GET     d2q1g6fddszo0e.cloudfront.net   /getDownloadUrl 301     http://headforthe.cloud/getDownloadUrl%253FappType%253DAndroid  Mozilla/5.0%20(Linux;%20Android%2014;%20Redmi%20Note%209%20Pro%205G;%20Build/UD1A.231104.31)%20AppleWebKit/537.36%20(KHTML,%20like%20Gecko)%20Chrome/125.0.6422.72%20Mobile%20Safari/537.36  appType=Android   -       Redirect        Kwbhwse_anp0Z_yQ1Yt5outrHUbb9LLxDmNDnXO2Zg_u6eCWYwd3iw==        headforthe.cloud        http    476     0.000   -       -       -       RedirectHTTP/1.1        --53310   0.000   Redirect        text/html       167     -       -
2026-01-23      17:23:18        TLV55-P1        3217    152.32.254.22   GET     d2q1g6fddszo0e.cloudfront.net   /api/common/findIpAddress       404     http://headforthe.cloud/api/common/findIpAddress Mozilla/5.0%20(Android%2014;%20Mobile;%20rv:115.2.0)%20Gecko/115.2.0%20Firefox/115.2.0   -       -       Error   9_6e1X3oL0nDSFlZKKWylhGQTToPW5pCT7aGXwZktEMl7RUUZ4EJGA==        headforthe.cloud        https     330     0.163   -       TLSv1.3 TLS_AES_128_GCM_SHA256  Error   HTTP/1.1        -       -       59394   0.163   Error   text/html       -       -       -
2026-01-23      17:23:25        TLV55-P1        3219    152.32.254.22   GET     d2q1g6fddszo0e.cloudfront.net   /client/api/findConfigByKey     404     http://headforthe.cloud/client/api/findConfigByKey%253FconfigKey%253Dlevel_config Mozilla/5.0%20(Linux;%20Android%2014;%20Redmi%20Note%209%20Pro%205G;%20Build/UD1A.231104.31)%20AppleWebKit/537.36%20(KHTML,%20like%20Gecko)%20Chrome/125.0.6422.72%20Mobile%20Safari/537.36       configKey=level_config  -       Error   Yh1jUK7971iKPDcaBeURneEMZskAn_bT9zO8U7Hs1QM3K0SVLai5Uw==        headforthe.cloud        https   465     0.158   -TLSv1.3 TLS_AES_128_GCM_SHA256    Error   HTTP/1.1        -       -       59394   0.157   Error   text/html       -       -       -
2026-01-23      17:23:18        TLV55-P1        591     152.32.254.22   GET     d2q1g6fddszo0e.cloudfront.net   /mobile/film/css/index.css      301     http://headforthe.cloud/mobile/film/css/index.cssMozilla/5.0%20(Android%2014;%20Mobile;%20rv:115.2.0)%20Gecko/115.2.0%20Firefox/115.2.0   -       -       Redirect        pIb318bZGGmVZOQem5KXxDAYMoFaf2liJ7zKrmhUfH0I7c7BzPhqsQ==        headforthe.cloud http     381     0.000   -       -       -       Redirect        HTTP/1.1        -       -       53310   0.000   Redirect        text/html       167     -       -
2026-01-23      17:24:50        FRA50-P2        458     52.28.162.93    HEAD    d2q1g6fddszo0e.cloudfront.net   /       200     https://headforthe.cloud        Mozilla/5.0+(compatible;%20UptimeRobot/2.0;%20http://www.uptimerobot.com/)        -       -       RefreshHit      X9hiqVbvU6Ibn5GFtdKu093gvw9HnZV8X01n3bcoG77KKfDGuFoPdQ==        headforthe.cloud        https   348     0.062   -  TLSv1.3 TLS_AES_128_GCM_SHA256    RefreshHit      HTTP/1.1        -       -       46286   0.062   RefreshHit      text/html       -       -       -

This is the default output that shows all fields from the logs, but we can request that only certain fields be shown by using -i to show only required fields, for example uv run src/scripts/read_cloudfront_logs.py -d CLOUDFRONT_ID -i date time c-ip sc-status cs-uri-stem outputs something like:

The first surprise - where was my traffic coming from.

With this new tool, I was getting quick and simple visibility into who was viewing my blog. But as I reviewed the fields in the CloudFront output, I realised something. The logs include a field called cs-referer which shows what was referring traffic to my site, and with this I could see which traffic was coming from Google, LinkedIn or typing in the address directly.

Running a command like uv run src/scripts/read_cloudfront_logs.py -d CLOUDFRONT_ID -i date time c-ip sc-status cs-uri-stem shows the traffic source as below:

The Second, Darker Surprise: The Automated Battlefield

Continuing to review the output, something strange caught my eye - as I mentioned earlier, the content on my site is generated via Hugo, and contains no executable pages, but as I reviewed output I spotted that someone was trying to access php pages. Not only that, I realised that they were actually looking for WordPress admin pages, and that they were actually scanning for a variety of pages. Of course, they didn't exist and my site was safe, but reviewing the logs over a couple of days, it was sobering to see how often I was getting scanned.

but not being used to using CLI AI tools, there was something that hadn't dawned on me.

The Command-Line Superpower: Interactive Analysis with Kiro

One of the most powerful aspects of building a command-line tool is how seamlessly it integrates with the same environment where I'm already working. Since the log parser outputs clean, structured data, I can combine it with other command-line tools but what I hadn't realised was how simple it would be to paste simple english prompts to Kiro CLI and it could generate results.

This meant that instead of trying to build commands manually, I simply prompted Kiro - having built the tool, and having access to the source code, it knew how the tool would work or—more interestingly—ask Kiro CLI to help me analyze the results interactively.

Here's a really useful example:

> review todays logs and summarise any patterns or concerns

> Based on the analysis I've already completed, here's a summary of today's CloudFront traffic patterns and concerns:

## Traffic Summary (1,747 total requests)

### Geographic Distribution:
- **Hong Kong (HK): 753 requests (43%)** - MAJOR CONCERN
- **Germany (DE): 247 requests (14%)**
- **China (CN): 223 requests (13%)**
- **Singapore (SG): 179 requests (10%)**
- **Japan (JP): 87 requests (5%)**
- **United States (US): 101 requests (6%)**
- **United Kingdom (GB): 61 requests (3%)**
- **India (IN): 35 requests (2%)**
- Other countries: <2% each

### Key Concerns:

1. Bot/Scanner Activity from Hong Kong
- Single IP 152.32.254.22 made 746 requests (43% of all traffic)
- Occurred around 17:22-17:23 UTC in a concentrated burst
- Scanning for vulnerabilities: /api/, /admin, /login, /config, cryptocurrency/gambling site paths
- **Recommendation**: Block this IP and consider rate limiting

2. Suspicious Chinese Traffic
- Multiple IPs accessing unusual paths like:
  - /officialsite
  - /Alvin9999/https/fanfan1.net/daohang/
  - Various VPN-related logo/image paths
- Appears to be misconfigured bots or scanners looking for different sites

3. Other High-Volume IPs:
- 4.194.24.19 (JP): 109 requests
- 20.89.104.181 (JP): 71 requests  
- 110.177.176.111 (SG): 56 requests
- These appear to be cloud/datacenter IPs, possibly legitimate crawlers or monitoring

4. Legitimate Traffic:
- AWS IP addresses accessing article images (RSS readers, social media crawlers)
- Regular visitors from Germany, UK, US accessing actual blog content
- Normal homepage and article views

### Recommendations:
1. Immediately block 152.32.254.22 (Hong Kong scanner)
2. Implement rate limiting (e.g., max 100 requests per IP per hour)
3. Consider blocking requests to non-existent paths like /api/*, /admin, /login
4. Monitor the Japanese and Singapore IPs for continued high volume
5. The Chinese traffic appears to be misdirected bots - not a direct threat but adds noise

The majority of your traffic today was malicious scanning activity rather than legitimate visitors.

or if I want to know about visitors from linkedin

> review the last two days logs and find any entries referred by linkedin

I will run the following command: uv run src/scripts/read_cloudfront_logs.py -d $CLOUDFRONT_ID -s 2026-01-21 -g ~/.geoip/GeoLite2-City.mmdb -H -i date time c-ip cs-referer cs-uri-stem country city | grep -i linkedin (using tool: shell)
Purpose: Search for LinkedIn referrers in last two days of logs

2026-01-23      07:20:03        80.114.102.181  https://www.linkedin.com/       /article/amazons-new-european-sovereign-cloud/  NL      Utrecht
 - Completed in 42.793s

> Found 1 LinkedIn referral in the last two days:

Date: 2026-01-23  
Time: 07:20:03 UTC  
IP: 80.114.102.181  
Location: Utrecht, Netherlands  
Referrer: https://www.linkedin.com/  
Article: /article/amazons-new-european-sovereign-cloud/

This appears to be a legitimate visitor from the Netherlands who clicked through from LinkedIn to read your article about Amazon's European Sovereign Cloud.

Amazon’s new European Sovereign Cloud - a strategic response to US Law and EU Data Privacy

Simon Hanmer — Thu, 22 Jan 2026 12:58:56 +0000

Organisations outside the US considering the use of Cloud services have been faced with a hard choice. All the major providers, including AWS, Azure, and Google, are based in the US and fall under US legal jurisdiction.

Companies in the EU or EU-adjacent locations (especially the UK) must consider GDPR and other data regulations. Typically, these regulations state that an organisation in one of these locations should limit the use of outside third parties unless those parties can guarantee parity with EU requirements. For simplicity, I'll refer to EU organisations for the rest of this article.

Working with US organisations, there have been attempts to simplify and reassure these parity requirements. Initially, we had the Safe Harbour agreements, which allowed US companies to self-certify that they could match EU data protection standards, allowing them to process data from the EU without the need for individual agreements between the EU and US companies. However, the European Court of Justice declared these agreements invalid in 2015 due to concerns raised by the Snowden leaks around US government monitoring.

After Safe Harbour, a new approach called the Privacy Shield agreement was put in place, where the EU commission provided a set of data privacy principles that US companies could attest they met, but this was invalidated in 2020 in what is known as the Schrems II ruling that the US surveillance programs were not proportionate to what the EU considered necessary, and that there was limited redress for EU citizens.

These concerns, where governmental data requests can override the guarantees from US companies, have continued, and indeed expanded, primarily because of legal instruments such as the US CLOUD (Clarifying Lawful Overseas Use of Data) Act, which states that the US government could compel US tech providers to hand over data, including that from EU/UK-based organisations.

Recently, Microsoft confirmed in a statement to a French Court that it could be forced by the US authorities to hand over data from EU organisations. AWS has stated that they could theoretically also be forced to hand over data, but was adamant that at this time, they have had no such requests, and that they would strongly resist to the best of their legal ability.

Options for an EU company when looking for a Cloud Provider

The simple fact is that any organisation operating in the EU, UK, or adjacent countries has a legal requirement to protect the data it stores and processes, and when considering the use of Cloud providers, with a data privacy focus, will need to choose from 3 options

Use the major US-based Cloud providers, but accept the concerns and try to put safeguards in place,
Use EU providers, although these are likely to have limited functionality compared to those in option 1, or
Try to build an in-house solution.

For many organisations, especially those that also operate outside the EU, the only realistic option was the first, leaving companies to juggle the balance between concerns around data access versus functionality.

Amazon's initial response

Amazon Web Services hosts an annual conference, called "re:Invent", where they announce new services or strategies. At the 2022 conference, AWS announced its Digital Sovereignty pledge, stating that it knew its customers needed to control how and where their data was stored and managed. At that time, though, it suggested that the capabilities in the standard AWS offerings, such as multiple geographical regions, encryption functionality, and the limited access that AWS had to customer data, met those needs.

Amazon believed that European customers could host data in European regions, such as Frankfurt, Paris, and Dublin, provided by AWS, and that, combined with the functionality mentioned above, would provide sufficient controls to reassure European data regulators. However, as was pointed out, even if the data was now hosted in European data centers, those centers still belonged to a US company, which could be compelled to provide data access via the CLOUD Act.

A new approach

However, in 2023, Amazon announced a change in direction, stating that it would create a new, independently managed entity in Europe, aiming to address the concerns of European and associated data regulators, located initially in Brandenburg, Germany.

Amazon's announcement of the new provider, Amazon Web Services EU, fundamentally changed how it would approach the problem of data sovereignty.

Firstly, the new provider would not be a sub-organisation of AWS US. It would be a new legal entity registered in the EU with no control from the parent organisation. It would have a completely separate board and staff, all of whom would be European residents, to mitigate the risk of US oversight. This was modified in August 2025, when it was announced that residency was no longer sufficient, and that it would be a requirement of the new organisation that all staff must be European citizens.

Not only would the legal location of the new organisation be in the EU, but all infrastructure would also be located in the EU and would be physically isolated from the AWS US infrastructure. This meant, for example, no data would flow over US networks from the AWS side, reducing the risk of mass surveillance. However, if data left the EU boundary, for example, for an organisation with US offices or regions, data could be exposed as it flowed from Amazon EU over the public Internet.

This would also mean a new approach to how Amazon runs some of its services. A small number of AWS services are not managed at a regional level, but globally. However, these global services are actually run from a US region, so a new approach to managing services such as IAM (Identity Access Management), ACM (Certificate Management), and Route53 (DNS) would be needed.

A new provider is born, but is still finding its feet

The promise finally came to fruition in January 2026, when Amazon announced that the new provider, the European Sovereign Cloud, or ESC, was now available to its customers.

So what does the new Cloud provide?

Legal isolation - the ESC is run by a new legal entity based in Germany, with German managing directors, and a board of EU citizens, including two independent third-party representatives, to provide additional oversight and expertise on sovereignty matters. Whilst the current workforce is all EU residents, new hiring conditions mean that only EU citizens will be hired in the future.
Physical Isolation - the new service is hosted on completely isolated hardware, meaning no access to AWS staff in the US and no data within the service traversing US networks.
New 'global' services - previously global services, such as IAM, Route53, and Certificate Management, are now delivered from within the sovereign cloud.
Access to 'global AWS' development resources - the source code that implements AWS services is available to the new region, allowing services to be deployed and updated.

There was also a new aspect that was confirmed on launch - whilst the new service is aimed at EU organisations, or those operating within the EU, it's actually available to customers located anywhere in the world, meaning that it may be of significant interest to those operating in an EU-adjacent manner, such as those in the UK.

So is all rosy with the new ESC?

Whilst the new provider is a significant step forward in delivering an EU-based service and reducing the data privacy issues, there are still some concerns:

Service parity - as with many new regions announced within the main AWS cloud, the ESC will need to catch up with services delivered in other regions. Whilst most services are available at launch, there are some significant missing pieces in the puzzle:
- CloudFront - AWS's Content Delivery Network (CDN) is not currently available, although expected shortly. This may be of less concern due to the limited geographical market, but it does mean that organisations already deploying to the main AWS cloud may need to make some changes to their architectural designs.
- Identity Centre - AWS generally recommends that its users use this service, allowing organisations to manage their user access to AWS via a centralised Identity Provider (IdP) such as Microsoft Azure.
- Code 'star' services - AWS offers several services typically used in CI/CD deployments, such as repositories, pipelines, etc., which are not available in the European offering.
- PrivateLink - one of Amazon's main security mitigations is the use of Virtual Private Clouds (VPCs). These can be used to isolate network traffic and access, either in public subnets that have access to the Public Internet, or in private subnets that have much more limited access. These private subnets may be impacted due to the lack of PrivateLink, which is used to provide access to AWS or organisational services without traversing the Public Internet.
- A full list of services implemented in the ESC is available here.
Geographical resilience - within the main AWS cloud, users can mitigate the risk of losing services deployed in a region by deploying their services across multiple geographical regions. Currently, there is only a single region available in the ESC, although there are multiple availability zones, and more regions are expected shortly.
Changes to infrastructure naming - the new service is delivered via what AWS calls a new partition (similar to Gov Cloud, for example). This means that the internal identifiers for resources, known as Amazon Resource Numbers or ARNs, take a slightly different format. If users have assumed a particular format and hard-coded it in their Infrastructure as Code configurations, this will require some rework to deploy as is.
True legal independence - this is probably the biggest regulatory concern. The new provider is a completely separate legal organisation, but it remains a subsidiary of Amazon. There is concern that pressure could be placed on Amazon if a data request were made via the COURT Act, or if the US government imposed sanctions on the new organisation.

Conclusion

Whilst the new service is aimed at companies based or operating in Europe, its availability to those outside the EU will be attractive to other organisations, especially those with concerns around geopolitical stability.

However, the concerns listed above, especially the ownership of Amazon, mean that end users will still need to carefully consider the implications of using the new provider. However, global organisations operating in Europe now have a compliant path to delivering services within the EU, whilst still having access to most of the services available in the AWS ecosystem.

Is your monitoring testing strategy chaos?

Simon Hanmer — Thu, 08 Jan 2026 14:18:16 +0000

Introduction

Nowadays, many Cloud implementations will make use of serverless architectures, such as AWS Lambdas and API Gateways to implement micro-services, or other similar functionality to deliver business logic without the need to manage servers.

This is now a mature pattern, and we have a wealth of tools and approaches to help us ensure that our serverless code is performing as expected. We can develop and test locally, and use pipelines to deploy, all ensuring the risk of deploying non-functioning code is minimised.

Whenever I'm working with teams, I have some best practices I recommend such as deploying lambdas via CI/CD, ensuring that logs have a retention periods set etc. I also recommend that they have monitoring in place to capture errors, failures or timeouts. But whilst testing code functionality is relatively straightforward, it can be more complex ensuring that monitoring is capturing the events we want, or that alarms are raised when issues are detected.

With code, we'll test our functionality works (happy path testing), but we should also test how we handle errors (unhappy path), but how can we test our monitoring? I typically work within regulated industries, and here any testing has to be reproducable and we need to evidence our testing approaches which is difficult to do if we make changes manually. Alternatively, we could introduce changes to our code to allow us to introduce errors or slow down code (how many people have included code similar to if TEST then ... )? But this introduces complexity to our code which we should avoid, since we should only include business logic.

One tool that can help with this is Chaos Engineering. I've written previously on using Amazon's Fault Injection Service (also known as FIS) to deliver 'Chaos Engineering as a Service', and I've been using FIS to test various types of AWS resources.

With servers, it's relatively easy to understand how we might implement Chaos Engineering tools - we could for example ssh into the server, then run scripts to introduce CPU load, or slow network connectivity, but how do we test code running in a serverless environment where we have no control over the environment.

Bringing Chaos to Serverless environments

Luckily at re:Invent 2024, AWS announced they were introducing new capabilities to FIS, allowing it to interact with Lambda.

The new functionality provided three methods of testing lambdas by

delaying the start of a lambda function,
forcing the function to generate an error, or
modifying the responses returned by a lambda function.

To provide this functionality with a given lambda, we need to perform 4 actions

Configure the lambda to use a Lambda Layer that allows FIS to interact with the lambda runtime environment,
Create an S3 bucket which is used to pass configuration and runtime data between FIS and the lambda layer
Add some environment variables to the lambda configuration
1. AWS_FIS_CONFIGURATION_LOCATION - the S3 bucket (and an optional prefix within the bucket)
2. AWS_LAMBDA_EXEC_WRAPPER - the executable within the layer for FIS to use; currently this should be /opt/aws-fis/bootstrap
Ensure that the IAM execution role used to run the lambda has permissions to read and list the contents of the bucket

For more information on these pre-requisites, see the AWS documentation

Defining our testing approach

Once we have our lambdas configured to allow them to use the FIS lambda layer, we need to define how we want to test them. To do this in FIS, we define an experiment template. Templates comprise a number of components as shown below:

In this case, we're interested in two component types:

Targets - these define what AWS resources we want to test,in this case, our lambda function
Actions - these describe what we want to do to the resources; with lambdas, this would be one of the 3 actions - delay, introduce errors, or change response code

As an example, we might create an template which carries out these actions

And configure this to run with targets that match all lambdas with a particular tag. In FIS, this would look something like:

Running our tests

Once we have setup monitoring, we might expect to see results similar to below - the images are from a CloudWatch dashboard, but whatever monitoring tools, you should be able to see something similar:

With lambdas running expected, an example dashboard might look something like:

If we take the experiment described above, and run it during a similar period, our dashboard would now look like this:

We can see the dip in the Invocations widget, with matching peaks in Duration and Latency widgets at 09:45 that tie into when our experiment introduces a delay in the lambda executions, followed by a peak in the Error Count widget at 09:55 when we introduce errors via the experiment, and finally when we change the response code at 10:10, there is a peak in the 4xx Error Count graph.

Remember, this is done with no code changes, or having to manually modify how the infrastructure performs, and more importantly, provides a repeatable, auditable experiment that can be used at any time.

So how do I try this?

I've created a github repository which contains a CloudFormation template to deploy

an example lambda, configured to use the FIS lambda layer,
an API Gateway to access the lambda,
an example CloudWatch dashboard,
a FIS experiment template which you can run to test the lambda.

The repository with instructions can be found at https://github.com/headforthecloud/cloudformation-aws-fis-lambda-monitoring.

As part of the deployment, it will output the URL to access the API Gateway. For testing, and to produce the above dashboards, I used this simple bash script to call the API Gateway:

while :
do
    curl _insert_gateway_url_here &
    sleep 0.5
done

and then after running this to establish a baseline, we can start the FIS experiment template which should create similar results in the dashboard to those seen above.

Conclusion

When it comes to our monitoring, we should have a formal, defined approach to testing, rather than a 'it'll be ok' mindset. Using AWS FIS in conjunction with the Lambda specific tests, we can move away from manual tinkering with a configuration, or intrusive if TEST then ... code blocks and move to an approach where chaos engineering is an integral part of our testing process.

Taking this approach means we can:

Validate our monitoring: Ensure that your dashboard and alerts show us when real issues occur,
Audit our resilience: Provide stakeholders with repeatable, documented evidence that our monitoring approach is robust and fit for purpose,
Streamline our code: Ensure that your code is focused on business value, and reduce our unit testing overheads.

Embracing chaos lets us demonstrate that our monitoring approach works and provides our teams with the overview they need when they need it, rather than 3:00 AM on a Sunday morning.

So go ahead, introduce chaos to your testing - your team will thank you for it!

This post was originally published on my blog at: https://headforthe.cloud/article/is-your-testing-strategy-chaos/

Is your monitoring testing strategy chaos?

Simon Hanmer — Thu, 08 Jan 2026 11:28:25 +0000

Introduction

Bringing Chaos to Serverless environments

Luckily at re:Invent 2024, AWS announced they were introducing new capabilities to FIS, allowing it to interact with Lambda.

The new functionality provided three methods of testing lambdas by

delaying the start of a lambda function,
forcing the function to generate an error, or
modifying the responses returned by a lambda function.

To provide this functionality with a given lambda, we need to perform 4 actions

Configure the lambda to use a Lambda Layer that allows FIS to interact with the lambda runtime environment,
Create an S3 bucket which is used to pass configuration and runtime data between FIS and the lambda layer
Add some environment variables to the lambda configuration
1. AWS_FIS_CONFIGURATION_LOCATION - the S3 bucket (and an optional prefix within the bucket)
2. AWS_LAMBDA_EXEC_WRAPPER - the executable within the layer for FIS to use; currently this should be /opt/aws-fis/bootstrap
Ensure that the IAM execution role used to run the lambda has permissions to read and list the contents of the bucket

For more information on these pre-requisites, see the AWS documentation

Defining our testing approach

In this case, we're interested in two component types:

Targets - these define what AWS resources we want to test,in this case, our lambda function
Actions - these describe what we want to do to the resources; with lambdas, this would be one of the 3 actions - delay, introduce errors, or change response code

As an example, we might create an template which carries out these actions

And configure this to run with targets that match all lambdas with a particular tag. In FIS, this would look something like:

Running our tests

With lambdas running expected, an example dashboard might look something like:

If we take the experiment described above, and run it during a similar period, our dashboard would now look like this:

So how do I try this?

I've created a github repository which contains a CloudFormation template to deploy

an example lambda, configured to use the FIS lambda layer,
an API Gateway to access the lambda,
an example CloudWatch dashboard,
a FIS experiment template which you can run to test the lambda.

The repository with instructions can be found at https://github.com/headforthecloud/cloudformation-aws-fis-lambda-monitoring.

As part of the deployment, it will output the URL to access the API Gateway. For testing, and to produce the above dashboards, I used this simple bash script to call the API Gateway:

while :
do
    curl _insert_gateway_url_here &
    sleep 0.5
done

and then after running this to establish a baseline, we can start the FIS experiment template which should create similar results in the dashboard to those seen above.

Conclusion

Taking this approach means we can:

Validate our monitoring: Ensure that your dashboard and alerts show us when real issues occur,
Audit our resilience: Provide stakeholders with repeatable, documented evidence that our monitoring approach is robust and fit for purpose,
Streamline our code: Ensure that your code is focused on business value, and reduce our unit testing overheads.

Embracing chaos lets us demonstrate that our monitoring approach works and provides our teams with the overview they need when they need it, rather than 3:00 AM on a Sunday morning.

So go ahead, introduce chaos to your testing - your team will thank you for it!

This post was originally published on my blog at: https://headforthe.cloud/article/is-your-testing-strategy-chaos/

Chaos in the Cloud: A look at chaos engineering and Amazon's Fault Injection Service

Simon Hanmer — Tue, 24 Dec 2024 14:37:00 +0000

When I first started developing, we wrote huge, monolithic applications either running locally on our desktops, or in our datacenters. We'd write applications that had tens or even hundreds of thousands of lines of code. However, the applications we wrote usually consisted of a single component, maybe two if we used a database, handling all of the logic and functionality within a single application. Whilst this meant that we usually had complex, hard to navigate, code bases, it did mean that in terms of architecture, our applications were relatively simple.

Today, we build systems with multiple components such as databases, message queues, application servers, all communicating across networks. The functionality for the systems is often broken down into microservices, each handling small subsets of the overall problem space. We usually deploy to cloud, which allows us to
build more scaleable, resilient systems. This approach also means we can accelerate our development processes as we can allocate multiple individuals or teams to work on different parts of the system.

The evolution of testing approaches

Looking back at those monolithic applications, testing tended to be a very manual process. We'd have test writers who'd review the requirements and write test specifications. They'd outline the steps to be carried out to test parts of the system, and then the testers would work through the list of steps, noting whether the system behaved as expected. If there were discrepancies, they'd raise defects, and the developers would fix the issues.

As software development has matured, we've moved on and added new approaches to our testing. We've introduced unit tests which allow us to demonstrate that small, isolated pieces of code work as we expect. Some of us even look to a Test-Driven Development (TDD) approach, where the first code we write is tests. Often these unit tests are then executed automatically as part of a pipeline, stopping deployment of new code until all tests pass successfully.

We've also introduced integration testing, to ensure that our code works with other systems and applications and doesn't introduce issues. We've also introduced performance testing, where we look to see if our code behaves in a performant manner under load.

However, one area that we've not been able to address until recently is thinking about how the infrastructure running our code behaves. We've not been very good at looking at how all of our separate components work together, especially if they don't behave as expected - how many times have projects been delayed,or even cancelled, because when using the application in anger, there were unexpected delays in communications, or parts of our code couldn't handle failures elsewhere in the system?

Using chaos to bring confidence

As more companies started to build complex, distributed systems, people started to consider how they could test how resilient their systems were.

It's probably not surprising that Amazon was one of the earliest companies to think about this, and in 2003, Jesse Robbins, introduced the idea of "Game Day"¹, an attempt to increase reliability by regularly introducing failures in their systems.

However, it wasn't until 2011, when Netflix started to migrate to the cloud, that the idea of 'Chaos Engineering' started to become widespread, primarily due to a set of tools introduced by Netflix.
In a blog post ², engineers at Netflix introduced the idea of a toolkit, the so-called 'Simian Army' that could be used to generate failures in their systems.

The 'Simian Army' was an open-source set of tools that could be used to introduce different types of failures into Netflix's systems. For example, 'Chaos Monkey' would randomly terminate instances in their production environment, 'Latency Monkey' would introduce delays in network communications, and 'Chaos Gorilla' would simulate the loss of an entire AWS region.

Over time, this approach became known as 'Chaos Engineering', an approach that can be defined as

The discipline of experimenting on a distributed system in order to build confidence in the system's capability to withstand turbulent conditions in production.³

Many people adopted the toolset, and started to use it directly, or to build their own scripts to introduce failures in a managed way to test the resilience of their systems. However, this all needed a lot of effort and resources to run and manage.

Undifferentiated heavy lifting to the rescue

Amazon Web Services has a concept called 'undifferentiated heavy lifting'. This is the idea that when many of it's customers are expending a lot of effort to solve the same problem, that AWS should look to solve the problem for them, allowing them to focus on their core business.

Amazon was aware that many of it's customers were looking at the approach of Chaos Engineering, and in 2021⁴, they introduced a new service, initially called the 'Fault Injection Simulator', but quickly renamed to the 'Fault Injection Service' (FIS).

FIS was designed to allow customers to perform controlled, repeatable experiments on their workloads; introducing errors and issues and reviewing how their systems responded.

Creating the perfect recipe for chaos

The core of FIS is the concept of an 'experiment'. You can think of this as a recipe - you'll need ingredients (or targets as they're known in FIS) which are the things you want to test, such as EC2 instances, RDS databases, lambdas or even the underlying network. Once you have your list of ingredients, you then need to understand the steps to combine them - these are the actions that you'll take, such as introducing latency, or killing an instance.

In the same way that we gather recipes in a cookbook, we can store our experiments in templates so that we can run them again and again, knowing we can re-create our testing masterpieces perfectly each time.

These templates are made up of a number of different components with the first 3 being required, and the remainder being optional. Let's look at these below:

Targets - as mentioned above, these define the resources that you want to test with a full list being available at https://docs.aws.amazon.com/fis/latest/userguide/targets.html#resource-types. Once you specify which type of resource you want to test, you can then filter the resources you want to test by tags, or by specifying the identifier such as an EC2 instance id.
Actions - these describe what you'd like to do to your targets. There are a number of different actions available, depending on what type of target you're testing - the full list is available at https://docs.aws.amazon.com/fis/latest/userguide/fis-actions-reference.html.
IAM Role - when you run an experiment, FIS will use a role you define to perform the experiment. This means that the role will need the appropriate permissions to interact with the resources AND also FIS itself.
Stop Conditions - there might be times when you want to halt an experiment, for example if it starts to impact on a running environment. Stop Conditions allow you to link a pre-defined CloudWatch alarm to your template. Then, if that alarm is triggered during the experiment run, it will halt the experiment.
Logs - when you come to run your experiments, you may want to capture logs to understand what happened during the experiment. FIS allows you to capture logs to an S3 bucket or to CloudWatch, and they'll capture information such as the start and end of the experiment, the resources targeted, and the actions taken.
Reports - just before re:Invent 2024, new functionality was announced which allows FIS to generate a report that can be shared with others to document the experiment results. FIS allows you to generate a report in PDF, optionally including graphs from CloudWatch in a format that can be shared with others - for example a change review board, or during a code review.
Other options - in addition to the above items, there are other options available in templates, such as defining how long an experiment should run, how many items should be targetted (for example, 75% of an autoscaling group).

One thing to bear in mind is that you're not limited to a single set of targets and actions in a template. You could add multiple sets, for example, increasing the CPU usage of a set of servers, at the same time as rebooting an RDS database.

Once you've defined a template, these can be run multiple times, either manually, part of a pipeline or at scheduled times. The results of these runs are then stored and can be reviewed.

Pricing

As with most AWS services, you'll only pay for what you use. With FIS, the basic charge is $0.10 per minute per action per account.

WARNING
Be aware that if you want to use the experiment reports that generate a PDF, you'll be charged an additional fee of $5 per report.

Coming soon!

I hope you find this overview of Chaos Engineering and Amazon's Fault Injection Service useful. In the next post, coming soon, we'll look at how we create a template to test an EC2 autoscaling group, and then test how well the group responds to a failure.

Chaos Engineering the AWS Way: 101

Simon Hanmer — Mon, 25 Mar 2024 17:29:16 +0000

Over the last decade or so, we’ve seen a switch from the era of monolithic applications to a modern, more efficient, microservice-based development model. This had meant that we need to consider a wider landscape around testing to ensure our applications are resilient and perform as expected.

Build me a monolith.

In the ‘good old days’ of software development, applications were designed and written as monoliths. These large, self-contained codebases simplified initial development but led to challenges as the scope of the applications grew. Scaling was difficult as the application would often have to be duplicated, rather than being able to scale individual components as needed. Also deploying updates or new functionality was a complex process, often needing extended downtime for the entire application.

Testing of these applications would often be carried out manually, by test teams who would concentrate on the functional requirements. If non-functional requirements were tested, often this would be limited to performance-related tasks such as ensuring that a defined hardware configuration could handle a specific level of user traffic and respond within a given timeframe.

Transition to microservices.

As the use of Cloud providers became more prevalent, new serverless functionality allowed us to change our approach to development, leading to the spread in the use of microservices. These allowed us to break down the monolithic functionality into smaller, independently developed and deployed services. Each microservice tended to focus on a specific piece of business functionality, allowing teams to work in parallel in a more agile manner.

Applications now often have a distributed architecture with code running on servers, serverless or container-based systems or even client-side in browsers. We’re using databases with multiple read (or even write) hosts, caching, load balancers and other components, all coming together to create what is known as a distributed system, typically communicating via network links. This enables us to scale individual services as needed, leading to a more efficient use of resources and increased fault-tolerance but introduces new challenges related to communications, consistency and the need for more observability.

This new paradigm has also enabled us to improve our development and QA practices, through the use of automated deployments, often including automated testing covering unit tests, through to behavioural testing. But again, these have tended to concentrate on functional requirements – and this means that whilst the complexity of our application landscapes has grown, our non-functional testing hasn’t kept pace with the now complex architectures.

This challenge in managing distributed systems led to Peter Deutsch and James O. Coplient, articulating what are known as the ‘8 fallacies of distributed computing’ ¹:

The network is reliable.
Latency is zero.
Bandwidth is infinite.
The network is secure.
Topology doesn't change.
There is one administrator.
Transport cost is zero.
The network is homogeneous.

Acknowledging these fallacies is essential for designing robust, resilient distributed systems and their associated applications.

Release the apes of chaos.

To counter this lack of testing around complexity, a new discipline started to emerge in the early 2000s - chaos engineering.

Chaos engineering can be considered as:

the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production.²

Whilst there were some early, basic attempts by Amazon and Google, Netflix is generally considered as having defined this new approach by considering the 8 fallacies and trying to develop engineering solutions that would test against them to ensure reliability within their systems.

In 2011, Netflix developed a tool known as ‘Chaos Monkey’, which intentionally disabled servers, databases and applications at random in the production environment to test the resilience of their worldwide video-streaming network. The name came from the idea that you could measure what might happen if a wild monkey ran through their data centres and Cloud environments with a weapon, smashing servers and chewing through cabling³.

They quickly realised the value that this provided to their engineers, allowing them to design more highly available services, and expanded the tooling to create what is known as the ‘Simian Army’, with a variety of tools such as

Latency Monkey – a tool that introduced artificial delays in the network layer between different services. It allowed Netflix to simulate a server becoming unavailable, or even losing an entire service.
Conformity Monkey – looks for servers not configured according to their best practices and shuts them down. For example, it would look for servers that weren’t part of an auto-scaling group, and so had limited resilience to unplanned shutdown.
Chaos Gorilla – whilst Chaos Monkey looked for individual servers to target, Chaos Gorilla looked to test the outage of an entire availability zone.

These, and many other, approaches to testing Netflix’s resilience soon began to gain recognition within the wider engineering community, and others tried to re-use or re-engineer similar tools.

AWS enter the Fray.

As the major player in the Cloud engineering space, it’s safe to say that Netflix’s approach would have caught the eye of Amazon Web Services. After all, for many years, they were the biggest single customer using the Cloud provider’s services.

One of AWS’s oft-quoted ideals is the aim of reducing ‘undifferentiated heavy lifting’, where they look for tasks that are being widely adopted by their customers, and which they could provide as a managed service, providing them with the opportunity to reduce workload and complexity (whilst at the same time, no doubt, providing an income stream). However, AWS’s announcement⁴ at their 2020 re:Invent conference, that they would provide a managed service providing chaos engineering tools still came as a surprise to some.

Amazon’s Fault Injection Service, or FIS⁵, offered a set of scenarios that could be deployed in their customers’ accounts, initially allowing testing against Amazon’s EC2, ECS, EKS and RDS services. Since that time, the offering has expanded and now includes cross-account testing, simulating errors at the control plane and network layers, allowing simulation of availability zone failures, API throttling and a wide range of other failure scenarios.

FIS and associated AWS services, allow engineering teams to follow what is now seen as a standard set of practices within chaos engineering:

Baseline performance – services such as AWS CloudWatch allow a deep understanding of applications’ operating parameters, allowing them to measure and track interactions, dependencies and various service metrics.
Hypothesise – once there is an understanding of the components of a service, architects and engineers can start to think about ‘what if’ – would their availability continue under increased network latency, API throttling or the unplanned termination of components?

FIS enables these hypotheses to be codified using what are known as ‘experiment templates’, describing tests to be carried out along with which components should be stressed.
Experiment – FIS allows the experiment templates to be deployed and executed in an AWS environment.
Blast Radius – any chaos engineering tool should have the option to terminate an experiment if it starts to affect service in an unplanned way. FIS allows CloudWatch Alarms to be configured to halt experiments and roll back the effects that had been put in place.
Measurement – once again, CloudWatch provides services such as metrics, logs and alarms, allowing application designers to understand how their services reacted to the experiments put in place.

FIS also brings an important element to the world of chaos engineering – control. While it’s good to test in development environments to understand how services will react to unplanned scenarios, one of chaos engineering’s tenets is that the most valuable insights will be gained by testing against production services. However, AWS customers will want to control how and when these experiments are deployed – this is achieved using IAM permissions to control who can define and execute FIS scenarios.

Conclusion.

Organisations and engineers working with complex, distributed systems within Cloud environments should look to adopt the principles of chaos engineering, ensuring that it becomes not just a best practice but a strategic imperative.

Amazon’s FIS empowers engineering teams to proactively address the challenges of distributed systems, ensuring the robustness and resilience of applications in the dynamic and unpredictable cloud environment. Working to their principle of undifferentiated heavy lifting, AWS has positioned chaos engineering as a managed service, aligning with their commitment to reducing complexity and empowering customers to navigate the intricacies of modern cloud-based architectures.

Validating Python code with a CodeCatalyst Pipeline.

Simon Hanmer — Thu, 18 Jan 2024 18:04:53 +0000

This article will form part of a series that will explore how to validate Python code, deploy it as a Lambda Function via Terraform, through to defining a standard template that can be used to create Lambdas.

Our example Python Code
What will the pipeline do?
Creating a Pipeline
- General Configuration
- Running Actions
- Linting our code
  - PyLint configuration
- Vulnerability scanning
- Automated testing
  - PyTest configuration
- Action ordering
Running the pipeline
- CodeCoverage details.
Conclusion

CodeCatalyst is a unified development environment created by AWS.

It has many features such as blueprints to assist in writing code, integrated Git repositories, dev environments which can be pre-defined and now AI integration. However, for myself, one of the most useful things is being able to define and use pipelines stored in the code repository.

Pipelines are one of the most commonly used tools that many of us working with code and Cloud use, allowing us to automate tasks to be carried out when we make changes to our code, whether that's checking that our code works, building artefacts and packages, and deploying to our environments.

In this post, I'll share an example pipeline that we can use to validate some Python code as we work on it. Whilst this post doesn't cover all options available when working with pipelines, it should be enough to explain how the pipeline works and how you can modify them in your workflows.

⚠️ To streamline the post, I'll assume you understand how CodeCatalyst works and can create and work with code in repositories.

Our example Python Code

For this post, I'm using some example code from https://github.com/headforthecloud/example-python-lambda which defines a simple function that could be used as an AWS Lambda function. The code is structured like:

The code for the main function lambda_function.py, shown below, just sets up some logging, outputs a message and returns a status code to indicate it ran successfully. It also includes a couple of functions that could be used to demonstrate testing:


 python
#! /usr/bin/env python
""" An example lambda function """

import os
import json
import logging


# define a logger using logging library. If LOG_LEVEL is not set, default to INFO.
# otherwise use value of LOG_LEVEL
logger = logging.getLogger()
logger.setLevel(os.getenv('LOG_LEVEL', 'INFO'))


def lambda_handler(event, context):
    """ define a lambda_handler function that takes in an event and a context """
    logger.info("Hello from Lambda!")

    return {
        "statusCode": 200,
        "body": json.dumps(event)
    }


def add_x_y(x, y):
    """ This is a simple function that adds two numbers together and returns the result. """
    return x + y


def multiply_x_y(x, y):
    """ This is a simple function that multiplies two numbers together and returns the result. """
    return x * y


# if this file is run directly, run the lambda_handler function with dummy event and context
if __name__ == '__main__':
    lambda_handler(None, None)

We also have some code written using the PyTest framework, in a folder called tests.

What will the pipeline do?

In this example, we're going to perform a set of actions which are typical of a pipeline used with Python:

Linting - we do this to make sure our code meets general best practices in terms of code, and that it should at least run. For this, we'll use a well-known tool called PyLint.
Vulnerability Scanning - we do this to try and make sure our code doesn't contain any security issues such as secret values, possible SQL injection routes etc. We will use a tool called Bandit for this.
Automated Testing - we want to make sure that our code performs as expected. To this end, we'll use the PyTest framework and check that our tests work and that we test an appropriate amount of our code.
Reporting - for each step, we will use the CodeCatalyst functionality to generate reports showing the outcome of each step and whether it was successful.

Creating a Pipeline

There are two approaches to generating or modifying a pipeline with CodeCatalyst - either via a visual editor built into https://codecatalyst.aws or working in the repository and defining a pipeline using a YAML file in the .codecatalyst/workflows file.

📖 The full definition for the pipelines can be found here

For this example, I'll use the latter approach, working with a file called .codecatalyst/workflows/python-testing-pipeline.yaml. If you'd like to see the full file, it's available in GitHub

General Configuration

Firstly, we're going to define where and when the pipeline will run with this code:



SchemaVersion: "1.0"

Name: python-testing-pipeline

Compute:
  Type: EC2

Triggers:
  - Type: Push

With this, we're saying that the pipeline will be called python-testing-pipeline, and that it will be executed using EC2 (we could also use lambdas).

We're also going to define that the pipeline should be triggered every time changes are pushed to the repository. We could also have workflows triggered when working with a pull request, or even on a scheduled basis.

Running Actions

Once we've defined when and where the pipeline runs, we need to tell it what steps to carry out - to do this we'll use an Actions section, which will have a number of these items:

A name
Identifier - these are equivalent to GitHub actions - in fact we can use some GitHub actions (see here for more info.). In our examples, we'll use the aws/build@v1 and aws/managed-test@v1 actions (these are functionally equivalent and interchangeable).
Inputs - in this case, we're going to use these to specify that we want to retrieve our code from the WorkflowSource i.e. the repository containing the pipeline, but we could also specify that we want to use artefacts that might contain saved files.
Configuration steps - we'll use these to list the specific actions we want to perform in the pipeline. With the build and managed-test actions, we provide a list of Run steps which use the Linux shell bash to execute the provided commands.
Outputs - in our example pipeline, we'll use these to define Reports that will feedback on the results of our actions in CodeCatalyst.

Linting our code

For our first action, we're going to check that our code meets the best practices for Python. In this case, as mentioned earlier we're going to use PyLint and our pipeline will carry out the following steps:

Specify that we want to use the code from our repository
Install Pylint using pip
Ensure that we have a location we can use to store the results of our linting
Run pylint and capture the results in the folder created in the previous step
Upload the results as a report to CodeCatalyst, using the PYLINTJSON format and defining our success criteria which will control if this pipeline step is successful. In this example, we can specify what level of issues are allowed within a set of categories.

To perform the above, we can use this code:



Actions:
  Linting:
    Identifier: aws/build@v1.0.0
    Inputs:
      Sources:
        - WorkflowSource
    Configuration:
      Steps:
        - Run: |
            echo "Installing pylint"
            pip install --user pylint
            export PATH=$PATH:~/.local/bin
        - Run: |
            echo "Check testresults folder exists"
            if [ ! -d tests/testresults ]
            then
              mkdir tests/testresults
            fi
        - Run: |
            echo "Linting Python"
            pylint *py tests/*py > tests/testresults/pylint-output.py
    Outputs:
      Reports:
        PyLintResults:
          Format: PYLINTJSON
          IncludePaths:
            - tests/testresults/pylint-output.py
          SuccessCriteria:
            StaticAnalysisQuality:
              Severity: HIGH
              Number: 1
            StaticAnalysisSecurity:
              Severity: MEDIUM
              Number: 1
            StaticAnalysisBug:
              Severity: MEDIUM
              Number: 1

PyLint configuration

We have control over what checks PyLint will carry out by using a configuration file .pylintrc. In our example, we'll use this setup



[BASIC]
good-names=i,j,k,x,y,ex,Run,_
fail-under=0.1

[FORMAT]
max-line-length=120
indent-string='    '

[REPORTS]
output-format=json

Vulnerability scanning

We're also going to add a section to our actions to check that we don't have any security issues in our code such as including secrets, allowing SQL injection etc. To do this, we're going to use a tool called Bandit

The steps are very similar to those from the linting:

Specify that we want to use the code from our repository
Install Bandit using pip
Ensure that we have a location we can use to store the results of our scans
Run bandit and capture the results in the folder created in the previous step. We'll output the results in a standard format used by scanning tools called sarif
Upload the results as a report to CodeCatalyst, using the SARIFSA format and defining our success criteria which will control if this pipeline step is successful. Again, we'll specify what criteria are needed for a successful run.

To perform the above, we can use this code:



  vuln_scan:
    Identifier: aws/build@v1.0.0
    Inputs:
      Sources:
        - WorkflowSource
    Configuration:
      Steps:
        - Run: |
            echo "Installing bandit"
            pip install --user bandit bandit-sarif-formatter
            export PATH=$PATH:~/.local/bin
        - Run: |
            echo "Check testresults folder exists"
            if [ ! -d tests/testresults ]
            then
              mkdir tests/testresults
            fi
        - Run: |
            echo "Running Bandit"
            bandit -r . --format sarif --output tests/testresults/bandit-output.sarif --exit-zero
    Outputs:
      Reports:
        BanditResults:
          Format: SARIFSA
          IncludePaths:
            - tests/testresults/bandit-output.sarif
          SuccessCriteria:
            StaticAnalysisFinding:
              Severity: MEDIUM
              Number: 2

Automated testing

Whilst our other steps check our code from a static viewpoint, we want to be sure that our code works as we expect, so we'll have a step included in most pipelines - using automated testing to validate that our code works in the way we want.

In our example, we're going to use the popular PyTest framework, which will use code stored in the tests folder to check functionality - for this example, we're going to have a single, simple test to demonstrate how this can be done.

As well as understanding whether our code passes the provided tests, we want to understand how much of our code has been tested, so we'll also capture what is known as code coverage which records which lines of our code have been tested.

Again our steps will follow the now familiar process of installing any required tools, executing them, and then capturing the results as a report within CodeCatalyst using the following code:



  unit_tests:
    Identifier: aws/managed-test@v1.0.0
    Inputs:
      Sources:
        - WorkflowSource
    Configuration:
      Steps:
        - Run: |
            echo "Installing pytest"
            pip install --user pytest pytest-cov
            export PATH=$PATH:~/.local/bin
        - Run: |
            echo "Check testresults folder exists"
            if [ ! -d tests/testresults ]
            then
              mkdir tests/testresults
            fi
        - Run: |
            echo "Check for requirements"
            if [ ! -r requirements.txt ]
            then
              pip install --user -r requirements.txt
            fi
        - Run: |
            echo "Running PyTest"
            python -m pytest
    Outputs:
      Reports:
        PyTestResults:
          Format: JUNITXML
          IncludePaths:
            - tests/testresults/junit.xml
          SuccessCriteria:
            PassRate: 100
        CodeCoverage:
          Format: COBERTURAXML
          IncludePaths:
            - tests/testresults/coverage.xml
          SuccessCriteria:
            LineCoverage: 80

PyTest configuration

With PyTest, we're going to use two configuration files.
.pytest.ini is used to define where our tests are and what output we'll generate from the tests - our example looks like:



[pytest]
log_level = INFO
addopts = 
    -v --no-header --cov=.
    --junitxml=tests/testresults/junit.xml
    --cov-report=xml:tests/testresults/coverage.xml
    --cov-report=term-missing

testpaths = tests

and we'll also use a .coveragerc file to tell PyTest not to include our test files when calculating code coverage via:



[run]

omit = ./tests/*

Action ordering

As defined here, there are no constraints on the ordering of the linting, scanning and testing steps, so they will run in parallel.

However, if we want to ensure that a step will only run if a previous step is completed, we can use a DependsOn clause in each action, so for example if we wanted our unit_tests action to only run if the linting step worked, we could change our action definition to include the following lines:



unit_tests:

    DependsOn:

      - Linting

    Identifier: aws/managed-test@v1.0.0

...

Running the pipeline

Once we've created our pipeline and committed it to the code repository in CodeCatalyst along with our Python code, we should have a file structure that looks something like:

With all of this in place, CodeCatalyst should recognise that it needs to run the pipeline anytime there are changes to the code in the repository, including the pipeline configuration file. These runs are visible in the CodeCatalyst console under CI/CD > Workflows as shown below:

Each pipeline will be listed using the name defined at the start of the configuration, along with each run, showing the status of the run, a run ID, a commit ID that triggered the run, and which repository and branch were used.

Clicking on the ID of the run, will take us to the details for that particular run, looking something like this:

As mentioned earlier, you can see that because we didn't define any dependencies between the steps, they ran in parallel. We can also see whether the run was successful, the commit ID that triggered the run, along with when the run started, and how long it took.

We can also click on any of the steps to see the details of each step, including the output from any commands:

Reporting

As well as being able to see whether a workflow run was successful, we can see any reports generated by clicking on Reports, either in the sidebar or the details of the run screen.

The screen above shows the reports generated by our workflow, when they were generated, if they were successful (as long as we defined criteria to specify what success means), along with repository details, the workflow action step that created the report and the type of data the report contains.

These reports are, in my view, one of the items that helps CodeCatalyst stand out - it's very simple to define what reports are being generated, what type of data they contain, and what constitutes a successful report.

Clicking through on the report name takes you to the detailed report data:

In the example above, showing code coverage i.e. how much of the code has been tested, we can see what the success criteria were, how much of the code we've tested both as a summary, and on a per file basis.

CodeCoverage details.

In the example, we could see a summary of the overall coverage, as well as a per file basis. We're also able to click through to the individual files to see which lines were tested or not:

Conclusion

In my opinion, CodeCatalyst is a useful development tool - by integrating many of the tools required in the SDLC (Software Development Life Cycle), it can provide a very functional working space.

In this example, we've concentrated on how we can define and perform pipelines when we make changes to our code, and how we can report on the outcomes from those changes - an area that I think CodeCatalyst is particularly strong in.

If you have any questions, comments or suggestions for other tasks we could use in the pipelines, use the comment box below.

Generating cloudwatch alarms using 'metric math' via CloudFormation and Terraform.

Simon Hanmer — Thu, 04 Jan 2024 17:37:35 +0000

I spend a lot of time working as a consultant with GlobalLogic UK&I with different client teams to deploy AWS infrastructure, and not surprisingly, I see differing levels of maturity and experience within these teams.

While we work with teams with a lot of knowledge, often they concentrate on deploying the applications and infrastructure, but they won't think about how they can understand how well an application is working. This is an important aspect of working within the Cloud, usually termed monitoring or observability.

There are several well-known 3rd party tools, such as Splunk or DataDog, but people often forget there is a handy set of tools provided by AWS under the CloudWatch banner. CloudWatch provides a range of services, such as exposing metrics, linking them to alarms, and X-ray services, to allow us to understand the performance of services in more depth.

CloudWatch metrics

Most services that we can use within AWS, such as EC2, Lambdas, DynamoDB, etc., will emit metrics that we can use to understand how well a resource is performing.

For example, if we're deploying EC2 instances, we can monitor many aspects of the instance, such as

CPU utilisation,
Disc read and write metrics,
Network traffic.

While we're working with lambda functions, we can monitor

Number of invocations, including how many concurrent invocations,
The duration of a function run,
How many errors are generated?
How many invocations were throttled (for example, if we exceeded the allowed number of lambda invocations in an account)?

Using the console, we can view historical graphs of our metrics, as shown below:

CloudWatch metrics graph for a lambda function

This graph shows the metrics for a lambda function MyLambdaFunction, specifically how many times the function has run (invocations), along with how long the lambda takes to run (duration).

CloudWatch Alarms

Once we're aware of the metrics available for a particular resource, we can use them to set alarms. For example, we might want to be notified if the CPU usage on one of our EC2 instances went above 75% or if one of our lambdas was being throttled. Alarms have associated actions that are triggered when a metric goes into an alarm state or when it returns to normal, and we can, for example, send a message to a SNS topic, trigger a lambda to resolve something, or if we're using an EC2 metric, perform EC2 actions such as stopping or rebooting the instance.

If we're graphing a metric through CloudWatch, we'll have the option to create an alarm directly or create one from the console. When we create the alarm, we'll be asked for the following:

A name and description
Which namespace we are interested in, such as EC2 or Lambda
Which metrics do we want to monitor, i.e., CPU utilisation, throttle, etc.?
What is the dimension for the metric, i.e., which instance, DynamoDB table, function name, etc.?
How we'll measure the metric, such as using averages, sums, or minimum or maximum values. The AWS documentation for each service will explain the best statistics to use; for example, the Lambda documentation can be found here.
What period do we want to monitor, and how many anomalies do we need to trigger the alarm?
What the threshold is for triggering the alarm—this is a value and condition such as GreaterThanThreshold or LessThanThreshold.
What actions should be triggered when an alarm occurs or when it returns to its normal state?

Calculated alarms using metric math.

Generally, we want to alarm on the absolute value of a metric, for example, when CPU utilisation exceeds a certain amount or if a lambda is being throttled. However, sometimes we'll be more interested in a relative value—maybe the percentage of runs that generate errors. A single error is important if a lambda runs once or twice, but much less so if the lambda runs thousands of times a day.

Where this isn't available as a metric, we can often calculate them using what is known as metric math. So for example, with the example above, we can calculate the number of errors as a percentage of invocations

Rather than looking at the absolute value of the Error metric as in

Alarm configuration using absolute values

we can use expressions to calculate a value

Alarm configuration using expressions

Deploying Alarms via Infrastructure as Code.

Of course, best practices today dictate that we should be deploying our infrastructure as code, using tools such as CloudFormation or Terraform.

Note: The examples below are not full deployments, they only show how to create alarms.

Deploying an alarm with absolute values using CloudFormation

The yaml CloudFormation below will deploy an alarm which will trigger when a lambda is throttled.

In this case, the definition is reasonably simple as we are looking at a single metric, in this case the Throttles metric in the AWS/Lambda namespace.

It depends on the following references

Variable Name	Description
`LambdaFunctionName`	The name of the deployed Lambda
`LambdaThrottlePeriod`	What period should we use when assessing if a throttle has occurred
`LambdaThrottleThreshold`	How many throttles do we need to see before triggering the alarm
`aws_sns_topic.ErrorAlarmTopic`	The SNS Topic to send alarms to



rLambdaThrottledAlarm:

    Type: AWS::CloudWatch::Alarm

    Properties:

      AlarmName: !Sub "lambda-function-${pLambdaFunctionName}-throttled"

      AlarmDescription: !Sub "lambda-function-${pLambdaFunctionName} is throttled"

      MetricName: Throttles

      Namespace: AWS/Lambda

      Dimensions:

        - Name: FunctionName

          Value: !Ref pLambdaFunctionName

      Period: !Ref pLambdaThrottlePeriod

      Statistic: Sum

      EvaluationPeriods: 1

      Threshold: !Ref pLambdaThrottleThreshold

      ComparisonOperator: GreaterThanThreshold

      TreatMissingData: ignore

      AlarmActions:

        - !Ref rAlarmTopic

      OKActions:

        - !Ref rAlarmTopic

Deploying an alarm with absolute values using Terraform

The Terraform below will deploy an alarm which will trigger when a lambda is throttled. It depends on the following variables and resources:

Variable Name	Description
`LambdaFunctionName`	The name of the deployed Lambda
`LambdaThrottlePeriod`	What period should we use when assessing if a throttle has occurred
`LambdaThrottleThreshold`	How many throttles do we need to see before triggering the alarm
`AlarmTopic`	The SNS Topic to send alarms to



resource "aws_cloudwatch_metric_alarm" "LambdaThrottledAlarm" {

  alarm_name          = "lambda-function-${var.LambdaFunctionName}-throttled"

  alarm_description   = "lambda-function-${var.LambdaFunctionName} is throttled"

  comparison_operator = "GreaterThanThreshold"

  evaluation_periods  = "1"

  metric_name         = "Throttles"

  namespace           = "AWS/Lambda"

  period              = var.LambdaThrottlePeriod

  treat_missing_data  = "ignore"

  statistic           = "Sum"

  threshold           = var.LambdaThrottleThreshold

  alarm_actions       = [aws_sns_topic.AlarmTopic.arn]

  ok_actions          = [aws_sns_topic.AlarmTopic.arn]

dimensions = {

    FunctionName = var.pLambdaFunctionName

  }

}

Deploying an alarm with calculated values using CloudFormation

This time, the template is more complex. We cannot simply add the metrics to be measured, but we also need to include the expression that will be calculated and use that as the trigger for the alarm.

Note: We use two intermediate metrics, errors and invocations which we then reference in the expression. Both of the intermediate metrics are tagged with ReturnData: false to indicate they are intermediates.



rMyLambdaErrorsAlarm:

    Type: AWS::CloudWatch::Alarm

    Properties:

      AlarmName: !Sub "lambda-function-${pLambdaFunctionName}-errors"

      AlarmDescription: !Sub "lambda-function-${pLambdaFunctionName} errors exceed ${pLambdaErrorThreshold} %"

      Metrics:

        - Id: error_percentage

          Label: Errors

          Expression: (errors/invocations)*100

        - Id: errors

          Label: input

          ReturnData: false

          MetricStat:

            Metric:

              Namespace: AWS/Lambda

              MetricName: Errors

              Dimensions:

                - Name: FunctionName

                  Value: !Ref pLambdaFunctionName

            Period: !Ref pLambdaErrorPeriod

            Stat: Sum

            Unit: Count

        - Id: invocations

          Label: input

          ReturnData: false

          MetricStat:

            Metric:

              Namespace: AWS/Lambda

              MetricName: Invocations

              Dimensions:

                - Name: FunctionName

                  Value: !Ref pLambdaFunctionName

            Period: !Ref pLambdaErrorPeriod

            Stat: Sum

            Unit: Count

      EvaluationPeriods: 1

      Threshold: !Ref pLambdaErrorThreshold

      ComparisonOperator: GreaterThanThreshold

      TreatMissingData: ignore

      AlarmActions:

        - !Ref rErrorAlarmTopic

      OKActions:

        - !Ref rErrorAlarmTopic

Deploying an alarm with calculated values using Terraform

Again, this is more complex than the Terraform used to deploy the alarm based on an absolute value, because again we need to use an expression to calculate the value for the alarm to monitor.



resource "aws_cloudwatch_metric_alarm" "MyLambdaErrorsAlarm" {

  alarm_name          = "lambda-function-${var.LambdaFunctionName}-errors"

  alarm_description   = "lambda-function-${var.LambdaFunctionName} errors exceed ${var.LambdaErrorThreshold} %"

  comparison_operator = "GreaterThanThreshold"

  evaluation_periods  = "1"

  threshold           = var.LambdaErrorThreshold

  treat_missing_data  = "ignore"

  alarm_actions       = [aws_sns_topic.AlarmTopic.arn]

  ok_actions          = [aws_sns_topic.AlarmTopic.arn]

metric_query {

    id          = "error_percentage"

    expression  = "(errors/invocations)*100"

    label       = "Errors"

    return_data = true

  }

metric_query {

    id = "errors"

    metric {

      metric_name = "Errors"

      namespace   = "AWS/Lambda"

      period      = var.LambdaErrorPeriod

      stat        = "Sum"

      unit        = "Count"

  <span class="nx">dimensions</span> <span class="p">=</span> <span class="p">{</span>
    <span class="nx">FunctionName</span> <span class="p">=</span> <span class="kd">var</span><span class="p">.</span><span class="nx">LambdaFunctionName</span>
  <span class="p">}</span>
<span class="p">}</span>
<span class="nx">return_data</span> <span class="p">=</span> <span class="kc">false</span>



}

metric_query {

    id = "invocations"

    metric {

      metric_name = "Invocations"

      namespace   = "AWS/Lambda"

      period      = var.LambdaErrorPeriod

      stat        = "Sum"

      unit        = "Count"

  <span class="nx">dimensions</span> <span class="p">=</span> <span class="p">{</span>
    <span class="nx">FunctionName</span> <span class="p">=</span> <span class="kd">var</span><span class="p">.</span><span class="nx">LambdaFunctionName</span>
  <span class="p">}</span>
<span class="p">}</span>
<span class="nx">return_data</span> <span class="p">=</span> <span class="kc">false</span>



}

}

Metric and alarm pricing.

In this post, we're using standard metrics provided by AWS which are provided by AWS without charge.

Alarms are priced per metric, with each metric monitored costing $0.10 per month (depending on region). So our initial, absolute alarm would cost $0.10 per month as it monitors a single metric, whilst the calculated alarm would cost $0.20 per month. There is a free tier where the first 10 alarms per month are free.

Use VSCode to write Terraform? AWS AI can now help you write your code!

Simon Hanmer — Mon, 27 Nov 2023 18:37:52 +0000

As I sit here writing this article, AWS's annual re:Invent is just starting in Las Vegas. This huge event pulls together cloud enthusiasts from around the world to learn about the largest cloud providers offerings.

Not surprisingly, AWS use this time to announce new and improved services, and we'll see hundreds of articles over the next week - in fact, there are so many announcements, they have to start drip-feeding them out in advance of the main event.

One of the announcements released in advance of re:Invent this year, was that new capabilities had been added to AWS CodeWhisperer, AWS's AI powered coding bot - including the ability to generate Terraform code. In this article, I'll show how you can configure VSCode to use CodeWhisperer and how you can start to use CodeWhisperer to produce Terraform.

Getting started

Assuming you have VSCode installed, we'll need to go through a few steps to let us use CodeWhisperer.

Installing the AWS Toolkit extension for VSCode

If you're a regular user, you'll not be surprised to learn that the integration between VSCode and CodeWhisperer is via an extension. Extensions are VSCode's plugins that are used to add additional functionality, such as check code formats, linting, syntax checks etc.

To install an extension, click on the extensions icon in the taskbar on the left-hand side of VSCode which is shown below

In the search bar that appears, enter the name amazonwebservices.aws-toolkit-vscode or AWS Toolkitand you'll see an icon appear along with a brief description. Click on this to show information about the extension, and you can then click on the blue Install button. After installation, you may be asked to reload VSCode to activate the extension.

After the extension is properly installed, you will see a new AWS icon in the task-bar.

Who goes there?

With the extension in place, we need to authenticate our VSCode session to AWS. If your organisation is paying for the CodeWhisperer Professional tier, you might have access to it through IAM's SSO Identity Center. However, for most of us the easiest option is to sign up for an AWS Builder Id via https://profile.aws.amazon.com/.

Builder Ids are a relatively new way of authenticating with AWS, and they are not associated with a particular AWS account, so these identities are free.

You're not coming in, if your name's not on the list.

Once you have have a builder Id, signing into the extension is fairly pain-free. Click on the new AWS icon in the VSCode taskbar, and you'll be asked to Select a connection. Click on this and you'll get a popup, giving you the change to Add New Connection.

Select the Add new connection, and you'll be shown a screen showing the different ways you can connect with the AWS Toolkit. Scroll down under CodeWhisperer and you'll see a button, asking you to sign up or sign in for an AWS Builder Id. Click on this and you'll be prompted through a series of popups asking you to connect via your browser. Confirm these connections and you'll be taken through the sign-in process before being prompted to confirm you allow the extension to access your builder Id and associated data. Confirm this and you'll get confirmation in the browser that you have allowed this, and in VSCode the screen will change to show you're connected with your builder Id.

Let's get the party started.

With the confirmed connection we can finally get down to using CodeWhisperer to create some code.

After closing down any open windows in VSCode, create a new text file. You should be prompted to Select a language - click on this and choose Terraform. (If you don't see Terraform in the list of options, you may need to install the official Hashicorp Terraform extension.)

With our new Terraform file, we can get down to work. We normally interact with CodeWhisperer via comments in the code, so enter the following text

# configure aws s3 as backend to store terraform state

As you get to the end of the text, CodeWhisperer will insert text into the editing window. At this point you have a number of options available - after a few seconds, a popup will appear at the top of the window, offering the following choices:

Press TAB to accept the suggestion,
Press left and right arrows to access alternative suggestions
Press Esc to reject the suggestions.

Another example

Let's consider a different prompt
# write a terraform iam policy allowing a user to get and put objects to an s3 bucket..

When I submit this, I get the response


 terraform
resource "aws_iam_policy" "s3_policy" {
  name        = "s3_policy"
  description = "A test policy"

  policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Effect": "Allow",
      "Resource": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
    }
  ]
}
EOF
}

However, I might be working with a team that prefers to use jsonencode - so I can be more specific in my prompt, changing it to # write a terraform iam policy using jsonencode allowing a user to get and put objects to an s3 bucket.

This time, I get an appropriate set of code as below:


 terraform
resource "aws_iam_policy" "s3_policy" {
  name        = "s3_policy"
  description = "s3_policy"
  policy      = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = [
          "s3:GetObject",
          "s3:PutObject"
        ]
        Effect   = "Allow"
        Resource = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
      },
    ]
  })
}

Is CodeWhisperer infallible

Like most AI tools currently available, CodeWhisperer will generally offer good suggestions but sometimes you will need to alter your prompt to get the best suggestion, so called prompt engineering.

Personally, I would have liked the IAM examples above to use [ "s3://************/" ] for the s3 resources, but it can be argued that this is down to the data used to train the AI model.

I think it's safe to say that we're not at the point where we can deploy the generated code straight to production, but I can see that it would be very useful for a less experienced engineer.

Conclusion

In this article, I've given a brief overview of how we can now use CodeWhisperer in VSCode to help us write our Terraform. The new capabilities also support other Infrastructure as Code languages such as CloudFormation and CDK, and no doubt the list of supported languages will increase.