Forem: Niklas Westerstråhle

Preparing for re:Invent 2025

Niklas Westerstråhle — Mon, 15 Sep 2025 11:45:27 +0000

Its time for re:Invent 2025

Seventh time is the charm. I’ve been lucky to be able to attend re:Invent every year since 2017— with the exception of 2018. During the years my priorities have changed a lot: first years I had a hugely packed daily planner, and I was running like a headless chicken from one session to another, calendar looking like Tetris on hard mode - to last years focusing on what matters most: hands‑on first (workshops, builder sessions), gamified learning second (Jams, GameDays), and real conversations with people I only ever have the opportunity to meet in Vegas.

I plan my week with Raphael Manke’s (AWS Hero) Unofficial re:Invent Session Planner. Highly recommend you take a look at it, and plan your agenda before reserved seating opens.

Workshops first, talks second

Breakouts are great for awareness but workshops are where the learning sticks. Lot of the breakouts are available on YouTube after the event. So I start my day in a workshop or builder session. If a keynote or a flashy breakout conflicts, it becomes nice-to-have. The non‑negotiables are the seat‑limited, hands‑on blocks—you can’t stream yourself into a lab, and standby lines are not a strategy.

I block time on my calendar for the reservation release, go in with a preselected selection of workshops, and when reservations open - click as fast as you can in the portal. If I miss the first wave, I’m ruthless with alternates and fully expect to use standby a few times. It’s fine if I planned for it; it’s pain if I didn’t. If you need to standby, you need to reserve anything from 30-60minutes prior to the session for that - if you really want in, be there early, really early.

The campus is “walking or shuttle distance,” which sounds reasonable until you’re sprinting between Venetian and Mandalay in afternoon traffic. I group sessions by venue: for example Venetian/Forum in the morning, Mandalay/MGM in the afternoon. I leave 45–60 minutes of buffer across venues and accept that one shiny session will die so the rest of the day lives. The busses between venues will kill your enthusiasm. On paper they're good, but they're slower than you'd expect. I've walked from Venetian to MGM and back, in almost the same time the shuttles take as worst times (specially if you're at the other side of the venue to start with). If you have comfortable shoes - try walking.

Jams & GameDays

These are my happy place, if you don't count the hallways and community lounges etc. Jams are structured challenge boards with scoring and hints; GameDays are narrative scenarios where a team keeps a live-ish environment healthy while requirements change. Both reward calm teamwork, triage, and observability over trivia.

You can go with friends, or just join a group of random people. I've done both, winning is a lot easier with non-random people. And I'm competitive as hell, so always go there to win.

But I've also done events with random people in the table, and met new and interesting people - had interesting discussions with them. Depends on what kind of people you end up seated with.

My advice is to talk to the people in the table, learn while you share your experiences.

In the game itself. Start all the tasks immediately. Assign tasks to every team member, and discuss when you hit any blockers - help each other out. Starting all the activities also makes sense, as sometimes they have a long wait before the environment is up and running in the background. So unless the task is timed (you have a limited amount of time to finish) - start all of em.

If you hit a blocker, first read the task again, and one more time. I've more than once skipped some important detail ("you have to put this exact name to the task" - for monitoring purposes). If I know what I'm doing, I tend to skip naming items exactly like the game expects - just using my own naming convention for whatever I build.

Communication in the team is the key to victory. And sometimes asking stupid questions from the AWS staff at the event.

People > sessions (on purpose)

The best outcomes usually begin with “you free near Venetian at 14:00?” I schedule two intentional micro‑meetups per day—AWS folks, partners, community friends, clients—and I stack them near my venue block. I go in with a plan but leave room for hallway track conversations; that’s where roadmap comments, napkin architectures, and “we tried this and it broke” stories surface. Treat these as first class sessions.

Daily template I’m using this year

Morning: workshop/builder session while the brain is freshest. If a keynote overlaps, I’ll watch a recap later and spend the time asking the service team pointed questions at their booth. Lightning demos at the expo are great for validating whether a headline matches a use case.

Midday: buffer + expo + short meetups. Early in the week is best for energy and quick demos; by Thursday everything calms down and the deeper conversations get easier.

Afternoon: Jam/GameDay or a second workshop as the anchor. If nothing looks strong, I’ll run “office hours”: meetup fellow community builders, whiteboard time with PMs or partner engineers about a specific client edge case. Those 30 minutes routinely outperform any breakout.

Evening: one social thing, or none. Comfortable shoes, hydration, chargers, and using the shuttle are not optional.

Keynotes (how I handle them)

I generally skip attending keynotes live. With one huge exception - Werner Vogels keynote - I want to attend that, feel the atmosphere in the room during the keynote. The others I usually start via remote viewing so I can keep the day flexible. I’m not anti‑keynote; if I’ve got the time and the venue lines up, I’ll go.

For the last two years I’ve had the Cloud Track sticker. It’s been a lifesaver for avoiding the big queues at keynotes. My reality is showing up last minute and almost missing walking straight in via the Cloud Track entrance, both times entering just as the first folks from the regular line are just pouring into the room. Not proud of the timing, would prefer just a little bit of time to spare.

Using the unofficial planner (how I actually do it)

First, allocate time for this. Really - this makes or breaks your week in Vegas.

I pick three themes for the week. Then I:

Search and favorite anchors (workshops, builder sessions, Jam/GameDay).
Add alternates for every anchor in the same venue block.
Color‑code by venue. The goal is zero sprints across town.

Then when reserved seating opens I reserve the hands‑on sessions (not recorded), keep breakouts as filler, and protect white space so the spontaneous value can happen. It’s boring advice because it works.

Tactics that reduce friction

Travel & campus: Hotels in the portal fill early; shuttles connect venues; maps show up in the event app closer to the week. “Walking or shuttle distance” is technically true, but the Strip is deceptively long between back‑to‑backs. Budget buffer. And opt for uber if you're in a better position for that rather than the shuttle bus.

Connectivity: If roaming is ugly, get a local eSIM or prepaid SIM on arrival. Put this next to “comfy shoes” and “external battery” under “things you wish you sorted on Sunday.”

Food: Yes, there’s breakfast and lunch. If you care about a proper meal, go early—popular stations run thin at peaks.

Replay: Re:Play has dedicated shuttles and monorail - use them; I personally opt to not go at all anymore - I'll rather end my week relaxing a bit, and preparing for the trip home Friday.

Expo strategy: Early week for breadth (lay of the land, swag if that’s your thing), late week for depth (quieter booths, longer technical chats). This has held true for me every year.

Footwear is strategy. The Strip + conference floors will eat your steps. Wear shoes you already trust.

Find me at the event

I'm one of the guys with a Golden Jacket roaming the halls. Come have a chat.

Some thoughts running Perforce P4 on AWS

Niklas Westerstråhle — Mon, 07 Jul 2025 13:20:48 +0000

We needed a cloud deployment of Perforce P4 (Helix Core + Swarm).

Perforce is the industry standard version control system for game development—used by studios like Epic, EA, and Ubisoft—because it handles huge binary assets, massive repos, and global teams better than anything else.

AWS offered the basic building blocks — but making everything ready for production took some sharp edges and deep dives.

Here’s the story from my perspective, including the gotchas and fixes I wish someone had told me earlier. The view is more from administrative perspective than an end-user perspective.

For deployment we used the Cloud Game Development Toolkit (CGD) - which was enriched with some additional pieces here and there to fit our use case.

I deployed with CDG v1.1.2-alpha, and also contributed by flagging future improvements to the framework. At the time of writing this some of the issues we ran into are already tackled in the newer releases.

But for the blog purposes you can refer to the example deployment from the toolkit - you can also test it out if you want to yourself.

Perforce does let you use the product with less than 5 users without a license.

Fixing the Swarm Docker Image

After initially deploying the environment we noticed some issues, every time we triggered a redeploy of the swarm container our swarm stopped working, its extension configuration changed at the core.

It's a feature, it's meant to update the configuration - to make sure everything works after recreation of the container. Tokens are shared and requests are pointed to the correct place. Or at least they should be.

The official perforce/helix-swarm Docker image has a hardcoded http scheme in its configure-swarm.sh script. But since our setup used an AWS Network Load Balancer (NLB) + AWS Application Load Balancer (ALB) with an SSL certificate terminating TLS in front of the Fargate service, this meant every time the swarm container configured the Swarm extension on the commit server, but it set the Swarm URL with the wrong scheme.

The container only lets you configure the hostname part of the Swarm URL - http:// is hardcoded. Instead of using the Perforce provided container image, we had to create our own and host it on Amazon Elastic Container Registry (ECR).

Here's a short snippet of what to put into your Dockerfile:

FROM perforce/helix-swarm

USER root

# Change hardcoded http -> https for SWARM_URL
RUN sed -E 's/http(:\/\/\$SWARM_HOST)/https\1/g' -i /opt/perforce/swarm/sbin/configure-swarm.sh

# Make sure image defined entry point won't interfere
ENTRYPOINT []

# Ensure the container starts as the original image would
CMD ["/bin/sh", "-c", "/opt/perforce/swarm/sbin/swarm-docker-setup.sh"]

I'll assume here that you're familiar with Docker, and won't go into details of setting that up to build your own image.

After you've built a custom image on top of perforce/helix-swarm, push it to Amazon ECR, and deployed it from there (update your terraform to point to it).

echo "Building Docker image..."
docker buildx build --platform "${PLATFORM}" -t sc-helix-swarm:latest .

echo "Logging in to ECR..."
aws ecr get-login-password --region "${REGION}" | docker login --username AWS --password-stdin "${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com"

echo "Tagging local image as ${IMAGE_URI}..."
docker tag sc-helix-swarm:latest "${IMAGE_URI}"

echo "Pushing image to ECR..."
docker push "${IMAGE_URI}"

Attempting to scale Swarm containers

Could we run into performance issues with Swarm? When we have 500+ users, will it work? Will user experience be good?

Out of the box, Swarm comes with 3 workers, and is a rather small container running on Fargate. Our estimate was that this would at some point become a bottle neck - we haven't hit issues yet.

And since swarm is behind an AWS Application Load Balancer (ALB) - I thought it would support high availability, and automatic scaling. That made total sense to me. Who would release something that doesn't?

But no - not the case.

The template back then supported container count variable.

So I tried running multiple Swarm containers by increasing the container count, running multiple tasks in Fargate. Looked nice, everything came up ok. But 1/3 requests actually went through, all others got a ERROR: Swarm communication error (Missing or invalid token) error message.

Why is that? You give the container rights to update its own extension configuration to the core. So it calls home, and updates where it resides and a secret token that the core uses to talk to the swarm server. Since each container updates the central extension config with its own token, the last one to register wins. Others get invalidated. This caused 1/3 of our swarm tasks to have a valid token, and the other two had invalid tokens.

Thought for a moment, could I circumvent this - maybe I could create a container where the tokens are always the same. But in the end the containers would need to talk to each other in some way - for user experience to stay sane. The task became too cumbersome to fulfill. I hope that Perforce themselves take another look at their setup - and come up with a more HA solution.

Official answer from Perforce support was "Swarm does not scale though, you can only have a single Swarm service" when I asked about the topic.

So more ideas - maybe I can make swarm perform better, to not hit the errors that we forecast could come.

To do that, I was thinking I would tune the container to fork more efficiently with replacing mpm_prefork and mpm_worker with php8.1-fpm and forward PHP requests through PHP-FPM.

Didn't take too long to write a Dockerfile that would replace these in the default setup - after all it's just a basic apache+php configuration that one needs to do. Spent a few hours on it, and was happy it deployed nice an clean.

Turns out: threading and Swarm’s PHP stack aren’t friends.

End result was that it didn't work at all. Turns out the current Swarm PHP setup doesn’t support threading at all.

So unless you're rewriting Swarm, just don’t bother - just forget about making the Swarm container run better/faster/harder. You can give it memory/cpu and change the worker count, but other than that - just live with it.

Using SES to send emails from Swarm

A request came in to have Swarm send out emails.

You can configure Swarm to send emails through SES, here's the correct configuration:

        'transport' => array(
            'host' => 'email-smtp.<region>.amazonaws.com',
            'port' => 587,
            'connection_class' => 'login',
            'connection_config' => array(
                'username' => '<SES USER KEY>',
                'password' => '<SES USER SECRET>',
                'ssl' => 'tls',
            ),
        ),

I would have thought someone else would have done this earlier, but was unable to find clear configuration instructions for what one needs to set - so a bit trial and error. And documentation found in google wasn't too clear on that the connection_class needs to be.

Again issues with how the container in CGD toolkit is setup - this works - but configuration is stored on ephemeral storage and gets recreated when the container starts. So every restart, automatic due to an error for example. Causes email to stop going out.

The setup script doesn't support giving it any more details than -e (--email-host). Which ends up in the config as

    'transport' => array(
        'host' => '$EMAIL_HOST',
    ),

There is no persistent storage in the Fargate container, a clear misunderstanding on our part. We (wrongly) assumed that a separate volume would only be needed for persistent data—turns out, config was getting wiped on every restart.

Ended up starting to write code to have EFS put into the fargate container, and that enabled the configuration to persist.

In case you need to add efs to your swarm module, here you go:

# Define EFS file system
resource "aws_efs_file_system" "swarm" {
  creation_token = "helix-swarm-efs"
  lifecycle_policy {
    transition_to_ia = "AFTER_7_DAYS"
  }
  encrypted = true
}

# Create a mount target in the appropriate subnet
resource "aws_efs_mount_target" "swarm" {
  for_each = toset(var.helix_swarm_service_subnets)

  file_system_id  = aws_efs_file_system.swarm.id
  subnet_id       = each.key
  security_groups = [aws_security_group.swarm_efs.id]
}

# Security group to allow access to EFS from ECS tasks
resource "aws_security_group" "swarm_efs" {
  name        = "swarm-efs-sg"
  description = "Allow ECS tasks to connect to EFS"
  vpc_id      = var.vpc_id

  ingress {
    from_port   = 2049
    to_port     = 2049
    protocol    = "tcp"
    security_groups = [aws_security_group.helix_swarm_service_sg.id]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

# EFS Access Point
resource "aws_efs_access_point" "swarm" {
  file_system_id = aws_efs_file_system.swarm.id

  root_directory {
    path = "/swarm"
    creation_info {
      owner_gid   = 0
      owner_uid   = 0
      permissions = "0777"
    }
  }
}

After persistent storage was put into the setup, then email setup via SES stays there between reboots.

EDIT: Note that above example actually breaks swarm, this was noticed after writing this blog, when it went live in production. As the queue data in /opt/perforce/swarm/data/queue is also on the NFS when the data folder was put there for the config. The queue can not be on NFS, so you need to move that - or mount an EBS to that folder.

Just make sure you take a look at how SSO is configured - if you're using SSO and have the parameter enabled in terraform. (You'll see)

Observability: Install P4Prometheus Early

Perforce can be really memory hungry in certain situations. When you have a lot of files, tags, branches - lists of files tend to grow. And if you're running commands that make Perforce look at the whole depots storage - it can easily OOM itself. Ran into memory issues early on while testing and developing the environment. Figuring out why and where took a lot of time, some questions back and forth to people smarter than ourselves.

Tip there, limit the users from accessing everything, and educate them on how their workspaces need to be setup.

The thing that Perforce does suggest you is to do is use p4prometheus, which we then did - even though it's not our monitoring tool of choice for all other environments (and still isn't).

Installing p4prometheus helped pinpoint bottlenecks.

It ships with ready-made Grafana dashboards and gives real visibility into performance—without the effort of building everything yourself. And the installation is a straightforward case.

Highly recommended to do early on. We opted for the EC2 way - running this on a small Graviton instance - where would you run it in your environment?

FSx for NetApp Ontap

When the requirement for storage goes over 16 TB, you can't run on a single Elastic Block Storage (EBS) volume anymore - maximum size is maximum size. Hard limit.

So what to do, well - you could run software raid on your instance or you could use separate volumes for separate depots. But that would introduce more complexity. So rather than doing that, we opted for Amazon FSx For NetApp ONTAP (FSxN).

It will scale up to 72 GBps of throughput, up to 2.4 million IOPs and up to 1 PiB of SSD storage.

Insane numbers if you ask me, and overkill for most, but perfect when you need it. It does the trick, does everything one might think of needing.

Just don't test it with terraform providers default example when creating it (at the time I was building this the CGD toolkit didn't support creating FSxN - so I learned and did it myself).

Provider example:

resource "aws_fsx_ontap_volume" "snaplock_volume" {
  name                       = "snaplock-vol"
  storage_virtual_machine_id = aws_fsx_ontap_storage_virtual_machine.example.id
  size_in_megabytes          = 102400
  junction_path              = "/snaplock-vol"
  ontap_volume_type          = "RW"
  security_style             = "UNIX"
  tiering_policy {
    name = "SNAPSHOT_ONLY"
  }

  snaplock_configuration {
    snaplock_type = "COMPLIANCE"

    retention_period {
      default_retention {
        type  = "MONTHS"
        value = 6
      }
      minimum_retention {
        type  = "MONTHS"
        value = 6
      }
      maximum_retention {
        type  = "MONTHS"
        value = 6
      }
    }
  }
}

Of course I tried it out, before I understood not to make snaplock volumes if you really don't need them - like us in dev environment while testing out the architecture.

Cannot delete the volume because it contains unexpired log files.

Ended up having a FSxN for 6 months before being able to delete it - luckily it was just a tiny one AZ deployment. Even AWS can't help you in removing it, you just need to wait until the snaplock expires.

So be careful out there.

Hidden pitfalls that we fell into

Empty space must be in the Storage Virtual Machine, not just in the Filesystem itself. When near full, Perforce started failing writes—despite what the top-level volume metrics showed.

We had allocated too much of the SVM storage into the iscsi block storage mounted on the server. ISCSI just stopped writing at times, with no clear reason why it did that. Storage looked like it had room, documented % of excess storage were left there. But it was in the wrong place.

Don’t forget _netdev in /etc/fstab for iSCSI mounts. Missing it caused server to hang on reboot. We rebuilt our instance a few times before catching this.

Human does become blind to tiny mistakes he does - two eyes principle might have helped.

Use SDP Tools - replicating data

Perforce's Server Deployment Package (SDP) is a gift. It gives you structure, scripts, backups, rotations—and a documented best-practices baseline.

Yes, you could do it your own way. But unless you're a masochist or need a one-off snowflake deployment, just use SDP.

We tried googling for how to setup a Perforce edge server - ending up in documentation what really wasn't for our use case, but we tried it anyway.

After some trial and error, we were pointed to the mkrep.sh script, it does all the required magic under the hood, and outputs you the required manual steps to get the replication up and running.

Graph depot replication bug

When users started using a graph depot through edge servers, we started getting reports of Blob data not found in archives for sha <sha> and them not being able to work. For some reason files were missing from the edge server.

It was fast identified that if one manually copies over the files, that works. Ok, initial fire put out. But it just lit again after next updates to the graph depot.

I spent hours on the phone with Perforce support, walking through logs and packet traces. It became a bit of a detective story.

We tcpdumped the network traffic between the edge and the commit server, and everything looked fine. The edge server sent a request upstream to the core server, and the core server dutifully responded with the blob in question. We literally saw the data leave the core and arrive at the edge — but somehow it never made it back to the client.

Here's how the flow looked:
Client -> Edge -> Core -> Answer to Edge -> Error to client

What made this harder was that everything seemed healthy — no logs complained (other than the error sent to client), and replication said it succeeded.

Eventually, Perforce support traced the issue to a bug in the system. Instead of writing to the mounted depot volume, the data was being written to /p4/1/root on the local file system.

So if you experience this error message - take a look if those Blobs are actually written into the root folder.

A temporary fix is to create a symlink to the depot volume.

ln -s /p4/1/depots/<graphdepot> /p4/1/root/<graphdepot>

This causes the data to be written to correct place, where p4 then responds with it to the client. Expecting Perforce to have this fixed in some future release.

Final Thoughts

Running Perforce on AWS can absolutely be done — but sometimes you'll need to look under the hood and adapt. Remember to observe early, and don’t hesitate to call Perforce support if/when you hit weird issues in replication. And remember, swarm doesn't scale horizontally.

Automating Well-Architected reviews

Niklas Westerstråhle — Tue, 18 Jun 2024 12:29:43 +0000

I've been leading our Well-Architected partnership at Knowit for some years now.

Last year, AWS introduced a workshop for using automation in conducting reviews, and it piqued my interest. Automating reviews can significantly enhance efficiency and accuracy, so I decided to delve into it further.

To gain deeper insights, I traveled to the AWS offices in Munich to join the Well-Architected team for a day-long workshop. Doing hands on labs against prebuilt environments to gain insight into possibilities. The experience was enlightening and reinforced the potential benefits of automating reviews wherever possible.

Integrating this into our reviews at Knowit required minimal discussion.

For the duration of this blog post, I assume you're familiar with Well-Architected Framework by AWS. If you're not take a look at the below link before reading forward:

https://aws.amazon.com/architecture/well-architected/

Can well-architected be automated?

Not all aspects of the Well-Architected Framework can be automated. For example, understanding how people operate, assessing the client’s processes, and their perspective on their workload are inherently human tasks. So prepare to talk to people still.

However, technical aspects, especially those within the Security pillar, along with some Cost Optimization and Sustainability items, can be automated as they align closely with your system’s usage and data.

If you can formulate a request to AWS APIs that will provide the answers you need, it can be automated.

What did we do?

In our quest to make the Well-Architected practitioner’s (person conducting the review( work easier, we took a look at multiple approaches. We found that integrating Prowler and Steampipe into our workflow was particularly effective.

Data Collection with Prowler: We used Prowler, a security tool that performs AWS security best practices assessments, audits, incident response, continuous monitoring, and hardening. Prowler crawled our AWS environment to gather insights into clients security posture.
Data Aggregation with Steampipe: Steampipe allowed us to query cloud resources using SQL. We used it to aggregate data across multiple AWS accounts, making it easier to gather comprehensive insights from accounts in AWS organizations.
Centralized Reporting: The data collected through Prowler was fed into the AWS Well-Architected Tool in a centralized account. This allowed us to consolidate our findings and generate a comprehensive report.

Running these automatically monthly, will give you insight into how you're progressing with your environment.

Another tool worth mentioning is Former2, I've used that a few times to create visualization for the review report in form of architectural diagrams from the environment in the target account.

By running these scans as a prerequisite to the review workshop, the practitioner can gain a better understanding of the client’s environment before the workshop even begins. This allows for a more informed discussion of the findings during the workshop, moving beyond assumptions to having an automated, data-driven view into the account.

Want to scan your own account for findings?

I recommend you try prowler and see what kind of findings are in your AWS account, try the following on your Mac (if you have windows, consider running this on AWS cloud 9). Example assumes you have AWS CLI already installed and read-only access working to your AWS account.

# install prowler
brew install prowler

# run the scan
prowler aws -f <insert your aws region here> --compliance aws_well_architected_framework_security_pillar_aws -p <aws-cli-profile>

You'll get something along the lines (sensitive information retracted):

 _ __  _ __ _____      _| | ___ _ __
| '_ \| '__/ _ \ \ /\ / / |/ _ \ '__|
| |_) | | | (_) \ V  V /| |  __/ |
| .__/|_|  \___/ \_/\_/ |_|\___|_|v4.2.4
|_| the handy multi cloud security tool

Date: today, right now

-> Using the AWS credentials below:
  · AWS-CLI Profile: your-profile
  · AWS Regions: eu-west-1
  · AWS Account: account-id

-> Using the following configuration:
  · Config File: config-yaml
  · Scanning unused services and resources: False

Executing 227 checks, please wait...
-> Scan completed! |▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉| 227/227 [100%] in 1:34.9 

Overview Results:
╭─────────────────────┬─────────────────────┬────────────────╮
│ 37.91% (105) Failed │ 60.29% (167) Passed │ 0.0% (0) Muted │
╰─────────────────────┴─────────────────────┴────────────────╯

Account account-id Scan Results (severity columns are for fails only):
╭────────────┬───────────────┬───────────┬────────────┬────────┬──────────┬───────┬─────────╮
│ Provider   │ Service       │ Status    │   Critical │   High │   Medium │   Low │   Muted │
├────────────┼───────────────┼───────────┼────────────┼────────┼──────────┼───────┼─────────┤
│ aws        │ account       │ PASS (0)  │          0 │      0 │        0 │     0 │       0 │
├────────────┼───────────────┼───────────┼────────────┼────────┼──────────┼───────┼─────────┤
│ aws        │ cloudtrail    │ FAIL (4)  │          0 │      0 │        1 │     3 │       0 │
├────────────┼───────────────┼───────────┼────────────┼────────┼──────────┼───────┼─────────┤
│ aws        │ cloudwatch    │ FAIL (15) │          0 │      0 │       15 │     0 │       0 │
├────────────┼───────────────┼───────────┼────────────┼────────┼──────────┼───────┼─────────┤
│ aws        │ config        │ FAIL (1)  │          0 │      0 │        1 │     0 │       0 │
├────────────┼───────────────┼───────────┼────────────┼────────┼──────────┼───────┼─────────┤
│ aws        │ ec2           │ FAIL (12) │          0 │      1 │        8 │     3 │       0 │
├────────────┼───────────────┼───────────┼────────────┼────────┼──────────┼───────┼─────────┤
│ aws        │ emr           │ PASS (1)  │          0 │      0 │        0 │     0 │       0 │
├────────────┼───────────────┼───────────┼────────────┼────────┼──────────┼───────┼─────────┤
│ aws        │ guardduty     │ FAIL (1)  │          0 │      0 │        1 │     0 │       0 │
├────────────┼───────────────┼───────────┼────────────┼────────┼──────────┼───────┼─────────┤
│ aws        │ iam           │ FAIL (56) │          1 │     27 │       21 │     7 │       0 │
├────────────┼───────────────┼───────────┼────────────┼────────┼──────────┼───────┼─────────┤
│ aws        │ macie         │ FAIL (1)  │          0 │      0 │        0 │     1 │       0 │
├────────────┼───────────────┼───────────┼────────────┼────────┼──────────┼───────┼─────────┤
│ aws        │ organizations │ FAIL (1)  │          0 │      0 │        1 │     0 │       0 │
├────────────┼───────────────┼───────────┼────────────┼────────┼──────────┼───────┼─────────┤
│ aws        │ route53       │ FAIL (3)  │          0 │      0 │        3 │     0 │       0 │
├────────────┼───────────────┼───────────┼────────────┼────────┼──────────┼───────┼─────────┤
│ aws        │ s3            │ FAIL (9)  │          0 │      1 │        8 │     0 │       0 │
├────────────┼───────────────┼───────────┼────────────┼────────┼──────────┼───────┼─────────┤
│ aws        │ securityhub   │ FAIL (1)  │          0 │      0 │        1 │     0 │       0 │
├────────────┼───────────────┼───────────┼────────────┼────────┼──────────┼───────┼─────────┤
│ aws        │ vpc           │ FAIL (1)  │          0 │      0 │        1 │     0 │       0 │
╰────────────┴───────────────┴───────────┴────────────┴────────┴──────────┴───────┴─────────╯
* You only see here those services that contains resources.

Detailed results are in:
 - JSON-OCSF: output/prowler-output-accountid-datetime.ocsf.json
 - CSV: output/prowler-output-accountid-datetime.csv
 - HTML: output/prowler-output-accountis-datetime.html

Now take a look at the output files, to see where your environment is failing. And see if you can remediate that.

Useful links

https://steampipe.io/
https://github.com/prowler-cloud/prowler
https://former2.com

Disclaimer

The pictures in this blog post are AI generated, and have clear mistakes in them. They're for visual illustration only.

AWS Shared VPC: The Good, the Bad, and the Ugly

Niklas Westerstråhle — Wed, 03 Jan 2024 11:07:02 +0000

I've been waiting to find the time to write on the shared VPC model that AWS offers as a possibility. It's now been close to 3 years since we built the setup for a client, and been running it ever since.

Key challenges that we wanted to tackle with the solution:

Low cost network setup
Simple
Secure

After a few years, I can say the low cost aspect is there, but simple to maintain - far from it. And from security aspect, limiting traffic is a lot harder than one would expect, as by default AWS does allow and route traffic within a VPC.

The more traditional way

We're all familiar with using a transit vpc or a transit gateway solution to connect vpcs from accounts in the organisation together. I won't dive into this, if you we're looking for above solution - do another google search :)

This picture below just to remind you how best practices would advice you to build connectivity.

What if we just use one VPC, shared to multiple accounts?

We build and organisation with several accounts, for network purposes we have account A for networking, and for simplicity lets assume we have accounts B, C and D for various environments.

Borrowed a picture to illustrate this a bit.

Sharing a subnet within the organization

Sharing a subnet is actually really simple, the only requisite is that resource sharing within the organization must be enabled in Organizations - after its enabled you just specify the targets where to share and which subnets.

Here's an example in cloudformation.

  ResourceShareSubnets:
    Type: 'AWS::RAM::ResourceShare'
    Properties:
      AllowExternalPrincipals: false
      Name: subnet-share
      Principals:
        - "<target account id>"
      ResourceArns:
        - !Sub 'arn:aws:ec2:${AWS::Region}:${AWS::AccountId}:subnet/${Subnet1}'
        - !Sub 'arn:aws:ec2:${AWS::Region}:${AWS::AccountId}:subnet/${Subnet2}'
        - !Sub 'arn:aws:ec2:${AWS::Region}:${AWS::AccountId}:subnet/${Subnet3}'

After you share the subnet, the subnet will be visible in the Principal targeted, and resources can be created into those subnets.

The Good: Cost aspects and latency

Instead of paying the current list price for Transit Gateway attachements (0,05$/hour/attachement) and 0,02$ per GB of data processed - the network traffic between accounts cost only the 0,01$ per GB that traffic costs within the same region.

So for high traffic volumes between environments, having them run in the same VPC, but their own accounts, saves in traffic costs.

Latency is also a bit smaller
rtt min/avg/max/mdev = 0.572/0.614/0.675/0.031 ms
compared to through Transit Gateway in same AZ
rtt min/avg/max/mdev = 0.877/0.994/1.400/0.150 ms

If you're trying to build the least latency - take a look at sharing the vpc.

The Bad: Operating and managing the network

Segmentation and limiting traffic is a bit painful to put it mildly.

You can utilise NACLs around the subnets to limit traffic, but that's about it. From the network engineers perspective that may be enough in some cases, but in some cases it's not. Depends who you talk to, and what kind of architecture and security they require.

Roughly put - we blocked traffic between environments B, C and D, allowed them just to talk to A. Traffic goes in and out to internet from A through the appliances running there.

The appliances also route traffic between the subnets, as the subnets themselves can't talk to each other directly.

The Ugly: Tags are not copied over

1) The most annoying feature of the shared VPC model, is that RAM only shares the subnet resources to the target account. But it doesn't share resource tags, which means you see the resource ids in the target account but nothing else.

Let's say you have an environment that has multiple subnets shared to an account, for different purposes. Identifying those subnets becomes a pain. You might be running a Kubernetes cluster, that wants networks to be tagged in certain ways for it to know where to place resources.

Solved this by deploying a role across the whole organisation for copying tags, triggering a lambda from EventBridge events associated with network tags.

    Type: AWS::Events::Rule
    Properties: 
      Description: "Rule for matching Network TAG changes"
      EventPattern:
        source: 
          - "aws.tag"
        detail-type:
          - "Tag Change on Shared Network Resource"
        detail:
          service: 
            - "ec2"
          resource-type:
            - "vpc"
            - "subnet"
            - "route-table"
            - "network-acl"

The rights you'll need are below, for a solution that checks first which resources have been shared, and how are they tagged. And then only creates/deletes those tags that need to be changed.

In the example below, note its not least privilege. This would allow tagging of all ec2 resources - update yours where needed to allow only required resources.

          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: "Allow"
                Action:
                  - ec2:CreateTags
                  - ec2:DeleteTags
                  - ec2:DescribeTags
                  - ec2:DescribeSubnets
                  - ec2:DescribeVpcs
                  - ec2:DescribeRouteTables
                  - ec2:DescribeNetworkAcls
                Resource: "*"

To limit the amount of requests on update, our flow is to check which tags are already copied, which have changed, which even exist in the target account.

Wish I could share the code for that - but sadly the IPR for that is with the client. But you get the idea what you need to accomplish - so go build.

2) Another Ugly Duckling is Guardduty faulty positives

In our setup, the internet traffic routes through a VPN to clients datacenter, and goes out from there - as everything is wanted to go through the firewalls there. In future the firewalls will be expanded to AWS.

I'm not 100% sure if this would happen when a instance role is used from wrong account, or is this just because we seem to use them from outside the AWS itself.

Guardduty sees the instance role credentials as a possible security issue, when they're used outside where they are residing in the environment.

So our instances calling services to make changes, pop up in Guardduty - and raise a flag.

Makes sense yes, those credentials shouldn't be used anywhere but within the account in question, so routing traffic to other accounts or out through a VPN onsite connection should raise a flag.

Afterthoughts

Would I do this again?
Short answer, No and you shouldn't do this either - unless you're sure it'll work in your use case perfectly.
Would I have done this in the first place?
No, I wouldn't. However client insisted in building the network this way.

However after doing it I see the ingenuity in it, but still think it came with far too hard operability, no one really has the insight into how everything works.

Deepracer racing in real life (Q2/23)

Niklas Westerstråhle — Tue, 20 Jun 2023 12:00:23 +0000

Continuing from where I left off, if you haven't read my earlier Deepracer blog post, read it first - click here.

Running on Ondemand instances

I updated my AWS accounts limits by making a Limit Increase request at Service Quotas and after a day or two I'm able to run G/P instance family on demand - just the minimum 4 cores. However that's all I need, GPU and few cores.

Why Ondemand you could ask, the cost of running training in the Deepracer console - for extended periods of time - is just too much for my wallet. 3.5 USD / hour compared to 0.558 USD/hour (g4dn.xlarge).

So I take a little bit more hassle in setting up the environment, but I get faster iterations (at least feels like it) and cheaper price.

With spot instances you can go even cheaper, but I like to be in control when the instance dies - just don't forget it running.

Getting started fast with training on instance

git clone https://github.com/aws-deepracer-community/deepracer-for-cloud
cd deepracer-for-cloud/bin/; ./prepare.sh
sudo reboot

cd deepracer-for-cloud/bin/; ./init.sh -c aws -a gpu
source activate.sh; cd ..

Edit system.env and run.env
Edit your reinforcement function and training parameters.
Start training.

dr-start-training -q -w

First tips I got from the pro's (Reinvent)

Only train at slow speed
Train multiple tracks
Don't overtrain

I didn't really understand why slow, don't we want it to drive as fast as possible. But here's where real world and virtual are different. The car only has a throttle from 0 to 100, training at lower speed on virtual environment will go a lot faster to start driving around the track. And in real world if you give the car 100% speed, it'll drive max speed of the car anyway.

Training my models

I started training the model with similar reward functions as what I used on the real world track. Setting race line, and having the car follow that.

Did that for 2-3 tracks, short training cycle, and hop to next track. Maybe 8-12 hours on a model.

Looked like they'd run around the track, maybe around 11s on virtual track.

Getting hands on - AWS Summits 2023

I had the pleasure of taking the trip to Berlin and Stockholm, both not for Deepracing - presented in Berlin, and manned our company stand in Stockholm.

In Berlin there was a track and I had my models prepared so I could shine in the real world. Or so I thought.

The track was the original track where deepracing started off from.

I tried out multiple models, but at the end had to just believe that my models were nowhere close to running around the track. Maybe too sophisticated approach.

Thanks to DBro for the short but enlightening talk at trackside , and for some additional hints on which way to continue. Networking is the best part of these events.

I had hoped that there would be a track at AWS Summit Stockholm also - but sadly this year that wasn't there.

Tips for next try

Have high entropy for your models
Discreate action space
Alternate driving directions

Why?

First of all, entropy plays a significant role in balancing exploration and exploitation during training. When referring to setting "high entropy" for real-world models, it means encouraging more exploration during the learning process. In real-world scenarios, track conditions, lighting, and obstacles can vary, and exploration allows the model to learn robust policies that can adapt to different situations.

By encouraging exploration, high entropy helps the model generalize its learned policies to handle novel or unseen situations. It prevents the model from becoming overly specialized to a specific track or set of conditions, making it more adaptable to different environments.

This is also where running alternate tracks, and directions comes to play - to generalize the model. It'll know more than just one optimally lit virtual track that it needs to drive the exact race line.

I also now understand how the action space affects the model and why discrete action space for real world models would make more sense. Need to take a look into how to make an optimal one for real world applications, as its not just math for perfect line on one track.

Discrete Action Space:

Finite set of predefined actions.
Precise control and interpretability.
Larger action space and sparse rewards, requiring more training samples.

Continuous Action Space:

Continuous range of actions.
Smooth control and flexibility.
Requires fewer training samples, but challenges in exploration.
Infinite possibilities for modeling complex behaviors.

Driving both ways, just gives the model more input, more possibilities to learn. And you never know which way you'll need to run the tracks in the future.

Then just train a bit on the track in question, to make sure your model is working on that exact track.

Thoughts for the future

To really ace models in the physical world, you need to have a track - period. Really hard to come with multiple pretrained models and go through em one by one to find if one would drive ok.

I feel that I didn't have enough time to train my models, even though for physical car I'm told one should only train the models shortly, its not a multi day training session on one track - like the virtual circuits are.

I do have a track and a car available - at our company office. So I'm one of the lucky ones that has the possibility of putting loads of more time, setting it up, calibrating the car, testing it live. This is what I'll write to you guys more about before Reinvent this year - aim to participate there with some new models.

My experience starting out with Deepracer (Q4/22)

Niklas Westerstråhle — Mon, 02 Jan 2023 13:19:44 +0000

What is Deepracer

I don't think I'll spend too much time writing about the history of deepracer, or what it is. You can read up on it on AWS website https://aws.amazon.com/deepracer/

Really short, it's gamified reinforcement learning. My stepping stone into learning machine learning.

Why on earth did I start now? Or didn't start earlier?

In short I have to thank a colleague for getting me started. He told me his organising an event in our company, where we're going to race live on a real track during H1/23 - Knowit League. So I figured it's time to start learning how to do this.

Earlier I thought this is just too hard, saw Jouni Luoma competing (ex-colleague) at prior events in Stockholm Summit and go on competing in Reinvent. As I hold him as an experienced AI and Machine learning guru - I thought deepracing is beyond me. Boy how mistaken I was.

The only thing I blame Victor for, is losing loads of credits running deepracer training. And myself for not picking up on this earlier - lots of fun.

So I started learning

First took a look at the console, started training an example model - just to see what happens. It'll get you driving around the track quite fast.

But I wanted to do more, learn more. So some googling followed, found some blog posts with code snippets that others had used. Copied pasted, put some thoughts into what I think of these pieces, and how would they help me.

Tried things like, are the wheels pointing towards where the track is going.

# Calculate the difference between the track direction and the heading direction of the car
   direction_diff = abs(track_direction - params['heading'])
   if direction_diff > 180:
        direction_diff = 360 - direction_diff

   abs_heading_reward = 1 - (direction_diff / 180.0)
   heading_reward = abs_heading_reward * heading_weight

# Reward if steering angle is aligned with direction difference
   abs_steering_reward = 1 - (abs(params['steering_angle'] - direction_diff) / 180.0)
   steering_reward = abs_steering_reward * steering_weight

Or if we're on straight, more points for going as fast as the car goes.

Trained a model that does the Reinvent 2017 track in about 11s constantly around the virtual track. But now I need to wait to get to try em on a real track.

Luckily I have access to a track, just need to find a space for it - and borrow a car. I'll write another blog after I have set it up, with experiences from that.

I have since then learned that they're most likely not going to work at all. :D That real world models are different - but more on that later.

Joining the October Qualifier 2022 - Open Division

My competitive side took over after I got the model "done" for our internal race, and I decided to try my luck in the October Open - that Deepracer League runs on AWS Virtual Circuit in the console, you can automatically submit your models to race once they've completed training.

Watching some YouTube videos of guys talking about hyper parameters and other ways to make your deepracer train better, one topic was the idea of a a race line. That most of the fastest teams with deepracer seem to follow a race line, instead of following the track itself.

It's like a F1 car, it drives around the track the shortest/fastest possible route.

Deepracer should be able to do this as well - even if you just tell it to "run as fast as you can, get more reward for faster time". Given enough time to train, it would find that route.

But to train it a bit faster, I went on to learn how to calculate a race line, using https://github.com/dgnzlz/Capstone_AWS_DeepRacer

I started up an Amazon Sagemaker Notebook instance, downloaded the GitHub repo and started following the notebook. I ran into some errors, needed to tweak something a bit. But in the end ended up with a nice looking route for our internal race on the 2017 track (as that's the only one we have as a real track), as well as one for the October qualifier.

For the most part used their reward function as a base, honestly not even 100% sure which one I ended up using. There was a few additions from other snippets - might have worked without as well. I tend to doo too many things at once. :)

Days of training, multiple days. I think my model ran for 3-5days total, in 12-18h runs.

It still drove off track sometimes, but resubmitting the model - second try did it without driving off track. I ended there.

Ended up #10 - which I think is an excellent position - sure it's 9.600s slower than the winner, but these guys have been doing this for years, I merely days. Of total 3940 models submitted.

I hear people have automation submitting models multiple times for evaluation in races, as the same model can drive faster in ideal conditions. Next time I'll know to submit a few extra times, maybe save a second or few from my time.

500 USD later

I kept the AWS bill in my mind, and took a look at what running deepracer costs, 3.5USD/hour does rack up costs.

My aim was to unlock pro league for the next races, so I can participate in monthly races that could get me invited to future ReInvents as a participant in that years competitor. And that was easily done.

On the positive side - I got some nice SWAG finishing the October open in top 10%, so you could say it's an "expensive hoodie and a cap".

Running it cheaper?

As I can't keep putting 500 USD a month into this hobby, I needed to figure how to do this in the future. Found out you can even train a model on your laptop, or running training on an EC2 instance. Spot or on demand. You could even join the dark side and run it on Azure - but lets not go there.

I decided to run training on spot - which is actually more annoying than one would think.

As you can see, price per training hour is a lot less. And my feeling is that it actually trains faster, when I run multiple training sessions parallel.

You can find more details how to do this at https://github.com/aws-deepracer-community/deepracer-for-cloud

I think I just ran out of time to get it running on spot perfectly. Spot instances kept dying, and continuing a model from pre-existing one seemed to make the model do worse than running from scratch. According to some graphs (I'm still newbie understanding what all the graphs tell me).

Ended up trying to get a model to converge for the Reinvent 2022 Championship track, but mostly my model kept just turning right all the time, never getting around the track. Not sure which parts contributed to that. I need to try again for the next races and tracks.

I don't know how I should analyse logs, and what those would tell me. I can get some details out of the training I'm running - but what should it be telling me, how should I make my reward function better.

I'm also going to request the use of on demand P instances, as I'll rather pay the on demand price, than play with dying instances. It will still be a fraction of the costs compared to the Deepracer console.

In case you were not aware, to use these type of instances on AWS you will need to request instance quota limit raise from 0 to bigger from AWS - and explaining them your use case for running them. And those go though an approval process. And the request is to be done by region.

I initially requested us-east-1, and later eu-north-1 (as it's closer to me). And can now run I think 2 instances per region on spot - for machine learning trainings.

ReInvent 2022 - reality of live cars hits me

Had the pleasure of attending Reinvent 2022, and since there were tracks - I tried some of my mostly failing models for the Championship track. Just to see them gloriously fail totally, stop in the middle of the track and just not move.

This is just mind blowing - what, it doesn't work at all like the models in the virtual environment.

Talked with some of the guys who have been doing this for a longer time, got a lot of tips on how one could train a model for the real world applications.

Next up on my path

In the end, I'm like a newborn baby in this reinforcement learning environment. Let's see what next year of learning brings.

I need to go and relearn everything for the physical car - see you in 2023 on the virtual and live tracks, aiming to join Stockholm summit race.

Analysing training logs better - a must learn thing.

And I think talking of Deepracer a bit, some community event, meetup. Maybe I'll get a few more guys interested in starting to race.

Operation Merry Christmas

Niklas Westerstråhle — Tue, 20 Dec 2022 10:00:00 +0000

What surprises you most about the community builders program?

The people, I mean - I've had the pleasure of meeting community builders live at two ReInvents now. Both times these guys amaze me. We're all givers, not takers.

The warm hearted welcome and inclusion, and crazy ideas that we come up with when put together.

It keeps surprising me time after time.

What’s your background and your experience with AWS?

I've got some 23+ years of IT career behind me, started out at a helpdesk on an ISP, went on to Linux system administration and datacenter management at the next ISP, and after cloud became the thing - left old style datacenters - to work with someone else's datacenter.

AWS experience for 8+years, 5 and 1/2 years fully just in cloud.

I love working with datacenters, specially someone else's. I don't anymore have to care if the A/C works or not, someone does it for me.

What’s the biggest benefit you see from the program?

This pretty much circles around to my answer to the first question, the people, the community. I hope to see more of all you guys. Not to mention any by name, but you guys know who you are - taking part in this operation.

What’s the next swag item that you would like to get?

I'm not a big fan of swag, as it a lot of time tends to be something not so usable. So took me a while to think of what would be useful, a travel power adapter would be great. Something that adapts to power my devices from all sockets around the world.

Or maybe invite to some live community event. Maybe one on every continent?

What are you eating for dinner today? Share the recipe!

Dinner tonite, impossible to say - but I know I will be baking before Christmas, so want to share my favourite pie at the moment - the version I make of Key Lime Pie.

Ingredients:

Base:
200g digestive cookies
75g melted butter

Filling:
4-7 fresh limes
5 egg yokes
1 can condensed milk

On top:
2dl whipping cream
1 1/2 tee spoons sugar
1 1/2 tee spoons vanilja sugar
1 lime

What to do:

Base:
Crush the cookies in a blender, melt butter and combine with cookiecrumble. Spread onto a cake pan, covering the bottom and 1,5cm from the sides.
Preface in 150 celcius for 10min, while preparing the filling.

Filling:
Wash the limes, and grate the peel. Squeeze the juice into a cup (you'll need about 1dl in total, doesn't hurt to put more).
Separate the egg yokes.
Combine all ingredients in a bowl and whisk em together.
Put the filing into the cake pan.

Cook at 150 celsius for 15minutes, let the pie cool in room temperature, and then put it into refrigerator for 3h or overnight.

Topping:
Whisk the cream with the sugars, grate lime peel for nice visual on top of the whipped cream while plating.

Enjoy.

Is there anything else you would like to say about the community builders program in 2022?

I hope it'll continue in 2023, and the next years. And that I'll have a chance to participate even more.

Big thanks to the AWS team working with us.

And Merry Christmas.

Running deployment scripts on Cisco routers @ AWS from a Private Github repository

Niklas Westerstråhle — Tue, 19 Oct 2021 12:36:02 +0000

Background:

I've been building with a client of ours their landing zone, and for the network connectivity part Cisco routers were selected to be used. This would connect nicely to their existing on premises network.

"Ok, sounds easy enough. Let's automate this fully."

The last sentence gave us a challenge, the documentation is non existent, and in the background Cisco does their automation - which is lacking features and not really working as one would expect.

We worked closely on this with Christofer on this, weeks bashing our heads to the wall - utilising support channels, submitting bug report, waiting.

So I'll open up here our experience, and hope it helps someone else sometime in the future building similar solution. If you just want the copypaste for your user data head to Solution part at the bottom.

Disclaimer I'm not a Cisco specialist, all my thoughts here come from usability perspective utilising AWS cloud.

Starting situation:

1) Deployment script (deploy.sh) for setting up the routers is in Github

This script generates everything needed to configure the router, tunnels, VRFs, so forth. It utilizes instance metadata for required information. I won't go into detail on this, this was written by clients network engineer.

2) Token has been created to access GitHub
3) BYOL model is used for the licences, AMI from market place that we used is aws-marketplace/Cisco-C8K-17.06.01a
4) Secrets are stored in Secrets Manager

The story:

Cloudformation was written, to build the basic infrastructure required. Nothing special there.

I'll focus on the instances user data parts, that's the main pain point.

For licensing and installation of AWS CLI and HA package, added to user data:

      Section: license
      TechPackage:appx

      Section: Python package
      csr_aws_ha 3.1.0
      awscli 1.20.40 sudo

The licence command fails, and tells you that correct option is ‘vacs’, ‘lite’, ‘ipbase’, ‘ax’, ‘security’ or ‘appx’ - wait what? - this is a bug and will be fixed by Cisco in a later AMI.

In CSR1000v these were correct options, but in C8000v options are ‘network-premier’, ‘network-essentials’ or ‘network-advantage’.

You'll also need to configure a IAM role for the instance, check requirements for the HA script from Cisco.

Then to run the deployment script. :)

We tried to get the script from Github using builtin options.

Few screenshots of Cisco bootstrapping manual:

So we did what was asked, tested with curl that our URL works, and set user data as:

      Section: scripts
https://token@raw.githubusercontent.com/Owner/Repository/main/deploy.sh

As you guessed this wouldn't work. Cisco actually in the background utilizes wget for all https requests - which in turn doesn't support tokens in the URL. Curl is only utilised for ftp.

So we tested, wget from GitHub works if you give it --user whatever and --pass token

Lets try this

So added credentials - no luck, only works for FTP, not HTTPS.

We wen't back and forth, testing that everything works if when we have our deployment code in a public website.

Considered using a wrapper script, hosted publicly. Which would do three things:
1) Get token from Secrets manager
2) Download deploy script with curl
3) Run deployment script

Tried to run that though Section: scripts and I think that actually worked, but since there was two minds on this was dropped when we found another solution that works.

Noticed from the logs that Cisco actually themselves use event manager applet while booting up to run the user data, why not do that ourselves also?

Solution

Section: IOS configuration works nicely, we can run your commands there - utilising event manager applet we ran the deploy script.

      Section: IOS configuration
      event manager applet Deploy authorization bypass
      event timer watchdog time 180 maxrun 360
      action 0010 cli command "enable"
      action 0015 syslog msg "Getting the secret"
      action 0020 cli command "conf t"
      action 0021 cli command "do guestshell run aws secretsmanager get-secret-value --secret-id github/access-token --region eu-west-1 --query SecretString --output text"
      action 0022 cli command "event manager environment _secret $_cli_result"
      action 0023 cli command "end"
      action 0030 syslog msg "Downloading the deploy-code"
      action 0031 syslog msg "guestshell run curl https://$_secret@raw.githubusercontent.com/Owner/Repository/main/deploy.sh -o /home/guestshell/deploy.sh"
      action 0035 syslog msg "Running deploy.sh"
      action 0040 cli command "guestshell run bash /home/guestshell/deploy.sh"
      action 0100 cli command "conf t"
      action 0110 cli command "no event manager applet Deploy"
      action 0115 cli command "end"

So that's it, we got keys from Secrets manager, downloaded our script to generate the config and removed the applet afterwards.

Deploy script also has builtin checks that it won't run twice.

One last note on running scripts within Guestshell

As a note, for your scripts that you run in guestshell that while #!/usr/bin/env python is a valid script interpreter #!/usr/bin/env bash is not, you have to use #!/bin/bash there.