<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Glenn Gray</title>
    <description>The latest articles on Forem by Glenn Gray (@tallgray1).</description>
    <link>https://forem.com/tallgray1</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3817657%2F22cc7f4e-c345-484f-89b0-07068c02c9c7.png</url>
      <title>Forem: Glenn Gray</title>
      <link>https://forem.com/tallgray1</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/tallgray1"/>
    <language>en</language>
    <item>
      <title>Stop Manually Updating Jira After Every PR Merge</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Tue, 05 May 2026 12:31:48 +0000</pubDate>
      <link>https://forem.com/tallgray1/stop-manually-updating-jira-after-every-pr-merge-5gpe</link>
      <guid>https://forem.com/tallgray1/stop-manually-updating-jira-after-every-pr-merge-5gpe</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://graycloudarch.com/blog/automate-jira-github-actions/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;You just merged a PR. Now you open Jira, find the ticket, paste the PR link in a comment, transition the status to Done, and update the deployed field. Five minutes. Twenty times a week. That's 1,700 minutes per year per engineer — nearly 30 hours of pure mechanical overhead.&lt;/p&gt;

&lt;p&gt;And that's assuming you remember. On one team I worked with, we audited the last three months of merged PRs. Thirty percent of tickets had no update after merge. No comment, no transition, no link. The ticket just sat in In Dev until someone noticed during sprint review.&lt;/p&gt;

&lt;p&gt;The fix is two GitHub Actions workflows and a shared composite action. Here's exactly how to build it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;Two workflows, one shared extraction layer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Workflow 1&lt;/strong&gt;: Fires on PR creation — posts a Jira link comment to the PR so reviewers can navigate directly to the ticket.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workflow 2&lt;/strong&gt;: Fires on PR merge to &lt;code&gt;main&lt;/code&gt; — posts a comment to the Jira ticket with the PR URL, commit SHA, and who merged it, then transitions the ticket to Done.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both workflows need to find the Jira ticket ID. Instead of duplicating that logic, we extract it into a composite action.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Composite Action for Ticket Extraction
&lt;/h2&gt;

&lt;p&gt;Create &lt;code&gt;.github/actions/extract-jira-ticket/action.yml&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The action checks four sources in priority order — easiest to fix first:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;PR title (simplest for the developer to correct)&lt;/li&gt;
&lt;li&gt;Commit messages&lt;/li&gt;
&lt;li&gt;Branch name in standard format: &lt;code&gt;PROJECT-123-description&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Branch name with prefix: &lt;code&gt;feat/PROJECT-123-description&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Extract Jira Ticket&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Extracts&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Jira&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;ticket&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;from&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;PR&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;title,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;commits,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;or&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;branch&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;name"&lt;/span&gt;

&lt;span class="na"&gt;inputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;jira-base-url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;jira-user-email&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;jira-api-token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

&lt;span class="na"&gt;outputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;jira-key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ steps.extract.outputs.jira_key }}&lt;/span&gt;
  &lt;span class="na"&gt;found&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ steps.extract.outputs.found }}&lt;/span&gt;

&lt;span class="na"&gt;runs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;using&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;composite&lt;/span&gt;
  &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Extract ticket ID&lt;/span&gt;
      &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;extract&lt;/span&gt;
      &lt;span class="na"&gt;shell&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bash&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
        &lt;span class="s"&gt;JIRA_KEY=""&lt;/span&gt;

        &lt;span class="s"&gt;# Priority 1: PR title&lt;/span&gt;
        &lt;span class="s"&gt;if [[ "${{ github.event.pull_request.title }}" =~ ([A-Z]+-[0-9]+) ]]; then&lt;/span&gt;
          &lt;span class="s"&gt;JIRA_KEY="${BASH_REMATCH[1]}"&lt;/span&gt;
        &lt;span class="s"&gt;fi&lt;/span&gt;

        &lt;span class="s"&gt;# Priority 2: Branch name&lt;/span&gt;
        &lt;span class="s"&gt;if [ -z "$JIRA_KEY" ]; then&lt;/span&gt;
          &lt;span class="s"&gt;BRANCH="${{ github.head_ref }}"&lt;/span&gt;
          &lt;span class="s"&gt;if [[ "$BRANCH" =~ ([A-Z]+-[0-9]+) ]]; then&lt;/span&gt;
            &lt;span class="s"&gt;JIRA_KEY="${BASH_REMATCH[1]}"&lt;/span&gt;
          &lt;span class="s"&gt;fi&lt;/span&gt;
        &lt;span class="s"&gt;fi&lt;/span&gt;

        &lt;span class="s"&gt;if [ -n "$JIRA_KEY" ]; then&lt;/span&gt;
          &lt;span class="s"&gt;echo "jira_key=$JIRA_KEY" &amp;gt;&amp;gt; $GITHUB_OUTPUT&lt;/span&gt;
          &lt;span class="s"&gt;echo "found=true" &amp;gt;&amp;gt; $GITHUB_OUTPUT&lt;/span&gt;
        &lt;span class="s"&gt;else&lt;/span&gt;
          &lt;span class="s"&gt;echo "found=false" &amp;gt;&amp;gt; $GITHUB_OUTPUT&lt;/span&gt;
        &lt;span class="s"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The regex &lt;code&gt;[A-Z]+-[0-9]+&lt;/code&gt; matches any Jira ticket format: &lt;code&gt;PROJ-1&lt;/code&gt;, &lt;code&gt;IN-89&lt;/code&gt;, &lt;code&gt;INFRA-1234&lt;/code&gt;. If you have tickets with lowercase project keys, adjust accordingly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: PR Creation Workflow
&lt;/h2&gt;

&lt;p&gt;Create &lt;code&gt;.github/workflows/link-jira-on-pr.yml&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This fires when a PR is opened and posts a formatted comment with the Jira ticket link. If no ticket is found, it posts a warning so the author knows to add one — before review, not after.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Link Jira on PR&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;types&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;opened&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;permissions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pull-requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;write&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;link-jira&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./.github/actions/extract-jira-ticket&lt;/span&gt;
        &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;jira&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;jira-base-url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.JIRA_BASE_URL }}&lt;/span&gt;
          &lt;span class="na"&gt;jira-user-email&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.JIRA_USER_EMAIL }}&lt;/span&gt;
          &lt;span class="na"&gt;jira-api-token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.JIRA_API_TOKEN }}&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Post Jira link comment&lt;/span&gt;
        &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;steps.jira.outputs.found == 'true'&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/github-script@v7&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;script&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;github.rest.issues.createComment({&lt;/span&gt;
              &lt;span class="s"&gt;issue_number: context.issue.number,&lt;/span&gt;
              &lt;span class="s"&gt;owner: context.repo.owner,&lt;/span&gt;
              &lt;span class="s"&gt;repo: context.repo.repo,&lt;/span&gt;
              &lt;span class="s"&gt;body: `📋 Jira: [${{ steps.jira.outputs.jira-key }}](${{ secrets.JIRA_BASE_URL }}/browse/${{ steps.jira.outputs.jira-key }})`&lt;/span&gt;
            &lt;span class="s"&gt;})&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Warn if no ticket found&lt;/span&gt;
        &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;steps.jira.outputs.found == 'false'&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/github-script@v7&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;script&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;github.rest.issues.createComment({&lt;/span&gt;
              &lt;span class="s"&gt;issue_number: context.issue.number,&lt;/span&gt;
              &lt;span class="s"&gt;owner: context.repo.owner,&lt;/span&gt;
              &lt;span class="s"&gt;repo: context.repo.repo,&lt;/span&gt;
              &lt;span class="s"&gt;body: '⚠️ No Jira ticket found. Add a ticket ID to the PR title (e.g., `PROJ-123: Your title`).'&lt;/span&gt;
            &lt;span class="s"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The warning step matters. It creates a feedback loop that trains the team to include ticket IDs upfront. Within a few weeks, the warning fires rarely.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: PR Merge Workflow
&lt;/h2&gt;

&lt;p&gt;Create &lt;code&gt;.github/workflows/update-jira-on-merge.yml&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This fires when a PR is closed against &lt;code&gt;main&lt;/code&gt;. The &lt;code&gt;if: github.event.pull_request.merged == true&lt;/code&gt; guard is important — the &lt;code&gt;closed&lt;/code&gt; event also fires for PRs that are closed without merging.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Update Jira on Merge&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;types&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;closed&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;main&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;update-jira&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;github.event.pull_request.merged == &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./.github/actions/extract-jira-ticket&lt;/span&gt;
        &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;jira&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;jira-base-url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.JIRA_BASE_URL }}&lt;/span&gt;
          &lt;span class="na"&gt;jira-user-email&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.JIRA_USER_EMAIL }}&lt;/span&gt;
          &lt;span class="na"&gt;jira-api-token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.JIRA_API_TOKEN }}&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Post merge comment to Jira&lt;/span&gt;
        &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;steps.jira.outputs.found == 'true'&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" \&lt;/span&gt;
            &lt;span class="s"&gt;-u "${{ secrets.JIRA_USER_EMAIL }}:${{ secrets.JIRA_API_TOKEN }}" \&lt;/span&gt;
            &lt;span class="s"&gt;-H "Content-Type: application/json" \&lt;/span&gt;
            &lt;span class="s"&gt;-X POST "${{ secrets.JIRA_BASE_URL }}/rest/api/2/issue/${{ steps.jira.outputs.jira-key }}/comment" \&lt;/span&gt;
            &lt;span class="s"&gt;-d "{\"body\": \"PR merged: #${{ github.event.pull_request.number }} ${{ github.event.pull_request.html_url }}\nCommit: ${{ github.sha }}\nBy: ${{ github.event.pull_request.merged_by.login }}\"}")&lt;/span&gt;

          &lt;span class="s"&gt;echo "Jira comment HTTP status: $HTTP_STATUS"&lt;/span&gt;
          &lt;span class="s"&gt;[ "$HTTP_STATUS" -eq 201 ] &amp;amp;&amp;amp; echo "✅ Comment posted" || echo "⚠️ Comment failed (non-critical)"&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Transition ticket to Done&lt;/span&gt;
        &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;steps.jira.outputs.found == 'true'&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;TRANSITION_ID="${{ secrets.JIRA_DONE_TRANSITION_ID }}"&lt;/span&gt;
          &lt;span class="s"&gt;[ -z "$TRANSITION_ID" ] &amp;amp;&amp;amp; echo "No transition ID configured, skipping" &amp;amp;&amp;amp; exit 0&lt;/span&gt;

          &lt;span class="s"&gt;curl -s -u "${{ secrets.JIRA_USER_EMAIL }}:${{ secrets.JIRA_API_TOKEN }}" \&lt;/span&gt;
            &lt;span class="s"&gt;-H "Content-Type: application/json" \&lt;/span&gt;
            &lt;span class="s"&gt;-X POST "${{ secrets.JIRA_BASE_URL }}/rest/api/2/issue/${{ steps.jira.outputs.jira-key }}/transitions" \&lt;/span&gt;
            &lt;span class="s"&gt;-d "{\"transition\": {\"id\": \"$TRANSITION_ID\"}}"&lt;/span&gt;
          &lt;span class="s"&gt;echo "✅ Transitioned to Done"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The comment step uses an HTTP status check rather than relying on curl's exit code. A failed comment doesn't fail the job — the PR already merged, and a missing notification shouldn't generate noise in CI. The transition step is fully optional: if &lt;code&gt;JIRA_DONE_TRANSITION_ID&lt;/code&gt; isn't set, it skips silently. This lets you start with just comments and add transitions once you've verified the workflow runs cleanly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding Your Transition IDs
&lt;/h2&gt;

&lt;p&gt;Transition IDs are project-specific. There's no universal "Done" ID. Run this against any ticket in your project to find yours:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-u&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$JIRA_EMAIL&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;$JIRA_API_TOKEN&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$JIRA_BASE_URL&lt;/span&gt;&lt;span class="s2"&gt;/rest/api/2/issue/&lt;/span&gt;&lt;span class="nv"&gt;$TICKET_KEY&lt;/span&gt;&lt;span class="s2"&gt;/transitions"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  | jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'.transitions[] | "ID: \(.id) | \(.name)"'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ID: 91 | Done
ID: 31 | In Review
ID: 21 | In Progress
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Set the Done ID as &lt;code&gt;JIRA_DONE_TRANSITION_ID&lt;/code&gt; in your repository secrets.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Note on Jira API Versions
&lt;/h2&gt;

&lt;p&gt;Use the v2 API: &lt;code&gt;/rest/api/2/&lt;/code&gt;. Some teams try v3 and get silent empty responses — &lt;code&gt;{"errorMessages":[],"errors":{}}&lt;/code&gt; — that look exactly like auth failures. It's not auth. The v3 request body format changed, and error handling is poor. v2 is stable, well-documented, and works consistently.&lt;/p&gt;

&lt;h2&gt;
  
  
  Required Secrets
&lt;/h2&gt;

&lt;p&gt;Add these to your GitHub repository secrets:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Secret&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;JIRA_BASE_URL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;https://yourorg.atlassian.net&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;JIRA_USER_EMAIL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The email address tied to your API token&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;JIRA_API_TOKEN&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Generate at id.atlassian.com → Security → API tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;JIRA_DONE_TRANSITION_ID&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Optional — from the transitions API call above&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For org-wide rollout, set these as organization secrets and restrict to relevant repositories.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;After rolling this out across a team of eight engineers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Zero manual Jira updates after merge&lt;/li&gt;
&lt;li&gt;Forgotten ticket updates dropped from 30% to 0%&lt;/li&gt;
&lt;li&gt;Roughly 1,700 minutes per year recovered per engineer&lt;/li&gt;
&lt;li&gt;Every merged PR has a complete audit trail: PR number, URL, commit SHA, who merged it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The composite action pattern also means when you need to extend this — adding a Slack notification on merge, posting to Confluence — you extend one file, not two.&lt;/p&gt;




&lt;p&gt;If you're rolling this out and hitting edge cases — multi-project Jira setups, tickets that span repos, or teams that don't follow branch naming conventions — &lt;a href="https://graycloudarch.com/contact/" rel="noopener noreferrer"&gt;get in touch&lt;/a&gt;. The extraction logic and composite action pattern are straightforward to extend once the baseline is working.&lt;/p&gt;

</description>
      <category>githubactions</category>
      <category>jira</category>
      <category>cicd</category>
      <category>automation</category>
    </item>
    <item>
      <title>What the first 24 hours of production CloudWatch data told us</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Mon, 04 May 2026 18:43:32 +0000</pubDate>
      <link>https://forem.com/tallgray1/what-the-first-24-hours-of-production-cloudwatch-data-told-us-1140</link>
      <guid>https://forem.com/tallgray1/what-the-first-24-hours-of-production-cloudwatch-data-told-us-1140</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://graycloudarch.com/blog/cloudwatch-go-live-24h/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;The morning after go-live, the first thing I looked at was CPU. One of the two delivery services was sitting at 99.8% average utilization across 9 tasks. P50 latency: 1,010ms.&lt;/p&gt;

&lt;p&gt;We'd launched deliberately without autoscaling. The plan was to observe real traffic patterns before configuring a scaling policy — you can't tune a policy you haven't seen the workload demand yet. What we didn't know was that the workload would reveal something about the task itself before we'd had a chance to watch it for a week.&lt;/p&gt;

&lt;p&gt;Thirty-six hours after go-live, we'd shipped right-sizing changes, a working autoscaling configuration, and a new observability source for ALB-layer signals. All of it came directly from what the first day of production data said. Here's how we read it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What 99.8% CPU means at 0.5 vCPU
&lt;/h2&gt;

&lt;p&gt;The service was allocated 512 ECS CPU units per task — half a vCPU. CloudWatch was telling us the tasks were spending essentially all of their scheduled CPU time working.&lt;/p&gt;

&lt;p&gt;The first instinct in this situation is to add tasks. Scale out horizontally. But adding more 0.5 vCPU containers when each one is already saturated doesn't change the constraint. In ECS, the scheduler distributes tasks across hosts, but the per-task CPU ceiling is set in the task definition. More tasks at ceiling is not materially different from fewer tasks at ceiling — you're distributing the same undersized unit more widely.&lt;/p&gt;

&lt;p&gt;The signal wasn't about count. It was about the unit itself.&lt;/p&gt;

&lt;p&gt;At 99.8% utilization, any burst in per-request processing time — a downstream API call that's slow, a cache miss, a spike in concurrent requests — queues. The task has no headroom to absorb it. That's where the 1,010ms p50 comes from: not that individual requests are slow, but that tasks are scheduled tightly enough that requests wait before they even start processing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Right-sizing the task before configuring the autoscaler
&lt;/h2&gt;

&lt;p&gt;We doubled the CPU allocation: 512 → 1,024 units. The rationale is mechanical once you see it: you can't configure a useful CPU-based autoscaling policy on a task that's already running at ceiling. If 100% CPU is the baseline, the autoscaler has nothing to respond to — it would scale out immediately on creation and never scale in.&lt;/p&gt;

&lt;p&gt;Target tracking at 70% CPU requires headroom. A 1 vCPU task running the same workload that previously pinned a 0.5 vCPU task will land around 50% utilization — below the target, room to absorb variance before triggering a scale-out, and enough signal for scale-in to be meaningful rather than noise.&lt;/p&gt;

&lt;p&gt;The second service had a different profile: 12 tasks, 1 vCPU each, hitting 92% at peak. Not saturated the same way, but thin on headroom. We went to 2 vCPU there.&lt;/p&gt;

&lt;p&gt;Two other services in the platform were running the opposite problem — allocated more memory than they'd ever used. Those went the other direction: overprovisioned memory cut back based on observed peaks. The same 24-hour data window showed both problems at once.&lt;/p&gt;

&lt;p&gt;Sequencing matters: &lt;strong&gt;right-size the task before you configure the autoscaler.&lt;/strong&gt; Otherwise you're teaching a scaling policy to respond to a signal that's already maxed out, and the first thing it does is scale out to a floor that's still running on undersized tasks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why we chose CPU tracking instead of request count
&lt;/h2&gt;

&lt;p&gt;The obvious autoscaling metric for an HTTP service is &lt;code&gt;ALBRequestCountPerTarget&lt;/code&gt;. The ALB knows the request rate per target group; scaling on that metric tracks load linearly and is highly predictable.&lt;/p&gt;

&lt;p&gt;We couldn't use it.&lt;/p&gt;

&lt;p&gt;The platform uses a cross-account Lambda to register ECS tasks with ALB target groups at boot. Because of how the registration bridge works, the ECS service resource is provisioned with &lt;code&gt;target_group_arn = null&lt;/code&gt; — the target group lives in a different account, and the service module doesn't know its ARN. &lt;code&gt;ALBRequestCountPerTarget&lt;/code&gt; requires the target group ARN to be known to the Application Auto Scaling policy. Without it, there's no way to wire the metric across accounts without building additional dependency plumbing.&lt;/p&gt;

&lt;p&gt;CPU target tracking at 70% was the correct second choice. For a CPU-bound workload — which 99.8% utilization confirms this is — CPU is a meaningful proxy for load. The metric was there, it was clean, and the task was now sized to make it useful.&lt;/p&gt;

&lt;p&gt;One thing worth noting: the cross-account registration bridge was the right architectural decision for the problem it solved. But it created a constraint three layers away in a scaling configuration we hadn't designed yet. Architecture decisions compound downstream. The fix here was straightforward; I've seen the same pattern take longer to untangle when the constraint wasn't recognized.&lt;/p&gt;

&lt;h2&gt;
  
  
  The observability gap app logs can't fill
&lt;/h2&gt;

&lt;p&gt;Application logs were already flowing to BetterStack from both services. We had route-level latency, HTTP status codes, request counts, error breakdowns — everything that happens inside a container.&lt;/p&gt;

&lt;p&gt;What the logs couldn't tell us was what happens above them. The ALB generates its own error signals: &lt;code&gt;HTTPCode_ELB_5XX_Count&lt;/code&gt; for errors the load balancer generates before a request reaches a container, &lt;code&gt;RejectedConnectionCount&lt;/code&gt; for connections refused at the ALB layer when backend capacity is exhausted, &lt;code&gt;ActiveConnectionCount&lt;/code&gt; as a proxy for in-flight load per target group. None of this appears in application logs. If the ALB had been dropping connections during the 99.8% CPU period, we would have had no signal in our observability platform.&lt;/p&gt;

&lt;p&gt;CloudWatch had the data. The gap was getting it into the same place as everything else.&lt;/p&gt;

&lt;p&gt;A 60-second Lambda in the infrastructure account — where the ALB lives — calls &lt;code&gt;GetMetricData&lt;/code&gt; and ships structured JSON to BetterStack. One EventBridge rule, no ECS changes, effectively zero cost (one CloudWatch API call per minute against Lambda's free tier). The metrics land alongside the application data and show the ALB layer that the app logs are blind to.&lt;/p&gt;

&lt;p&gt;The design decision here was Lambda over an ECS sidecar. A sidecar would have run per-service, per-task, 24 hours a day, and required task definition changes across the platform. A single Lambda running once per minute in the account that owns the ALB costs nothing and touches no ECS configuration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Autoscaling parameters worth explaining
&lt;/h2&gt;

&lt;p&gt;For the higher-load service: min=9, max=20, CPU target=70%, scale-out cooldown=60s, scale-in cooldown=300s.&lt;/p&gt;

&lt;p&gt;Setting &lt;code&gt;min_capacity&lt;/code&gt; to 9 — the current running task count — was deliberate. We'd just established that 9 tasks was a functional floor for this workload at current traffic levels. An autoscaler configured with min=2 or min=4 would have attempted to scale in on the first quiet period, bringing the service back to a state we knew was already under-provisioned. Anchoring the floor to the observed stable-state count prevents that while we accumulate enough autoscaling history to set a meaningful long-term floor.&lt;/p&gt;

&lt;p&gt;The asymmetric cooldowns — 60 seconds for scale-out, 5 minutes for scale-in — reflect the cost asymmetry of being wrong in each direction. Scaling out too slowly during a load spike means requests queue. Scaling in too aggressively during a brief quiet period means tasks are killed and restarted unnecessarily. The 5-minute scale-in cooldown is conservative; we'll revisit it once we have a week of data showing where the service naturally stabilizes.&lt;/p&gt;

&lt;h2&gt;
  
  
  What 24 hours of data drove
&lt;/h2&gt;

&lt;p&gt;We launched expecting to spend the first week observing. What the data delivered instead was a complete picture of three distinct problems: a task sizing issue that was causing queuing, a scaling policy that needed the right foundation before it could be configured, and an observability gap for a class of signals that app logs fundamentally can't surface.&lt;/p&gt;

&lt;p&gt;All three were solved from the same 24-hour data window. The pre-launch load testing hadn't revealed any of them — synthetic traffic and production ad-bidding traffic have different CPU profiles, and you don't know which until the real thing runs.&lt;/p&gt;

&lt;p&gt;The thing I'd change if running this again: put a structured post-launch data review into the go-live plan, not the next morning's to-do list. Not a formal incident review — a deliberate hour with CloudWatch after the first day's traffic has run through. The data is there. The question is whether you've planned to look at it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you're planning a production go-live and want a structured approach to post-launch data review and stabilization — or you're staring at a service running at ceiling with no autoscaling — &lt;a href="https://graycloudarch.com/contact/" rel="noopener noreferrer"&gt;get in touch&lt;/a&gt;. This is the kind of platform work I do regularly, and the pattern here applies well beyond ad delivery.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ecs</category>
      <category>cloudwatch</category>
      <category>autoscaling</category>
      <category>rightsizing</category>
    </item>
    <item>
      <title>Stop Managing EKS Add-ons by Hand</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Sun, 05 Apr 2026 16:53:40 +0000</pubDate>
      <link>https://forem.com/tallgray1/stop-managing-eks-add-ons-by-hand-2a7o</link>
      <guid>https://forem.com/tallgray1/stop-managing-eks-add-ons-by-hand-2a7o</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://graycloudarch.com/blog/eks-addons-terraform/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;I was preparing to upgrade a production EKS cluster to version 1.32 when I discovered a problem.&lt;/p&gt;

&lt;p&gt;Four of our core cluster components—VPC CNI, CoreDNS, kube-proxy, and Metrics Server—were all running versions incompatible with EKS 1.32. I needed to update them before upgrading.&lt;/p&gt;

&lt;p&gt;And I had no easy way to do it.&lt;/p&gt;

&lt;p&gt;VPC CNI, CoreDNS, and kube-proxy had been installed automatically when the cluster was created, running in "self-managed" mode. Metrics Server was installed with &lt;code&gt;kubectl apply -f metrics-server.yaml&lt;/code&gt; from some GitHub release page, months ago, by someone who is no longer on the team.&lt;/p&gt;

&lt;p&gt;No version pinning. No history of what changed or when. No way to test the upgrade before applying it to production.&lt;/p&gt;

&lt;p&gt;That's when I decided to stop managing EKS add-ons by hand.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with Self-Managed Add-ons
&lt;/h2&gt;

&lt;p&gt;There are two categories of EKS add-ons, and most teams don't think about the distinction until they're stuck.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Self-managed&lt;/strong&gt;: You're responsible for installation, updates, and compatibility. AWS won't help you troubleshoot them. When EKS releases a new version, you need to manually verify your add-ons still work, find compatible versions, and update them yourself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;EKS-managed&lt;/strong&gt;: AWS handles the lifecycle. Compatible versions are tested and published for each EKS release. AWS Support can troubleshoot them. Security patches are available without you tracking CVEs.&lt;/p&gt;

&lt;p&gt;If you created an EKS cluster without explicitly enabling managed add-ons, VPC CNI, CoreDNS, and kube-proxy are running in self-managed mode right now.&lt;/p&gt;

&lt;p&gt;The fix is straightforward—migrate them to EKS-managed. But if you're also running kubectl-installed tools like Metrics Server, you have a second problem: those aren't managed by anything at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Solution: One Terraform Module for All Six Add-ons
&lt;/h2&gt;

&lt;p&gt;I built a single &lt;code&gt;eks-addons&lt;/code&gt; Terraform module that manages everything:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;EKS-managed (4):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;VPC CNI — pod networking&lt;/li&gt;
&lt;li&gt;EBS CSI Driver — persistent volumes (added this one while I was at it)&lt;/li&gt;
&lt;li&gt;CoreDNS — DNS resolution&lt;/li&gt;
&lt;li&gt;kube-proxy — network proxy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Helm-managed (2):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Metrics Server — resource metrics for &lt;code&gt;kubectl top&lt;/code&gt; and HPA&lt;/li&gt;
&lt;li&gt;Reloader — auto-restart pods when ConfigMaps or Secrets change&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Why one module instead of six separate ones? All of these share the same dependency: the EKS cluster. Consolidating them means one &lt;code&gt;terragrunt apply&lt;/code&gt; deploys everything, one &lt;code&gt;terraform plan&lt;/code&gt; shows drift across all add-ons, and one PR updates any version.&lt;/p&gt;

&lt;p&gt;The core Terraform for an EKS-managed add-on is minimal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_eks_addon"&lt;/span&gt; &lt;span class="s2"&gt;"vpc_cni"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;count&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;enable_vpc_cni&lt;/span&gt; &lt;span class="err"&gt;?&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

  &lt;span class="nx"&gt;cluster_name&lt;/span&gt;                &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cluster_name&lt;/span&gt;
  &lt;span class="nx"&gt;addon_name&lt;/span&gt;                  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"vpc-cni"&lt;/span&gt;
  &lt;span class="nx"&gt;addon_version&lt;/span&gt;               &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc_cni_version&lt;/span&gt;
  &lt;span class="nx"&gt;resolve_conflicts_on_create&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"OVERWRITE"&lt;/span&gt;
  &lt;span class="nx"&gt;resolve_conflicts_on_update&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"OVERWRITE"&lt;/span&gt;
  &lt;span class="nx"&gt;preserve&lt;/span&gt;                    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things worth explaining:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;resolve_conflicts = "OVERWRITE"&lt;/code&gt; tells Terraform it's the source of truth. Any manual changes in the cluster get overwritten on the next apply. This is what you want.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;preserve = true&lt;/code&gt; means if you remove the resource from Terraform, the add-on stays in the cluster. Safety net during refactoring—you won't accidentally delete a running add-on.&lt;/p&gt;

&lt;h2&gt;
  
  
  EBS CSI Driver Needs an IAM Role
&lt;/h2&gt;

&lt;p&gt;The EBS CSI Driver is the one add-on that requires extra work: it needs IAM permissions to create and attach EBS volumes. The right way to handle this is IRSA (IAM Roles for Service Accounts).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_iam_role"&lt;/span&gt; &lt;span class="s2"&gt;"ebs_csi"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;count&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;enable_ebs_csi&lt;/span&gt; &lt;span class="err"&gt;?&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${var.cluster_name}-ebs-csi-driver"&lt;/span&gt;

  &lt;span class="nx"&gt;assume_role_policy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonencode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;Version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;
    &lt;span class="nx"&gt;Statement&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
      &lt;span class="nx"&gt;Effect&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
      &lt;span class="nx"&gt;Principal&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Federated&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;oidc_provider_arn&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="nx"&gt;Action&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"sts:AssumeRoleWithWebIdentity"&lt;/span&gt;
      &lt;span class="nx"&gt;Condition&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;StringEquals&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="s2"&gt;"${var.oidc_provider}:sub"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"system:serviceaccount:kube-system:ebs-csi-controller-sa"&lt;/span&gt;
          &lt;span class="s2"&gt;"${var.oidc_provider}:aud"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"sts.amazonaws.com"&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}]&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_iam_role_policy_attachment"&lt;/span&gt; &lt;span class="s2"&gt;"ebs_csi"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;count&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;enable_ebs_csi&lt;/span&gt; &lt;span class="err"&gt;?&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
  &lt;span class="nx"&gt;role&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_iam_role&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ebs_csi&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;
  &lt;span class="nx"&gt;policy_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No credentials in pods, automatic rotation, and a clean audit trail in CloudTrail. IRSA is the correct pattern for any AWS service that needs to call AWS APIs from inside Kubernetes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Migrating Metrics Server from kubectl to Helm
&lt;/h2&gt;

&lt;p&gt;This is the one step that requires manual cleanup before Terraform can take over.&lt;/p&gt;

&lt;p&gt;The existing kubectl-installed Metrics Server needs to go first:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl delete deployment metrics-server &lt;span class="nt"&gt;-n&lt;/span&gt; kube-system
kubectl delete service metrics-server &lt;span class="nt"&gt;-n&lt;/span&gt; kube-system
kubectl delete apiservice v1beta1.metrics.k8s.io
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then Terraform installs the Helm-managed version:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"helm_release"&lt;/span&gt; &lt;span class="s2"&gt;"metrics_server"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;count&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;enable_metrics_server&lt;/span&gt; &lt;span class="err"&gt;?&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"metrics-server"&lt;/span&gt;
  &lt;span class="nx"&gt;repository&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"https://kubernetes-sigs.github.io/metrics-server/"&lt;/span&gt;
  &lt;span class="nx"&gt;chart&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"metrics-server"&lt;/span&gt;
  &lt;span class="nx"&gt;version&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;metrics_server_chart_version&lt;/span&gt;
  &lt;span class="nx"&gt;namespace&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"kube-system"&lt;/span&gt;

  &lt;span class="nx"&gt;values&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;yamlencode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;replicas&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
    &lt;span class="nx"&gt;args&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="s2"&gt;"--kubelet-preferred-address-types=InternalIP"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="s2"&gt;"--kubelet-insecure-tls"&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="nx"&gt;podDisruptionBudget&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;enabled&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="nx"&gt;minAvailable&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;})]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected downtime: 2-3 minutes. Only &lt;code&gt;kubectl top&lt;/code&gt; is unavailable during the transition. Running applications are not affected.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deploying It
&lt;/h2&gt;

&lt;p&gt;One thing that bit me: CI/CD doesn't pick up module changes automatically.&lt;/p&gt;

&lt;p&gt;Our GitHub Actions workflow detects changes by looking for modified &lt;code&gt;terragrunt.hcl&lt;/code&gt; files. When I changed files in &lt;code&gt;common/modules/eks-addons/&lt;/code&gt;, the workflow triggered but found no stacks to deploy (no &lt;code&gt;terragrunt.hcl&lt;/code&gt; changed), so nothing ran.&lt;/p&gt;

&lt;p&gt;Module changes require a manual deploy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;workloads-nonprod/us-east-1/cluster-name/eks-addons
terragrunt init
terragrunt plan   &lt;span class="c"&gt;# Review: should show ~10 resources to add&lt;/span&gt;
terragrunt apply
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After apply, verify everything is healthy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check EKS-managed add-on status&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;addon &lt;span class="k"&gt;in &lt;/span&gt;vpc-cni aws-ebs-csi-driver coredns kube-proxy&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;aws eks describe-addon &lt;span class="nt"&gt;--cluster-name&lt;/span&gt; &amp;lt;cluster&amp;gt; &lt;span class="nt"&gt;--addon-name&lt;/span&gt; &lt;span class="nv"&gt;$addon&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'addon.[addonName,status]'&lt;/span&gt; &lt;span class="nt"&gt;--output&lt;/span&gt; text
&lt;span class="k"&gt;done&lt;/span&gt;
&lt;span class="c"&gt;# All should show: ACTIVE&lt;/span&gt;

&lt;span class="c"&gt;# Verify Metrics Server&lt;/span&gt;
kubectl top nodes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What Changed
&lt;/h2&gt;

&lt;p&gt;Before: four add-ons running in self-managed mode, one installed by kubectl, no version history, no drift detection.&lt;/p&gt;

&lt;p&gt;After:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All six add-ons defined in code with pinned versions&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;terraform plan&lt;/code&gt; shows immediately if anything drifts from the declared state&lt;/li&gt;
&lt;li&gt;Rollback is &lt;code&gt;git revert&lt;/code&gt; + &lt;code&gt;terragrunt apply&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;EKS cluster upgrade checklist is now: update four version strings in the Terragrunt config, open a PR, done&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The cluster upgrade I was dreading took about 30 minutes instead of a day of manual compatibility checking.&lt;/p&gt;




&lt;p&gt;Running into EKS add-on management problems? &lt;a href="https://graycloudarch.com/contact" rel="noopener noreferrer"&gt;Reach out&lt;/a&gt;—this is the kind of operational work I do for platform teams.&lt;/p&gt;

</description>
      <category>eks</category>
      <category>kubernetes</category>
      <category>terraform</category>
      <category>terragrunt</category>
    </item>
    <item>
      <title>Zero-Downtime AWS Transit Gateway Hub-Spoke Migration</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Sat, 04 Apr 2026 23:36:57 +0000</pubDate>
      <link>https://forem.com/tallgray1/zero-downtime-aws-transit-gateway-hub-spoke-migration-36h</link>
      <guid>https://forem.com/tallgray1/zero-downtime-aws-transit-gateway-hub-spoke-migration-36h</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://graycloudarch.com/blog/transit-gateway-hub-spoke-migration/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;The request came from the security team: they needed network-level access from the nonprod account to the dev account so a vulnerability scanner could reach internal services. Simple enough on the surface. In practice, it exposed a gap we'd been living with for months — and forced us to fix the network architecture we'd been deferring.&lt;/p&gt;

&lt;p&gt;We had three standalone Transit Gateways: one in each workload account, dev, nonprod, and prod. Completely isolated from each other. No cross-account connectivity at all. The security scanner couldn't reach its targets, and adding more point-to-point peering connections to fix it would have made everything worse.&lt;/p&gt;

&lt;p&gt;But the TGW isolation was only part of the problem. We also had no inspection of traffic crossing our network boundary. Egress from workload pods went straight to the internet with no filtering. Ingress came through per-account load balancers with no centralized enforcement point. As the platform scaled toward additional workload accounts, this pattern was going to get expensive and hard to reason about.&lt;/p&gt;

&lt;p&gt;So we didn't just fix the TGW. We rebuilt the network foundation: a centralized Inspection VPC with a Network Firewall inline, a single hub Transit Gateway shared across all accounts, and centralized security tooling (GuardDuty, CloudTrail, Security Hub) aggregated in a dedicated Security account. Two maintenance windows, a few weeks of module work, and the platform went from fragmented per-account networking to a coherent hub-spoke design with full traffic inspection.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture We Were Replacing
&lt;/h2&gt;

&lt;p&gt;Before the migration, each workload account was self-contained. It had its own TGW, its own internet gateway, its own NAT gateways. Security tooling ran independently in each account with no aggregation. The management account had no single-pane visibility into what was happening across the environment.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgraycloudarch.com%2Fwp-content%2Fuploads%2F2026%2F03%2Fdiag-before.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgraycloudarch.com%2Fwp-content%2Fuploads%2F2026%2F03%2Fdiag-before.png" alt="Before: Three isolated workload accounts — each with its own IGW, NAT Gateway, and standalone Transit Gateway, no cross-account connectivity" width="800" height="258"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The cost of running this way was about $150/month in TGW charges plus duplicated NAT gateway charges in each account. Every new workload account would multiply this cost again and add another independent security configuration.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Target: Inspection VPC + Hub Transit Gateway
&lt;/h2&gt;

&lt;p&gt;The target was AWS Security Reference Architecture Pattern B: an Inspection VPC that sits between the internet and all workload VPCs. All internet traffic — ingress and egress — flows through this VPC and through a Network Firewall before reaching any workload account.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgraycloudarch.com%2Fwp-content%2Fuploads%2F2026%2F03%2Fdiag-after.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgraycloudarch.com%2Fwp-content%2Fuploads%2F2026%2F03%2Fdiag-after.png" alt="After: Centralized hub with inline Network Firewall inspection — all traffic flows through the Infrastructure Account's Inspection VPC before reaching any workload" width="800" height="552"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Egress path: workload pod → TGW → Inspection VPC TGW subnets → Network Firewall → NAT Gateway → IGW → internet.&lt;/p&gt;

&lt;p&gt;Ingress path: internet → IGW → centralized ALB (public subnet) → Network Firewall → TGW → workload VPC → pod.&lt;/p&gt;

&lt;p&gt;Nothing crosses the network boundary without passing through the firewall. Workload accounts carry no internet-facing infrastructure at all — no IGW, no NAT gateways, no public load balancers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 1: Module Changes
&lt;/h2&gt;

&lt;p&gt;All Terraform work happened before scheduling any maintenance. The goal was to reach a state where the migration itself was just running pre-staged plan files in a specific sequence.&lt;/p&gt;

&lt;h3&gt;
  
  
  Transit Gateway: add a conditional create flag
&lt;/h3&gt;

&lt;p&gt;The existing network module always created a TGW. We needed spoke accounts to declare the same module without spinning up their own gateway:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight terraform"&gt;&lt;code&gt;&lt;span class="k"&gt;variable&lt;/span&gt; &lt;span class="s2"&gt;"create_transit_gateway"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Whether to create a Transit Gateway (false for hub-spoke spokes)"&lt;/span&gt;
  &lt;span class="nx"&gt;type&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;bool&lt;/span&gt;
  &lt;span class="nx"&gt;default&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_ec2_transit_gateway"&lt;/span&gt; &lt;span class="s2"&gt;"this"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;count&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;create_transit_gateway&lt;/span&gt; &lt;span class="err"&gt;?&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tgw_description&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;output&lt;/span&gt; &lt;span class="s2"&gt;"transit_gateway_id"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;create_transit_gateway&lt;/span&gt; &lt;span class="err"&gt;?&lt;/span&gt; &lt;span class="nx"&gt;aws_ec2_transit_gateway&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;this&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;default = true&lt;/code&gt; means existing configurations need no changes. The flag only flips to &lt;code&gt;false&lt;/code&gt; after the spoke attachment is confirmed working.&lt;/p&gt;

&lt;h3&gt;
  
  
  New module: vpc-attachment
&lt;/h3&gt;

&lt;p&gt;The vpc-attachment module handles the spoke side of the hub relationship: create the TGW attachment, associate it to the hub's route table, and add routes to every private route table in the spoke VPC pointing at the hub TGW.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight terraform"&gt;&lt;code&gt;&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_ec2_transit_gateway_vpc_attachment"&lt;/span&gt; &lt;span class="s2"&gt;"this"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;transit_gateway_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;transit_gateway_id&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_id&lt;/span&gt;             &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc_id&lt;/span&gt;
  &lt;span class="nx"&gt;subnet_ids&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;subnet_ids&lt;/span&gt;

  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;-hub-attachment"&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_ec2_transit_gateway_route_table_association"&lt;/span&gt; &lt;span class="s2"&gt;"this"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;transit_gateway_attachment_id&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_ec2_transit_gateway_vpc_attachment&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;transit_gateway_route_table_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;transit_gateway_route_table_id&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_route"&lt;/span&gt; &lt;span class="s2"&gt;"to_hub_tgw"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;for_each&lt;/span&gt;               &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;toset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc_route_table_ids&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="nx"&gt;route_table_id&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;each&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;
  &lt;span class="nx"&gt;destination_cidr_block&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"10.0.0.0/8"&lt;/span&gt;
  &lt;span class="nx"&gt;transit_gateway_id&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;transit_gateway_id&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;10.0.0.0/8&lt;/code&gt; supernet covers all workload and Inspection VPC CIDRs without maintaining per-prefix route entries. It also covers the Inspection VPC CIDR (&lt;code&gt;10.100.0.0/20&lt;/code&gt;) — that's how return traffic from the centralized ALB finds its way back to pods in workload VPCs.&lt;/p&gt;

&lt;p&gt;The Terragrunt config for a spoke account reads VPC details from the existing network dependency and hardcodes the hub TGW identifiers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;dependency&lt;/span&gt; &lt;span class="s2"&gt;"network"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;config_path&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"../network"&lt;/span&gt;
  &lt;span class="nx"&gt;mock_outputs&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;vpc_id&lt;/span&gt;                  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"vpc-mockid"&lt;/span&gt;
    &lt;span class="nx"&gt;private_subnet_ids&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"subnet-mock1"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="nx"&gt;private_route_table_ids&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"rtb-mock1"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;inputs&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;transit_gateway_id&lt;/span&gt;             &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"tgw-xxxxx"&lt;/span&gt;   &lt;span class="c1"&gt;# hub TGW, documented in runbook&lt;/span&gt;
  &lt;span class="nx"&gt;transit_gateway_route_table_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"tgw-rtb-xxxxx"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We hardcoded the hub TGW and route table IDs rather than using cross-account data sources. The alternative — reading TGW details from the Infrastructure account at plan time — requires cross-account state access and adds complexity that isn't worth it for values that change maybe once in the platform's lifetime.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hub route tables: workload isolation by default
&lt;/h3&gt;

&lt;p&gt;A key design decision: workload accounts should not route to each other directly. Dev should not reach nonprod; nonprod should not reach prod. The hub TGW enforces this through route table structure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;default-association-rt&lt;/strong&gt;: all workload attachments associate here. The only route is &lt;code&gt;0.0.0.0/0 → inspection attachment&lt;/code&gt;. Workloads can reach the internet via the Inspection VPC, but cannot reach other workload VPCs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;default-propagation-rt&lt;/strong&gt;: the inspection attachment propagates workload CIDRs here for return traffic routing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Inter-account communication is opt-in: you add an explicit route table entry for a specific attachment pair. By default, the architecture prevents lateral movement across workload accounts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Inspection VPC subnet layout
&lt;/h3&gt;

&lt;p&gt;The Inspection VPC has three tiers with carefully constructed route tables that force traffic through the firewall in both directions:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgraycloudarch.com%2Fwp-content%2Fuploads%2F2026%2F03%2Fdiag-subnets.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgraycloudarch.com%2Fwp-content%2Fuploads%2F2026%2F03%2Fdiag-subnets.png" alt="Inspection VPC subnet layout — three tiers (public, firewall, TGW) with asymmetric route tables that force all traffic through Network Firewall endpoints in both directions" width="800" height="1255"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The asymmetric route table design ensures the firewall sees every packet crossing the network boundary, regardless of direction. Traffic entering from the internet hits the firewall before reaching workloads. Traffic from workloads hits the firewall before reaching the internet.&lt;/p&gt;

&lt;h3&gt;
  
  
  Security baseline: convert to delegated admin model
&lt;/h3&gt;

&lt;p&gt;GuardDuty and CloudTrail were running independently per account. We added &lt;code&gt;enable_guardduty&lt;/code&gt; and &lt;code&gt;enable_cloudtrail&lt;/code&gt; boolean variables to the security-baseline module so workload accounts could switch from standalone to member without touching the module invocation itself.&lt;/p&gt;

&lt;p&gt;In the Security account, we deployed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GuardDuty&lt;/strong&gt; as delegated admin with organization-level auto-enrollment. EKS Protection and S3 Protection enabled. All findings from all accounts visible in a single dashboard.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CloudTrail&lt;/strong&gt; organization trail writing to a cross-account S3 bucket. Log file validation and KMS encryption enabled. Per-account trails archived after the cutover — not deleted, in case historical log formats differed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security Hub&lt;/strong&gt; with CIS AWS Foundations Benchmark and AWS Foundational Security Best Practices enabled across the full organization.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Phase 2: Two Maintenance Windows
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Window 1: Deploy the hub (~45 minutes, low risk)
&lt;/h3&gt;

&lt;p&gt;With no existing attachments and no workload traffic, deploying the hub infrastructure carried minimal risk. We applied the Infrastructure account TGW and Inspection VPC in a single window. The Network Firewall takes 5–10 minutes to reach READY state after creation — account for that in your timing.&lt;/p&gt;

&lt;p&gt;At the end of this window: hub TGW running, Inspection VPC active, Network Firewall endpoints healthy in both AZs, centralized ALB deployed. Nothing attached yet. We documented the TGW ID and route table IDs in the runbook before scheduling window 2.&lt;/p&gt;

&lt;h3&gt;
  
  
  Window 2: Spoke cutover (~2 hours)
&lt;/h3&gt;

&lt;p&gt;The key insight for keeping applications running: &lt;strong&gt;create the hub attachment before destroying the standalone TGW&lt;/strong&gt;. While both exist simultaneously, traffic continues flowing through the standalone path. The actual cutover is updating routes to point at the hub — that's a single &lt;code&gt;terragrunt apply&lt;/code&gt;, not the destruction of the old TGW.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;T+0 — Accept RAM share.&lt;/strong&gt; Infrastructure account shares the hub TGW via Resource Access Manager. Workload accounts accept the share invitation. Pure metadata operation; zero network impact.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;T+15 — Deploy VPC attachments.&lt;/strong&gt; Apply the &lt;code&gt;vpc-attachment&lt;/code&gt; module in each workload account. At this point each spoke VPC has two routes for &lt;code&gt;10.0.0.0/8&lt;/code&gt;: the existing one pointing at the standalone TGW, and the new one pointing at the hub. With identical prefix lengths, traffic still flows through the standalone path. Rollback at this stage is &lt;code&gt;terragrunt destroy&lt;/code&gt; on the attachment module — under five minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;T+30 — Verify routes and test cross-account connectivity.&lt;/strong&gt; Confirm hub routes are present in every private route table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws ec2 describe-route-tables &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--filters&lt;/span&gt; &lt;span class="s2"&gt;"Name=vpc-id,Values=vpc-xxxxx"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'RouteTables[*].Routes[?DestinationCidrBlock==`10.0.0.0/8`]'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then test actual cross-account traffic: connect from a dev instance to a service in the nonprod VPC. The hub TGW and Inspection VPC should route it correctly. This also validates that the firewall rule groups are permitting expected traffic — catch any rule issues here, before cutting over production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;T+45 — Migrate security tooling.&lt;/strong&gt; Apply the updated security-baseline to each workload account. GuardDuty converts from standalone admin to member; findings flow to the Security account delegated admin. CloudTrail local trail disabled; organization trail confirmed logging events from the account. Zero network impact.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Verify GuardDuty membership&lt;/span&gt;
aws guardduty get-administrator-account &lt;span class="nt"&gt;--detector-id&lt;/span&gt; &amp;lt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="c"&gt;# Returns the Security account as administrator&lt;/span&gt;

&lt;span class="c"&gt;# Verify organization trail is capturing events&lt;/span&gt;
&lt;span class="c"&gt;# Make an API call, wait ~15 minutes, check the Security account's S3 bucket&lt;/span&gt;
aws s3 &lt;span class="nb"&gt;ls &lt;/span&gt;s3://&amp;lt;org-trail-bucket&amp;gt;/AWSLogs/&amp;lt;account-id&amp;gt;/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;T+60 — Set &lt;code&gt;create_transit_gateway = false&lt;/code&gt; in each spoke.&lt;/strong&gt; This is the cutover. Run &lt;code&gt;terraform plan&lt;/code&gt; first and confirm it shows only the TGW and its attached resources being destroyed — nothing else. Apply dev first, watch the destruction complete, confirm application traffic is flowing through the hub. Then apply nonprod. About 3 minutes per account.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;T+90 — Health checks and close.&lt;/strong&gt; Spot-check API endpoints, database connectivity, anything that traverses the network. Confirm egress traffic is hitting the firewall logs in the Infrastructure account. The maintenance window closed at the 90-minute mark; actual work was done by T+75. We kept the window open for the last 15 minutes as a buffer.&lt;/p&gt;

&lt;p&gt;The parallel attachment approach ensured there was never a moment where a workload account had no routing path. Even if the hub TGW had been misconfigured, traffic would have continued flowing through the standalone gateway until we chose to destroy it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Ended Up With
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;One TGW&lt;/strong&gt; in the Infrastructure account with three spoke attachments. Route tables that allow workload→internet traffic while preventing workload→workload lateral movement by default.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One Inspection VPC&lt;/strong&gt; with Network Firewall endpoints in two AZs. All egress inspected against stateful domain filter rules and stateless port rules. All ingress from the centralized ALB inspected. Firewall policy updates apply to all workload accounts simultaneously — no per-account changes needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One centralized ALB&lt;/strong&gt; in the Infrastructure account, routing to EKS target groups in workload accounts via cross-account IAM role assumption. Workload accounts carry no public-facing load balancers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One security console&lt;/strong&gt; in the Security account. GuardDuty findings from all accounts in a single dashboard. CloudTrail logs from every account in one S3 bucket. Security Hub compliance posture for the full organization visible in one place.&lt;/p&gt;

&lt;p&gt;Cost went from roughly $150–200/month (standalone TGWs, per-account NAT, independent security tooling) to approximately $50/month (single hub TGW plus attachment hours, shared NAT in the Inspection VPC, delegated security services). Cost savings validated against AWS Cost Explorer after 30 days.&lt;/p&gt;

&lt;p&gt;The original security scanner request — cross-account access from nonprod to dev — was live the same day. The compliance team had a single GuardDuty and Security Hub dashboard the same week.&lt;/p&gt;

&lt;p&gt;More importantly: adding a new workload account to this architecture now takes about an hour. Create the VPC, deploy the vpc-attachment module pointing at the documented hub TGW ID, invite the new account as a GuardDuty and Security Hub member, apply the security-baseline with &lt;code&gt;enable_guardduty = false&lt;/code&gt;. Every new account inherits the full inspection and security posture without any per-account configuration. That's the actual value of a hub-spoke design — not the one-time cost savings, but the fact that account seven is as well-secured and as easy to audit as account two.&lt;/p&gt;




&lt;p&gt;Working through a multi-account network redesign, or building the inspection layer on top of an existing Transit Gateway setup? &lt;a href="https://graycloudarch.com/contact/" rel="noopener noreferrer"&gt;Get in touch&lt;/a&gt; — this is the kind of platform architecture I work on regularly.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>terraform</category>
      <category>transitgateway</category>
      <category>networking</category>
    </item>
    <item>
      <title>DNS Validation: From 15 Steps to Zero</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Sat, 04 Apr 2026 22:30:30 +0000</pubDate>
      <link>https://forem.com/tallgray1/dns-validation-from-15-steps-to-zero-1nng</link>
      <guid>https://forem.com/tallgray1/dns-validation-from-15-steps-to-zero-1nng</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://graycloudarch.com/blog/dns-hell-to-automated/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;You know what's the worst part of launching a new site?&lt;/p&gt;

&lt;p&gt;SSL certificate validation.&lt;/p&gt;

&lt;p&gt;Not creating the cert—that's one click in AWS ACM. It's the validation dance:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;AWS gives you a CNAME record: &lt;code&gt;_abc123extremely-long-string-here.graycloudarch.com&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;The value is equally ridiculous: &lt;code&gt;_xyz789another-massive-string.acm-validations.aws.&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;You copy it (pray you don't miss a character)&lt;/li&gt;
&lt;li&gt;Switch to Cloudflare (or Route 53, or wherever)&lt;/li&gt;
&lt;li&gt;Paste it in&lt;/li&gt;
&lt;li&gt;Wait 5-10 minutes&lt;/li&gt;
&lt;li&gt;Refresh AWS console&lt;/li&gt;
&lt;li&gt;Still pending...&lt;/li&gt;
&lt;li&gt;Refresh again&lt;/li&gt;
&lt;li&gt;Finally validated!&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Now do it again for &lt;code&gt;www.graycloudarch.com&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;And then repeat the whole thing for your second domain.&lt;/p&gt;

&lt;p&gt;This is "DNS hell."&lt;/p&gt;

&lt;h2&gt;
  
  
  There's a Better Way
&lt;/h2&gt;

&lt;p&gt;Terraform can read AWS validation records and create them in Cloudflare automatically.&lt;/p&gt;

&lt;p&gt;Zero copy-paste. Zero browser tab switching. Zero waiting and refreshing.&lt;/p&gt;

&lt;p&gt;Here's the whole thing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Request certificate&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_acm_certificate"&lt;/span&gt; &lt;span class="s2"&gt;"site"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;domain_name&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"graycloudarch.com"&lt;/span&gt;
  &lt;span class="nx"&gt;validation_method&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"DNS"&lt;/span&gt;
  &lt;span class="nx"&gt;subject_alternative_names&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"www.graycloudarch.com"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Create validation records in Cloudflare&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"cloudflare_record"&lt;/span&gt; &lt;span class="s2"&gt;"cert_validation"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;for_each&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;for&lt;/span&gt; &lt;span class="nx"&gt;dvo&lt;/span&gt; &lt;span class="nx"&gt;in&lt;/span&gt; &lt;span class="nx"&gt;aws_acm_certificate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;site&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;domain_validation_options&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt;
    &lt;span class="nx"&gt;dvo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;domain_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;name&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dvo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;resource_record_name&lt;/span&gt;
      &lt;span class="nx"&gt;value&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dvo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;resource_record_value&lt;/span&gt;
      &lt;span class="nx"&gt;type&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dvo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;resource_record_type&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;zone_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cloudflare_zone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;site&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;each&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;
  &lt;span class="nx"&gt;value&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;each&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;
  &lt;span class="nx"&gt;type&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;each&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;type&lt;/span&gt;
  &lt;span class="nx"&gt;proxied&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;  &lt;span class="c1"&gt;# Critical - ACM validation breaks with proxy&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Wait for validation&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_acm_certificate_validation"&lt;/span&gt; &lt;span class="s2"&gt;"site"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;certificate_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_acm_certificate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;site&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
  &lt;span class="nx"&gt;validation_record_fqdns&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nx"&gt;for&lt;/span&gt; &lt;span class="nx"&gt;record&lt;/span&gt; &lt;span class="nx"&gt;in&lt;/span&gt; &lt;span class="nx"&gt;cloudflare_record&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cert_validation&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;record&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;hostname&lt;/span&gt;
  &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run &lt;code&gt;terraform apply&lt;/code&gt;. Go make coffee. Come back to a validated certificate.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Magic: for_each
&lt;/h2&gt;

&lt;p&gt;The key is this part:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;for_each&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;for&lt;/span&gt; &lt;span class="nx"&gt;dvo&lt;/span&gt; &lt;span class="nx"&gt;in&lt;/span&gt; &lt;span class="nx"&gt;aws_acm_certificate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;site&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;domain_validation_options&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt;
  &lt;span class="nx"&gt;dvo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;domain_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dvo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;resource_record_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;AWS generates validation records dynamically (one for apex domain, one for www). Terraform reads them, loops over them, and creates each one in Cloudflare.&lt;/p&gt;

&lt;p&gt;You never see the records. You never copy anything. It just works.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Screwed Up
&lt;/h2&gt;

&lt;p&gt;First time I ran this, ACM validation timed out after 30 minutes.&lt;/p&gt;

&lt;p&gt;The problem:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;proxied&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;  &lt;span class="c1"&gt;# Wrong!&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cloudflare's proxy rewrites DNS responses. ACM's validation servers hit Cloudflare's IP instead of seeing your validation record.&lt;/p&gt;

&lt;p&gt;The fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;proxied&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;  &lt;span class="c1"&gt;# Correct&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;DNS-only mode. No proxy. ACM validation works.&lt;/p&gt;

&lt;p&gt;Cost me 30 minutes of debugging. Now it's in code so I never hit it again.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;I'm running two brands: graycloudarch.com and cloudpatterns.io.&lt;/p&gt;

&lt;p&gt;Manual approach: 15 steps per domain = 30 steps total. 30 minutes minimum. High chance of typos.&lt;/p&gt;

&lt;p&gt;Terraform approach: One &lt;code&gt;terraform apply&lt;/code&gt;. 5 minutes to write the code (once), 10 minutes for AWS to validate. Then copy-paste the pattern for the second domain.&lt;/p&gt;

&lt;p&gt;When I launch my third brand (and I will), it'll take 5 minutes and one terraform apply.&lt;/p&gt;

&lt;p&gt;That's the bet: upfront automation for long-term velocity.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Part People Miss
&lt;/h2&gt;

&lt;p&gt;Most Terraform tutorials stop at requesting the certificate. They don't show you the validation loop or the waiting resource.&lt;/p&gt;

&lt;p&gt;Without &lt;code&gt;aws_acm_certificate_validation&lt;/code&gt;, Terraform exits immediately after creating the cert. It's still "Pending Validation" in AWS. When you try to use it in CloudFront, it fails.&lt;/p&gt;

&lt;p&gt;You'd have to run &lt;code&gt;terraform apply&lt;/code&gt; again later, after manually checking that validation completed.&lt;/p&gt;

&lt;p&gt;That's not automation—that's just documentation.&lt;/p&gt;

&lt;p&gt;The waiting resource makes it truly hands-off.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scaling It
&lt;/h2&gt;

&lt;p&gt;Adding a second domain is 10 lines of code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_acm_certificate"&lt;/span&gt; &lt;span class="s2"&gt;"cloudpatterns"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;domain_name&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"cloudpatterns.io"&lt;/span&gt;
  &lt;span class="nx"&gt;validation_method&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"DNS"&lt;/span&gt;
  &lt;span class="nx"&gt;subject_alternative_names&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"www.cloudpatterns.io"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"cloudflare_record"&lt;/span&gt; &lt;span class="s2"&gt;"cloudpatterns_validation"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;for_each&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="cm"&gt;/* same pattern */&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="c1"&gt;# ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_acm_certificate_validation"&lt;/span&gt; &lt;span class="s2"&gt;"cloudpatterns"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;# ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same pattern, different names. No clicking. No switching between consoles. No remembering which validation record goes where.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Win
&lt;/h2&gt;

&lt;p&gt;It's not the time savings (though 30 minutes per deployment adds up).&lt;/p&gt;

&lt;p&gt;It's the mental overhead.&lt;/p&gt;

&lt;p&gt;Manual DNS configuration requires focus. "Did I copy the whole string? Did I add the trailing dot? Is it DNS-only mode?"&lt;/p&gt;

&lt;p&gt;Terraform requires running one command. That's it.&lt;/p&gt;

&lt;p&gt;I get my focus back. I can write this blog post while Terraform validates certificates.&lt;/p&gt;

&lt;p&gt;Want the full code? It's not open source (yet), but if you're building something similar and want to talk through it, &lt;a href="https://graycloudarch.com/contact" rel="noopener noreferrer"&gt;reach out&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Or if you just want to tell me I'm overthinking this and should've clicked through Cloudflare like a normal person, that's cool too.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>terraform</category>
      <category>cloudflare</category>
      <category>dns</category>
    </item>
    <item>
      <title>Building Multi-Account AWS Infrastructure with Terraform and ECP</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Sat, 04 Apr 2026 22:30:25 +0000</pubDate>
      <link>https://forem.com/tallgray1/building-multi-account-aws-infrastructure-with-terraform-and-ecp-49an</link>
      <guid>https://forem.com/tallgray1/building-multi-account-aws-infrastructure-with-terraform-and-ecp-49an</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://graycloudarch.com/blog/multi-account-aws-ecp/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;After years of building AWS infrastructure at scale, I've learned that multi-account strategy isn't just about security—it's about organizational clarity and cost management.&lt;/p&gt;

&lt;p&gt;At a large podcast hosting platform, we implemented an Enterprise Control Plane (ECP) pattern using Terraform to manage 20+ AWS accounts. Here's what I learned:&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with Single-Account AWS
&lt;/h2&gt;

&lt;p&gt;Most companies start with one AWS account. Everything lives together: dev, staging, prod, data pipelines, security tools. It works... until it doesn't.&lt;/p&gt;

&lt;p&gt;Problems emerge:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Blast radius:&lt;/strong&gt; A misconfigured dev resource can affect production&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IAM complexity:&lt;/strong&gt; Permission boundaries become impossible to manage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost allocation:&lt;/strong&gt; Finance can't track spending by team or project&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance:&lt;/strong&gt; Auditors want logical separation between environments&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The ECP Pattern
&lt;/h2&gt;

&lt;p&gt;Enterprise Control Plane is an architectural pattern for managing multiple AWS accounts as a unified platform:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Organization Structure:&lt;/strong&gt; AWS Organizations with OUs (Organizational Units) for different environments and teams&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Centralized Networking:&lt;/strong&gt; Transit Gateway connecting all accounts through hub-and-spoke model&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security Baseline:&lt;/strong&gt; Service Control Policies (SCPs) enforcing guardrails at the organization level&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infrastructure as Code:&lt;/strong&gt; Terraform/Terragrunt managing everything from a central repository&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Key Design Decisions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Account Boundaries:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Production accounts: Isolated per application/team&lt;/li&gt;
&lt;li&gt;Non-prod accounts: Shared dev/staging to reduce overhead&lt;/li&gt;
&lt;li&gt;Platform accounts: Separate accounts for logging, monitoring, security tools&lt;/li&gt;
&lt;li&gt;Data accounts: Isolated for compliance and access control&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Network Architecture:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hub account with Transit Gateway&lt;/li&gt;
&lt;li&gt;VPC peering only where absolutely necessary&lt;/li&gt;
&lt;li&gt;Private subnet defaults for everything&lt;/li&gt;
&lt;li&gt;Centralized egress through NAT Gateway in hub&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Security Model:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SCPs prevent account-level misconfigurations&lt;/li&gt;
&lt;li&gt;IAM roles for cross-account access (no shared credentials)&lt;/li&gt;
&lt;li&gt;CloudTrail logs aggregated to security account&lt;/li&gt;
&lt;li&gt;GuardDuty and Security Hub in every account&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Terraform Structure
&lt;/h2&gt;

&lt;p&gt;We use Terragrunt to manage configurations across accounts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="s"&gt;ecp-ou-structure/&lt;/span&gt;     &lt;span class="c1"&gt;# Organization and account management&lt;/span&gt;
&lt;span class="s"&gt;ecp-network/&lt;/span&gt;          &lt;span class="c1"&gt;# Transit Gateway, VPCs, networking&lt;/span&gt;
&lt;span class="s"&gt;ecp-security/&lt;/span&gt;         &lt;span class="c1"&gt;# Security baseline, SCPs, IAM&lt;/span&gt;
&lt;span class="s"&gt;tf-live-aws-*/&lt;/span&gt;        &lt;span class="c1"&gt;# Application-specific infrastructure&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start with security:&lt;/strong&gt; SCPs first, then networking, then workloads&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automate account creation:&lt;/strong&gt; Manual account provisioning doesn't scale&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Document the why:&lt;/strong&gt; Every architectural decision needs context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plan for day 2:&lt;/strong&gt; Operations matter more than initial setup&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;After implementing ECP:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reduced security incident blast radius by 90%&lt;/li&gt;
&lt;li&gt;Finance can now track costs by team and project&lt;/li&gt;
&lt;li&gt;New environments deploy in hours, not days&lt;/li&gt;
&lt;li&gt;Passed SOC2 audit with zero infrastructure findings&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Multi-account AWS isn't just best practice—it's how you scale infrastructure beyond the startup phase.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>terraform</category>
      <category>multiaccount</category>
      <category>ecp</category>
    </item>
    <item>
      <title>Stop Manually Updating Jira After Every PR Merge</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Wed, 25 Mar 2026 07:10:48 +0000</pubDate>
      <link>https://forem.com/tallgray1/stop-manually-updating-jira-after-every-pr-merge-1c9p</link>
      <guid>https://forem.com/tallgray1/stop-manually-updating-jira-after-every-pr-merge-1c9p</guid>
      <description>&lt;p&gt;&lt;em&gt;This post was originally published on &lt;a href="https://graycloudarch.com/automate-jira-github-actions/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;You just merged a PR. Now you open Jira, find the ticket, paste the&lt;br&gt;
PR link in a comment, transition the status to Done, and update the&lt;br&gt;
deployed field. Five minutes. Twenty times a week. That's 1,700 minutes&lt;br&gt;
per year per engineer --- nearly 30 hours of pure mechanical overhead.&lt;/p&gt;

&lt;p&gt;And that's assuming you remember. On one team I worked with, we&lt;br&gt;
audited the last three months of merged PRs. Thirty percent of tickets&lt;br&gt;
had no update after merge. No comment, no transition, no link. The&lt;br&gt;
ticket just sat in In Dev until someone noticed during sprint&lt;br&gt;
review.&lt;/p&gt;

&lt;p&gt;The fix is two GitHub Actions workflows and a shared composite&lt;br&gt;
action. Here's exactly how to build it.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;Two workflows, one shared extraction layer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Workflow 1&lt;/strong&gt;: Fires on PR creation --- posts a Jira
link comment to the PR so reviewers can navigate directly to the
ticket.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Workflow 2&lt;/strong&gt;: Fires on PR merge to &lt;code&gt;main&lt;/code&gt;
--- posts a comment to the Jira ticket with the PR URL, commit SHA, and
who merged it, then transitions the ticket to Done.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both workflows need to find the Jira ticket ID. Instead of&lt;br&gt;
duplicating that logic, we extract it into a composite action.&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 1: Composite Action for Ticket Extraction
&lt;/h2&gt;

&lt;p&gt;Create&lt;br&gt;
&lt;code&gt;.github/actions/extract-jira-ticket/action.yml&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The action checks four sources in priority order --- easiest to fix&lt;br&gt;
first:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; PR title (simplest for the developer to correct)&lt;/li&gt;
&lt;li&gt; Commit messages&lt;/li&gt;
&lt;li&gt; Branch name in standard format:
&lt;code&gt;PROJECT-123-description&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt; Branch name with prefix:
&lt;code&gt;feat/PROJECT-123-description&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;::: {#cb1 .sourceCode}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Extract Jira Ticket&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Extracts Jira ticket from PR title, commits, or branch name&lt;/span&gt;

&lt;span class="na"&gt;inputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;jira-base-url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;jira-user-email&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;jira-api-token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

&lt;span class="na"&gt;outputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;jira-key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ steps.extract.outputs.jira_key }}&lt;/span&gt;
  &lt;span class="na"&gt;found&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ steps.extract.outputs.found }}&lt;/span&gt;

&lt;span class="na"&gt;runs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;using&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;composite&lt;/span&gt;
  &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Extract ticket ID&lt;/span&gt;
      &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;extract&lt;/span&gt;
      &lt;span class="na"&gt;shell&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bash&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
        &lt;span class="s"&gt;JIRA_KEY=""&lt;/span&gt;

        &lt;span class="s"&gt;# Priority 1: PR title&lt;/span&gt;
        &lt;span class="s"&gt;if [[ "${{ github.event.pull_request.title }}" =~ ([A-Z]+-[0-9]+) ]]; then&lt;/span&gt;
          &lt;span class="s"&gt;JIRA_KEY="${BASH_REMATCH[1]}"&lt;/span&gt;
        &lt;span class="s"&gt;fi&lt;/span&gt;

        &lt;span class="s"&gt;# Priority 2: Branch name&lt;/span&gt;
        &lt;span class="s"&gt;if [ -z "$JIRA_KEY" ]; then&lt;/span&gt;
          &lt;span class="s"&gt;BRANCH="${{ github.head_ref }}"&lt;/span&gt;
          &lt;span class="s"&gt;if [[ "$BRANCH" =~ ([A-Z]+-[0-9]+) ]]; then&lt;/span&gt;
            &lt;span class="s"&gt;JIRA_KEY="${BASH_REMATCH[1]}"&lt;/span&gt;
          &lt;span class="s"&gt;fi&lt;/span&gt;
        &lt;span class="s"&gt;fi&lt;/span&gt;

        &lt;span class="s"&gt;if [ -n "$JIRA_KEY" ]; then&lt;/span&gt;
          &lt;span class="s"&gt;echo "jira_key=$JIRA_KEY" &amp;gt;&amp;gt; $GITHUB_OUTPUT&lt;/span&gt;
          &lt;span class="s"&gt;echo "found=true" &amp;gt;&amp;gt; $GITHUB_OUTPUT&lt;/span&gt;
        &lt;span class="s"&gt;else&lt;/span&gt;
          &lt;span class="s"&gt;echo "found=false" &amp;gt;&amp;gt; $GITHUB_OUTPUT&lt;/span&gt;
        &lt;span class="s"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;The regex &lt;code&gt;[A-Z]+-[0-9]+&lt;/code&gt; matches any Jira ticket format:&lt;br&gt;
&lt;code&gt;PROJ-1&lt;/code&gt;, &lt;code&gt;IN-89&lt;/code&gt;, &lt;code&gt;INFRA-1234&lt;/code&gt;. If you&lt;br&gt;
have tickets with lowercase project keys, adjust accordingly.&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 2: PR Creation Workflow
&lt;/h2&gt;

&lt;p&gt;Create &lt;code&gt;.github/workflows/link-jira-on-pr.yml&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This fires when a PR is opened and posts a formatted comment with the&lt;br&gt;
Jira ticket link. If no ticket is found, it posts a warning so the&lt;br&gt;
author knows to add one --- before review, not after.&lt;/p&gt;

&lt;p&gt;::: {#cb2 .sourceCode}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Link Jira on PR&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;types&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;opened&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;permissions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pull-requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;write&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;link-jira&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./.github/actions/extract-jira-ticket&lt;/span&gt;
        &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;jira&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;jira-base-url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.JIRA_BASE_URL }}&lt;/span&gt;
          &lt;span class="na"&gt;jira-user-email&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.JIRA_USER_EMAIL }}&lt;/span&gt;
          &lt;span class="na"&gt;jira-api-token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.JIRA_API_TOKEN }}&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Post Jira link comment&lt;/span&gt;
        &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;steps.jira.outputs.found == 'true'&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/github-script@v7&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;script&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;github.rest.issues.createComment({&lt;/span&gt;
              &lt;span class="s"&gt;issue_number: context.issue.number,&lt;/span&gt;
              &lt;span class="s"&gt;owner: context.repo.owner,&lt;/span&gt;
              &lt;span class="s"&gt;repo: context.repo.repo,&lt;/span&gt;
              &lt;span class="s"&gt;body: `📋 Jira: [${{ steps.jira.outputs.jira-key }}](${{ secrets.JIRA_BASE_URL }}/browse/${{ steps.jira.outputs.jira-key }})`&lt;/span&gt;
            &lt;span class="s"&gt;})&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Warn if no ticket found&lt;/span&gt;
        &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;steps.jira.outputs.found == 'false'&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/github-script@v7&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;script&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;github.rest.issues.createComment({&lt;/span&gt;
              &lt;span class="s"&gt;issue_number: context.issue.number,&lt;/span&gt;
              &lt;span class="s"&gt;owner: context.repo.owner,&lt;/span&gt;
              &lt;span class="s"&gt;repo: context.repo.repo,&lt;/span&gt;
              &lt;span class="s"&gt;body: '⚠️ No Jira ticket found. Add a ticket ID to the PR title (e.g., `PROJ-123: Your title`).'&lt;/span&gt;
            &lt;span class="s"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;The warning step matters. It creates a feedback loop that trains the&lt;br&gt;
team to include ticket IDs upfront. Within a few weeks, the warning&lt;br&gt;
fires rarely.&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 3: PR Merge Workflow
&lt;/h2&gt;

&lt;p&gt;Create &lt;code&gt;.github/workflows/update-jira-on-merge.yml&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This fires when a PR is closed against &lt;code&gt;main&lt;/code&gt;. The&lt;br&gt;
&lt;code&gt;if: github.event.pull_request.merged == true&lt;/code&gt; guard is&lt;br&gt;
important --- the &lt;code&gt;closed&lt;/code&gt; event also fires for PRs that are&lt;br&gt;
closed without merging.&lt;/p&gt;

&lt;p&gt;::: {#cb3 .sourceCode}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Update Jira on Merge&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;types&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;closed&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;main&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;update-jira&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;github.event.pull_request.merged == &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./.github/actions/extract-jira-ticket&lt;/span&gt;
        &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;jira&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;jira-base-url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.JIRA_BASE_URL }}&lt;/span&gt;
          &lt;span class="na"&gt;jira-user-email&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.JIRA_USER_EMAIL }}&lt;/span&gt;
          &lt;span class="na"&gt;jira-api-token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.JIRA_API_TOKEN }}&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Post merge comment to Jira&lt;/span&gt;
        &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;steps.jira.outputs.found == 'true'&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}"&lt;/span&gt;
            &lt;span class="s"&gt;-u "${{ secrets.JIRA_USER_EMAIL }}:${{ secrets.JIRA_API_TOKEN }}"&lt;/span&gt;
            &lt;span class="s"&gt;-H "Content-Type: application/json"&lt;/span&gt;
            &lt;span class="s"&gt;-X POST "${{ secrets.JIRA_BASE_URL }}/rest/api/2/issue/${{ steps.jira.outputs.jira-key }}/comment"&lt;/span&gt;
            &lt;span class="s"&gt;-d "{\"body\": \"PR merged: #${{ github.event.pull_request.number }} ${{ github.event.pull_request.html_url }}\nCommit: ${{ github.sha }}\nBy: ${{ github.event.pull_request.merged_by.login }}\"}")&lt;/span&gt;

          &lt;span class="s"&gt;echo "Jira comment HTTP status: $HTTP_STATUS"&lt;/span&gt;
          &lt;span class="s"&gt;[ "$HTTP_STATUS" -eq 201 ] &amp;amp;&amp;amp; echo "✅ Comment posted" || echo "⚠️ Comment failed (non-critical)"&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Transition ticket to Done&lt;/span&gt;
        &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;steps.jira.outputs.found == 'true'&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;TRANSITION_ID="${{ secrets.JIRA_DONE_TRANSITION_ID }}"&lt;/span&gt;
          &lt;span class="s"&gt;[ -z "$TRANSITION_ID" ] &amp;amp;&amp;amp; echo "No transition ID configured, skipping" &amp;amp;&amp;amp; exit 0&lt;/span&gt;

          &lt;span class="s"&gt;curl -s -u "${{ secrets.JIRA_USER_EMAIL }}:${{ secrets.JIRA_API_TOKEN }}"&lt;/span&gt;
            &lt;span class="s"&gt;-H "Content-Type: application/json"&lt;/span&gt;
            &lt;span class="s"&gt;-X POST "${{ secrets.JIRA_BASE_URL }}/rest/api/2/issue/${{ steps.jira.outputs.jira-key }}/transitions"&lt;/span&gt;
            &lt;span class="s"&gt;-d "{\"transition\": {\"id\": \"$TRANSITION_ID\"}}"&lt;/span&gt;
          &lt;span class="s"&gt;echo "✅ Transitioned to Done"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;The comment step uses an HTTP status check rather than relying on&lt;br&gt;
curl's exit code. A failed comment doesn't fail the job --- the PR already&lt;br&gt;
merged, and a missing notification shouldn't generate noise in CI. The&lt;br&gt;
transition step is fully optional: if&lt;br&gt;
&lt;code&gt;JIRA_DONE_TRANSITION_ID&lt;/code&gt; isn't set, it skips silently. This&lt;br&gt;
lets you start with just comments and add transitions once you've&lt;br&gt;
verified the workflow runs cleanly.&lt;/p&gt;
&lt;h2&gt;
  
  
  Finding Your Transition IDs
&lt;/h2&gt;

&lt;p&gt;Transition IDs are project-specific. There's no universal "Done" ID.&lt;br&gt;
Run this against any ticket in your project to find yours:&lt;/p&gt;

&lt;p&gt;::: {#cb4 .sourceCode}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-u&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$JIRA_EMAIL&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;$JIRA_API_TOKEN&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$JIRA_BASE_URL&lt;/span&gt;&lt;span class="s2"&gt;/rest/api/2/issue/&lt;/span&gt;&lt;span class="nv"&gt;$TICKET_KEY&lt;/span&gt;&lt;span class="s2"&gt;/transitions"&lt;/span&gt;
  | jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'.transitions[] | "ID: \(.id) | \(.name)"'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;Example output:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ID: 91 | Done
ID: 31 | In Review
ID: 21 | In Progress
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Set the Done ID as &lt;code&gt;JIRA_DONE_TRANSITION_ID&lt;/code&gt; in your&lt;br&gt;
repository secrets.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Note on Jira API Versions
&lt;/h2&gt;

&lt;p&gt;Use the v2 API: &lt;code&gt;/rest/api/2/&lt;/code&gt;. Some teams try v3 and get&lt;br&gt;
silent empty responses --- &lt;code&gt;{"errorMessages":[],"errors":{}}&lt;/code&gt; ---&lt;br&gt;
that look exactly like auth failures. It's not auth. The v3 request body&lt;br&gt;
format changed, and error handling is poor. v2 is stable,&lt;br&gt;
well-documented, and works consistently.&lt;/p&gt;

&lt;h2&gt;
  
  
  Required Secrets
&lt;/h2&gt;

&lt;p&gt;Add these to your GitHub repository secrets:&lt;/p&gt;

&lt;p&gt;Secret                      Value&lt;/p&gt;




&lt;p&gt;&lt;code&gt;JIRA_BASE_URL&lt;/code&gt;             &lt;code&gt;https://yourorg.atlassian.net&lt;/code&gt;&lt;br&gt;
  &lt;code&gt;JIRA_USER_EMAIL&lt;/code&gt;           The email address tied to your API token&lt;br&gt;
  &lt;code&gt;JIRA_API_TOKEN&lt;/code&gt;            Generate at id.atlassian.com → Security → API tokens&lt;br&gt;
  &lt;code&gt;JIRA_DONE_TRANSITION_ID&lt;/code&gt;   Optional --- from the transitions API call above&lt;/p&gt;

&lt;p&gt;For org-wide rollout, set these as organization secrets and restrict&lt;br&gt;
to relevant repositories.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;After rolling this out across a team of eight engineers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Zero manual Jira updates after merge&lt;/li&gt;
&lt;li&gt;  Forgotten ticket updates dropped from 30% to 0%&lt;/li&gt;
&lt;li&gt;  Roughly 1,700 minutes per year recovered per engineer&lt;/li&gt;
&lt;li&gt;  Every merged PR has a complete audit trail: PR number, URL, commit
SHA, who merged it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The composite action pattern also means when you need to extend this&lt;br&gt;
--- adding a Slack notification on merge, posting to Confluence --- you&lt;br&gt;
extend one file, not two.&lt;/p&gt;

&lt;p&gt;If you're building out automation like this across your engineering&lt;br&gt;
platform and want a second opinion on the design, &lt;a href="https://graycloudarch.com/#contact" rel="noopener noreferrer"&gt;I'm available for advisory engagements&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>automation</category>
      <category>cicd</category>
      <category>devops</category>
      <category>githubactions</category>
    </item>
    <item>
      <title>How I Manage Claude Code Context Across 20+ Repositories</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Fri, 20 Mar 2026 06:57:13 +0000</pubDate>
      <link>https://forem.com/tallgray1/how-i-manage-claude-code-context-across-20-repositories-5b16</link>
      <guid>https://forem.com/tallgray1/how-i-manage-claude-code-context-across-20-repositories-5b16</guid>
      <description>&lt;p&gt;&lt;em&gt;This post was originally published on &lt;a href="https://graycloudarch.com/blog/managing-claude-code-context-multi-repo/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Three months ago I was re-explaining my Terragrunt state backend to&lt;br&gt;
Claude for the third time in a week. Different session, same repo, same&lt;br&gt;
repo I'd worked in the session before. Claude had no idea I was even in&lt;br&gt;
the same project.&lt;/p&gt;

&lt;p&gt;I run Claude Code daily across a 6-account AWS platform monorepo, a&lt;br&gt;
personal consulting site, homelab infrastructure, and a handful of side&lt;br&gt;
projects. Every session started with the same five minutes of "here's&lt;br&gt;
the project, here are the conventions, here's the Jira workflow" --- and&lt;br&gt;
still ended with Claude suggesting patterns that didn't fit the&lt;br&gt;
environment, because I'd inevitably forgotten to mention something.&lt;/p&gt;

&lt;p&gt;After three months of broken symlinks and abandoned experiments, I&lt;br&gt;
landed on a three-tier context hierarchy that loads the right context&lt;br&gt;
automatically depending on which directory I'm working in --- and I manage&lt;br&gt;
all of it from a single dotfiles repo.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Problem with Single-File Context
&lt;/h2&gt;

&lt;p&gt;Claude Code loads &lt;code&gt;CLAUDE.md&lt;/code&gt; from the current directory&lt;br&gt;
(and parent directories, walking up to&lt;br&gt;
&lt;code&gt;~/.claude/CLAUDE.md&lt;/code&gt;). Most teams start with one file and&lt;br&gt;
put everything in it.&lt;/p&gt;

&lt;p&gt;That breaks down quickly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Global preferences get mixed with project
specifics.&lt;/strong&gt; Your "use snake_case for variable names" preference
shouldn't live next to your Terraform state bucket configuration.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Credentials and account IDs end up in files you accidentally
commit.&lt;/strong&gt; Put AWS account IDs in a shared CLAUDE.md, and someone
will eventually push it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You can't share patterns across repos without
duplication.&lt;/strong&gt; Every new repo gets a fresh copy of the same
conventions, and updates never propagate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-employer context creates conflicts.&lt;/strong&gt; Your
consulting client's Jira workflow shouldn't contaminate your personal
project sessions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;My first attempt at fixing this was a shared scripts directory. The&lt;br&gt;
three-tier hierarchy came later, after I figured out what was actually&lt;br&gt;
wrong with the simpler approach.&lt;/p&gt;
&lt;h2&gt;
  
  
  My First Attempt: A Shared Scripts Directory
&lt;/h2&gt;

&lt;p&gt;Before landing on the three-tier system, I built something more&lt;br&gt;
obvious: a &lt;code&gt;~/shared-claude-infra/&lt;/code&gt; directory containing a&lt;br&gt;
&lt;code&gt;setup-project.sh&lt;/code&gt; script that initialized&lt;br&gt;
&lt;code&gt;.claude/&lt;/code&gt; context for each new repo.&lt;/p&gt;

&lt;p&gt;The script created the directory structure and symlinked a&lt;br&gt;
&lt;code&gt;rules/shared/&lt;/code&gt; folder back to the shared repo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mkdir -p "$PROJECT_DIR/.claude/rules"
ln -s ~/shared-claude-infra/rules "$PROJECT_DIR/.claude/rules/shared"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This worked for the first two repos I configured. Then the problems&lt;br&gt;
compounded:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Manual per-project setup.&lt;/strong&gt; Every new repo required
running the script. Miss one, and that repo has no shared context.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Two repos to maintain.&lt;/strong&gt; The shared infrastructure
lived in its own git repo, separate from dotfiles. Two places to update
when conventions changed, and they drifted.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Nested symlinks instead of directory-level
symlinks.&lt;/strong&gt; The &lt;code&gt;rules/shared&lt;/code&gt; symlink lived deep
inside the project's &lt;code&gt;.claude/&lt;/code&gt; tree. When the target moved,
every project that had run the script got a broken symlink ---
silently.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hardcoded paths that drifted.&lt;/strong&gt; The script referenced
workspace paths from three months earlier. My actual directory layout
had changed; the script still pointed at the old locations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When I eventually deleted the shared directory, a quick&lt;br&gt;
&lt;code&gt;find&lt;/code&gt; confirmed broken symlinks scattered across every repo&lt;br&gt;
that had run the setup script. The approach was inherently fragile&lt;br&gt;
because it depended on every machine, every repo, and every workspace&lt;br&gt;
path staying synchronized manually.&lt;/p&gt;

&lt;p&gt;The fix isn't a smarter script. It's inverting the relationship:&lt;br&gt;
instead of a script that runs once per project, use dotfiles that wire&lt;br&gt;
context automatically based on what directories exist.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Three-Tier Hierarchy
&lt;/h2&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;~/.claude/CLAUDE.md              ← Global: preferences, style, git workflow
~/work/{employer}/.claude/       ← Org: team structure, AWS accounts, Jira workflow
~/work/{employer}/{repo}/.claude/ ← Project: repo architecture, active tickets
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Claude Code walks up the directory tree loading&lt;br&gt;
&lt;code&gt;CLAUDE.md&lt;/code&gt; files at each level. Each tier handles a specific&lt;br&gt;
scope:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Global tier&lt;/strong&gt; (&lt;code&gt;~/.claude/&lt;/code&gt;): Everything&lt;br&gt;
that applies across all work --- communication style, git commit format,&lt;br&gt;
PR description templates, universal infrastructure patterns. No&lt;br&gt;
credentials, no account IDs, nothing employer-specific.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Org tier&lt;/strong&gt; (&lt;code&gt;~/work/{employer}/.claude/&lt;/code&gt;):&lt;br&gt;
Team structure, Jira project keys, AWS account layout, CI/CD pipeline&lt;br&gt;
conventions. Sensitive patterns (account IDs, VPC IDs, state bucket&lt;br&gt;
names) go in gitignored files within this directory. Reusable patterns&lt;br&gt;
(CI/CD templates, AWS patterns without specifics) go in committed&lt;br&gt;
files.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Project tier&lt;/strong&gt;&lt;br&gt;
(&lt;code&gt;~/work/{employer}/{repo}/.claude/&lt;/code&gt;): Architecture decisions&lt;br&gt;
for this specific repo, active tickets, ongoing work state. Always&lt;br&gt;
gitignored --- this is ephemeral working context that changes&lt;br&gt;
frequently.&lt;/p&gt;
&lt;h2&gt;
  
  
  Implementation: Symlinks from Dotfiles
&lt;/h2&gt;

&lt;p&gt;The hierarchy only works if it's consistent across machines. I manage&lt;br&gt;
all context files from a dotfiles repo using symlinks:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dotfiles/claude/
├── global/          → symlinked to ~/.claude/
├── {employer}/      → symlinked to ~/work/{employer}/.claude/
└── {personal}/      → symlinked to ~/personal/.claude/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;&lt;code&gt;install.sh&lt;/code&gt; wires these automatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Global context
ln -sf "$DOTFILES/claude/global" "$HOME/.claude"

# Per-employer context
for employer in "${EMPLOYERS[@]}"; do
  WORK_DIR="$HOME/work/$employer"
  if [ -d "$WORK_DIR" ]; then
    ln -sf "$DOTFILES/claude/$employer" "$WORK_DIR/.claude"
  fi
done
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Any machine that runs &lt;code&gt;install.sh&lt;/code&gt; gets the same context&lt;br&gt;
hierarchy. Changes committed to dotfiles propagate immediately.&lt;/p&gt;
&lt;h2&gt;
  
  
  What Each Level Contains
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Global (&lt;code&gt;~/.claude/&lt;/code&gt;)
&lt;/h3&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;~/.claude/
├── CLAUDE.md          # Preferences, active work summary
└── rules/
    ├── git-workflow.md
    ├── pr-patterns.md
    ├── infrastructure.md
    └── context-management.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;&lt;code&gt;CLAUDE.md&lt;/code&gt; is short --- preferences and a pointer to where&lt;br&gt;
active work lives. The heavy lifting goes in &lt;code&gt;rules/&lt;/code&gt; files&lt;br&gt;
that Claude loads as supplementary context.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## Communication Style
- Be direct and technical — I understand infrastructure concepts
- Explain the "why" behind decisions
- Provide specific file paths and line numbers

## Git Workflow
- Branch format: feat/TICKET-123-description
- Commit format: [TICKET-123] Brief summary\n\nWhy this change...
- Never add Co-Authored-By trailers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Org Level (&lt;code&gt;~/work/{employer}/.claude/&lt;/code&gt;)
&lt;/h3&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;~/work/{employer}/.claude/
├── {EMPLOYER}.md              # Team structure, Jira workflow — committed
└── rules/
    ├── cicd-patterns.md       # CI/CD conventions — committed
    ├── aws-patterns.md        # Account IDs, VPC IDs — GITIGNORED
    └── terraform-patterns.md  # State config, module paths — GITIGNORED
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The privacy split matters. &lt;code&gt;cicd-patterns.md&lt;/code&gt; contains&lt;br&gt;
reusable GitHub Actions patterns --- fine to commit.&lt;br&gt;
&lt;code&gt;aws-patterns.md&lt;/code&gt; contains actual account IDs --- stays&lt;br&gt;
local.&lt;/p&gt;

&lt;p&gt;A typical employer context file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## Team Structure
- Platform team: 5 engineers, all in Jira project IN
- AWS accounts: dev, nonprod, prod (+ 3 infra accounts network, security, management)
- Monorepo: ~/work/{employer}/iac — Terragrunt, 78 components

## Jira Workflow
- IN project (infrastructure): transition IDs 3=In Dev, 8=Needs Review, 9=Done
- Prefix all commits: [IT-XXX]
- API: REST v2 only — v3 silently returns empty responses

## CI/CD
- GitHub Actions with OIDC to AWS (no long-lived credentials)
- PR requires: terraform plan output posted as comment
- Merge to main triggers auto-deploy to nonprod
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The gitignored &lt;code&gt;aws-patterns.md&lt;/code&gt; contains account IDs and&lt;br&gt;
specific ARNs that Claude needs for generating Terraform configurations&lt;br&gt;
accurately but shouldn't be committed anywhere.&lt;/p&gt;
&lt;h3&gt;
  
  
  Project Level (&lt;code&gt;~/work/{employer}/{repo}/.claude/&lt;/code&gt;)
&lt;/h3&gt;

&lt;p&gt;Project context is ephemeral and always gitignored. It's the working&lt;br&gt;
memory for an ongoing effort:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## Current State
- Branch: feat/IT-89-my-app-dev-ecr
- Active ticket: IT-89 — ECR in dev
- Next: IT-90 — ECS task definition

## Architecture Decisions
- ECR in dev only; cross-account pull policies for nonprod and prod
- Mutable tags in dev, immutable in nonprod/prod
- KMS key per environment, not per repository

## Blockers
- Waiting on network DNS zone creation before cutover can proceed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I update this file at the end of each session with current state so&lt;br&gt;
the next session loads instantly without re-explaining where things&lt;br&gt;
are.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Privacy Model
&lt;/h2&gt;

&lt;p&gt;The critical insight is that context files need two categories:&lt;br&gt;
committed (shareable) and local-only (sensitive).&lt;/p&gt;

&lt;p&gt;Content                 Location                                           Committed?&lt;/p&gt;



&lt;p&gt;Personal preferences    &lt;code&gt;~/.claude/CLAUDE.md&lt;/code&gt;                              ✅&lt;br&gt;
  Git workflow rules      &lt;code&gt;~/.claude/rules/git-workflow.md&lt;/code&gt;                  ✅&lt;br&gt;
  Team structure          &lt;code&gt;{employer}/.claude/{EMPLOYER}.md&lt;/code&gt;                 ✅ sanitized&lt;br&gt;
  CI/CD patterns          &lt;code&gt;{employer}/.claude/rules/cicd-patterns.md&lt;/code&gt;        ✅&lt;br&gt;
  AWS account IDs         &lt;code&gt;{employer}/.claude/rules/aws-patterns.md&lt;/code&gt;         ❌ gitignored&lt;br&gt;
  VPC IDs, state config   &lt;code&gt;{employer}/.claude/rules/terraform-patterns.md&lt;/code&gt;   ❌ gitignored&lt;br&gt;
  Active ticket state     &lt;code&gt;{repo}/.claude/OVERRIDES.md&lt;/code&gt;                      ❌ gitignored&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;.gitignore&lt;/code&gt; at the dotfiles level handles this&lt;br&gt;
automatically by ignoring &lt;code&gt;**/aws-patterns.md&lt;/code&gt; and&lt;br&gt;
&lt;code&gt;**/terraform-patterns.md&lt;/code&gt; across all employer&lt;br&gt;
directories.&lt;/p&gt;
&lt;h2&gt;
  
  
  Custom Commands
&lt;/h2&gt;

&lt;p&gt;Beyond context files, Claude Code supports custom&lt;br&gt;
&lt;code&gt;/commands&lt;/code&gt; --- reusable prompts stored as markdown files:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;~/.claude/commands/
├── checkpoint.md       # Create context snapshot
├── sync-work.md        # Update active work status
└── pr-ready.md         # Generate PR description
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;A command file is just the prompt Claude should execute:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# checkpoint.md
Create a context checkpoint. Read the current git status across active repos,
summarize open PRs and their status, list active tickets with their current
state, and write a structured summary to ~/.claude/local.md. Include any
blocking issues and the next planned action.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Commands at the global level are available everywhere. Org-level&lt;br&gt;
commands handle employer-specific workflows like Jira transitions.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Solves in Practice
&lt;/h2&gt;

&lt;p&gt;Before this system: every new Claude session started with "here's the&lt;br&gt;
project, here are the conventions, here's where things are." Five&lt;br&gt;
minutes of ramp-up, inconsistent outputs because I'd forget to mention&lt;br&gt;
something.&lt;/p&gt;

&lt;p&gt;After: I &lt;code&gt;cd&lt;/code&gt; into a repo and Claude already knows the&lt;br&gt;
Jira workflow, the AWS account structure, the naming conventions, and&lt;br&gt;
where the active work stands. When I start a session mid-ticket, the&lt;br&gt;
project-level context tells Claude exactly what was in progress.&lt;/p&gt;

&lt;p&gt;The bigger payoff is consistency. When Claude generates Terraform, it&lt;br&gt;
generates it with the correct state backend. When it writes commit&lt;br&gt;
messages, they follow the format reviewers expect. When it suggests&lt;br&gt;
architecture, it fits the actual account model rather than a generic AWS&lt;br&gt;
example.&lt;/p&gt;

&lt;h2&gt;
  
  
  Starting Point
&lt;/h2&gt;

&lt;p&gt;If you're starting from scratch, work tier by tier:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Create &lt;code&gt;~/.claude/CLAUDE.md&lt;/code&gt; with your communication
preferences and git conventions.&lt;/li&gt;
&lt;li&gt; Add a &lt;code&gt;rules/&lt;/code&gt; directory with patterns you want loaded
consistently.&lt;/li&gt;
&lt;li&gt; Create an org-level directory when you start working with a specific
employer or major project.&lt;/li&gt;
&lt;li&gt; Add project-level context when you start a multi-session
effort.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Don't try to build the whole system at once. The global tier alone&lt;br&gt;
eliminates most of the per-session ramp-up. The org and project tiers&lt;br&gt;
pay off as work gets more complex.&lt;/p&gt;

&lt;p&gt;The thing that surprised me most wasn't the time saved on ramp-up. It&lt;br&gt;
was how much the output quality improved. When Claude knows the actual&lt;br&gt;
state backend, the actual account IDs, the actual PR format your&lt;br&gt;
reviewers expect --- the suggestions it makes fit your environment. That&lt;br&gt;
gap between "technically correct" and "actually usable" is where most of&lt;br&gt;
the friction in AI-assisted infrastructure work lives. The context&lt;br&gt;
hierarchy is mostly just closing that gap.&lt;/p&gt;

&lt;p&gt;If you're setting up Claude Code for a platform team and want to talk&lt;br&gt;
through the context design, &lt;a href="https://graycloudarch.com/#contact" rel="noopener noreferrer"&gt;I do advisory&lt;br&gt;
engagements&lt;/a&gt; for teams getting serious about AI tooling in their&lt;br&gt;
infrastructure workflow.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claudecode</category>
      <category>developertooling</category>
      <category>devops</category>
    </item>
    <item>
      <title>Great design, and easy to follow.</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Wed, 18 Mar 2026 16:23:46 +0000</pubDate>
      <link>https://forem.com/tallgray1/great-design-and-easy-to-follow-109b</link>
      <guid>https://forem.com/tallgray1/great-design-and-easy-to-follow-109b</guid>
      <description>&lt;div class="ltag__link"&gt;
  &lt;a href="/cbecerra" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__pic"&gt;
      &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3545716%2Fa8cbf641-51dd-4f99-ad6b-abe0f714fa3b.jpeg" alt="cbecerra"&gt;
    &lt;/div&gt;
  &lt;/a&gt;
  &lt;a href="https://dev.to/cbecerra/how-to-implement-aws-network-firewall-in-a-multi-account-architecture-using-transit-gateway-2nam" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__content"&gt;
      &lt;h2&gt;How to Implement AWS Network Firewall in a Multi-Account Architecture Using Transit Gateway&lt;/h2&gt;
      &lt;h3&gt;Cristhian Becerra ・ Oct 13 '25&lt;/h3&gt;
      &lt;div class="ltag__link__taglist"&gt;
        &lt;span class="ltag__link__tag"&gt;#english&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#aws&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#networking&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#cybersecurity&lt;/span&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/a&gt;
&lt;/div&gt;


</description>
      <category>english</category>
      <category>aws</category>
      <category>networking</category>
      <category>cybersecurity</category>
    </item>
    <item>
      <title>Building Automated AWS Permission Testing Infrastructure for CI/CD</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Wed, 18 Mar 2026 07:14:50 +0000</pubDate>
      <link>https://forem.com/tallgray1/building-automated-aws-permission-testing-infrastructure-for-cicd-42pk</link>
      <guid>https://forem.com/tallgray1/building-automated-aws-permission-testing-infrastructure-for-cicd-42pk</guid>
      <description>&lt;p&gt;&lt;em&gt;This post was originally published on &lt;a href="https://graycloudarch.com/aws-permission-testing-cicd/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;I deployed a permission set for our data engineers five times before&lt;br&gt;
it worked correctly.&lt;/p&gt;

&lt;p&gt;The first deployment: S3 reads worked, Glue Data Catalog reads&lt;br&gt;
worked. Athena queries failed --- the query engine needs KMS decrypt&lt;br&gt;
through a service principal, and I'd missed the&lt;br&gt;
&lt;code&gt;kms:ViaService&lt;/code&gt; condition. Second deployment: Athena worked.&lt;br&gt;
EMR Serverless job submission failed --- missing&lt;br&gt;
&lt;code&gt;iam:PassRole&lt;/code&gt;. Third deployment: EMR submission worked. Job&lt;br&gt;
execution failed --- missing permissions on the EMR Serverless execution&lt;br&gt;
role boundary. I kept deploying, engineers kept getting blocked, I kept&lt;br&gt;
opening tickets.&lt;/p&gt;

&lt;p&gt;Five iterations. Two weeks. Every failure meant a data engineer&lt;br&gt;
opened a ticket instead of running their job.&lt;/p&gt;

&lt;p&gt;The problem wasn't that IAM is complicated --- it is, but that's&lt;br&gt;
expected. The problem was that I had no way to catch these issues before&lt;br&gt;
deploying to the account where real engineers were trying to do real&lt;br&gt;
work. Every bug was a production bug.&lt;/p&gt;
&lt;h2&gt;
  
  
  The "Access Denied" Debugging Loop
&lt;/h2&gt;

&lt;p&gt;Here's what the reactive debugging cycle looks like from the&lt;br&gt;
inside.&lt;/p&gt;

&lt;p&gt;Engineer opens a ticket:&lt;br&gt;
&lt;code&gt;AccessDeniedException: User is not authorized to perform: s3:GetObject&lt;/code&gt;.&lt;br&gt;
I add &lt;code&gt;s3:GetObject&lt;/code&gt; to the permission set. Next day:&lt;br&gt;
&lt;code&gt;AccessDeniedException: s3:PutObject&lt;/code&gt;. I add&lt;br&gt;
&lt;code&gt;s3:PutObject&lt;/code&gt;. Day after: write succeeds but cleanup fails ---&lt;br&gt;
&lt;code&gt;s3:DeleteObject&lt;/code&gt;. At this point I've done four deployment&lt;br&gt;
cycles and two days of work to get S3 read/write/delete working. If I'd&lt;br&gt;
just added &lt;code&gt;s3:*&lt;/code&gt; I'd be done, but that violates&lt;br&gt;
least-privilege and opens the raw zone to write access, which we&lt;br&gt;
explicitly don't want.&lt;/p&gt;

&lt;p&gt;The deeper issue is that individual services don't fail atomically.&lt;br&gt;
Athena requires &lt;code&gt;athena:StartQueryExecution&lt;/code&gt; and&lt;br&gt;
&lt;code&gt;athena:GetQueryResults&lt;/code&gt; and&lt;br&gt;
&lt;code&gt;athena:GetQueryExecution&lt;/code&gt;, but it also requires KMS decrypt&lt;br&gt;
through the Athena service principal to read encrypted S3 results. That&lt;br&gt;
last piece isn't in the Athena docs --- you find it by failing in&lt;br&gt;
production.&lt;/p&gt;

&lt;p&gt;I wanted a way to find it before deploying.&lt;/p&gt;
&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;The testing framework has four components: per-persona permission set&lt;br&gt;
templates, a Bash test library, per-service test scripts, and a GitHub&lt;br&gt;
Actions workflow that runs everything on pull requests.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────┐
│  GitHub Pull Request (Permission Set Changes)   │
└───────────────────┬─────────────────────────────┘
                    │
         ┌──────────▼──────────┐
         │  CI/CD Workflow     │
         │  (GitHub Actions)   │
         └──────────┬──────────┘
                    │
    ┌───────────────┼───────────────┐
    ▼               ▼               ▼
┌───────┐      ┌──────────┐   ┌──────────┐
│ S3    │      │  Glue    │   │ Athena   │
│ Tests │      │  Tests   │   │  Tests   │
└───────┘      └──────────┘   └──────────┘
                    │
         ┌──────────▼──────────┐
         │  Test Report        │
         │  (Posted to PR)     │
         └─────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;The workflow triggers on any pull request that modifies the&lt;br&gt;
identity-center Terraform directory. Tests run against real AWS accounts&lt;br&gt;
--- dev and nonprod --- using test credentials provisioned for that purpose.&lt;br&gt;
Results post as a PR comment before anyone approves the change.&lt;/p&gt;
&lt;h2&gt;
  
  
  Phase 1: Pre-Validated Templates
&lt;/h2&gt;

&lt;p&gt;Before I wrote a single test, I needed a starting point for&lt;br&gt;
permission sets that captured the patterns I'd learned the hard way.&lt;br&gt;
Templates that handle the non-obvious pieces --- zone-scoped S3 access,&lt;br&gt;
KMS conditions tied to specific services, explicit denies for&lt;br&gt;
destructive operations.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;AnalystAccess&lt;/code&gt; template is representative. Analysts&lt;br&gt;
get read-only access to the curated zone of the data lake, Athena query&lt;br&gt;
execution in the primary workgroup, and KMS decrypt --- but only when the&lt;br&gt;
decrypt request originates from S3 or Athena, not from arbitrary API&lt;br&gt;
calls:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;inline_policy&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonencode&lt;/span&gt;&lt;span class="err"&gt;(&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;Version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;
  &lt;span class="nx"&gt;Statement&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"GlueCatalogReadOnly"&lt;/span&gt;
      &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
      &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"glue:GetDatabase"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"glue:GetTable"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"glue:GetPartitions"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"glue:SearchTables"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="s2"&gt;"arn:aws:glue:*:*:catalog"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;"arn:aws:glue:*:*:database/curated_*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;"arn:aws:glue:*:*:table/curated_*/*"&lt;/span&gt;
      &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"S3CuratedReadOnly"&lt;/span&gt;
      &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
      &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"s3:GetObject"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"s3:ListBucket"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:s3:::lake-bucket-*/curated/*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:s3:::lake-bucket-*"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="nx"&gt;Condition&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;StringLike&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="s2"&gt;"s3:prefix"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"curated/*"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"AthenaQueryExecution"&lt;/span&gt;
      &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
      &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"athena:StartQueryExecution"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"athena:GetQueryExecution"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"athena:GetQueryResults"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"athena:StopQueryExecution"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:athena:*:*:workgroup/primary"&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"KMSDecryptViaSvc"&lt;/span&gt;
      &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
      &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"kms:Decrypt"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"kms:DescribeKey"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:kms:*:*:key/*"&lt;/span&gt;
      &lt;span class="nx"&gt;Condition&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;StringEquals&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="s2"&gt;"kms:ViaService"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"s3.us-east-1.amazonaws.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"athena.us-east-1.amazonaws.com"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"DenyDestructiveOps"&lt;/span&gt;
      &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Deny"&lt;/span&gt;
      &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"s3:DeleteObject"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"s3:DeleteBucket"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"glue:DeleteDatabase"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"glue:DeleteTable"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"*"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="err"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;kms:ViaService&lt;/code&gt; condition is the piece that took five&lt;br&gt;
production failures to discover. KMS decrypt without that condition&lt;br&gt;
allows an analyst to call &lt;code&gt;kms:Decrypt&lt;/code&gt; directly from their&lt;br&gt;
shell, which is not what we want. The condition locks decrypt to&lt;br&gt;
requests that pass through S3 or Athena specifically.&lt;/p&gt;

&lt;p&gt;The explicit deny block matters too. Without it, if someone later&lt;br&gt;
grants broader S3 permissions to this persona for a different reason,&lt;br&gt;
the curated zone protection evaporates. The deny creates a hard floor&lt;br&gt;
regardless of what else gets added.&lt;/p&gt;
&lt;h2&gt;
  
  
  Phase 2: The Test Framework
&lt;/h2&gt;

&lt;p&gt;I chose Bash over Python or a proper test framework deliberately. The&lt;br&gt;
tests run in CI with no dependencies beyond the AWS CLI --- no package&lt;br&gt;
installs, no virtual environments, no version pinning of test libraries.&lt;br&gt;
The machines running these tests already have the AWS CLI.&lt;/p&gt;

&lt;p&gt;The core library in &lt;code&gt;lib/test-framework.sh&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;::: {#cb3 .sourceCode}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;declare -a TESTS_PASSED=()
declare -a TESTS_FAILED=()

run_test() {
  local test_name="$1"
  local test_command="$2"
  local description="$3"

  if eval "$test_command" &amp;amp;&amp;gt;/dev/null; then
    TESTS_PASSED+=("$test_name")
    echo "  ✅ PASS: $test_name"
  else
    TESTS_FAILED+=("$test_name")
    echo "  ❌ FAIL: $test_name"
  fi
}

generate_text_report() {
  echo "Total: $((${#TESTS_PASSED[@]} + ${#TESTS_FAILED[@]}))"
  echo "Passed: ${#TESTS_PASSED[@]}"
  echo "Failed: ${#TESTS_FAILED[@]}"
  [ ${#TESTS_FAILED[@]} -gt 0 ] &amp;amp;&amp;amp; printf '  - %s\n' "${TESTS_FAILED[@]}"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;The most important design decision in the test scripts is testing&lt;br&gt;
denials as carefully as allowances. Testing only what should succeed&lt;br&gt;
tells you the permission set isn't obviously broken. Testing what should&lt;br&gt;
fail tells you it's not accidentally too permissive.&lt;/p&gt;

&lt;p&gt;::: {#cb4 .sourceCode}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Test what should succeed
run_test "s3-list-curated"
  "aws s3 ls s3://lake-bucket-dev/curated/"
  "Analyst can list curated zone"

# Test what should fail (negative test)
run_test "s3-write-denied"
  "! aws s3 cp /tmp/test.txt s3://lake-bucket-dev/curated/test.txt 2&amp;gt;&amp;amp;1 | grep -q 'AccessDenied'"
  "Analyst cannot write to curated zone"

run_test "s3-raw-zone-denied"
  "! aws s3 ls s3://lake-bucket-dev/raw/ 2&amp;gt;&amp;amp;1 | grep -q 'AccessDenied'"
  "Analyst cannot access raw zone"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;Beyond service-level tests, I run persona tests that simulate&lt;br&gt;
end-to-end workflows. An analyst's workflow isn't "call S3, then call&lt;br&gt;
Athena separately" --- it's "run an Athena query that reads encrypted S3&lt;br&gt;
data and writes results to the query results bucket." That integration&lt;br&gt;
test catches failures that individual service tests miss. The original&lt;br&gt;
five-iteration DataPlatformAccess failure? An individual S3 test would&lt;br&gt;
have passed. A persona test running an actual Athena query against the&lt;br&gt;
encrypted lake would have caught the KMS gap.&lt;/p&gt;
&lt;h2&gt;
  
  
  Phase 3: CI/CD Integration
&lt;/h2&gt;

&lt;p&gt;The GitHub Actions workflow triggers on pull requests that touch the&lt;br&gt;
identity-center Terraform directory, runs tests in a matrix against dev&lt;br&gt;
and nonprod, and posts a summary comment to the PR.&lt;/p&gt;

&lt;p&gt;::: {#cb5 .sourceCode}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;on:
  pull_request:
    paths:
      - 'common/modules/identity-center/**/*.tf'

permissions:
  contents: read
  id-token: write
  pull-requests: write

jobs:
  test-permissions:
    strategy:
      matrix:
        environment: [workloads-dev, workloads-nonprod]
    steps:
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::${{ matrix.environment.account }}:role/github-actions-role
      - run: ./scripts/test-permissions/run-permission-tests.sh --persona analyst
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;id-token: write&lt;/code&gt; permission is required for OIDC&lt;br&gt;
authentication to AWS --- the workflow assumes a role in each account&lt;br&gt;
rather than using long-lived credentials in GitHub Secrets. This is the&lt;br&gt;
right pattern: credentials rotate automatically, and there's no secret&lt;br&gt;
to rotate manually or accidentally expose.&lt;/p&gt;

&lt;p&gt;The PR comment posts the full test output with pass/fail counts per&lt;br&gt;
persona per account. A reviewer can look at the comment and immediately&lt;br&gt;
see whether the permission change has test coverage and whether the&lt;br&gt;
tests pass.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Things I Learned the Hard Way
&lt;/h2&gt;

&lt;p&gt;First: test KMS decryption through each service separately.&lt;br&gt;
&lt;code&gt;kms:Decrypt&lt;/code&gt; via S3 and &lt;code&gt;kms:Decrypt&lt;/code&gt; via Athena&lt;br&gt;
are different IAM evaluation paths even though they're the same API&lt;br&gt;
call. A test that puts an object and gets it back via S3 directly won't&lt;br&gt;
catch a broken Athena KMS path.&lt;/p&gt;

&lt;p&gt;Second: negative tests matter as much as positive ones. Before I had&lt;br&gt;
the test framework, every permission set I wrote was tested only for&lt;br&gt;
what it should allow. I had no systematic check that it didn't allow&lt;br&gt;
more. The denial tests are what give security reviewers confidence.&lt;/p&gt;

&lt;p&gt;Third: persona tests catch failures that service tests miss.&lt;br&gt;
Individual service tests are fast to write and good for regression&lt;br&gt;
coverage, but they test permissions in isolation. Real workflows cross&lt;br&gt;
service boundaries. Build both.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Changed
&lt;/h2&gt;

&lt;p&gt;Before the framework: five iterations to get one permission set&lt;br&gt;
right, every iteration a production impact. After: 95% of permission&lt;br&gt;
issues caught at PR review time. Zero production impacts from permission&lt;br&gt;
bugs since we shipped it. The templates reduced new permission set&lt;br&gt;
creation time by about 70% --- instead of starting from scratch with the&lt;br&gt;
IAM documentation, we start from a pre-validated base and modify from&lt;br&gt;
there.&lt;/p&gt;

&lt;p&gt;The time investment was about a week: two days for templates, two&lt;br&gt;
days for the test framework and scripts, one day for CI/CD integration&lt;br&gt;
and documentation. That investment paid back in the first sprint when&lt;br&gt;
the analyst permission set for a new hire went out correct on the first&lt;br&gt;
deployment.&lt;/p&gt;

&lt;p&gt;Running into IAM permission debugging loops on your team? &lt;a href="https://graycloudarch.com/#contact" rel="noopener noreferrer"&gt;Reach out&lt;/a&gt; --- permission testing infrastructure is&lt;br&gt;
one of the first things I build when joining a new platform team.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>cicd</category>
      <category>githubactions</category>
      <category>iam</category>
    </item>
    <item>
      <title>Zero-Downtime AWS Transit Gateway Hub-Spoke Migration</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Thu, 12 Mar 2026 03:05:59 +0000</pubDate>
      <link>https://forem.com/tallgray1/zero-downtime-aws-transit-gateway-hub-spoke-migration-nii</link>
      <guid>https://forem.com/tallgray1/zero-downtime-aws-transit-gateway-hub-spoke-migration-nii</guid>
      <description>&lt;p&gt;&lt;em&gt;This post was originally published on &lt;a href="https://graycloudarch.com/transit-gateway-hub-spoke-migration/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;The request came from the security team: they needed network-level\&lt;br&gt;
access from the nonprod account to the dev account so a vulnerability\&lt;br&gt;
scanner could reach internal services. Simple enough on the surface. In\&lt;br&gt;
practice, it exposed a gap we'd been living with for months --- and forced\&lt;br&gt;
us to fix the network architecture we'd been deferring.&lt;/p&gt;

&lt;p&gt;We had three standalone Transit Gateways: one in each workload\&lt;br&gt;
account, dev, nonprod, and prod. Completely isolated from each other. No\&lt;br&gt;
cross-account connectivity at all. The security scanner couldn't reach\&lt;br&gt;
its targets, and adding more point-to-point peering connections to fix\&lt;br&gt;
it would have made everything worse.&lt;/p&gt;

&lt;p&gt;But the TGW isolation was only part of the problem. We also had no\&lt;br&gt;
inspection of traffic crossing our network boundary. Egress from\&lt;br&gt;
workload pods went straight to the internet with no filtering. Ingress\&lt;br&gt;
came through per-account load balancers with no centralized enforcement\&lt;br&gt;
point. As the platform scaled toward additional workload accounts, this\&lt;br&gt;
pattern was going to get expensive and hard to reason about.&lt;/p&gt;

&lt;p&gt;So we didn't just fix the TGW. We rebuilt the network foundation: a\&lt;br&gt;
centralized Inspection VPC with a Network Firewall inline, a single hub\&lt;br&gt;
Transit Gateway shared across all accounts, and centralized security\&lt;br&gt;
tooling (GuardDuty, CloudTrail, Security Hub) aggregated in a dedicated\&lt;br&gt;
Security account. Two maintenance windows, a few weeks of module work,\&lt;br&gt;
and the platform went from fragmented per-account networking to a\&lt;br&gt;
coherent hub-spoke design with full traffic inspection.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Architecture We Were Replacing
&lt;/h2&gt;

&lt;p&gt;Before the migration, each workload account was self-contained. It\&lt;br&gt;
had its own TGW, its own internet gateway, its own NAT gateways.\&lt;br&gt;
Security tooling ran independently in each account with no aggregation.\&lt;br&gt;
The management account had no single-pane visibility into what was\&lt;br&gt;
happening across the environment.&lt;/p&gt;


&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgraycloudarch.com%2Fwp-content%2Fuploads%2F2026%2F03%2Fdiag-before.png" alt="Before: Three isolated workload accounts — each with its own IGW, NAT Gateway, and standalone Transit Gateway, no cross-account connectivity" width="800" height="258"&gt;&lt;br&gt;



&lt;p&gt;The cost of running this way was about \$150/month in TGW charges plus\&lt;br&gt;
duplicated NAT gateway charges in each account. Every new workload\&lt;br&gt;
account would multiply this cost again and add another independent\&lt;br&gt;
security configuration.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Target: Inspection VPC + Hub Transit Gateway
&lt;/h2&gt;

&lt;p&gt;The target was AWS Security Reference Architecture Pattern B: an\&lt;br&gt;
Inspection VPC that sits between the internet and all workload VPCs. All\&lt;br&gt;
internet traffic --- ingress and egress --- flows through this VPC and\&lt;br&gt;
through a Network Firewall before reaching any workload account.&lt;/p&gt;


&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgraycloudarch.com%2Fwp-content%2Fuploads%2F2026%2F03%2Fdiag-after.png" alt="After: Centralized hub with inline Network Firewall inspection — all traffic flows through the Infrastructure Account's Inspection VPC before reaching any workload" width="800" height="552"&gt;&lt;br&gt;



&lt;p&gt;Egress path: workload pod → TGW → Inspection VPC TGW subnets →\&lt;br&gt;
Network Firewall → NAT Gateway → IGW → internet.&lt;/p&gt;

&lt;p&gt;Ingress path: internet → IGW → centralized ALB (public subnet) →\&lt;br&gt;
Network Firewall → TGW → workload VPC → pod.&lt;/p&gt;

&lt;p&gt;Nothing crosses the network boundary without passing through the\&lt;br&gt;
firewall. Workload accounts carry no internet-facing infrastructure at\&lt;br&gt;
all --- no IGW, no NAT gateways, no public load balancers.&lt;/p&gt;
&lt;h2&gt;
  
  
  Phase 1: Module Changes
&lt;/h2&gt;

&lt;p&gt;All Terraform work happened before scheduling any maintenance. The\&lt;br&gt;
goal was to reach a state where the migration itself was just running\&lt;br&gt;
pre-staged plan files in a specific sequence.&lt;/p&gt;
&lt;h3&gt;
  
  
  Transit Gateway: add a conditional create flag
&lt;/h3&gt;

&lt;p&gt;The existing network module always created a TGW. We needed spoke\&lt;br&gt;
accounts to declare the same module without spinning up their own\&lt;br&gt;
gateway:&lt;/p&gt;

&lt;p&gt;::: {#cb1 .sourceCode}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;variable "create_transit_gateway" {
  description = "Whether to create a Transit Gateway (false for hub-spoke spokes)"
  type        = bool
  default     = true
}

resource "aws_ec2_transit_gateway" "this" {
  count       = var.create_transit_gateway ? 1 : 0
  description = var.tgw_description
}

output "transit_gateway_id" {
  value = var.create_transit_gateway ? aws_ec2_transit_gateway.this[0].id : null
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;&lt;code&gt;default = true&lt;/code&gt; means existing configurations need no\&lt;br&gt;
changes. The flag only flips to &lt;code&gt;false&lt;/code&gt; after the spoke\&lt;br&gt;
attachment is confirmed working.&lt;/p&gt;
&lt;h3&gt;
  
  
  New module: vpc-attachment
&lt;/h3&gt;

&lt;p&gt;The vpc-attachment module handles the spoke side of the hub\&lt;br&gt;
relationship: create the TGW attachment, associate it to the hub's route\&lt;br&gt;
table, and add routes to every private route table in the spoke VPC\&lt;br&gt;
pointing at the hub TGW.&lt;/p&gt;

&lt;p&gt;::: {#cb2 .sourceCode}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;resource "aws_ec2_transit_gateway_vpc_attachment" "this" {
  transit_gateway_id = var.transit_gateway_id
  vpc_id             = var.vpc_id
  subnet_ids         = var.subnet_ids

  tags = merge(var.tags, {
    Name = "${var.name}-hub-attachment"
  })
}

resource "aws_ec2_transit_gateway_route_table_association" "this" {
  transit_gateway_attachment_id  = aws_ec2_transit_gateway_vpc_attachment.this.id
  transit_gateway_route_table_id = var.transit_gateway_route_table_id
}

resource "aws_route" "to_hub_tgw" {
  for_each               = toset(var.vpc_route_table_ids)
  route_table_id         = each.value
  destination_cidr_block = "10.0.0.0/8"
  transit_gateway_id     = var.transit_gateway_id
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;10.0.0.0/8&lt;/code&gt; supernet covers all workload and\&lt;br&gt;
Inspection VPC CIDRs without maintaining per-prefix route entries. It\&lt;br&gt;
also covers the Inspection VPC CIDR (&lt;code&gt;10.100.0.0/20&lt;/code&gt;) ---\&lt;br&gt;
that's how return traffic from the centralized ALB finds its way back to\&lt;br&gt;
pods in workload VPCs.&lt;/p&gt;

&lt;p&gt;The Terragrunt config for a spoke account reads VPC details from the\&lt;br&gt;
existing network dependency and hardcodes the hub TGW identifiers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;dependency&lt;/span&gt; &lt;span class="s2"&gt;"network"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;config_path&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"../network"&lt;/span&gt;
  &lt;span class="nx"&gt;mock_outputs&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;vpc_id&lt;/span&gt;                  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"vpc-mockid"&lt;/span&gt;
    &lt;span class="nx"&gt;private_subnet_ids&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"subnet-mock1"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="nx"&gt;private_route_table_ids&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"rtb-mock1"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;inputs&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;transit_gateway_id&lt;/span&gt;             &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"tgw-xxxxx"&lt;/span&gt;   &lt;span class="c1"&gt;# hub TGW, documented in runbook&lt;/span&gt;
  &lt;span class="nx"&gt;transit_gateway_route_table_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"tgw-rtb-xxxxx"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We hardcoded the hub TGW and route table IDs rather than using\&lt;br&gt;
cross-account data sources. The alternative --- reading TGW details from\&lt;br&gt;
the Infrastructure account at plan time --- requires cross-account state\&lt;br&gt;
access and adds complexity that isn't worth it for values that change\&lt;br&gt;
maybe once in the platform's lifetime.&lt;/p&gt;
&lt;h3&gt;
  
  
  Hub route tables: workload isolation by default
&lt;/h3&gt;

&lt;p&gt;A key design decision: workload accounts should not route to each\&lt;br&gt;
other directly. Dev should not reach nonprod; nonprod should not reach\&lt;br&gt;
prod. The hub TGW enforces this through route table structure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;default-association-rt&lt;/strong&gt;: all workload attachments\
associate here. The only route is\
&lt;code&gt;0.0.0.0/0 → inspection attachment&lt;/code&gt;. Workloads can reach the\
internet via the Inspection VPC, but cannot reach other workload\
VPCs.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;default-propagation-rt&lt;/strong&gt;: the inspection attachment\
propagates workload CIDRs here for return traffic routing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Inter-account communication is opt-in: you add an explicit route\&lt;br&gt;
table entry for a specific attachment pair. By default, the architecture\&lt;br&gt;
prevents lateral movement across workload accounts.&lt;/p&gt;
&lt;h3&gt;
  
  
  Inspection VPC subnet layout
&lt;/h3&gt;

&lt;p&gt;The Inspection VPC has three tiers with carefully constructed route\&lt;br&gt;
tables that force traffic through the firewall in both directions:&lt;/p&gt;


&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgraycloudarch.com%2Fwp-content%2Fuploads%2F2026%2F03%2Fdiag-subnets.png" alt="Inspection VPC subnet layout — three tiers (public, firewall, TGW) with asymmetric route tables that force all traffic through Network Firewall endpoints in both directions" width="800" height="1254"&gt;&lt;br&gt;



&lt;p&gt;The asymmetric route table design ensures the firewall sees every\&lt;br&gt;
packet crossing the network boundary, regardless of direction. Traffic\&lt;br&gt;
entering from the internet hits the firewall before reaching workloads.\&lt;br&gt;
Traffic from workloads hits the firewall before reaching the\&lt;br&gt;
internet.&lt;/p&gt;
&lt;h3&gt;
  
  
  Security baseline: convert to delegated admin model
&lt;/h3&gt;

&lt;p&gt;GuardDuty and CloudTrail were running independently per account. We\&lt;br&gt;
added &lt;code&gt;enable_guardduty&lt;/code&gt; and &lt;code&gt;enable_cloudtrail&lt;/code&gt;\&lt;br&gt;
boolean variables to the security-baseline module so workload accounts\&lt;br&gt;
could switch from standalone to member without touching the module\&lt;br&gt;
invocation itself.&lt;/p&gt;

&lt;p&gt;In the Security account, we deployed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;GuardDuty&lt;/strong&gt; as delegated admin with\
organization-level auto-enrollment. EKS Protection and S3 Protection\
enabled. All findings from all accounts visible in a single\
dashboard.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;CloudTrail&lt;/strong&gt; organization trail writing to a\
cross-account S3 bucket. Log file validation and KMS encryption enabled.\
Per-account trails archived after the cutover --- not deleted, in case\
historical log formats differed.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Security Hub&lt;/strong&gt; with CIS AWS Foundations Benchmark and\
AWS Foundational Security Best Practices enabled across the full\
organization.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Phase 2: Two Maintenance Windows
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Window 1: Deploy the hub (~45 minutes, low risk)
&lt;/h3&gt;

&lt;p&gt;With no existing attachments and no workload traffic, deploying the\&lt;br&gt;
hub infrastructure carried minimal risk. We applied the Infrastructure\&lt;br&gt;
account TGW and Inspection VPC in a single window. The Network Firewall\&lt;br&gt;
takes 5--10 minutes to reach READY state after creation --- account for\&lt;br&gt;
that in your timing.&lt;/p&gt;

&lt;p&gt;At the end of this window: hub TGW running, Inspection VPC active,\&lt;br&gt;
Network Firewall endpoints healthy in both AZs, centralized ALB\&lt;br&gt;
deployed. Nothing attached yet. We documented the TGW ID and route table\&lt;br&gt;
IDs in the runbook before scheduling window 2.&lt;/p&gt;
&lt;h3&gt;
  
  
  Window 2: Spoke cutover (~2 hours)
&lt;/h3&gt;

&lt;p&gt;The key insight for keeping applications running: &lt;strong&gt;create the\&lt;br&gt;
hub attachment before destroying the standalone TGW&lt;/strong&gt;. While both\&lt;br&gt;
exist simultaneously, traffic continues flowing through the standalone\&lt;br&gt;
path. The actual cutover is updating routes to point at the hub --- that's\&lt;br&gt;
a single &lt;code&gt;terragrunt apply&lt;/code&gt;, not the destruction of the old\&lt;br&gt;
TGW.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;T+0 --- Accept RAM share.&lt;/strong&gt; Infrastructure account\&lt;br&gt;
shares the hub TGW via Resource Access Manager. Workload accounts accept\&lt;br&gt;
the share invitation. Pure metadata operation; zero network impact.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;T+15 --- Deploy VPC attachments.&lt;/strong&gt; Apply the\&lt;br&gt;
&lt;code&gt;vpc-attachment&lt;/code&gt; module in each workload account. At this\&lt;br&gt;
point each spoke VPC has two routes for &lt;code&gt;10.0.0.0/8&lt;/code&gt;: the\&lt;br&gt;
existing one pointing at the standalone TGW, and the new one pointing at\&lt;br&gt;
the hub. With identical prefix lengths, traffic still flows through the\&lt;br&gt;
standalone path. Rollback at this stage is\&lt;br&gt;
&lt;code&gt;terragrunt destroy&lt;/code&gt; on the attachment module --- under five\&lt;br&gt;
minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;T+30 --- Verify routes and test cross-account\&lt;br&gt;
connectivity.&lt;/strong&gt; Confirm hub routes are present in every private\&lt;br&gt;
route table:&lt;/p&gt;

&lt;p&gt;::: {#cb4 .sourceCode}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;aws ec2 describe-route-tables \
  --filters "Name=vpc-id,Values=vpc-xxxxx" \
  --query 'RouteTables[*].Routes[?DestinationCidrBlock==`10.0.0.0/8`]'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;Then test actual cross-account traffic: connect from a dev instance\&lt;br&gt;
to a service in the nonprod VPC. The hub TGW and Inspection VPC should\&lt;br&gt;
route it correctly. This also validates that the firewall rule groups\&lt;br&gt;
are permitting expected traffic --- catch any rule issues here, before\&lt;br&gt;
cutting over production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;T+45 --- Migrate security tooling.&lt;/strong&gt; Apply the updated\&lt;br&gt;
security-baseline to each workload account. GuardDuty converts from\&lt;br&gt;
standalone admin to member; findings flow to the Security account\&lt;br&gt;
delegated admin. CloudTrail local trail disabled; organization trail\&lt;br&gt;
confirmed logging events from the account. Zero network impact.&lt;/p&gt;

&lt;p&gt;::: {#cb5 .sourceCode}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Verify GuardDuty membership
aws guardduty get-administrator-account --detector-id &amp;lt;id&amp;gt;
# Returns the Security account as administrator

# Verify organization trail is capturing events
# Make an API call, wait ~15 minutes, check the Security account's S3 bucket
aws s3 ls s3://&amp;lt;org-trail-bucket&amp;gt;/AWSLogs/&amp;lt;account-id&amp;gt;/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;T+60 --- Set &lt;code&gt;create_transit_gateway = false&lt;/code&gt; in\&lt;br&gt;
each spoke.&lt;/strong&gt; This is the cutover. Run\&lt;br&gt;
&lt;code&gt;terraform plan&lt;/code&gt; first and confirm it shows only the TGW and\&lt;br&gt;
its attached resources being destroyed --- nothing else. Apply dev first,\&lt;br&gt;
watch the destruction complete, confirm application traffic is flowing\&lt;br&gt;
through the hub. Then apply nonprod. About 3 minutes per account.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;T+90 --- Health checks and close.&lt;/strong&gt; Spot-check API\&lt;br&gt;
endpoints, database connectivity, anything that traverses the network.\&lt;br&gt;
Confirm egress traffic is hitting the firewall logs in the\&lt;br&gt;
Infrastructure account. The maintenance window closed at the 90-minute\&lt;br&gt;
mark; actual work was done by T+75. We kept the window open for the last\&lt;br&gt;
15 minutes as a buffer.&lt;/p&gt;

&lt;p&gt;The parallel attachment approach ensured there was never a moment\&lt;br&gt;
where a workload account had no routing path. Even if the hub TGW had\&lt;br&gt;
been misconfigured, traffic would have continued flowing through the\&lt;br&gt;
standalone gateway until we chose to destroy it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Ended Up With
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;One TGW&lt;/strong&gt; in the Infrastructure account with three\&lt;br&gt;
spoke attachments. Route tables that allow workload→internet traffic\&lt;br&gt;
while preventing workload→workload lateral movement by default.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One Inspection VPC&lt;/strong&gt; with Network Firewall endpoints\&lt;br&gt;
in two AZs. All egress inspected against stateful domain filter rules\&lt;br&gt;
and stateless port rules. All ingress from the centralized ALB\&lt;br&gt;
inspected. Firewall policy updates apply to all workload accounts\&lt;br&gt;
simultaneously --- no per-account changes needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One centralized ALB&lt;/strong&gt; in the Infrastructure account,\&lt;br&gt;
routing to EKS target groups in workload accounts via cross-account IAM\&lt;br&gt;
role assumption. Workload accounts carry no public-facing load\&lt;br&gt;
balancers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One security console&lt;/strong&gt; in the Security account.\&lt;br&gt;
GuardDuty findings from all accounts in a single dashboard. CloudTrail\&lt;br&gt;
logs from every account in one S3 bucket. Security Hub compliance\&lt;br&gt;
posture for the full organization visible in one place.&lt;/p&gt;

&lt;p&gt;Cost went from roughly \$150--200/month (standalone TGWs, per-account\&lt;br&gt;
NAT, independent security tooling) to approximately \$50/month (single\&lt;br&gt;
hub TGW plus attachment hours, shared NAT in the Inspection VPC,\&lt;br&gt;
delegated security services). Cost savings validated against AWS Cost\&lt;br&gt;
Explorer after 30 days.&lt;/p&gt;

&lt;p&gt;The original security scanner request --- cross-account access from\&lt;br&gt;
nonprod to dev --- was live the same day. The compliance team had a single\&lt;br&gt;
GuardDuty and Security Hub dashboard the same week.&lt;/p&gt;

&lt;p&gt;More importantly: adding a new workload account to this architecture\&lt;br&gt;
now takes about an hour. Create the VPC, deploy the vpc-attachment\&lt;br&gt;
module pointing at the documented hub TGW ID, invite the new account as\&lt;br&gt;
a GuardDuty and Security Hub member, apply the security-baseline with\&lt;br&gt;
&lt;code&gt;enable_guardduty = false&lt;/code&gt;. Every new account inherits the\&lt;br&gt;
full inspection and security posture without any per-account\&lt;br&gt;
configuration. That's the actual value of a hub-spoke design --- not the\&lt;br&gt;
one-time cost savings, but the fact that account seven is as\&lt;br&gt;
well-secured and as easy to audit as account two.&lt;/p&gt;

&lt;p&gt;Working through a multi-account network redesign, or building the\&lt;br&gt;
inspection layer on top of an existing Transit Gateway setup? &lt;a href="https://graycloudarch.com/#contact" rel="noopener noreferrer"&gt;Get in touch&lt;/a&gt; --- this is the kind of platform\&lt;br&gt;
architecture I work on regularly.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>networkfirewall</category>
      <category>networking</category>
      <category>terraform</category>
    </item>
    <item>
      <title>Stop Managing EKS Add-ons by Hand</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Wed, 11 Mar 2026 02:20:36 +0000</pubDate>
      <link>https://forem.com/tallgray1/stop-managing-eks-add-ons-by-hand-1cc4</link>
      <guid>https://forem.com/tallgray1/stop-managing-eks-add-ons-by-hand-1cc4</guid>
      <description>&lt;p&gt;&lt;em&gt;This post was originally published on &lt;a href="https://graycloudarch.com/eks-addons-terraform/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;I was preparing to upgrade a production EKS cluster to version 1.32\&lt;br&gt;
when I discovered a problem.&lt;/p&gt;

&lt;p&gt;Four of our core cluster components---VPC CNI, CoreDNS, kube-proxy, and\&lt;br&gt;
Metrics Server---were all running versions incompatible with EKS 1.32. I\&lt;br&gt;
needed to update them before upgrading.&lt;/p&gt;

&lt;p&gt;And I had no easy way to do it.&lt;/p&gt;

&lt;p&gt;VPC CNI, CoreDNS, and kube-proxy had been installed automatically\&lt;br&gt;
when the cluster was created, running in "self-managed" mode. Metrics\&lt;br&gt;
Server was installed with\&lt;br&gt;
&lt;code&gt;kubectl apply -f metrics-server.yaml&lt;/code&gt; from some GitHub\&lt;br&gt;
release page, months ago, by someone who is no longer on the team.&lt;/p&gt;

&lt;p&gt;No version pinning. No history of what changed or when. No way to\&lt;br&gt;
test the upgrade before applying it to production.&lt;/p&gt;

&lt;p&gt;That's when I decided to stop managing EKS add-ons by hand.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Problem with Self-Managed Add-ons
&lt;/h2&gt;

&lt;p&gt;There are two categories of EKS add-ons, and most teams don't think\&lt;br&gt;
about the distinction until they're stuck.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Self-managed&lt;/strong&gt;: You're responsible for installation,\&lt;br&gt;
updates, and compatibility. AWS won't help you troubleshoot them. When\&lt;br&gt;
EKS releases a new version, you need to manually verify your add-ons\&lt;br&gt;
still work, find compatible versions, and update them yourself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;EKS-managed&lt;/strong&gt;: AWS handles the lifecycle. Compatible\&lt;br&gt;
versions are tested and published for each EKS release. AWS Support can\&lt;br&gt;
troubleshoot them. Security patches are available without you tracking\&lt;br&gt;
CVEs.&lt;/p&gt;

&lt;p&gt;If you created an EKS cluster without explicitly enabling managed\&lt;br&gt;
add-ons, VPC CNI, CoreDNS, and kube-proxy are running in self-managed\&lt;br&gt;
mode right now.&lt;/p&gt;

&lt;p&gt;The fix is straightforward---migrate them to EKS-managed. But if you're\&lt;br&gt;
also running kubectl-installed tools like Metrics Server, you have a\&lt;br&gt;
second problem: those aren't managed by anything at all.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Solution: One Terraform Module for All Six Add-ons
&lt;/h2&gt;

&lt;p&gt;I built a single &lt;code&gt;eks-addons&lt;/code&gt; Terraform module that\&lt;br&gt;
manages everything:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;EKS-managed (4):&lt;/strong&gt; -- VPC CNI --- pod networking -- EBS\&lt;br&gt;
CSI Driver --- persistent volumes (added this one while I was at it) --\&lt;br&gt;
CoreDNS --- DNS resolution -- kube-proxy --- network proxy&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Helm-managed (2):&lt;/strong&gt; -- Metrics Server --- resource\&lt;br&gt;
metrics for &lt;code&gt;kubectl top&lt;/code&gt; and HPA -- Reloader --- auto-restart\&lt;br&gt;
pods when ConfigMaps or Secrets change&lt;/p&gt;

&lt;p&gt;Why one module instead of six separate ones? All of these share the\&lt;br&gt;
same dependency: the EKS cluster. Consolidating them means one\&lt;br&gt;
&lt;code&gt;terragrunt apply&lt;/code&gt; deploys everything, one\&lt;br&gt;
&lt;code&gt;terraform plan&lt;/code&gt; shows drift across all add-ons, and one PR\&lt;br&gt;
updates any version.&lt;/p&gt;

&lt;p&gt;The core Terraform for an EKS-managed add-on is minimal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_eks_addon"&lt;/span&gt; &lt;span class="s2"&gt;"vpc_cni"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;count&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;enable_vpc_cni&lt;/span&gt; &lt;span class="err"&gt;?&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

  &lt;span class="nx"&gt;cluster_name&lt;/span&gt;                &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cluster_name&lt;/span&gt;
  &lt;span class="nx"&gt;addon_name&lt;/span&gt;                  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"vpc-cni"&lt;/span&gt;
  &lt;span class="nx"&gt;addon_version&lt;/span&gt;               &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc_cni_version&lt;/span&gt;
  &lt;span class="nx"&gt;resolve_conflicts_on_create&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"OVERWRITE"&lt;/span&gt;
  &lt;span class="nx"&gt;resolve_conflicts_on_update&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"OVERWRITE"&lt;/span&gt;
  &lt;span class="nx"&gt;preserve&lt;/span&gt;                    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things worth explaining:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;resolve_conflicts = "OVERWRITE"&lt;/code&gt; tells Terraform it's the\&lt;br&gt;
source of truth. Any manual changes in the cluster get overwritten on\&lt;br&gt;
the next apply. This is what you want.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;preserve = true&lt;/code&gt; means if you remove the resource from\&lt;br&gt;
Terraform, the add-on stays in the cluster. Safety net during\&lt;br&gt;
refactoring---you won't accidentally delete a running add-on.&lt;/p&gt;
&lt;h2&gt;
  
  
  EBS CSI Driver Needs an IAM Role
&lt;/h2&gt;

&lt;p&gt;The EBS CSI Driver is the one add-on that requires extra work: it\&lt;br&gt;
needs IAM permissions to create and attach EBS volumes. The right way to\&lt;br&gt;
handle this is IRSA (IAM Roles for Service Accounts).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_iam_role"&lt;/span&gt; &lt;span class="s2"&gt;"ebs_csi"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;count&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;enable_ebs_csi&lt;/span&gt; &lt;span class="err"&gt;?&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${var.cluster_name}-ebs-csi-driver"&lt;/span&gt;

  &lt;span class="nx"&gt;assume_role_policy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonencode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;Version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;
    &lt;span class="nx"&gt;Statement&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
      &lt;span class="nx"&gt;Effect&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
      &lt;span class="nx"&gt;Principal&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Federated&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;oidc_provider_arn&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="nx"&gt;Action&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"sts:AssumeRoleWithWebIdentity"&lt;/span&gt;
      &lt;span class="nx"&gt;Condition&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;StringEquals&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="s2"&gt;"${var.oidc_provider}:sub"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"system:serviceaccount:kube-system:ebs-csi-controller-sa"&lt;/span&gt;
          &lt;span class="s2"&gt;"${var.oidc_provider}:aud"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"sts.amazonaws.com"&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}]&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_iam_role_policy_attachment"&lt;/span&gt; &lt;span class="s2"&gt;"ebs_csi"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;count&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;enable_ebs_csi&lt;/span&gt; &lt;span class="err"&gt;?&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
  &lt;span class="nx"&gt;role&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_iam_role&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ebs_csi&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;
  &lt;span class="nx"&gt;policy_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No credentials in pods, automatic rotation, and a clean audit trail\&lt;br&gt;
in CloudTrail. IRSA is the correct pattern for any AWS service that\&lt;br&gt;
needs to call AWS APIs from inside Kubernetes.&lt;/p&gt;
&lt;h2&gt;
  
  
  Migrating Metrics Server from kubectl to Helm
&lt;/h2&gt;

&lt;p&gt;This is the one step that requires manual cleanup before Terraform\&lt;br&gt;
can take over.&lt;/p&gt;

&lt;p&gt;The existing kubectl-installed Metrics Server needs to go first:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl delete deployment metrics-server -n kube-system
kubectl delete service metrics-server -n kube-system
kubectl delete apiservice v1beta1.metrics.k8s.io
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then Terraform installs the Helm-managed version:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"helm_release"&lt;/span&gt; &lt;span class="s2"&gt;"metrics_server"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;count&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;enable_metrics_server&lt;/span&gt; &lt;span class="err"&gt;?&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"metrics-server"&lt;/span&gt;
  &lt;span class="nx"&gt;repository&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"https://kubernetes-sigs.github.io/metrics-server/"&lt;/span&gt;
  &lt;span class="nx"&gt;chart&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"metrics-server"&lt;/span&gt;
  &lt;span class="nx"&gt;version&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;metrics_server_chart_version&lt;/span&gt;
  &lt;span class="nx"&gt;namespace&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"kube-system"&lt;/span&gt;

  &lt;span class="nx"&gt;values&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;yamlencode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;replicas&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
    &lt;span class="nx"&gt;args&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="s2"&gt;"--kubelet-preferred-address-types=InternalIP"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="s2"&gt;"--kubelet-insecure-tls"&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="nx"&gt;podDisruptionBudget&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;enabled&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="nx"&gt;minAvailable&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;})]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected downtime: 2-3 minutes. Only &lt;code&gt;kubectl top&lt;/code&gt; is\&lt;br&gt;
unavailable during the transition. Running applications are not\&lt;br&gt;
affected.&lt;/p&gt;
&lt;h2&gt;
  
  
  Deploying It
&lt;/h2&gt;

&lt;p&gt;One thing that bit me: CI/CD doesn't pick up module changes\&lt;br&gt;
automatically.&lt;/p&gt;

&lt;p&gt;Our GitHub Actions workflow detects changes by looking for modified\&lt;br&gt;
&lt;code&gt;terragrunt.hcl&lt;/code&gt; files. When I changed files in\&lt;br&gt;
&lt;code&gt;common/modules/eks-addons/&lt;/code&gt;, the workflow triggered but\&lt;br&gt;
found no stacks to deploy (no &lt;code&gt;terragrunt.hcl&lt;/code&gt; changed), so\&lt;br&gt;
nothing ran.&lt;/p&gt;

&lt;p&gt;Module changes require a manual deploy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cd workloads-nonprod/us-east-1/cluster-name/eks-addons
terragrunt init
terragrunt plan   # Review: should show ~10 resources to add
terragrunt apply
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After apply, verify everything is healthy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Check EKS-managed add-on status
for addon in vpc-cni aws-ebs-csi-driver coredns kube-proxy; do
  aws eks describe-addon --cluster-name &amp;lt;cluster&amp;gt; --addon-name $addon \
    --query 'addon.[addonName,status]' --output text
done
# All should show: ACTIVE

# Verify Metrics Server
kubectl top nodes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What Changed
&lt;/h2&gt;

&lt;p&gt;Before: four add-ons running in self-managed mode, one installed by\&lt;br&gt;
kubectl, no version history, no drift detection.&lt;/p&gt;

&lt;p&gt;After: -- All six add-ons defined in code with pinned versions --\&lt;br&gt;
&lt;code&gt;terraform plan&lt;/code&gt; shows immediately if anything drifts from\&lt;br&gt;
the declared state -- Rollback is &lt;code&gt;git revert&lt;/code&gt; +\&lt;br&gt;
&lt;code&gt;terragrunt apply&lt;/code&gt; -- EKS cluster upgrade checklist is now:\&lt;br&gt;
update four version strings in the Terragrunt config, open a PR,\&lt;br&gt;
done&lt;/p&gt;

&lt;p&gt;The cluster upgrade I was dreading took about 30 minutes instead of a\&lt;br&gt;
day of manual compatibility checking.&lt;/p&gt;

&lt;p&gt;Running into EKS add-on management problems? &lt;a href="https://graycloudarch.com/#contact" rel="noopener noreferrer"&gt;Reach\&lt;br&gt;
out&lt;/a&gt;---this is the kind of operational work I do for platform\&lt;br&gt;
teams.&lt;/p&gt;

</description>
      <category>eks</category>
      <category>infrastructureascode</category>
      <category>kubernetes</category>
      <category>terraform</category>
    </item>
  </channel>
</rss>
