<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Glenn Gray</title>
    <description>The latest articles on Forem by Glenn Gray (@tallgray1).</description>
    <link>https://forem.com/tallgray1</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3817657%2F22cc7f4e-c345-484f-89b0-07068c02c9c7.png</url>
      <title>Forem: Glenn Gray</title>
      <link>https://forem.com/tallgray1</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/tallgray1"/>
    <language>en</language>
    <item>
      <title>Building Automated AWS Permission Testing Infrastructure for CI/CD</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Sun, 05 Apr 2026 16:53:43 +0000</pubDate>
      <link>https://forem.com/tallgray1/building-automated-aws-permission-testing-infrastructure-for-cicd-414o</link>
      <guid>https://forem.com/tallgray1/building-automated-aws-permission-testing-infrastructure-for-cicd-414o</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://graycloudarch.com/blog/aws-permission-testing-cicd/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;I deployed a permission set for our data engineers five times before it worked correctly.&lt;/p&gt;

&lt;p&gt;The first deployment: S3 reads worked, Glue Data Catalog reads worked. Athena queries failed — the query engine needs KMS decrypt through a service principal, and I'd missed the &lt;code&gt;kms:ViaService&lt;/code&gt; condition. Second deployment: Athena worked. EMR Serverless job submission failed — missing &lt;code&gt;iam:PassRole&lt;/code&gt;. Third deployment: EMR submission worked. Job execution failed — missing permissions on the EMR Serverless execution role boundary. I kept deploying, engineers kept getting blocked, I kept opening tickets.&lt;/p&gt;

&lt;p&gt;Five iterations. Two weeks. Every failure meant a data engineer opened a ticket instead of running their job.&lt;/p&gt;

&lt;p&gt;The problem wasn't that IAM is complicated — it is, but that's expected. The problem was that I had no way to catch these issues before deploying to the account where real engineers were trying to do real work. Every bug was a production bug.&lt;/p&gt;

&lt;h2&gt;
  
  
  The "Access Denied" Debugging Loop
&lt;/h2&gt;

&lt;p&gt;Here's what the reactive debugging cycle looks like from the inside.&lt;/p&gt;

&lt;p&gt;Engineer opens a ticket: &lt;code&gt;AccessDeniedException: User is not authorized to perform: s3:GetObject&lt;/code&gt;. I add &lt;code&gt;s3:GetObject&lt;/code&gt; to the permission set. Next day: &lt;code&gt;AccessDeniedException: s3:PutObject&lt;/code&gt;. I add &lt;code&gt;s3:PutObject&lt;/code&gt;. Day after: write succeeds but cleanup fails — &lt;code&gt;s3:DeleteObject&lt;/code&gt;. At this point I've done four deployment cycles and two days of work to get S3 read/write/delete working. If I'd just added &lt;code&gt;s3:*&lt;/code&gt; I'd be done, but that violates least-privilege and opens the raw zone to write access, which we explicitly don't want.&lt;/p&gt;

&lt;p&gt;The deeper issue is that individual services don't fail atomically. Athena requires &lt;code&gt;athena:StartQueryExecution&lt;/code&gt; and &lt;code&gt;athena:GetQueryResults&lt;/code&gt; and &lt;code&gt;athena:GetQueryExecution&lt;/code&gt;, but it also requires KMS decrypt through the Athena service principal to read encrypted S3 results. That last piece isn't in the Athena docs — you find it by failing in production.&lt;/p&gt;

&lt;p&gt;I wanted a way to find it before deploying.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;The testing framework has four components: per-persona permission set templates, a Bash test library, per-service test scripts, and a GitHub Actions workflow that runs everything on pull requests.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgraycloudarch.com%2Fwp-content%2Fuploads%2F2026%2F03%2Fdiag-permission-testing-cicd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgraycloudarch.com%2Fwp-content%2Fuploads%2F2026%2F03%2Fdiag-permission-testing-cicd.png" alt="Permission testing CI/CD architecture: GitHub Pull Request triggers a GitHub Actions CI/CD Workflow, which fans out to S3 Tests, Glue Tests, and Athena Tests in parallel, with all results aggregated into a Test Report posted back to the PR" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The workflow triggers on any pull request that modifies the identity-center Terraform directory. Tests run against real AWS accounts — dev and nonprod — using test credentials provisioned for that purpose. Results post as a PR comment before anyone approves the change.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 1: Pre-Validated Templates
&lt;/h2&gt;

&lt;p&gt;Before I wrote a single test, I needed a starting point for permission sets that captured the patterns I'd learned the hard way. Templates that handle the non-obvious pieces — zone-scoped S3 access, KMS conditions tied to specific services, explicit denies for destructive operations.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;AnalystAccess&lt;/code&gt; template is representative. Analysts get read-only access to the curated zone of the data lake, Athena query execution in the primary workgroup, and KMS decrypt — but only when the decrypt request originates from S3 or Athena, not from arbitrary API calls:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;inline_policy&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonencode&lt;/span&gt;&lt;span class="err"&gt;(&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;Version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;
  &lt;span class="nx"&gt;Statement&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"GlueCatalogReadOnly"&lt;/span&gt;
      &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
      &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"glue:GetDatabase"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"glue:GetTable"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"glue:GetPartitions"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"glue:SearchTables"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="s2"&gt;"arn:aws:glue:*:*:catalog"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;"arn:aws:glue:*:*:database/curated_*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;"arn:aws:glue:*:*:table/curated_*/*"&lt;/span&gt;
      &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"S3CuratedReadOnly"&lt;/span&gt;
      &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
      &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"s3:GetObject"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"s3:ListBucket"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:s3:::lake-bucket-*/curated/*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:s3:::lake-bucket-*"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="nx"&gt;Condition&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;StringLike&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="s2"&gt;"s3:prefix"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"curated/*"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"AthenaQueryExecution"&lt;/span&gt;
      &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
      &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"athena:StartQueryExecution"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"athena:GetQueryExecution"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"athena:GetQueryResults"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"athena:StopQueryExecution"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:athena:*:*:workgroup/primary"&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"KMSDecryptViaSvc"&lt;/span&gt;
      &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
      &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"kms:Decrypt"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"kms:DescribeKey"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:kms:*:*:key/*"&lt;/span&gt;
      &lt;span class="nx"&gt;Condition&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;StringEquals&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="s2"&gt;"kms:ViaService"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"s3.us-east-1.amazonaws.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"athena.us-east-1.amazonaws.com"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"DenyDestructiveOps"&lt;/span&gt;
      &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Deny"&lt;/span&gt;
      &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"s3:DeleteObject"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"s3:DeleteBucket"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"glue:DeleteDatabase"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"glue:DeleteTable"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"*"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="err"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;kms:ViaService&lt;/code&gt; condition is the piece that took five production failures to discover. KMS decrypt without that condition allows an analyst to call &lt;code&gt;kms:Decrypt&lt;/code&gt; directly from their shell, which is not what we want. The condition locks decrypt to requests that pass through S3 or Athena specifically.&lt;/p&gt;

&lt;p&gt;The explicit deny block matters too. Without it, if someone later grants broader S3 permissions to this persona for a different reason, the curated zone protection evaporates. The deny creates a hard floor regardless of what else gets added.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 2: The Test Framework
&lt;/h2&gt;

&lt;p&gt;I chose Bash over Python or a proper test framework deliberately. The tests run in CI with no dependencies beyond the AWS CLI — no package installs, no virtual environments, no version pinning of test libraries. The machines running these tests already have the AWS CLI.&lt;/p&gt;

&lt;p&gt;The core library in &lt;code&gt;lib/test-framework.sh&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;declare&lt;/span&gt; &lt;span class="nt"&gt;-a&lt;/span&gt; &lt;span class="nv"&gt;TESTS_PASSED&lt;/span&gt;&lt;span class="o"&gt;=()&lt;/span&gt;
&lt;span class="nb"&gt;declare&lt;/span&gt; &lt;span class="nt"&gt;-a&lt;/span&gt; &lt;span class="nv"&gt;TESTS_FAILED&lt;/span&gt;&lt;span class="o"&gt;=()&lt;/span&gt;

run_test&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="nb"&gt;local &lt;/span&gt;&lt;span class="nv"&gt;test_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$1&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  &lt;span class="nb"&gt;local &lt;/span&gt;&lt;span class="nv"&gt;test_command&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$2&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  &lt;span class="nb"&gt;local &lt;/span&gt;&lt;span class="nv"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$3&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="nb"&gt;eval&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$test_command&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &amp;amp;&amp;gt;/dev/null&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
    &lt;/span&gt;TESTS_PASSED+&lt;span class="o"&gt;=(&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$test_name&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"  ✅ PASS: &lt;/span&gt;&lt;span class="nv"&gt;$test_name&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  &lt;span class="k"&gt;else
    &lt;/span&gt;TESTS_FAILED+&lt;span class="o"&gt;=(&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$test_name&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"  ❌ FAIL: &lt;/span&gt;&lt;span class="nv"&gt;$test_name&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  &lt;span class="k"&gt;fi&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

generate_text_report&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Total: &lt;/span&gt;&lt;span class="k"&gt;$((${#&lt;/span&gt;&lt;span class="nv"&gt;TESTS_PASSED&lt;/span&gt;&lt;span class="p"&gt;[@]&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="k"&gt;${#&lt;/span&gt;&lt;span class="nv"&gt;TESTS_FAILED&lt;/span&gt;&lt;span class="p"&gt;[@]&lt;/span&gt;&lt;span class="k"&gt;}))&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Passed: &lt;/span&gt;&lt;span class="k"&gt;${#&lt;/span&gt;&lt;span class="nv"&gt;TESTS_PASSED&lt;/span&gt;&lt;span class="p"&gt;[@]&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Failed: &lt;/span&gt;&lt;span class="k"&gt;${#&lt;/span&gt;&lt;span class="nv"&gt;TESTS_FAILED&lt;/span&gt;&lt;span class="p"&gt;[@]&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="k"&gt;${#&lt;/span&gt;&lt;span class="nv"&gt;TESTS_FAILED&lt;/span&gt;&lt;span class="p"&gt;[@]&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt; &lt;span class="nt"&gt;-gt&lt;/span&gt; 0 &lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;printf&lt;/span&gt; &lt;span class="s1"&gt;'  - %s\n'&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;TESTS_FAILED&lt;/span&gt;&lt;span class="p"&gt;[@]&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The most important design decision in the test scripts is testing denials as carefully as allowances. Testing only what should succeed tells you the permission set isn't obviously broken. Testing what should fail tells you it's not accidentally too permissive.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Test what should succeed&lt;/span&gt;
run_test &lt;span class="s2"&gt;"s3-list-curated"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"aws s3 ls s3://lake-bucket-dev/curated/"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"Analyst can list curated zone"&lt;/span&gt;

&lt;span class="c"&gt;# Test what should fail (negative test)&lt;/span&gt;
run_test &lt;span class="s2"&gt;"s3-write-denied"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"! aws s3 cp /tmp/test.txt s3://lake-bucket-dev/curated/test.txt 2&amp;gt;&amp;amp;1 | grep -q 'AccessDenied'"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"Analyst cannot write to curated zone"&lt;/span&gt;

run_test &lt;span class="s2"&gt;"s3-raw-zone-denied"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"! aws s3 ls s3://lake-bucket-dev/raw/ 2&amp;gt;&amp;amp;1 | grep -q 'AccessDenied'"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"Analyst cannot access raw zone"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Beyond service-level tests, I run persona tests that simulate end-to-end workflows. An analyst's workflow isn't "call S3, then call Athena separately" — it's "run an Athena query that reads encrypted S3 data and writes results to the query results bucket." That integration test catches failures that individual service tests miss. The original five-iteration DataPlatformAccess failure? An individual S3 test would have passed. A persona test running an actual Athena query against the encrypted lake would have caught the KMS gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 3: CI/CD Integration
&lt;/h2&gt;

&lt;p&gt;The GitHub Actions workflow triggers on pull requests that touch the identity-center Terraform directory, runs tests in a matrix against dev and nonprod, and posts a summary comment to the PR.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;common/modules/identity-center/**/*.tf'&lt;/span&gt;

&lt;span class="na"&gt;permissions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;contents&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;read&lt;/span&gt;
  &lt;span class="na"&gt;id-token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;write&lt;/span&gt;
  &lt;span class="na"&gt;pull-requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;write&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;test-permissions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;matrix&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;workloads-dev&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;workloads-nonprod&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws-actions/configure-aws-credentials@v4&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;role-to-assume&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;arn:aws:iam::${{ matrix.environment.account }}:role/github-actions-role&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./scripts/test-permissions/run-permission-tests.sh --persona analyst&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;id-token: write&lt;/code&gt; permission is required for OIDC authentication to AWS — the workflow assumes a role in each account rather than using long-lived credentials in GitHub Secrets. This is the right pattern: credentials rotate automatically, and there's no secret to rotate manually or accidentally expose.&lt;/p&gt;

&lt;p&gt;The PR comment posts the full test output with pass/fail counts per persona per account. A reviewer can look at the comment and immediately see whether the permission change has test coverage and whether the tests pass.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Things I Learned the Hard Way
&lt;/h2&gt;

&lt;p&gt;First: test KMS decryption through each service separately. &lt;code&gt;kms:Decrypt&lt;/code&gt; via S3 and &lt;code&gt;kms:Decrypt&lt;/code&gt; via Athena are different IAM evaluation paths even though they're the same API call. A test that puts an object and gets it back via S3 directly won't catch a broken Athena KMS path.&lt;/p&gt;

&lt;p&gt;Second: negative tests matter as much as positive ones. Before I had the test framework, every permission set I wrote was tested only for what it should allow. I had no systematic check that it didn't allow more. The denial tests are what give security reviewers confidence.&lt;/p&gt;

&lt;p&gt;Third: persona tests catch failures that service tests miss. Individual service tests are fast to write and good for regression coverage, but they test permissions in isolation. Real workflows cross service boundaries. Build both.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Changed
&lt;/h2&gt;

&lt;p&gt;Before the framework: five iterations to get one permission set right, every iteration a production impact. After: 95% of permission issues caught at PR review time. Zero production impacts from permission bugs since we shipped it. The templates reduced new permission set creation time by about 70% — instead of starting from scratch with the IAM documentation, we start from a pre-validated base and modify from there.&lt;/p&gt;

&lt;p&gt;The time investment was about a week: two days for templates, two days for the test framework and scripts, one day for CI/CD integration and documentation. That investment paid back in the first sprint when the analyst permission set for a new hire went out correct on the first deployment.&lt;/p&gt;




&lt;p&gt;Running into IAM permission debugging loops on your team? &lt;a href="https://graycloudarch.com/contact/" rel="noopener noreferrer"&gt;Reach out&lt;/a&gt; — permission testing infrastructure is one of the first things I build when joining a new platform team.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>iam</category>
      <category>security</category>
      <category>githubactions</category>
    </item>
    <item>
      <title>Stop Managing EKS Add-ons by Hand</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Sun, 05 Apr 2026 16:53:40 +0000</pubDate>
      <link>https://forem.com/tallgray1/stop-managing-eks-add-ons-by-hand-2a7o</link>
      <guid>https://forem.com/tallgray1/stop-managing-eks-add-ons-by-hand-2a7o</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://graycloudarch.com/blog/eks-addons-terraform/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;I was preparing to upgrade a production EKS cluster to version 1.32 when I discovered a problem.&lt;/p&gt;

&lt;p&gt;Four of our core cluster components—VPC CNI, CoreDNS, kube-proxy, and Metrics Server—were all running versions incompatible with EKS 1.32. I needed to update them before upgrading.&lt;/p&gt;

&lt;p&gt;And I had no easy way to do it.&lt;/p&gt;

&lt;p&gt;VPC CNI, CoreDNS, and kube-proxy had been installed automatically when the cluster was created, running in "self-managed" mode. Metrics Server was installed with &lt;code&gt;kubectl apply -f metrics-server.yaml&lt;/code&gt; from some GitHub release page, months ago, by someone who is no longer on the team.&lt;/p&gt;

&lt;p&gt;No version pinning. No history of what changed or when. No way to test the upgrade before applying it to production.&lt;/p&gt;

&lt;p&gt;That's when I decided to stop managing EKS add-ons by hand.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with Self-Managed Add-ons
&lt;/h2&gt;

&lt;p&gt;There are two categories of EKS add-ons, and most teams don't think about the distinction until they're stuck.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Self-managed&lt;/strong&gt;: You're responsible for installation, updates, and compatibility. AWS won't help you troubleshoot them. When EKS releases a new version, you need to manually verify your add-ons still work, find compatible versions, and update them yourself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;EKS-managed&lt;/strong&gt;: AWS handles the lifecycle. Compatible versions are tested and published for each EKS release. AWS Support can troubleshoot them. Security patches are available without you tracking CVEs.&lt;/p&gt;

&lt;p&gt;If you created an EKS cluster without explicitly enabling managed add-ons, VPC CNI, CoreDNS, and kube-proxy are running in self-managed mode right now.&lt;/p&gt;

&lt;p&gt;The fix is straightforward—migrate them to EKS-managed. But if you're also running kubectl-installed tools like Metrics Server, you have a second problem: those aren't managed by anything at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Solution: One Terraform Module for All Six Add-ons
&lt;/h2&gt;

&lt;p&gt;I built a single &lt;code&gt;eks-addons&lt;/code&gt; Terraform module that manages everything:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;EKS-managed (4):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;VPC CNI — pod networking&lt;/li&gt;
&lt;li&gt;EBS CSI Driver — persistent volumes (added this one while I was at it)&lt;/li&gt;
&lt;li&gt;CoreDNS — DNS resolution&lt;/li&gt;
&lt;li&gt;kube-proxy — network proxy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Helm-managed (2):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Metrics Server — resource metrics for &lt;code&gt;kubectl top&lt;/code&gt; and HPA&lt;/li&gt;
&lt;li&gt;Reloader — auto-restart pods when ConfigMaps or Secrets change&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Why one module instead of six separate ones? All of these share the same dependency: the EKS cluster. Consolidating them means one &lt;code&gt;terragrunt apply&lt;/code&gt; deploys everything, one &lt;code&gt;terraform plan&lt;/code&gt; shows drift across all add-ons, and one PR updates any version.&lt;/p&gt;

&lt;p&gt;The core Terraform for an EKS-managed add-on is minimal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_eks_addon"&lt;/span&gt; &lt;span class="s2"&gt;"vpc_cni"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;count&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;enable_vpc_cni&lt;/span&gt; &lt;span class="err"&gt;?&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

  &lt;span class="nx"&gt;cluster_name&lt;/span&gt;                &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cluster_name&lt;/span&gt;
  &lt;span class="nx"&gt;addon_name&lt;/span&gt;                  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"vpc-cni"&lt;/span&gt;
  &lt;span class="nx"&gt;addon_version&lt;/span&gt;               &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc_cni_version&lt;/span&gt;
  &lt;span class="nx"&gt;resolve_conflicts_on_create&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"OVERWRITE"&lt;/span&gt;
  &lt;span class="nx"&gt;resolve_conflicts_on_update&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"OVERWRITE"&lt;/span&gt;
  &lt;span class="nx"&gt;preserve&lt;/span&gt;                    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things worth explaining:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;resolve_conflicts = "OVERWRITE"&lt;/code&gt; tells Terraform it's the source of truth. Any manual changes in the cluster get overwritten on the next apply. This is what you want.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;preserve = true&lt;/code&gt; means if you remove the resource from Terraform, the add-on stays in the cluster. Safety net during refactoring—you won't accidentally delete a running add-on.&lt;/p&gt;

&lt;h2&gt;
  
  
  EBS CSI Driver Needs an IAM Role
&lt;/h2&gt;

&lt;p&gt;The EBS CSI Driver is the one add-on that requires extra work: it needs IAM permissions to create and attach EBS volumes. The right way to handle this is IRSA (IAM Roles for Service Accounts).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_iam_role"&lt;/span&gt; &lt;span class="s2"&gt;"ebs_csi"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;count&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;enable_ebs_csi&lt;/span&gt; &lt;span class="err"&gt;?&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${var.cluster_name}-ebs-csi-driver"&lt;/span&gt;

  &lt;span class="nx"&gt;assume_role_policy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonencode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;Version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;
    &lt;span class="nx"&gt;Statement&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
      &lt;span class="nx"&gt;Effect&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
      &lt;span class="nx"&gt;Principal&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Federated&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;oidc_provider_arn&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="nx"&gt;Action&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"sts:AssumeRoleWithWebIdentity"&lt;/span&gt;
      &lt;span class="nx"&gt;Condition&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;StringEquals&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="s2"&gt;"${var.oidc_provider}:sub"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"system:serviceaccount:kube-system:ebs-csi-controller-sa"&lt;/span&gt;
          &lt;span class="s2"&gt;"${var.oidc_provider}:aud"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"sts.amazonaws.com"&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}]&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_iam_role_policy_attachment"&lt;/span&gt; &lt;span class="s2"&gt;"ebs_csi"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;count&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;enable_ebs_csi&lt;/span&gt; &lt;span class="err"&gt;?&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
  &lt;span class="nx"&gt;role&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_iam_role&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ebs_csi&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;
  &lt;span class="nx"&gt;policy_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No credentials in pods, automatic rotation, and a clean audit trail in CloudTrail. IRSA is the correct pattern for any AWS service that needs to call AWS APIs from inside Kubernetes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Migrating Metrics Server from kubectl to Helm
&lt;/h2&gt;

&lt;p&gt;This is the one step that requires manual cleanup before Terraform can take over.&lt;/p&gt;

&lt;p&gt;The existing kubectl-installed Metrics Server needs to go first:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl delete deployment metrics-server &lt;span class="nt"&gt;-n&lt;/span&gt; kube-system
kubectl delete service metrics-server &lt;span class="nt"&gt;-n&lt;/span&gt; kube-system
kubectl delete apiservice v1beta1.metrics.k8s.io
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then Terraform installs the Helm-managed version:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"helm_release"&lt;/span&gt; &lt;span class="s2"&gt;"metrics_server"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;count&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;enable_metrics_server&lt;/span&gt; &lt;span class="err"&gt;?&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"metrics-server"&lt;/span&gt;
  &lt;span class="nx"&gt;repository&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"https://kubernetes-sigs.github.io/metrics-server/"&lt;/span&gt;
  &lt;span class="nx"&gt;chart&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"metrics-server"&lt;/span&gt;
  &lt;span class="nx"&gt;version&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;metrics_server_chart_version&lt;/span&gt;
  &lt;span class="nx"&gt;namespace&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"kube-system"&lt;/span&gt;

  &lt;span class="nx"&gt;values&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;yamlencode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;replicas&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
    &lt;span class="nx"&gt;args&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="s2"&gt;"--kubelet-preferred-address-types=InternalIP"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="s2"&gt;"--kubelet-insecure-tls"&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="nx"&gt;podDisruptionBudget&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;enabled&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="nx"&gt;minAvailable&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;})]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected downtime: 2-3 minutes. Only &lt;code&gt;kubectl top&lt;/code&gt; is unavailable during the transition. Running applications are not affected.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deploying It
&lt;/h2&gt;

&lt;p&gt;One thing that bit me: CI/CD doesn't pick up module changes automatically.&lt;/p&gt;

&lt;p&gt;Our GitHub Actions workflow detects changes by looking for modified &lt;code&gt;terragrunt.hcl&lt;/code&gt; files. When I changed files in &lt;code&gt;common/modules/eks-addons/&lt;/code&gt;, the workflow triggered but found no stacks to deploy (no &lt;code&gt;terragrunt.hcl&lt;/code&gt; changed), so nothing ran.&lt;/p&gt;

&lt;p&gt;Module changes require a manual deploy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;workloads-nonprod/us-east-1/cluster-name/eks-addons
terragrunt init
terragrunt plan   &lt;span class="c"&gt;# Review: should show ~10 resources to add&lt;/span&gt;
terragrunt apply
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After apply, verify everything is healthy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check EKS-managed add-on status&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;addon &lt;span class="k"&gt;in &lt;/span&gt;vpc-cni aws-ebs-csi-driver coredns kube-proxy&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;aws eks describe-addon &lt;span class="nt"&gt;--cluster-name&lt;/span&gt; &amp;lt;cluster&amp;gt; &lt;span class="nt"&gt;--addon-name&lt;/span&gt; &lt;span class="nv"&gt;$addon&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'addon.[addonName,status]'&lt;/span&gt; &lt;span class="nt"&gt;--output&lt;/span&gt; text
&lt;span class="k"&gt;done&lt;/span&gt;
&lt;span class="c"&gt;# All should show: ACTIVE&lt;/span&gt;

&lt;span class="c"&gt;# Verify Metrics Server&lt;/span&gt;
kubectl top nodes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What Changed
&lt;/h2&gt;

&lt;p&gt;Before: four add-ons running in self-managed mode, one installed by kubectl, no version history, no drift detection.&lt;/p&gt;

&lt;p&gt;After:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All six add-ons defined in code with pinned versions&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;terraform plan&lt;/code&gt; shows immediately if anything drifts from the declared state&lt;/li&gt;
&lt;li&gt;Rollback is &lt;code&gt;git revert&lt;/code&gt; + &lt;code&gt;terragrunt apply&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;EKS cluster upgrade checklist is now: update four version strings in the Terragrunt config, open a PR, done&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The cluster upgrade I was dreading took about 30 minutes instead of a day of manual compatibility checking.&lt;/p&gt;




&lt;p&gt;Running into EKS add-on management problems? &lt;a href="https://graycloudarch.com/contact" rel="noopener noreferrer"&gt;Reach out&lt;/a&gt;—this is the kind of operational work I do for platform teams.&lt;/p&gt;

</description>
      <category>eks</category>
      <category>kubernetes</category>
      <category>terraform</category>
      <category>terragrunt</category>
    </item>
    <item>
      <title>Building Apache Iceberg Lakehouse Storage with S3 Table Buckets</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Sat, 04 Apr 2026 23:37:02 +0000</pubDate>
      <link>https://forem.com/tallgray1/building-apache-iceberg-lakehouse-storage-with-s3-table-buckets-3jj8</link>
      <guid>https://forem.com/tallgray1/building-apache-iceberg-lakehouse-storage-with-s3-table-buckets-3jj8</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://graycloudarch.com/blog/apache-iceberg-lakehouse-s3-table-buckets/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;The data platform team had a deadline and a storage decision to make. They'd committed to Apache Iceberg as the table format — open standard, time travel, schema evolution, the usual reasons. What they hadn't locked down was where the data was actually going to live, and whether the storage layer would hold up under the metadata-heavy access patterns Iceberg requires.&lt;/p&gt;

&lt;p&gt;The default answer is regular S3. It works. Most Iceberg deployments run on it. But AWS launched S3 Table Buckets in late 2024, and they're purpose-built for exactly this workload: Iceberg metadata operations. The numbers made the decision easy — 10x faster metadata queries, 50% or more improvement in query planning time compared to standard S3. The gotcha worth knowing upfront: S3 Table Bucket support requires AWS Provider 5.70 or later. If your Terraform modules are pinned to an older provider version, that's your first upgrade.&lt;/p&gt;

&lt;p&gt;We built the storage layer as a three-zone medallion architecture, fully managed with Terraform, with Intelligent-Tiering configured from day one. Here's how we did it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Medallion Architecture
&lt;/h2&gt;

&lt;p&gt;Three zones, each with a clear contract about what data lives there and who owns it:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgraycloudarch.com%2Fwp-content%2Fuploads%2F2026%2F03%2Fdiag-apache-iceberg-medallion.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgraycloudarch.com%2Fwp-content%2Fuploads%2F2026%2F03%2Fdiag-apache-iceberg-medallion.png" alt="Medallion architecture — three-zone lakehouse: Source Systems flow into Raw Zone (immutable landing), then ETL into Clean Zone (normalized), then aggregation into Curated Zone (analytics-ready), consumed by Athena, QuickSight, and Tableau" width="552" height="1404"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Raw is immutable. Once data lands there, it doesn't change — ETL failures don't corrupt the source record because the source record is untouched. Clean is normalized and domain-aligned, owned by data engineering. Curated is the analytics layer that BI tools, Athena queries, and QuickSight dashboards read from.&lt;/p&gt;

&lt;p&gt;The naming convention we landed on was &lt;code&gt;{zone}_{domain}&lt;/code&gt; for Glue databases — &lt;code&gt;raw_crm&lt;/code&gt;, &lt;code&gt;clean_customer&lt;/code&gt;, &lt;code&gt;curated_sales_metrics&lt;/code&gt;. It looks minor, but it matters. When you're looking at a table in Athena or debugging a failed Glue job, the database name tells you exactly what tier you're in and what domain you're touching. Namespace collisions become impossible because the zone prefix scopes every domain. Data lineage is readable from table names alone.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Two Modules Instead of One
&lt;/h2&gt;

&lt;p&gt;The first design question was whether to build a single composite module that creates the KMS key and the S3 Table Bucket together, or split them into separate modules. We split them.&lt;/p&gt;

&lt;p&gt;The KMS key isn't just for the lake. It's used by five downstream services: Athena for query results, EMR for cluster encryption, MWAA for DAG storage, Kinesis for stream encryption, and Glue DataBrew for transform outputs. If we bundled the key into the lake storage module, every one of those services would need a dependency chain that eventually resolves back through lake storage just to get a KMS key ARN. Separate modules mean the key has one owner, and everything else declares a dependency on it independently.&lt;/p&gt;

&lt;p&gt;The KMS module:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight terraform"&gt;&lt;code&gt;&lt;span class="c1"&gt;# kms-key/main.tf&lt;/span&gt;
&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_kms_key"&lt;/span&gt; &lt;span class="s2"&gt;"this"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt;             &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;description&lt;/span&gt;
  &lt;span class="nx"&gt;enable_key_rotation&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;enable_key_rotation&lt;/span&gt;
  &lt;span class="nx"&gt;deletion_window_in_days&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;deletion_window_in_days&lt;/span&gt;

  &lt;span class="nx"&gt;policy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonencode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;Version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;
    &lt;span class="nx"&gt;Statement&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Enable IAM User Permissions"&lt;/span&gt;
        &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
        &lt;span class="nx"&gt;Principal&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;AWS&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:iam::&lt;/span&gt;&lt;span class="k"&gt;${data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;aws_caller_identity&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;current&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;account_id&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;:root"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="nx"&gt;Action&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"kms:*"&lt;/span&gt;
        &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"*"&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow Service Access"&lt;/span&gt;
        &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
        &lt;span class="nx"&gt;Principal&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Service&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;service_principals&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"kms:Decrypt"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"kms:GenerateDataKey"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"kms:CreateGrant"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"*"&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;service_principals&lt;/code&gt; variable takes a list of service principal strings — &lt;code&gt;["athena.amazonaws.com", "glue.amazonaws.com"]&lt;/code&gt; and so on. Adding a new service that needs key access is one line in the Terragrunt config, no module change required.&lt;/p&gt;

&lt;h2&gt;
  
  
  The S3 Table Bucket Module
&lt;/h2&gt;

&lt;p&gt;The table bucket itself is straightforward. The interesting part is Intelligent-Tiering:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight terraform"&gt;&lt;code&gt;&lt;span class="c1"&gt;# s3-table-bucket/main.tf&lt;/span&gt;
&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_s3tables_table_bucket"&lt;/span&gt; &lt;span class="s2"&gt;"this"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;bucket_name&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_s3_bucket_intelligent_tiering_configuration"&lt;/span&gt; &lt;span class="s2"&gt;"this"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;count&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;enable_intelligent_tiering&lt;/span&gt; &lt;span class="err"&gt;?&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
  &lt;span class="nx"&gt;bucket&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_s3tables_table_bucket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"EntireBucket"&lt;/span&gt;

  &lt;span class="nx"&gt;tiering&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;access_tier&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ARCHIVE_ACCESS"&lt;/span&gt;
    &lt;span class="nx"&gt;days&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;90&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="nx"&gt;tiering&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;access_tier&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"DEEP_ARCHIVE_ACCESS"&lt;/span&gt;
    &lt;span class="nx"&gt;days&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;180&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We enable Intelligent-Tiering on the entire bucket from the start. The 90-day threshold for Archive Access and 180-day threshold for Deep Archive weren't arbitrary — they match the typical access patterns for a data lake: raw data is queried heavily during initial load and validation, then access drops off sharply once the clean layer is populated.&lt;/p&gt;

&lt;p&gt;The reason Intelligent-Tiering beats manual lifecycle policies here is subtle but important. A manual lifecycle policy moves data based on age. Intelligent-Tiering moves data based on actual access patterns. If a dataset from eight months ago suddenly becomes relevant for a compliance audit, Intelligent-Tiering keeps it in a more accessible tier automatically. A manual policy would have moved it to Deep Archive on day 180 regardless. For a data lake, where access patterns are genuinely unpredictable, letting AWS monitor actual usage is worth the small monitoring fee.&lt;/p&gt;

&lt;p&gt;The Terragrunt dependency chain wires the KMS key ARN into the table bucket configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# lake-storage/terragrunt.hcl&lt;/span&gt;
&lt;span class="nx"&gt;dependency&lt;/span&gt; &lt;span class="s2"&gt;"kms"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;config_path&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"../kms-key"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;inputs&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;bucket_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"company-lake-${local.environment}"&lt;/span&gt;
  &lt;span class="nx"&gt;kms_key_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dependency&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;kms&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;key_arn&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Glue Data Catalog
&lt;/h2&gt;

&lt;p&gt;We provisioned 12 Glue databases across the three zones — four domains per zone (CRM, customer, sales, operations). The Terraform for each database includes the Iceberg metadata parameters that enable Iceberg table format for all tables created in that database:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight terraform"&gt;&lt;code&gt;&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_glue_catalog_database"&lt;/span&gt; &lt;span class="s2"&gt;"this"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"raw_crm"&lt;/span&gt;
  &lt;span class="nx"&gt;parameters&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"iceberg_enabled"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"true"&lt;/span&gt;
    &lt;span class="s2"&gt;"table_type"&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ICEBERG"&lt;/span&gt;
    &lt;span class="s2"&gt;"format-version"&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"2"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Format version 2 is the current Iceberg spec. It unlocks row-level deletes, which is required for GDPR compliance — when a user requests deletion, you can execute a targeted delete on the Iceberg table rather than rewriting entire Parquet partitions.&lt;/p&gt;

&lt;p&gt;One thing that's easy to miss: Glue databases with Iceberg parameters set don't automatically create Iceberg tables. The database parameters act as defaults and metadata; actual table creation still happens via your ETL tooling (Glue jobs, Spark, Flink). What you get from Terraform is the catalog structure and the governance layer — databases, permissions, encryption settings — so that when the data engineering team writes their first Glue job, the infrastructure is already in place.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Cost Model
&lt;/h2&gt;

&lt;p&gt;When I presented this to the platform team's tech lead, the cost projection was what turned a "nice to have" into a "let's do this now."&lt;/p&gt;

&lt;p&gt;For a 100TB lake:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Timeframe&lt;/th&gt;
&lt;th&gt;Storage Tier&lt;/th&gt;
&lt;th&gt;Monthly Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;First 90 days&lt;/td&gt;
&lt;td&gt;Standard&lt;/td&gt;
&lt;td&gt;~$2,300&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;After 90 days&lt;/td&gt;
&lt;td&gt;Archive Access&lt;/td&gt;
&lt;td&gt;~$400&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;After 180 days&lt;/td&gt;
&lt;td&gt;Deep Archive&lt;/td&gt;
&lt;td&gt;~$100&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That's roughly 80% savings once the bulk of the data ages past 90 days, and 95% savings at 180 days. The Intelligent-Tiering monitoring cost is $0.0025 per 1,000 objects — on a 100TB lake with typical Iceberg file sizes, that's a few dollars a month. Negligible.&lt;/p&gt;

&lt;p&gt;The S3 Table Bucket metadata performance improvement compounds this. Faster query planning means less Athena scan time, which means lower query costs and faster results for analysts. The platform pays for itself in reduced query costs as the data volume grows.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deployment Sequence
&lt;/h2&gt;

&lt;p&gt;The deployment order is driven by dependencies: KMS must exist before S3 (bucket encryption needs the key ARN), and both must exist before Glue (catalog databases reference the bucket location).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgraycloudarch.com%2Fwp-content%2Fuploads%2F2026%2F03%2Fdiag-apache-iceberg-pipeline.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgraycloudarch.com%2Fwp-content%2Fuploads%2F2026%2F03%2Fdiag-apache-iceberg-pipeline.png" alt="Deployment sequence: KMS Key must be created first, then S3 Table Bucket (which uses the key ARN), then Glue Data Catalog" width="800" height="94"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In practice, across three environments (dev, nonprod, prod), the full deployment took about four hours. Most of that was Terragrunt apply time — the actual resource creation for each component is fast, but we ran plan, reviewed, applied, and verified before moving to the next environment.&lt;/p&gt;

&lt;p&gt;One deployment note: the first time you run &lt;code&gt;terragrunt plan&lt;/code&gt; on the Glue module in an account that hasn't had Glue configured before, you'll get an error about the Glue service-linked role not existing. Fix it by running &lt;code&gt;aws iam create-service-linked-role --aws-service-name glue.amazonaws.com&lt;/code&gt; before the apply. It only needs to happen once per account.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Data Team Inherited
&lt;/h2&gt;

&lt;p&gt;When we handed this over to the data engineering team, they had a fully provisioned catalog — three zones, twelve databases, Iceberg metadata configured, encryption enabled, Intelligent-Tiering active. They could start writing Glue jobs and creating tables immediately without worrying about storage configuration, access patterns, or cost optimization after the fact.&lt;/p&gt;

&lt;p&gt;The Terraform modules are reusable. Adding a new domain (say, a &lt;code&gt;finance&lt;/code&gt; domain across all three zones) is three database resource declarations and one pull request. The KMS key, bucket, and Intelligent-Tiering configuration don't change.&lt;/p&gt;

&lt;p&gt;S3 Table Buckets are still relatively new, and the Terraform provider support came together in late 2024. If your team is planning an Iceberg migration and hasn't evaluated Table Buckets yet, the metadata performance gains and the cost trajectory make a strong case for starting there rather than retrofitting later.&lt;/p&gt;




&lt;p&gt;Building out a data platform and figuring out the storage and catalog architecture? &lt;a href="https://graycloudarch.com/contact/" rel="noopener noreferrer"&gt;Get in touch&lt;/a&gt; — this kind of infrastructure design work is something I do regularly, whether you're starting from scratch or migrating an existing lake.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>terraform</category>
      <category>apacheiceberg</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Zero-Downtime AWS Transit Gateway Hub-Spoke Migration</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Sat, 04 Apr 2026 23:36:57 +0000</pubDate>
      <link>https://forem.com/tallgray1/zero-downtime-aws-transit-gateway-hub-spoke-migration-36h</link>
      <guid>https://forem.com/tallgray1/zero-downtime-aws-transit-gateway-hub-spoke-migration-36h</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://graycloudarch.com/blog/transit-gateway-hub-spoke-migration/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;The request came from the security team: they needed network-level access from the nonprod account to the dev account so a vulnerability scanner could reach internal services. Simple enough on the surface. In practice, it exposed a gap we'd been living with for months — and forced us to fix the network architecture we'd been deferring.&lt;/p&gt;

&lt;p&gt;We had three standalone Transit Gateways: one in each workload account, dev, nonprod, and prod. Completely isolated from each other. No cross-account connectivity at all. The security scanner couldn't reach its targets, and adding more point-to-point peering connections to fix it would have made everything worse.&lt;/p&gt;

&lt;p&gt;But the TGW isolation was only part of the problem. We also had no inspection of traffic crossing our network boundary. Egress from workload pods went straight to the internet with no filtering. Ingress came through per-account load balancers with no centralized enforcement point. As the platform scaled toward additional workload accounts, this pattern was going to get expensive and hard to reason about.&lt;/p&gt;

&lt;p&gt;So we didn't just fix the TGW. We rebuilt the network foundation: a centralized Inspection VPC with a Network Firewall inline, a single hub Transit Gateway shared across all accounts, and centralized security tooling (GuardDuty, CloudTrail, Security Hub) aggregated in a dedicated Security account. Two maintenance windows, a few weeks of module work, and the platform went from fragmented per-account networking to a coherent hub-spoke design with full traffic inspection.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture We Were Replacing
&lt;/h2&gt;

&lt;p&gt;Before the migration, each workload account was self-contained. It had its own TGW, its own internet gateway, its own NAT gateways. Security tooling ran independently in each account with no aggregation. The management account had no single-pane visibility into what was happening across the environment.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgraycloudarch.com%2Fwp-content%2Fuploads%2F2026%2F03%2Fdiag-before.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgraycloudarch.com%2Fwp-content%2Fuploads%2F2026%2F03%2Fdiag-before.png" alt="Before: Three isolated workload accounts — each with its own IGW, NAT Gateway, and standalone Transit Gateway, no cross-account connectivity" width="800" height="258"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The cost of running this way was about $150/month in TGW charges plus duplicated NAT gateway charges in each account. Every new workload account would multiply this cost again and add another independent security configuration.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Target: Inspection VPC + Hub Transit Gateway
&lt;/h2&gt;

&lt;p&gt;The target was AWS Security Reference Architecture Pattern B: an Inspection VPC that sits between the internet and all workload VPCs. All internet traffic — ingress and egress — flows through this VPC and through a Network Firewall before reaching any workload account.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgraycloudarch.com%2Fwp-content%2Fuploads%2F2026%2F03%2Fdiag-after.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgraycloudarch.com%2Fwp-content%2Fuploads%2F2026%2F03%2Fdiag-after.png" alt="After: Centralized hub with inline Network Firewall inspection — all traffic flows through the Infrastructure Account's Inspection VPC before reaching any workload" width="800" height="552"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Egress path: workload pod → TGW → Inspection VPC TGW subnets → Network Firewall → NAT Gateway → IGW → internet.&lt;/p&gt;

&lt;p&gt;Ingress path: internet → IGW → centralized ALB (public subnet) → Network Firewall → TGW → workload VPC → pod.&lt;/p&gt;

&lt;p&gt;Nothing crosses the network boundary without passing through the firewall. Workload accounts carry no internet-facing infrastructure at all — no IGW, no NAT gateways, no public load balancers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 1: Module Changes
&lt;/h2&gt;

&lt;p&gt;All Terraform work happened before scheduling any maintenance. The goal was to reach a state where the migration itself was just running pre-staged plan files in a specific sequence.&lt;/p&gt;

&lt;h3&gt;
  
  
  Transit Gateway: add a conditional create flag
&lt;/h3&gt;

&lt;p&gt;The existing network module always created a TGW. We needed spoke accounts to declare the same module without spinning up their own gateway:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight terraform"&gt;&lt;code&gt;&lt;span class="k"&gt;variable&lt;/span&gt; &lt;span class="s2"&gt;"create_transit_gateway"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Whether to create a Transit Gateway (false for hub-spoke spokes)"&lt;/span&gt;
  &lt;span class="nx"&gt;type&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;bool&lt;/span&gt;
  &lt;span class="nx"&gt;default&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_ec2_transit_gateway"&lt;/span&gt; &lt;span class="s2"&gt;"this"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;count&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;create_transit_gateway&lt;/span&gt; &lt;span class="err"&gt;?&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tgw_description&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;output&lt;/span&gt; &lt;span class="s2"&gt;"transit_gateway_id"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;create_transit_gateway&lt;/span&gt; &lt;span class="err"&gt;?&lt;/span&gt; &lt;span class="nx"&gt;aws_ec2_transit_gateway&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;this&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;default = true&lt;/code&gt; means existing configurations need no changes. The flag only flips to &lt;code&gt;false&lt;/code&gt; after the spoke attachment is confirmed working.&lt;/p&gt;

&lt;h3&gt;
  
  
  New module: vpc-attachment
&lt;/h3&gt;

&lt;p&gt;The vpc-attachment module handles the spoke side of the hub relationship: create the TGW attachment, associate it to the hub's route table, and add routes to every private route table in the spoke VPC pointing at the hub TGW.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight terraform"&gt;&lt;code&gt;&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_ec2_transit_gateway_vpc_attachment"&lt;/span&gt; &lt;span class="s2"&gt;"this"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;transit_gateway_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;transit_gateway_id&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_id&lt;/span&gt;             &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc_id&lt;/span&gt;
  &lt;span class="nx"&gt;subnet_ids&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;subnet_ids&lt;/span&gt;

  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;-hub-attachment"&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_ec2_transit_gateway_route_table_association"&lt;/span&gt; &lt;span class="s2"&gt;"this"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;transit_gateway_attachment_id&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_ec2_transit_gateway_vpc_attachment&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;transit_gateway_route_table_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;transit_gateway_route_table_id&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_route"&lt;/span&gt; &lt;span class="s2"&gt;"to_hub_tgw"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;for_each&lt;/span&gt;               &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;toset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc_route_table_ids&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="nx"&gt;route_table_id&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;each&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;
  &lt;span class="nx"&gt;destination_cidr_block&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"10.0.0.0/8"&lt;/span&gt;
  &lt;span class="nx"&gt;transit_gateway_id&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;transit_gateway_id&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;10.0.0.0/8&lt;/code&gt; supernet covers all workload and Inspection VPC CIDRs without maintaining per-prefix route entries. It also covers the Inspection VPC CIDR (&lt;code&gt;10.100.0.0/20&lt;/code&gt;) — that's how return traffic from the centralized ALB finds its way back to pods in workload VPCs.&lt;/p&gt;

&lt;p&gt;The Terragrunt config for a spoke account reads VPC details from the existing network dependency and hardcodes the hub TGW identifiers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;dependency&lt;/span&gt; &lt;span class="s2"&gt;"network"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;config_path&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"../network"&lt;/span&gt;
  &lt;span class="nx"&gt;mock_outputs&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;vpc_id&lt;/span&gt;                  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"vpc-mockid"&lt;/span&gt;
    &lt;span class="nx"&gt;private_subnet_ids&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"subnet-mock1"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="nx"&gt;private_route_table_ids&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"rtb-mock1"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;inputs&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;transit_gateway_id&lt;/span&gt;             &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"tgw-xxxxx"&lt;/span&gt;   &lt;span class="c1"&gt;# hub TGW, documented in runbook&lt;/span&gt;
  &lt;span class="nx"&gt;transit_gateway_route_table_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"tgw-rtb-xxxxx"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We hardcoded the hub TGW and route table IDs rather than using cross-account data sources. The alternative — reading TGW details from the Infrastructure account at plan time — requires cross-account state access and adds complexity that isn't worth it for values that change maybe once in the platform's lifetime.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hub route tables: workload isolation by default
&lt;/h3&gt;

&lt;p&gt;A key design decision: workload accounts should not route to each other directly. Dev should not reach nonprod; nonprod should not reach prod. The hub TGW enforces this through route table structure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;default-association-rt&lt;/strong&gt;: all workload attachments associate here. The only route is &lt;code&gt;0.0.0.0/0 → inspection attachment&lt;/code&gt;. Workloads can reach the internet via the Inspection VPC, but cannot reach other workload VPCs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;default-propagation-rt&lt;/strong&gt;: the inspection attachment propagates workload CIDRs here for return traffic routing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Inter-account communication is opt-in: you add an explicit route table entry for a specific attachment pair. By default, the architecture prevents lateral movement across workload accounts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Inspection VPC subnet layout
&lt;/h3&gt;

&lt;p&gt;The Inspection VPC has three tiers with carefully constructed route tables that force traffic through the firewall in both directions:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgraycloudarch.com%2Fwp-content%2Fuploads%2F2026%2F03%2Fdiag-subnets.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgraycloudarch.com%2Fwp-content%2Fuploads%2F2026%2F03%2Fdiag-subnets.png" alt="Inspection VPC subnet layout — three tiers (public, firewall, TGW) with asymmetric route tables that force all traffic through Network Firewall endpoints in both directions" width="800" height="1255"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The asymmetric route table design ensures the firewall sees every packet crossing the network boundary, regardless of direction. Traffic entering from the internet hits the firewall before reaching workloads. Traffic from workloads hits the firewall before reaching the internet.&lt;/p&gt;

&lt;h3&gt;
  
  
  Security baseline: convert to delegated admin model
&lt;/h3&gt;

&lt;p&gt;GuardDuty and CloudTrail were running independently per account. We added &lt;code&gt;enable_guardduty&lt;/code&gt; and &lt;code&gt;enable_cloudtrail&lt;/code&gt; boolean variables to the security-baseline module so workload accounts could switch from standalone to member without touching the module invocation itself.&lt;/p&gt;

&lt;p&gt;In the Security account, we deployed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GuardDuty&lt;/strong&gt; as delegated admin with organization-level auto-enrollment. EKS Protection and S3 Protection enabled. All findings from all accounts visible in a single dashboard.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CloudTrail&lt;/strong&gt; organization trail writing to a cross-account S3 bucket. Log file validation and KMS encryption enabled. Per-account trails archived after the cutover — not deleted, in case historical log formats differed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security Hub&lt;/strong&gt; with CIS AWS Foundations Benchmark and AWS Foundational Security Best Practices enabled across the full organization.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Phase 2: Two Maintenance Windows
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Window 1: Deploy the hub (~45 minutes, low risk)
&lt;/h3&gt;

&lt;p&gt;With no existing attachments and no workload traffic, deploying the hub infrastructure carried minimal risk. We applied the Infrastructure account TGW and Inspection VPC in a single window. The Network Firewall takes 5–10 minutes to reach READY state after creation — account for that in your timing.&lt;/p&gt;

&lt;p&gt;At the end of this window: hub TGW running, Inspection VPC active, Network Firewall endpoints healthy in both AZs, centralized ALB deployed. Nothing attached yet. We documented the TGW ID and route table IDs in the runbook before scheduling window 2.&lt;/p&gt;

&lt;h3&gt;
  
  
  Window 2: Spoke cutover (~2 hours)
&lt;/h3&gt;

&lt;p&gt;The key insight for keeping applications running: &lt;strong&gt;create the hub attachment before destroying the standalone TGW&lt;/strong&gt;. While both exist simultaneously, traffic continues flowing through the standalone path. The actual cutover is updating routes to point at the hub — that's a single &lt;code&gt;terragrunt apply&lt;/code&gt;, not the destruction of the old TGW.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;T+0 — Accept RAM share.&lt;/strong&gt; Infrastructure account shares the hub TGW via Resource Access Manager. Workload accounts accept the share invitation. Pure metadata operation; zero network impact.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;T+15 — Deploy VPC attachments.&lt;/strong&gt; Apply the &lt;code&gt;vpc-attachment&lt;/code&gt; module in each workload account. At this point each spoke VPC has two routes for &lt;code&gt;10.0.0.0/8&lt;/code&gt;: the existing one pointing at the standalone TGW, and the new one pointing at the hub. With identical prefix lengths, traffic still flows through the standalone path. Rollback at this stage is &lt;code&gt;terragrunt destroy&lt;/code&gt; on the attachment module — under five minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;T+30 — Verify routes and test cross-account connectivity.&lt;/strong&gt; Confirm hub routes are present in every private route table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws ec2 describe-route-tables &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--filters&lt;/span&gt; &lt;span class="s2"&gt;"Name=vpc-id,Values=vpc-xxxxx"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'RouteTables[*].Routes[?DestinationCidrBlock==`10.0.0.0/8`]'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then test actual cross-account traffic: connect from a dev instance to a service in the nonprod VPC. The hub TGW and Inspection VPC should route it correctly. This also validates that the firewall rule groups are permitting expected traffic — catch any rule issues here, before cutting over production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;T+45 — Migrate security tooling.&lt;/strong&gt; Apply the updated security-baseline to each workload account. GuardDuty converts from standalone admin to member; findings flow to the Security account delegated admin. CloudTrail local trail disabled; organization trail confirmed logging events from the account. Zero network impact.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Verify GuardDuty membership&lt;/span&gt;
aws guardduty get-administrator-account &lt;span class="nt"&gt;--detector-id&lt;/span&gt; &amp;lt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="c"&gt;# Returns the Security account as administrator&lt;/span&gt;

&lt;span class="c"&gt;# Verify organization trail is capturing events&lt;/span&gt;
&lt;span class="c"&gt;# Make an API call, wait ~15 minutes, check the Security account's S3 bucket&lt;/span&gt;
aws s3 &lt;span class="nb"&gt;ls &lt;/span&gt;s3://&amp;lt;org-trail-bucket&amp;gt;/AWSLogs/&amp;lt;account-id&amp;gt;/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;T+60 — Set &lt;code&gt;create_transit_gateway = false&lt;/code&gt; in each spoke.&lt;/strong&gt; This is the cutover. Run &lt;code&gt;terraform plan&lt;/code&gt; first and confirm it shows only the TGW and its attached resources being destroyed — nothing else. Apply dev first, watch the destruction complete, confirm application traffic is flowing through the hub. Then apply nonprod. About 3 minutes per account.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;T+90 — Health checks and close.&lt;/strong&gt; Spot-check API endpoints, database connectivity, anything that traverses the network. Confirm egress traffic is hitting the firewall logs in the Infrastructure account. The maintenance window closed at the 90-minute mark; actual work was done by T+75. We kept the window open for the last 15 minutes as a buffer.&lt;/p&gt;

&lt;p&gt;The parallel attachment approach ensured there was never a moment where a workload account had no routing path. Even if the hub TGW had been misconfigured, traffic would have continued flowing through the standalone gateway until we chose to destroy it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Ended Up With
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;One TGW&lt;/strong&gt; in the Infrastructure account with three spoke attachments. Route tables that allow workload→internet traffic while preventing workload→workload lateral movement by default.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One Inspection VPC&lt;/strong&gt; with Network Firewall endpoints in two AZs. All egress inspected against stateful domain filter rules and stateless port rules. All ingress from the centralized ALB inspected. Firewall policy updates apply to all workload accounts simultaneously — no per-account changes needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One centralized ALB&lt;/strong&gt; in the Infrastructure account, routing to EKS target groups in workload accounts via cross-account IAM role assumption. Workload accounts carry no public-facing load balancers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One security console&lt;/strong&gt; in the Security account. GuardDuty findings from all accounts in a single dashboard. CloudTrail logs from every account in one S3 bucket. Security Hub compliance posture for the full organization visible in one place.&lt;/p&gt;

&lt;p&gt;Cost went from roughly $150–200/month (standalone TGWs, per-account NAT, independent security tooling) to approximately $50/month (single hub TGW plus attachment hours, shared NAT in the Inspection VPC, delegated security services). Cost savings validated against AWS Cost Explorer after 30 days.&lt;/p&gt;

&lt;p&gt;The original security scanner request — cross-account access from nonprod to dev — was live the same day. The compliance team had a single GuardDuty and Security Hub dashboard the same week.&lt;/p&gt;

&lt;p&gt;More importantly: adding a new workload account to this architecture now takes about an hour. Create the VPC, deploy the vpc-attachment module pointing at the documented hub TGW ID, invite the new account as a GuardDuty and Security Hub member, apply the security-baseline with &lt;code&gt;enable_guardduty = false&lt;/code&gt;. Every new account inherits the full inspection and security posture without any per-account configuration. That's the actual value of a hub-spoke design — not the one-time cost savings, but the fact that account seven is as well-secured and as easy to audit as account two.&lt;/p&gt;




&lt;p&gt;Working through a multi-account network redesign, or building the inspection layer on top of an existing Transit Gateway setup? &lt;a href="https://graycloudarch.com/contact/" rel="noopener noreferrer"&gt;Get in touch&lt;/a&gt; — this is the kind of platform architecture I work on regularly.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>terraform</category>
      <category>transitgateway</category>
      <category>networking</category>
    </item>
    <item>
      <title>DNS Validation: From 15 Steps to Zero</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Sat, 04 Apr 2026 22:30:30 +0000</pubDate>
      <link>https://forem.com/tallgray1/dns-validation-from-15-steps-to-zero-1nng</link>
      <guid>https://forem.com/tallgray1/dns-validation-from-15-steps-to-zero-1nng</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://graycloudarch.com/blog/dns-hell-to-automated/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;You know what's the worst part of launching a new site?&lt;/p&gt;

&lt;p&gt;SSL certificate validation.&lt;/p&gt;

&lt;p&gt;Not creating the cert—that's one click in AWS ACM. It's the validation dance:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;AWS gives you a CNAME record: &lt;code&gt;_abc123extremely-long-string-here.graycloudarch.com&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;The value is equally ridiculous: &lt;code&gt;_xyz789another-massive-string.acm-validations.aws.&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;You copy it (pray you don't miss a character)&lt;/li&gt;
&lt;li&gt;Switch to Cloudflare (or Route 53, or wherever)&lt;/li&gt;
&lt;li&gt;Paste it in&lt;/li&gt;
&lt;li&gt;Wait 5-10 minutes&lt;/li&gt;
&lt;li&gt;Refresh AWS console&lt;/li&gt;
&lt;li&gt;Still pending...&lt;/li&gt;
&lt;li&gt;Refresh again&lt;/li&gt;
&lt;li&gt;Finally validated!&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Now do it again for &lt;code&gt;www.graycloudarch.com&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;And then repeat the whole thing for your second domain.&lt;/p&gt;

&lt;p&gt;This is "DNS hell."&lt;/p&gt;

&lt;h2&gt;
  
  
  There's a Better Way
&lt;/h2&gt;

&lt;p&gt;Terraform can read AWS validation records and create them in Cloudflare automatically.&lt;/p&gt;

&lt;p&gt;Zero copy-paste. Zero browser tab switching. Zero waiting and refreshing.&lt;/p&gt;

&lt;p&gt;Here's the whole thing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Request certificate&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_acm_certificate"&lt;/span&gt; &lt;span class="s2"&gt;"site"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;domain_name&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"graycloudarch.com"&lt;/span&gt;
  &lt;span class="nx"&gt;validation_method&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"DNS"&lt;/span&gt;
  &lt;span class="nx"&gt;subject_alternative_names&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"www.graycloudarch.com"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Create validation records in Cloudflare&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"cloudflare_record"&lt;/span&gt; &lt;span class="s2"&gt;"cert_validation"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;for_each&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;for&lt;/span&gt; &lt;span class="nx"&gt;dvo&lt;/span&gt; &lt;span class="nx"&gt;in&lt;/span&gt; &lt;span class="nx"&gt;aws_acm_certificate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;site&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;domain_validation_options&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt;
    &lt;span class="nx"&gt;dvo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;domain_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;name&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dvo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;resource_record_name&lt;/span&gt;
      &lt;span class="nx"&gt;value&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dvo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;resource_record_value&lt;/span&gt;
      &lt;span class="nx"&gt;type&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dvo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;resource_record_type&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;zone_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cloudflare_zone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;site&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;each&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;
  &lt;span class="nx"&gt;value&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;each&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;
  &lt;span class="nx"&gt;type&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;each&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;type&lt;/span&gt;
  &lt;span class="nx"&gt;proxied&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;  &lt;span class="c1"&gt;# Critical - ACM validation breaks with proxy&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Wait for validation&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_acm_certificate_validation"&lt;/span&gt; &lt;span class="s2"&gt;"site"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;certificate_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_acm_certificate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;site&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
  &lt;span class="nx"&gt;validation_record_fqdns&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nx"&gt;for&lt;/span&gt; &lt;span class="nx"&gt;record&lt;/span&gt; &lt;span class="nx"&gt;in&lt;/span&gt; &lt;span class="nx"&gt;cloudflare_record&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cert_validation&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;record&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;hostname&lt;/span&gt;
  &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run &lt;code&gt;terraform apply&lt;/code&gt;. Go make coffee. Come back to a validated certificate.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Magic: for_each
&lt;/h2&gt;

&lt;p&gt;The key is this part:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;for_each&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;for&lt;/span&gt; &lt;span class="nx"&gt;dvo&lt;/span&gt; &lt;span class="nx"&gt;in&lt;/span&gt; &lt;span class="nx"&gt;aws_acm_certificate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;site&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;domain_validation_options&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt;
  &lt;span class="nx"&gt;dvo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;domain_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dvo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;resource_record_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;AWS generates validation records dynamically (one for apex domain, one for www). Terraform reads them, loops over them, and creates each one in Cloudflare.&lt;/p&gt;

&lt;p&gt;You never see the records. You never copy anything. It just works.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Screwed Up
&lt;/h2&gt;

&lt;p&gt;First time I ran this, ACM validation timed out after 30 minutes.&lt;/p&gt;

&lt;p&gt;The problem:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;proxied&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;  &lt;span class="c1"&gt;# Wrong!&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cloudflare's proxy rewrites DNS responses. ACM's validation servers hit Cloudflare's IP instead of seeing your validation record.&lt;/p&gt;

&lt;p&gt;The fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;proxied&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;  &lt;span class="c1"&gt;# Correct&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;DNS-only mode. No proxy. ACM validation works.&lt;/p&gt;

&lt;p&gt;Cost me 30 minutes of debugging. Now it's in code so I never hit it again.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;I'm running two brands: graycloudarch.com and cloudpatterns.io.&lt;/p&gt;

&lt;p&gt;Manual approach: 15 steps per domain = 30 steps total. 30 minutes minimum. High chance of typos.&lt;/p&gt;

&lt;p&gt;Terraform approach: One &lt;code&gt;terraform apply&lt;/code&gt;. 5 minutes to write the code (once), 10 minutes for AWS to validate. Then copy-paste the pattern for the second domain.&lt;/p&gt;

&lt;p&gt;When I launch my third brand (and I will), it'll take 5 minutes and one terraform apply.&lt;/p&gt;

&lt;p&gt;That's the bet: upfront automation for long-term velocity.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Part People Miss
&lt;/h2&gt;

&lt;p&gt;Most Terraform tutorials stop at requesting the certificate. They don't show you the validation loop or the waiting resource.&lt;/p&gt;

&lt;p&gt;Without &lt;code&gt;aws_acm_certificate_validation&lt;/code&gt;, Terraform exits immediately after creating the cert. It's still "Pending Validation" in AWS. When you try to use it in CloudFront, it fails.&lt;/p&gt;

&lt;p&gt;You'd have to run &lt;code&gt;terraform apply&lt;/code&gt; again later, after manually checking that validation completed.&lt;/p&gt;

&lt;p&gt;That's not automation—that's just documentation.&lt;/p&gt;

&lt;p&gt;The waiting resource makes it truly hands-off.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scaling It
&lt;/h2&gt;

&lt;p&gt;Adding a second domain is 10 lines of code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_acm_certificate"&lt;/span&gt; &lt;span class="s2"&gt;"cloudpatterns"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;domain_name&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"cloudpatterns.io"&lt;/span&gt;
  &lt;span class="nx"&gt;validation_method&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"DNS"&lt;/span&gt;
  &lt;span class="nx"&gt;subject_alternative_names&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"www.cloudpatterns.io"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"cloudflare_record"&lt;/span&gt; &lt;span class="s2"&gt;"cloudpatterns_validation"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;for_each&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="cm"&gt;/* same pattern */&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="c1"&gt;# ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_acm_certificate_validation"&lt;/span&gt; &lt;span class="s2"&gt;"cloudpatterns"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;# ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same pattern, different names. No clicking. No switching between consoles. No remembering which validation record goes where.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Win
&lt;/h2&gt;

&lt;p&gt;It's not the time savings (though 30 minutes per deployment adds up).&lt;/p&gt;

&lt;p&gt;It's the mental overhead.&lt;/p&gt;

&lt;p&gt;Manual DNS configuration requires focus. "Did I copy the whole string? Did I add the trailing dot? Is it DNS-only mode?"&lt;/p&gt;

&lt;p&gt;Terraform requires running one command. That's it.&lt;/p&gt;

&lt;p&gt;I get my focus back. I can write this blog post while Terraform validates certificates.&lt;/p&gt;

&lt;p&gt;Want the full code? It's not open source (yet), but if you're building something similar and want to talk through it, &lt;a href="https://graycloudarch.com/contact" rel="noopener noreferrer"&gt;reach out&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Or if you just want to tell me I'm overthinking this and should've clicked through Cloudflare like a normal person, that's cool too.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>terraform</category>
      <category>cloudflare</category>
      <category>dns</category>
    </item>
    <item>
      <title>Building Multi-Account AWS Infrastructure with Terraform and ECP</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Sat, 04 Apr 2026 22:30:25 +0000</pubDate>
      <link>https://forem.com/tallgray1/building-multi-account-aws-infrastructure-with-terraform-and-ecp-49an</link>
      <guid>https://forem.com/tallgray1/building-multi-account-aws-infrastructure-with-terraform-and-ecp-49an</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://graycloudarch.com/blog/multi-account-aws-ecp/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;After years of building AWS infrastructure at scale, I've learned that multi-account strategy isn't just about security—it's about organizational clarity and cost management.&lt;/p&gt;

&lt;p&gt;At a large podcast hosting platform, we implemented an Enterprise Control Plane (ECP) pattern using Terraform to manage 20+ AWS accounts. Here's what I learned:&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with Single-Account AWS
&lt;/h2&gt;

&lt;p&gt;Most companies start with one AWS account. Everything lives together: dev, staging, prod, data pipelines, security tools. It works... until it doesn't.&lt;/p&gt;

&lt;p&gt;Problems emerge:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Blast radius:&lt;/strong&gt; A misconfigured dev resource can affect production&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IAM complexity:&lt;/strong&gt; Permission boundaries become impossible to manage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost allocation:&lt;/strong&gt; Finance can't track spending by team or project&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance:&lt;/strong&gt; Auditors want logical separation between environments&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The ECP Pattern
&lt;/h2&gt;

&lt;p&gt;Enterprise Control Plane is an architectural pattern for managing multiple AWS accounts as a unified platform:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Organization Structure:&lt;/strong&gt; AWS Organizations with OUs (Organizational Units) for different environments and teams&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Centralized Networking:&lt;/strong&gt; Transit Gateway connecting all accounts through hub-and-spoke model&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security Baseline:&lt;/strong&gt; Service Control Policies (SCPs) enforcing guardrails at the organization level&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infrastructure as Code:&lt;/strong&gt; Terraform/Terragrunt managing everything from a central repository&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Key Design Decisions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Account Boundaries:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Production accounts: Isolated per application/team&lt;/li&gt;
&lt;li&gt;Non-prod accounts: Shared dev/staging to reduce overhead&lt;/li&gt;
&lt;li&gt;Platform accounts: Separate accounts for logging, monitoring, security tools&lt;/li&gt;
&lt;li&gt;Data accounts: Isolated for compliance and access control&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Network Architecture:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hub account with Transit Gateway&lt;/li&gt;
&lt;li&gt;VPC peering only where absolutely necessary&lt;/li&gt;
&lt;li&gt;Private subnet defaults for everything&lt;/li&gt;
&lt;li&gt;Centralized egress through NAT Gateway in hub&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Security Model:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SCPs prevent account-level misconfigurations&lt;/li&gt;
&lt;li&gt;IAM roles for cross-account access (no shared credentials)&lt;/li&gt;
&lt;li&gt;CloudTrail logs aggregated to security account&lt;/li&gt;
&lt;li&gt;GuardDuty and Security Hub in every account&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Terraform Structure
&lt;/h2&gt;

&lt;p&gt;We use Terragrunt to manage configurations across accounts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="s"&gt;ecp-ou-structure/&lt;/span&gt;     &lt;span class="c1"&gt;# Organization and account management&lt;/span&gt;
&lt;span class="s"&gt;ecp-network/&lt;/span&gt;          &lt;span class="c1"&gt;# Transit Gateway, VPCs, networking&lt;/span&gt;
&lt;span class="s"&gt;ecp-security/&lt;/span&gt;         &lt;span class="c1"&gt;# Security baseline, SCPs, IAM&lt;/span&gt;
&lt;span class="s"&gt;tf-live-aws-*/&lt;/span&gt;        &lt;span class="c1"&gt;# Application-specific infrastructure&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start with security:&lt;/strong&gt; SCPs first, then networking, then workloads&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automate account creation:&lt;/strong&gt; Manual account provisioning doesn't scale&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Document the why:&lt;/strong&gt; Every architectural decision needs context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plan for day 2:&lt;/strong&gt; Operations matter more than initial setup&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;After implementing ECP:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reduced security incident blast radius by 90%&lt;/li&gt;
&lt;li&gt;Finance can now track costs by team and project&lt;/li&gt;
&lt;li&gt;New environments deploy in hours, not days&lt;/li&gt;
&lt;li&gt;Passed SOC2 audit with zero infrastructure findings&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Multi-account AWS isn't just best practice—it's how you scale infrastructure beyond the startup phase.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>terraform</category>
      <category>multiaccount</category>
      <category>ecp</category>
    </item>
    <item>
      <title>Stop Manually Updating Jira After Every PR Merge</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Wed, 25 Mar 2026 07:10:48 +0000</pubDate>
      <link>https://forem.com/tallgray1/stop-manually-updating-jira-after-every-pr-merge-1c9p</link>
      <guid>https://forem.com/tallgray1/stop-manually-updating-jira-after-every-pr-merge-1c9p</guid>
      <description>&lt;p&gt;&lt;em&gt;This post was originally published on &lt;a href="https://graycloudarch.com/automate-jira-github-actions/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;You just merged a PR. Now you open Jira, find the ticket, paste the&lt;br&gt;
PR link in a comment, transition the status to Done, and update the&lt;br&gt;
deployed field. Five minutes. Twenty times a week. That's 1,700 minutes&lt;br&gt;
per year per engineer --- nearly 30 hours of pure mechanical overhead.&lt;/p&gt;

&lt;p&gt;And that's assuming you remember. On one team I worked with, we&lt;br&gt;
audited the last three months of merged PRs. Thirty percent of tickets&lt;br&gt;
had no update after merge. No comment, no transition, no link. The&lt;br&gt;
ticket just sat in In Dev until someone noticed during sprint&lt;br&gt;
review.&lt;/p&gt;

&lt;p&gt;The fix is two GitHub Actions workflows and a shared composite&lt;br&gt;
action. Here's exactly how to build it.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;Two workflows, one shared extraction layer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Workflow 1&lt;/strong&gt;: Fires on PR creation --- posts a Jira
link comment to the PR so reviewers can navigate directly to the
ticket.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Workflow 2&lt;/strong&gt;: Fires on PR merge to &lt;code&gt;main&lt;/code&gt;
--- posts a comment to the Jira ticket with the PR URL, commit SHA, and
who merged it, then transitions the ticket to Done.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both workflows need to find the Jira ticket ID. Instead of&lt;br&gt;
duplicating that logic, we extract it into a composite action.&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 1: Composite Action for Ticket Extraction
&lt;/h2&gt;

&lt;p&gt;Create&lt;br&gt;
&lt;code&gt;.github/actions/extract-jira-ticket/action.yml&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The action checks four sources in priority order --- easiest to fix&lt;br&gt;
first:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; PR title (simplest for the developer to correct)&lt;/li&gt;
&lt;li&gt; Commit messages&lt;/li&gt;
&lt;li&gt; Branch name in standard format:
&lt;code&gt;PROJECT-123-description&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt; Branch name with prefix:
&lt;code&gt;feat/PROJECT-123-description&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;::: {#cb1 .sourceCode}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Extract Jira Ticket&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Extracts Jira ticket from PR title, commits, or branch name&lt;/span&gt;

&lt;span class="na"&gt;inputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;jira-base-url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;jira-user-email&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;jira-api-token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

&lt;span class="na"&gt;outputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;jira-key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ steps.extract.outputs.jira_key }}&lt;/span&gt;
  &lt;span class="na"&gt;found&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ steps.extract.outputs.found }}&lt;/span&gt;

&lt;span class="na"&gt;runs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;using&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;composite&lt;/span&gt;
  &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Extract ticket ID&lt;/span&gt;
      &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;extract&lt;/span&gt;
      &lt;span class="na"&gt;shell&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bash&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
        &lt;span class="s"&gt;JIRA_KEY=""&lt;/span&gt;

        &lt;span class="s"&gt;# Priority 1: PR title&lt;/span&gt;
        &lt;span class="s"&gt;if [[ "${{ github.event.pull_request.title }}" =~ ([A-Z]+-[0-9]+) ]]; then&lt;/span&gt;
          &lt;span class="s"&gt;JIRA_KEY="${BASH_REMATCH[1]}"&lt;/span&gt;
        &lt;span class="s"&gt;fi&lt;/span&gt;

        &lt;span class="s"&gt;# Priority 2: Branch name&lt;/span&gt;
        &lt;span class="s"&gt;if [ -z "$JIRA_KEY" ]; then&lt;/span&gt;
          &lt;span class="s"&gt;BRANCH="${{ github.head_ref }}"&lt;/span&gt;
          &lt;span class="s"&gt;if [[ "$BRANCH" =~ ([A-Z]+-[0-9]+) ]]; then&lt;/span&gt;
            &lt;span class="s"&gt;JIRA_KEY="${BASH_REMATCH[1]}"&lt;/span&gt;
          &lt;span class="s"&gt;fi&lt;/span&gt;
        &lt;span class="s"&gt;fi&lt;/span&gt;

        &lt;span class="s"&gt;if [ -n "$JIRA_KEY" ]; then&lt;/span&gt;
          &lt;span class="s"&gt;echo "jira_key=$JIRA_KEY" &amp;gt;&amp;gt; $GITHUB_OUTPUT&lt;/span&gt;
          &lt;span class="s"&gt;echo "found=true" &amp;gt;&amp;gt; $GITHUB_OUTPUT&lt;/span&gt;
        &lt;span class="s"&gt;else&lt;/span&gt;
          &lt;span class="s"&gt;echo "found=false" &amp;gt;&amp;gt; $GITHUB_OUTPUT&lt;/span&gt;
        &lt;span class="s"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;The regex &lt;code&gt;[A-Z]+-[0-9]+&lt;/code&gt; matches any Jira ticket format:&lt;br&gt;
&lt;code&gt;PROJ-1&lt;/code&gt;, &lt;code&gt;IN-89&lt;/code&gt;, &lt;code&gt;INFRA-1234&lt;/code&gt;. If you&lt;br&gt;
have tickets with lowercase project keys, adjust accordingly.&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 2: PR Creation Workflow
&lt;/h2&gt;

&lt;p&gt;Create &lt;code&gt;.github/workflows/link-jira-on-pr.yml&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This fires when a PR is opened and posts a formatted comment with the&lt;br&gt;
Jira ticket link. If no ticket is found, it posts a warning so the&lt;br&gt;
author knows to add one --- before review, not after.&lt;/p&gt;

&lt;p&gt;::: {#cb2 .sourceCode}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Link Jira on PR&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;types&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;opened&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;permissions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pull-requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;write&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;link-jira&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./.github/actions/extract-jira-ticket&lt;/span&gt;
        &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;jira&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;jira-base-url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.JIRA_BASE_URL }}&lt;/span&gt;
          &lt;span class="na"&gt;jira-user-email&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.JIRA_USER_EMAIL }}&lt;/span&gt;
          &lt;span class="na"&gt;jira-api-token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.JIRA_API_TOKEN }}&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Post Jira link comment&lt;/span&gt;
        &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;steps.jira.outputs.found == 'true'&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/github-script@v7&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;script&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;github.rest.issues.createComment({&lt;/span&gt;
              &lt;span class="s"&gt;issue_number: context.issue.number,&lt;/span&gt;
              &lt;span class="s"&gt;owner: context.repo.owner,&lt;/span&gt;
              &lt;span class="s"&gt;repo: context.repo.repo,&lt;/span&gt;
              &lt;span class="s"&gt;body: `📋 Jira: [${{ steps.jira.outputs.jira-key }}](${{ secrets.JIRA_BASE_URL }}/browse/${{ steps.jira.outputs.jira-key }})`&lt;/span&gt;
            &lt;span class="s"&gt;})&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Warn if no ticket found&lt;/span&gt;
        &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;steps.jira.outputs.found == 'false'&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/github-script@v7&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;script&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;github.rest.issues.createComment({&lt;/span&gt;
              &lt;span class="s"&gt;issue_number: context.issue.number,&lt;/span&gt;
              &lt;span class="s"&gt;owner: context.repo.owner,&lt;/span&gt;
              &lt;span class="s"&gt;repo: context.repo.repo,&lt;/span&gt;
              &lt;span class="s"&gt;body: '⚠️ No Jira ticket found. Add a ticket ID to the PR title (e.g., `PROJ-123: Your title`).'&lt;/span&gt;
            &lt;span class="s"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;The warning step matters. It creates a feedback loop that trains the&lt;br&gt;
team to include ticket IDs upfront. Within a few weeks, the warning&lt;br&gt;
fires rarely.&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 3: PR Merge Workflow
&lt;/h2&gt;

&lt;p&gt;Create &lt;code&gt;.github/workflows/update-jira-on-merge.yml&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This fires when a PR is closed against &lt;code&gt;main&lt;/code&gt;. The&lt;br&gt;
&lt;code&gt;if: github.event.pull_request.merged == true&lt;/code&gt; guard is&lt;br&gt;
important --- the &lt;code&gt;closed&lt;/code&gt; event also fires for PRs that are&lt;br&gt;
closed without merging.&lt;/p&gt;

&lt;p&gt;::: {#cb3 .sourceCode}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Update Jira on Merge&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;types&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;closed&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;main&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;update-jira&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;github.event.pull_request.merged == &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./.github/actions/extract-jira-ticket&lt;/span&gt;
        &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;jira&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;jira-base-url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.JIRA_BASE_URL }}&lt;/span&gt;
          &lt;span class="na"&gt;jira-user-email&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.JIRA_USER_EMAIL }}&lt;/span&gt;
          &lt;span class="na"&gt;jira-api-token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.JIRA_API_TOKEN }}&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Post merge comment to Jira&lt;/span&gt;
        &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;steps.jira.outputs.found == 'true'&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}"&lt;/span&gt;
            &lt;span class="s"&gt;-u "${{ secrets.JIRA_USER_EMAIL }}:${{ secrets.JIRA_API_TOKEN }}"&lt;/span&gt;
            &lt;span class="s"&gt;-H "Content-Type: application/json"&lt;/span&gt;
            &lt;span class="s"&gt;-X POST "${{ secrets.JIRA_BASE_URL }}/rest/api/2/issue/${{ steps.jira.outputs.jira-key }}/comment"&lt;/span&gt;
            &lt;span class="s"&gt;-d "{\"body\": \"PR merged: #${{ github.event.pull_request.number }} ${{ github.event.pull_request.html_url }}\nCommit: ${{ github.sha }}\nBy: ${{ github.event.pull_request.merged_by.login }}\"}")&lt;/span&gt;

          &lt;span class="s"&gt;echo "Jira comment HTTP status: $HTTP_STATUS"&lt;/span&gt;
          &lt;span class="s"&gt;[ "$HTTP_STATUS" -eq 201 ] &amp;amp;&amp;amp; echo "✅ Comment posted" || echo "⚠️ Comment failed (non-critical)"&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Transition ticket to Done&lt;/span&gt;
        &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;steps.jira.outputs.found == 'true'&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;TRANSITION_ID="${{ secrets.JIRA_DONE_TRANSITION_ID }}"&lt;/span&gt;
          &lt;span class="s"&gt;[ -z "$TRANSITION_ID" ] &amp;amp;&amp;amp; echo "No transition ID configured, skipping" &amp;amp;&amp;amp; exit 0&lt;/span&gt;

          &lt;span class="s"&gt;curl -s -u "${{ secrets.JIRA_USER_EMAIL }}:${{ secrets.JIRA_API_TOKEN }}"&lt;/span&gt;
            &lt;span class="s"&gt;-H "Content-Type: application/json"&lt;/span&gt;
            &lt;span class="s"&gt;-X POST "${{ secrets.JIRA_BASE_URL }}/rest/api/2/issue/${{ steps.jira.outputs.jira-key }}/transitions"&lt;/span&gt;
            &lt;span class="s"&gt;-d "{\"transition\": {\"id\": \"$TRANSITION_ID\"}}"&lt;/span&gt;
          &lt;span class="s"&gt;echo "✅ Transitioned to Done"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;The comment step uses an HTTP status check rather than relying on&lt;br&gt;
curl's exit code. A failed comment doesn't fail the job --- the PR already&lt;br&gt;
merged, and a missing notification shouldn't generate noise in CI. The&lt;br&gt;
transition step is fully optional: if&lt;br&gt;
&lt;code&gt;JIRA_DONE_TRANSITION_ID&lt;/code&gt; isn't set, it skips silently. This&lt;br&gt;
lets you start with just comments and add transitions once you've&lt;br&gt;
verified the workflow runs cleanly.&lt;/p&gt;
&lt;h2&gt;
  
  
  Finding Your Transition IDs
&lt;/h2&gt;

&lt;p&gt;Transition IDs are project-specific. There's no universal "Done" ID.&lt;br&gt;
Run this against any ticket in your project to find yours:&lt;/p&gt;

&lt;p&gt;::: {#cb4 .sourceCode}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-u&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$JIRA_EMAIL&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;$JIRA_API_TOKEN&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$JIRA_BASE_URL&lt;/span&gt;&lt;span class="s2"&gt;/rest/api/2/issue/&lt;/span&gt;&lt;span class="nv"&gt;$TICKET_KEY&lt;/span&gt;&lt;span class="s2"&gt;/transitions"&lt;/span&gt;
  | jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'.transitions[] | "ID: \(.id) | \(.name)"'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;Example output:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ID: 91 | Done
ID: 31 | In Review
ID: 21 | In Progress
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Set the Done ID as &lt;code&gt;JIRA_DONE_TRANSITION_ID&lt;/code&gt; in your&lt;br&gt;
repository secrets.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Note on Jira API Versions
&lt;/h2&gt;

&lt;p&gt;Use the v2 API: &lt;code&gt;/rest/api/2/&lt;/code&gt;. Some teams try v3 and get&lt;br&gt;
silent empty responses --- &lt;code&gt;{"errorMessages":[],"errors":{}}&lt;/code&gt; ---&lt;br&gt;
that look exactly like auth failures. It's not auth. The v3 request body&lt;br&gt;
format changed, and error handling is poor. v2 is stable,&lt;br&gt;
well-documented, and works consistently.&lt;/p&gt;

&lt;h2&gt;
  
  
  Required Secrets
&lt;/h2&gt;

&lt;p&gt;Add these to your GitHub repository secrets:&lt;/p&gt;

&lt;p&gt;Secret                      Value&lt;/p&gt;




&lt;p&gt;&lt;code&gt;JIRA_BASE_URL&lt;/code&gt;             &lt;code&gt;https://yourorg.atlassian.net&lt;/code&gt;&lt;br&gt;
  &lt;code&gt;JIRA_USER_EMAIL&lt;/code&gt;           The email address tied to your API token&lt;br&gt;
  &lt;code&gt;JIRA_API_TOKEN&lt;/code&gt;            Generate at id.atlassian.com → Security → API tokens&lt;br&gt;
  &lt;code&gt;JIRA_DONE_TRANSITION_ID&lt;/code&gt;   Optional --- from the transitions API call above&lt;/p&gt;

&lt;p&gt;For org-wide rollout, set these as organization secrets and restrict&lt;br&gt;
to relevant repositories.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;After rolling this out across a team of eight engineers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Zero manual Jira updates after merge&lt;/li&gt;
&lt;li&gt;  Forgotten ticket updates dropped from 30% to 0%&lt;/li&gt;
&lt;li&gt;  Roughly 1,700 minutes per year recovered per engineer&lt;/li&gt;
&lt;li&gt;  Every merged PR has a complete audit trail: PR number, URL, commit
SHA, who merged it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The composite action pattern also means when you need to extend this&lt;br&gt;
--- adding a Slack notification on merge, posting to Confluence --- you&lt;br&gt;
extend one file, not two.&lt;/p&gt;

&lt;p&gt;If you're building out automation like this across your engineering&lt;br&gt;
platform and want a second opinion on the design, &lt;a href="https://graycloudarch.com/#contact" rel="noopener noreferrer"&gt;I'm available for advisory engagements&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>automation</category>
      <category>cicd</category>
      <category>devops</category>
      <category>githubactions</category>
    </item>
    <item>
      <title>How I Manage Claude Code Context Across 20+ Repositories</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Fri, 20 Mar 2026 14:52:42 +0000</pubDate>
      <link>https://forem.com/tallgray1/how-i-manage-claude-code-context-across-20-repositories-100p</link>
      <guid>https://forem.com/tallgray1/how-i-manage-claude-code-context-across-20-repositories-100p</guid>
      <description>&lt;p&gt;&lt;em&gt;This post was originally published on &lt;a href="https://graycloudarch.com/managing-claude-code-context-multi-repo/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Three months ago I was re-explaining my Terragrunt state backend to&lt;br&gt;
Claude for the third time in a week. Different session, same repo, same&lt;br&gt;
repo I'd worked in the session before. Claude had no idea I was even in&lt;br&gt;
the same project.&lt;/p&gt;

&lt;p&gt;I run Claude Code daily across a 6-account AWS platform monorepo, a&lt;br&gt;
personal consulting site, homelab infrastructure, and a handful of side&lt;br&gt;
projects. Every session started with the same five minutes of "here's&lt;br&gt;
the project, here are the conventions, here's the Jira workflow" --- and&lt;br&gt;
still ended with Claude suggesting patterns that didn't fit the&lt;br&gt;
environment, because I'd inevitably forgotten to mention something.&lt;/p&gt;

&lt;p&gt;After three months of broken symlinks and abandoned experiments, I&lt;br&gt;
landed on a three-tier context hierarchy that loads the right context&lt;br&gt;
automatically depending on which directory I'm working in --- and I manage&lt;br&gt;
all of it from a single dotfiles repo.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Problem with Single-File Context
&lt;/h2&gt;

&lt;p&gt;Claude Code loads &lt;code&gt;CLAUDE.md&lt;/code&gt; from the current directory&lt;br&gt;
(and parent directories, walking up to&lt;br&gt;
&lt;code&gt;~/.claude/CLAUDE.md&lt;/code&gt;). Most teams start with one file and&lt;br&gt;
put everything in it.&lt;/p&gt;

&lt;p&gt;That breaks down quickly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Global preferences get mixed with project
specifics.&lt;/strong&gt; Your "use snake_case for variable names" preference
shouldn't live next to your Terraform state bucket configuration.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Credentials and account IDs end up in files you accidentally
commit.&lt;/strong&gt; Put AWS account IDs in a shared CLAUDE.md, and someone
will eventually push it.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;You can't share patterns across repos without
duplication.&lt;/strong&gt; Every new repo gets a fresh copy of the same
conventions, and updates never propagate.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Multi-employer context creates conflicts.&lt;/strong&gt; Your
consulting client's Jira workflow shouldn't contaminate your personal
project sessions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;My first attempt at fixing this was a shared scripts directory. The&lt;br&gt;
three-tier hierarchy came later, after I figured out what was actually&lt;br&gt;
wrong with the simpler approach.&lt;/p&gt;
&lt;h2&gt;
  
  
  My First Attempt: A Shared Scripts Directory
&lt;/h2&gt;

&lt;p&gt;Before landing on the three-tier system, I built something more&lt;br&gt;
obvious: a &lt;code&gt;~/shared-claude-infra/&lt;/code&gt; directory containing a&lt;br&gt;
&lt;code&gt;setup-project.sh&lt;/code&gt; script that initialized&lt;br&gt;
&lt;code&gt;.claude/&lt;/code&gt; context for each new repo.&lt;/p&gt;

&lt;p&gt;The script created the directory structure and symlinked a&lt;br&gt;
&lt;code&gt;rules/shared/&lt;/code&gt; folder back to the shared repo:&lt;/p&gt;

&lt;p&gt;::: {#cb1 .sourceCode}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mkdir -p "$PROJECT_DIR/.claude/rules"
ln -s ~/shared-claude-infra/rules "$PROJECT_DIR/.claude/rules/shared"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;This worked for the first two repos I configured. Then the problems&lt;br&gt;
compounded:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Manual per-project setup.&lt;/strong&gt; Every new repo required
running the script. Miss one, and that repo has no shared context.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Two repos to maintain.&lt;/strong&gt; The shared infrastructure
lived in its own git repo, separate from dotfiles. Two places to update
when conventions changed, and they drifted.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Nested symlinks instead of directory-level
symlinks.&lt;/strong&gt; The &lt;code&gt;rules/shared&lt;/code&gt; symlink lived deep
inside the project's &lt;code&gt;.claude/&lt;/code&gt; tree. When the target moved,
every project that had run the script got a broken symlink ---
silently.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Hardcoded paths that drifted.&lt;/strong&gt; The script referenced
workspace paths from three months earlier. My actual directory layout
had changed; the script still pointed at the old locations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When I eventually deleted the shared directory, a quick&lt;br&gt;
&lt;code&gt;find&lt;/code&gt; confirmed broken symlinks scattered across every repo&lt;br&gt;
that had run the setup script. The approach was inherently fragile&lt;br&gt;
because it depended on every machine, every repo, and every workspace&lt;br&gt;
path staying synchronized manually.&lt;/p&gt;

&lt;p&gt;The fix isn't a smarter script. It's inverting the relationship:&lt;br&gt;
instead of a script that runs once per project, use dotfiles that wire&lt;br&gt;
context automatically based on what directories exist.&lt;/p&gt;



**Working through something similar?** I advise platform teams on AWS infrastructure â€" multi-account architecture, Transit Gateway, EKS, and Terraform IaC. [Let's talk.](https://graycloudarch.com/#contact)


&lt;h2&gt;
  
  
  The Three-Tier Hierarchy
&lt;/h2&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;~/.claude/CLAUDE.md              ← Global: preferences, style, git workflow
~/work/{employer}/.claude/       ← Org: team structure, AWS accounts, Jira workflow
~/work/{employer}/{repo}/.claude/ ← Project: repo architecture, active tickets
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Claude Code walks up the directory tree loading&lt;br&gt;
&lt;code&gt;CLAUDE.md&lt;/code&gt; files at each level. Each tier handles a specific&lt;br&gt;
scope:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Global tier&lt;/strong&gt; (&lt;code&gt;~/.claude/&lt;/code&gt;): Everything&lt;br&gt;
that applies across all work --- communication style, git commit format,&lt;br&gt;
PR description templates, universal infrastructure patterns. No&lt;br&gt;
credentials, no account IDs, nothing employer-specific.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Org tier&lt;/strong&gt; (&lt;code&gt;~/work/{employer}/.claude/&lt;/code&gt;):&lt;br&gt;
Team structure, Jira project keys, AWS account layout, CI/CD pipeline&lt;br&gt;
conventions. Sensitive patterns (account IDs, VPC IDs, state bucket&lt;br&gt;
names) go in gitignored files within this directory. Reusable patterns&lt;br&gt;
(CI/CD templates, AWS patterns without specifics) go in committed&lt;br&gt;
files.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Project tier&lt;/strong&gt;&lt;br&gt;
(&lt;code&gt;~/work/{employer}/{repo}/.claude/&lt;/code&gt;): Architecture decisions&lt;br&gt;
for this specific repo, active tickets, ongoing work state. Always&lt;br&gt;
gitignored --- this is ephemeral working context that changes&lt;br&gt;
frequently.&lt;/p&gt;
&lt;h2&gt;
  
  
  Implementation: Symlinks from Dotfiles
&lt;/h2&gt;

&lt;p&gt;The hierarchy only works if it's consistent across machines. I manage&lt;br&gt;
all context files from a dotfiles repo using symlinks:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dotfiles/claude/
├── global/          → symlinked to ~/.claude/
├── {employer}/      → symlinked to ~/work/{employer}/.claude/
└── {personal}/      → symlinked to ~/personal/.claude/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;&lt;code&gt;install.sh&lt;/code&gt; wires these automatically:&lt;/p&gt;

&lt;p&gt;::: {#cb4 .sourceCode}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Global context
ln -sf "$DOTFILES/claude/global" "$HOME/.claude"

# Per-employer context
for employer in "${EMPLOYERS[@]}"; do
  WORK_DIR="$HOME/work/$employer"
  if [ -d "$WORK_DIR" ]; then
    ln -sf "$DOTFILES/claude/$employer" "$WORK_DIR/.claude"
  fi
done
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;Any machine that runs &lt;code&gt;install.sh&lt;/code&gt; gets the same context&lt;br&gt;
hierarchy. Changes committed to dotfiles propagate immediately.&lt;/p&gt;
&lt;h2&gt;
  
  
  What Each Level Contains
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Global (&lt;code&gt;~/.claude/&lt;/code&gt;)
&lt;/h3&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;~/.claude/
├── CLAUDE.md          # Preferences, active work summary
└── rules/
    ├── git-workflow.md
    ├── pr-patterns.md
    ├── infrastructure.md
    └── context-management.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;&lt;code&gt;CLAUDE.md&lt;/code&gt; is short --- preferences and a pointer to where&lt;br&gt;
active work lives. The heavy lifting goes in &lt;code&gt;rules/&lt;/code&gt; files&lt;br&gt;
that Claude loads as supplementary context.&lt;/p&gt;

&lt;p&gt;::: {#cb6 .sourceCode}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## Communication Style
- Be direct and technical — I understand infrastructure concepts
- Explain the "why" behind decisions
- Provide specific file paths and line numbers

## Git Workflow
- Branch format: feat/TICKET-123-description
- Commit format: [TICKET-123] Brief summary\n\nWhy this change...
- Never add Co-Authored-By trailers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;:::&lt;/p&gt;

&lt;h3&gt;
  
  
  Org Level (&lt;code&gt;~/work/{employer}/.claude/&lt;/code&gt;)
&lt;/h3&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;~/work/{employer}/.claude/
├── {EMPLOYER}.md              # Team structure, Jira workflow — committed
└── rules/
    ├── cicd-patterns.md       # CI/CD conventions — committed
    ├── aws-patterns.md        # Account IDs, VPC IDs — GITIGNORED
    └── terraform-patterns.md  # State config, module paths — GITIGNORED
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The privacy split matters. &lt;code&gt;cicd-patterns.md&lt;/code&gt; contains&lt;br&gt;
reusable GitHub Actions patterns --- fine to commit.&lt;br&gt;
&lt;code&gt;aws-patterns.md&lt;/code&gt; contains actual account IDs --- stays&lt;br&gt;
local.&lt;/p&gt;

&lt;p&gt;A typical employer context file:&lt;/p&gt;

&lt;p&gt;::: {#cb8 .sourceCode}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## Team Structure
- Platform team: 5 engineers, all in Jira project IN
- AWS accounts: dev, nonprod, prod (+ 3 infra accounts network, security, management)
- Monorepo: ~/work/{employer}/iac — Terragrunt, 78 components

## Jira Workflow
- IN project (infrastructure): transition IDs 3=In Dev, 8=Needs Review, 9=Done
- Prefix all commits: [IT-XXX]
- API: REST v2 only — v3 silently returns empty responses

## CI/CD
- GitHub Actions with OIDC to AWS (no long-lived credentials)
- PR requires: terraform plan output posted as comment
- Merge to main triggers auto-deploy to nonprod
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;The gitignored &lt;code&gt;aws-patterns.md&lt;/code&gt; contains account IDs and&lt;br&gt;
specific ARNs that Claude needs for generating Terraform configurations&lt;br&gt;
accurately but shouldn't be committed anywhere.&lt;/p&gt;
&lt;h3&gt;
  
  
  Project Level (&lt;code&gt;~/work/{employer}/{repo}/.claude/&lt;/code&gt;)
&lt;/h3&gt;

&lt;p&gt;Project context is ephemeral and always gitignored. It's the working&lt;br&gt;
memory for an ongoing effort:&lt;/p&gt;

&lt;p&gt;::: {#cb9 .sourceCode}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## Current State
- Branch: feat/IT-89-my-app-dev-ecr
- Active ticket: IT-89 — ECR in dev
- Next: IT-90 — ECS task definition

## Architecture Decisions
- ECR in dev only; cross-account pull policies for nonprod and prod
- Mutable tags in dev, immutable in nonprod/prod
- KMS key per environment, not per repository

## Blockers
- Waiting on network DNS zone creation before cutover can proceed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;I update this file at the end of each session with current state so&lt;br&gt;
the next session loads instantly without re-explaining where things&lt;br&gt;
are.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Privacy Model
&lt;/h2&gt;

&lt;p&gt;The critical insight is that context files need two categories:&lt;br&gt;
committed (shareable) and local-only (sensitive).&lt;/p&gt;

&lt;p&gt;Content                 Location                                           Committed?&lt;/p&gt;



&lt;p&gt;Personal preferences    &lt;code&gt;~/.claude/CLAUDE.md&lt;/code&gt;                              ✅&lt;br&gt;
  Git workflow rules      &lt;code&gt;~/.claude/rules/git-workflow.md&lt;/code&gt;                  ✅&lt;br&gt;
  Team structure          &lt;code&gt;{employer}/.claude/{EMPLOYER}.md&lt;/code&gt;                 ✅ sanitized&lt;br&gt;
  CI/CD patterns          &lt;code&gt;{employer}/.claude/rules/cicd-patterns.md&lt;/code&gt;        ✅&lt;br&gt;
  AWS account IDs         &lt;code&gt;{employer}/.claude/rules/aws-patterns.md&lt;/code&gt;         ❌ gitignored&lt;br&gt;
  VPC IDs, state config   &lt;code&gt;{employer}/.claude/rules/terraform-patterns.md&lt;/code&gt;   ❌ gitignored&lt;br&gt;
  Active ticket state     &lt;code&gt;{repo}/.claude/OVERRIDES.md&lt;/code&gt;                      ❌ gitignored&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;.gitignore&lt;/code&gt; at the dotfiles level handles this&lt;br&gt;
automatically by ignoring &lt;code&gt;**/aws-patterns.md&lt;/code&gt; and&lt;br&gt;
&lt;code&gt;**/terraform-patterns.md&lt;/code&gt; across all employer&lt;br&gt;
directories.&lt;/p&gt;
&lt;h2&gt;
  
  
  Custom Commands
&lt;/h2&gt;

&lt;p&gt;Beyond context files, Claude Code supports custom&lt;br&gt;
&lt;code&gt;/commands&lt;/code&gt; --- reusable prompts stored as markdown files:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;~/.claude/commands/
├── checkpoint.md       # Create context snapshot
├── sync-work.md        # Update active work status
└── pr-ready.md         # Generate PR description
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;A command file is just the prompt Claude should execute:&lt;/p&gt;

&lt;p&gt;::: {#cb11 .sourceCode}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# checkpoint.md
Create a context checkpoint. Read the current git status across active repos,
summarize open PRs and their status, list active tickets with their current
state, and write a structured summary to ~/.claude/local.md. Include any
blocking issues and the next planned action.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;Commands at the global level are available everywhere. Org-level&lt;br&gt;
commands handle employer-specific workflows like Jira transitions.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Solves in Practice
&lt;/h2&gt;

&lt;p&gt;Before this system: every new Claude session started with "here's the&lt;br&gt;
project, here are the conventions, here's where things are." Five&lt;br&gt;
minutes of ramp-up, inconsistent outputs because I'd forget to mention&lt;br&gt;
something.&lt;/p&gt;

&lt;p&gt;After: I &lt;code&gt;cd&lt;/code&gt; into a repo and Claude already knows the&lt;br&gt;
Jira workflow, the AWS account structure, the naming conventions, and&lt;br&gt;
where the active work stands. When I start a session mid-ticket, the&lt;br&gt;
project-level context tells Claude exactly what was in progress.&lt;/p&gt;

&lt;p&gt;The bigger payoff is consistency. When Claude generates Terraform, it&lt;br&gt;
generates it with the correct state backend. When it writes commit&lt;br&gt;
messages, they follow the format reviewers expect. When it suggests&lt;br&gt;
architecture, it fits the actual account model rather than a generic AWS&lt;br&gt;
example.&lt;/p&gt;

&lt;h2&gt;
  
  
  Starting Point
&lt;/h2&gt;

&lt;p&gt;If you're starting from scratch, work tier by tier:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Create &lt;code&gt;~/.claude/CLAUDE.md&lt;/code&gt; with your communication
preferences and git conventions.&lt;/li&gt;
&lt;li&gt; Add a &lt;code&gt;rules/&lt;/code&gt; directory with patterns you want loaded
consistently.&lt;/li&gt;
&lt;li&gt; Create an org-level directory when you start working with a specific
employer or major project.&lt;/li&gt;
&lt;li&gt; Add project-level context when you start a multi-session
effort.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Don't try to build the whole system at once. The global tier alone&lt;br&gt;
eliminates most of the per-session ramp-up. The org and project tiers&lt;br&gt;
pay off as work gets more complex.&lt;/p&gt;

&lt;p&gt;The thing that surprised me most wasn't the time saved on ramp-up. It&lt;br&gt;
was how much the output quality improved. When Claude knows the actual&lt;br&gt;
state backend, the actual account IDs, the actual PR format your&lt;br&gt;
reviewers expect --- the suggestions it makes fit your environment. That&lt;br&gt;
gap between "technically correct" and "actually usable" is where most of&lt;br&gt;
the friction in AI-assisted infrastructure work lives. The context&lt;br&gt;
hierarchy is mostly just closing that gap.&lt;/p&gt;

&lt;p&gt;If you're setting up Claude Code for a platform team and want to talk&lt;br&gt;
through the context design, &lt;a href="https://graycloudarch.com/#contact" rel="noopener noreferrer"&gt;I do advisory&lt;br&gt;
engagements&lt;/a&gt; for teams getting serious about AI tooling in their&lt;br&gt;
infrastructure workflow.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claudecode</category>
      <category>developertooling</category>
      <category>devops</category>
    </item>
    <item>
      <title>How I Manage Claude Code Context Across 20+ Repositories</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Fri, 20 Mar 2026 06:57:13 +0000</pubDate>
      <link>https://forem.com/tallgray1/how-i-manage-claude-code-context-across-20-repositories-5b16</link>
      <guid>https://forem.com/tallgray1/how-i-manage-claude-code-context-across-20-repositories-5b16</guid>
      <description>&lt;p&gt;&lt;em&gt;This post was originally published on &lt;a href="https://graycloudarch.com/blog/managing-claude-code-context-multi-repo/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Three months ago I was re-explaining my Terragrunt state backend to&lt;br&gt;
Claude for the third time in a week. Different session, same repo, same&lt;br&gt;
repo I'd worked in the session before. Claude had no idea I was even in&lt;br&gt;
the same project.&lt;/p&gt;

&lt;p&gt;I run Claude Code daily across a 6-account AWS platform monorepo, a&lt;br&gt;
personal consulting site, homelab infrastructure, and a handful of side&lt;br&gt;
projects. Every session started with the same five minutes of "here's&lt;br&gt;
the project, here are the conventions, here's the Jira workflow" --- and&lt;br&gt;
still ended with Claude suggesting patterns that didn't fit the&lt;br&gt;
environment, because I'd inevitably forgotten to mention something.&lt;/p&gt;

&lt;p&gt;After three months of broken symlinks and abandoned experiments, I&lt;br&gt;
landed on a three-tier context hierarchy that loads the right context&lt;br&gt;
automatically depending on which directory I'm working in --- and I manage&lt;br&gt;
all of it from a single dotfiles repo.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Problem with Single-File Context
&lt;/h2&gt;

&lt;p&gt;Claude Code loads &lt;code&gt;CLAUDE.md&lt;/code&gt; from the current directory&lt;br&gt;
(and parent directories, walking up to&lt;br&gt;
&lt;code&gt;~/.claude/CLAUDE.md&lt;/code&gt;). Most teams start with one file and&lt;br&gt;
put everything in it.&lt;/p&gt;

&lt;p&gt;That breaks down quickly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Global preferences get mixed with project
specifics.&lt;/strong&gt; Your "use snake_case for variable names" preference
shouldn't live next to your Terraform state bucket configuration.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Credentials and account IDs end up in files you accidentally
commit.&lt;/strong&gt; Put AWS account IDs in a shared CLAUDE.md, and someone
will eventually push it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You can't share patterns across repos without
duplication.&lt;/strong&gt; Every new repo gets a fresh copy of the same
conventions, and updates never propagate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-employer context creates conflicts.&lt;/strong&gt; Your
consulting client's Jira workflow shouldn't contaminate your personal
project sessions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;My first attempt at fixing this was a shared scripts directory. The&lt;br&gt;
three-tier hierarchy came later, after I figured out what was actually&lt;br&gt;
wrong with the simpler approach.&lt;/p&gt;
&lt;h2&gt;
  
  
  My First Attempt: A Shared Scripts Directory
&lt;/h2&gt;

&lt;p&gt;Before landing on the three-tier system, I built something more&lt;br&gt;
obvious: a &lt;code&gt;~/shared-claude-infra/&lt;/code&gt; directory containing a&lt;br&gt;
&lt;code&gt;setup-project.sh&lt;/code&gt; script that initialized&lt;br&gt;
&lt;code&gt;.claude/&lt;/code&gt; context for each new repo.&lt;/p&gt;

&lt;p&gt;The script created the directory structure and symlinked a&lt;br&gt;
&lt;code&gt;rules/shared/&lt;/code&gt; folder back to the shared repo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mkdir -p "$PROJECT_DIR/.claude/rules"
ln -s ~/shared-claude-infra/rules "$PROJECT_DIR/.claude/rules/shared"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This worked for the first two repos I configured. Then the problems&lt;br&gt;
compounded:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Manual per-project setup.&lt;/strong&gt; Every new repo required
running the script. Miss one, and that repo has no shared context.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Two repos to maintain.&lt;/strong&gt; The shared infrastructure
lived in its own git repo, separate from dotfiles. Two places to update
when conventions changed, and they drifted.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Nested symlinks instead of directory-level
symlinks.&lt;/strong&gt; The &lt;code&gt;rules/shared&lt;/code&gt; symlink lived deep
inside the project's &lt;code&gt;.claude/&lt;/code&gt; tree. When the target moved,
every project that had run the script got a broken symlink ---
silently.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hardcoded paths that drifted.&lt;/strong&gt; The script referenced
workspace paths from three months earlier. My actual directory layout
had changed; the script still pointed at the old locations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When I eventually deleted the shared directory, a quick&lt;br&gt;
&lt;code&gt;find&lt;/code&gt; confirmed broken symlinks scattered across every repo&lt;br&gt;
that had run the setup script. The approach was inherently fragile&lt;br&gt;
because it depended on every machine, every repo, and every workspace&lt;br&gt;
path staying synchronized manually.&lt;/p&gt;

&lt;p&gt;The fix isn't a smarter script. It's inverting the relationship:&lt;br&gt;
instead of a script that runs once per project, use dotfiles that wire&lt;br&gt;
context automatically based on what directories exist.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Three-Tier Hierarchy
&lt;/h2&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;~/.claude/CLAUDE.md              ← Global: preferences, style, git workflow
~/work/{employer}/.claude/       ← Org: team structure, AWS accounts, Jira workflow
~/work/{employer}/{repo}/.claude/ ← Project: repo architecture, active tickets
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Claude Code walks up the directory tree loading&lt;br&gt;
&lt;code&gt;CLAUDE.md&lt;/code&gt; files at each level. Each tier handles a specific&lt;br&gt;
scope:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Global tier&lt;/strong&gt; (&lt;code&gt;~/.claude/&lt;/code&gt;): Everything&lt;br&gt;
that applies across all work --- communication style, git commit format,&lt;br&gt;
PR description templates, universal infrastructure patterns. No&lt;br&gt;
credentials, no account IDs, nothing employer-specific.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Org tier&lt;/strong&gt; (&lt;code&gt;~/work/{employer}/.claude/&lt;/code&gt;):&lt;br&gt;
Team structure, Jira project keys, AWS account layout, CI/CD pipeline&lt;br&gt;
conventions. Sensitive patterns (account IDs, VPC IDs, state bucket&lt;br&gt;
names) go in gitignored files within this directory. Reusable patterns&lt;br&gt;
(CI/CD templates, AWS patterns without specifics) go in committed&lt;br&gt;
files.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Project tier&lt;/strong&gt;&lt;br&gt;
(&lt;code&gt;~/work/{employer}/{repo}/.claude/&lt;/code&gt;): Architecture decisions&lt;br&gt;
for this specific repo, active tickets, ongoing work state. Always&lt;br&gt;
gitignored --- this is ephemeral working context that changes&lt;br&gt;
frequently.&lt;/p&gt;
&lt;h2&gt;
  
  
  Implementation: Symlinks from Dotfiles
&lt;/h2&gt;

&lt;p&gt;The hierarchy only works if it's consistent across machines. I manage&lt;br&gt;
all context files from a dotfiles repo using symlinks:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dotfiles/claude/
├── global/          → symlinked to ~/.claude/
├── {employer}/      → symlinked to ~/work/{employer}/.claude/
└── {personal}/      → symlinked to ~/personal/.claude/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;&lt;code&gt;install.sh&lt;/code&gt; wires these automatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Global context
ln -sf "$DOTFILES/claude/global" "$HOME/.claude"

# Per-employer context
for employer in "${EMPLOYERS[@]}"; do
  WORK_DIR="$HOME/work/$employer"
  if [ -d "$WORK_DIR" ]; then
    ln -sf "$DOTFILES/claude/$employer" "$WORK_DIR/.claude"
  fi
done
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Any machine that runs &lt;code&gt;install.sh&lt;/code&gt; gets the same context&lt;br&gt;
hierarchy. Changes committed to dotfiles propagate immediately.&lt;/p&gt;
&lt;h2&gt;
  
  
  What Each Level Contains
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Global (&lt;code&gt;~/.claude/&lt;/code&gt;)
&lt;/h3&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;~/.claude/
├── CLAUDE.md          # Preferences, active work summary
└── rules/
    ├── git-workflow.md
    ├── pr-patterns.md
    ├── infrastructure.md
    └── context-management.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;&lt;code&gt;CLAUDE.md&lt;/code&gt; is short --- preferences and a pointer to where&lt;br&gt;
active work lives. The heavy lifting goes in &lt;code&gt;rules/&lt;/code&gt; files&lt;br&gt;
that Claude loads as supplementary context.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## Communication Style
- Be direct and technical — I understand infrastructure concepts
- Explain the "why" behind decisions
- Provide specific file paths and line numbers

## Git Workflow
- Branch format: feat/TICKET-123-description
- Commit format: [TICKET-123] Brief summary\n\nWhy this change...
- Never add Co-Authored-By trailers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Org Level (&lt;code&gt;~/work/{employer}/.claude/&lt;/code&gt;)
&lt;/h3&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;~/work/{employer}/.claude/
├── {EMPLOYER}.md              # Team structure, Jira workflow — committed
└── rules/
    ├── cicd-patterns.md       # CI/CD conventions — committed
    ├── aws-patterns.md        # Account IDs, VPC IDs — GITIGNORED
    └── terraform-patterns.md  # State config, module paths — GITIGNORED
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The privacy split matters. &lt;code&gt;cicd-patterns.md&lt;/code&gt; contains&lt;br&gt;
reusable GitHub Actions patterns --- fine to commit.&lt;br&gt;
&lt;code&gt;aws-patterns.md&lt;/code&gt; contains actual account IDs --- stays&lt;br&gt;
local.&lt;/p&gt;

&lt;p&gt;A typical employer context file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## Team Structure
- Platform team: 5 engineers, all in Jira project IN
- AWS accounts: dev, nonprod, prod (+ 3 infra accounts network, security, management)
- Monorepo: ~/work/{employer}/iac — Terragrunt, 78 components

## Jira Workflow
- IN project (infrastructure): transition IDs 3=In Dev, 8=Needs Review, 9=Done
- Prefix all commits: [IT-XXX]
- API: REST v2 only — v3 silently returns empty responses

## CI/CD
- GitHub Actions with OIDC to AWS (no long-lived credentials)
- PR requires: terraform plan output posted as comment
- Merge to main triggers auto-deploy to nonprod
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The gitignored &lt;code&gt;aws-patterns.md&lt;/code&gt; contains account IDs and&lt;br&gt;
specific ARNs that Claude needs for generating Terraform configurations&lt;br&gt;
accurately but shouldn't be committed anywhere.&lt;/p&gt;
&lt;h3&gt;
  
  
  Project Level (&lt;code&gt;~/work/{employer}/{repo}/.claude/&lt;/code&gt;)
&lt;/h3&gt;

&lt;p&gt;Project context is ephemeral and always gitignored. It's the working&lt;br&gt;
memory for an ongoing effort:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## Current State
- Branch: feat/IT-89-my-app-dev-ecr
- Active ticket: IT-89 — ECR in dev
- Next: IT-90 — ECS task definition

## Architecture Decisions
- ECR in dev only; cross-account pull policies for nonprod and prod
- Mutable tags in dev, immutable in nonprod/prod
- KMS key per environment, not per repository

## Blockers
- Waiting on network DNS zone creation before cutover can proceed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I update this file at the end of each session with current state so&lt;br&gt;
the next session loads instantly without re-explaining where things&lt;br&gt;
are.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Privacy Model
&lt;/h2&gt;

&lt;p&gt;The critical insight is that context files need two categories:&lt;br&gt;
committed (shareable) and local-only (sensitive).&lt;/p&gt;

&lt;p&gt;Content                 Location                                           Committed?&lt;/p&gt;



&lt;p&gt;Personal preferences    &lt;code&gt;~/.claude/CLAUDE.md&lt;/code&gt;                              ✅&lt;br&gt;
  Git workflow rules      &lt;code&gt;~/.claude/rules/git-workflow.md&lt;/code&gt;                  ✅&lt;br&gt;
  Team structure          &lt;code&gt;{employer}/.claude/{EMPLOYER}.md&lt;/code&gt;                 ✅ sanitized&lt;br&gt;
  CI/CD patterns          &lt;code&gt;{employer}/.claude/rules/cicd-patterns.md&lt;/code&gt;        ✅&lt;br&gt;
  AWS account IDs         &lt;code&gt;{employer}/.claude/rules/aws-patterns.md&lt;/code&gt;         ❌ gitignored&lt;br&gt;
  VPC IDs, state config   &lt;code&gt;{employer}/.claude/rules/terraform-patterns.md&lt;/code&gt;   ❌ gitignored&lt;br&gt;
  Active ticket state     &lt;code&gt;{repo}/.claude/OVERRIDES.md&lt;/code&gt;                      ❌ gitignored&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;.gitignore&lt;/code&gt; at the dotfiles level handles this&lt;br&gt;
automatically by ignoring &lt;code&gt;**/aws-patterns.md&lt;/code&gt; and&lt;br&gt;
&lt;code&gt;**/terraform-patterns.md&lt;/code&gt; across all employer&lt;br&gt;
directories.&lt;/p&gt;
&lt;h2&gt;
  
  
  Custom Commands
&lt;/h2&gt;

&lt;p&gt;Beyond context files, Claude Code supports custom&lt;br&gt;
&lt;code&gt;/commands&lt;/code&gt; --- reusable prompts stored as markdown files:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;~/.claude/commands/
├── checkpoint.md       # Create context snapshot
├── sync-work.md        # Update active work status
└── pr-ready.md         # Generate PR description
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;A command file is just the prompt Claude should execute:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# checkpoint.md
Create a context checkpoint. Read the current git status across active repos,
summarize open PRs and their status, list active tickets with their current
state, and write a structured summary to ~/.claude/local.md. Include any
blocking issues and the next planned action.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Commands at the global level are available everywhere. Org-level&lt;br&gt;
commands handle employer-specific workflows like Jira transitions.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Solves in Practice
&lt;/h2&gt;

&lt;p&gt;Before this system: every new Claude session started with "here's the&lt;br&gt;
project, here are the conventions, here's where things are." Five&lt;br&gt;
minutes of ramp-up, inconsistent outputs because I'd forget to mention&lt;br&gt;
something.&lt;/p&gt;

&lt;p&gt;After: I &lt;code&gt;cd&lt;/code&gt; into a repo and Claude already knows the&lt;br&gt;
Jira workflow, the AWS account structure, the naming conventions, and&lt;br&gt;
where the active work stands. When I start a session mid-ticket, the&lt;br&gt;
project-level context tells Claude exactly what was in progress.&lt;/p&gt;

&lt;p&gt;The bigger payoff is consistency. When Claude generates Terraform, it&lt;br&gt;
generates it with the correct state backend. When it writes commit&lt;br&gt;
messages, they follow the format reviewers expect. When it suggests&lt;br&gt;
architecture, it fits the actual account model rather than a generic AWS&lt;br&gt;
example.&lt;/p&gt;

&lt;h2&gt;
  
  
  Starting Point
&lt;/h2&gt;

&lt;p&gt;If you're starting from scratch, work tier by tier:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Create &lt;code&gt;~/.claude/CLAUDE.md&lt;/code&gt; with your communication
preferences and git conventions.&lt;/li&gt;
&lt;li&gt; Add a &lt;code&gt;rules/&lt;/code&gt; directory with patterns you want loaded
consistently.&lt;/li&gt;
&lt;li&gt; Create an org-level directory when you start working with a specific
employer or major project.&lt;/li&gt;
&lt;li&gt; Add project-level context when you start a multi-session
effort.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Don't try to build the whole system at once. The global tier alone&lt;br&gt;
eliminates most of the per-session ramp-up. The org and project tiers&lt;br&gt;
pay off as work gets more complex.&lt;/p&gt;

&lt;p&gt;The thing that surprised me most wasn't the time saved on ramp-up. It&lt;br&gt;
was how much the output quality improved. When Claude knows the actual&lt;br&gt;
state backend, the actual account IDs, the actual PR format your&lt;br&gt;
reviewers expect --- the suggestions it makes fit your environment. That&lt;br&gt;
gap between "technically correct" and "actually usable" is where most of&lt;br&gt;
the friction in AI-assisted infrastructure work lives. The context&lt;br&gt;
hierarchy is mostly just closing that gap.&lt;/p&gt;

&lt;p&gt;If you're setting up Claude Code for a platform team and want to talk&lt;br&gt;
through the context design, &lt;a href="https://graycloudarch.com/#contact" rel="noopener noreferrer"&gt;I do advisory&lt;br&gt;
engagements&lt;/a&gt; for teams getting serious about AI tooling in their&lt;br&gt;
infrastructure workflow.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claudecode</category>
      <category>developertooling</category>
      <category>devops</category>
    </item>
    <item>
      <title>Great design, and easy to follow.</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Wed, 18 Mar 2026 16:23:46 +0000</pubDate>
      <link>https://forem.com/tallgray1/great-design-and-easy-to-follow-109b</link>
      <guid>https://forem.com/tallgray1/great-design-and-easy-to-follow-109b</guid>
      <description>&lt;div class="ltag__link"&gt;
  &lt;a href="/cbecerra" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__pic"&gt;
      &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3545716%2Fa8cbf641-51dd-4f99-ad6b-abe0f714fa3b.jpeg" alt="cbecerra"&gt;
    &lt;/div&gt;
  &lt;/a&gt;
  &lt;a href="https://dev.to/cbecerra/how-to-implement-aws-network-firewall-in-a-multi-account-architecture-using-transit-gateway-2nam" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__content"&gt;
      &lt;h2&gt;How to Implement AWS Network Firewall in a Multi-Account Architecture Using Transit Gateway&lt;/h2&gt;
      &lt;h3&gt;Cristhian Becerra ・ Oct 13 '25&lt;/h3&gt;
      &lt;div class="ltag__link__taglist"&gt;
        &lt;span class="ltag__link__tag"&gt;#english&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#aws&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#networking&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#cybersecurity&lt;/span&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/a&gt;
&lt;/div&gt;


</description>
      <category>english</category>
      <category>aws</category>
      <category>networking</category>
      <category>cybersecurity</category>
    </item>
    <item>
      <title>Building Automated AWS Permission Testing Infrastructure for CI/CD</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Wed, 18 Mar 2026 07:14:50 +0000</pubDate>
      <link>https://forem.com/tallgray1/building-automated-aws-permission-testing-infrastructure-for-cicd-42pk</link>
      <guid>https://forem.com/tallgray1/building-automated-aws-permission-testing-infrastructure-for-cicd-42pk</guid>
      <description>&lt;p&gt;&lt;em&gt;This post was originally published on &lt;a href="https://graycloudarch.com/aws-permission-testing-cicd/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;I deployed a permission set for our data engineers five times before&lt;br&gt;
it worked correctly.&lt;/p&gt;

&lt;p&gt;The first deployment: S3 reads worked, Glue Data Catalog reads&lt;br&gt;
worked. Athena queries failed --- the query engine needs KMS decrypt&lt;br&gt;
through a service principal, and I'd missed the&lt;br&gt;
&lt;code&gt;kms:ViaService&lt;/code&gt; condition. Second deployment: Athena worked.&lt;br&gt;
EMR Serverless job submission failed --- missing&lt;br&gt;
&lt;code&gt;iam:PassRole&lt;/code&gt;. Third deployment: EMR submission worked. Job&lt;br&gt;
execution failed --- missing permissions on the EMR Serverless execution&lt;br&gt;
role boundary. I kept deploying, engineers kept getting blocked, I kept&lt;br&gt;
opening tickets.&lt;/p&gt;

&lt;p&gt;Five iterations. Two weeks. Every failure meant a data engineer&lt;br&gt;
opened a ticket instead of running their job.&lt;/p&gt;

&lt;p&gt;The problem wasn't that IAM is complicated --- it is, but that's&lt;br&gt;
expected. The problem was that I had no way to catch these issues before&lt;br&gt;
deploying to the account where real engineers were trying to do real&lt;br&gt;
work. Every bug was a production bug.&lt;/p&gt;
&lt;h2&gt;
  
  
  The "Access Denied" Debugging Loop
&lt;/h2&gt;

&lt;p&gt;Here's what the reactive debugging cycle looks like from the&lt;br&gt;
inside.&lt;/p&gt;

&lt;p&gt;Engineer opens a ticket:&lt;br&gt;
&lt;code&gt;AccessDeniedException: User is not authorized to perform: s3:GetObject&lt;/code&gt;.&lt;br&gt;
I add &lt;code&gt;s3:GetObject&lt;/code&gt; to the permission set. Next day:&lt;br&gt;
&lt;code&gt;AccessDeniedException: s3:PutObject&lt;/code&gt;. I add&lt;br&gt;
&lt;code&gt;s3:PutObject&lt;/code&gt;. Day after: write succeeds but cleanup fails ---&lt;br&gt;
&lt;code&gt;s3:DeleteObject&lt;/code&gt;. At this point I've done four deployment&lt;br&gt;
cycles and two days of work to get S3 read/write/delete working. If I'd&lt;br&gt;
just added &lt;code&gt;s3:*&lt;/code&gt; I'd be done, but that violates&lt;br&gt;
least-privilege and opens the raw zone to write access, which we&lt;br&gt;
explicitly don't want.&lt;/p&gt;

&lt;p&gt;The deeper issue is that individual services don't fail atomically.&lt;br&gt;
Athena requires &lt;code&gt;athena:StartQueryExecution&lt;/code&gt; and&lt;br&gt;
&lt;code&gt;athena:GetQueryResults&lt;/code&gt; and&lt;br&gt;
&lt;code&gt;athena:GetQueryExecution&lt;/code&gt;, but it also requires KMS decrypt&lt;br&gt;
through the Athena service principal to read encrypted S3 results. That&lt;br&gt;
last piece isn't in the Athena docs --- you find it by failing in&lt;br&gt;
production.&lt;/p&gt;

&lt;p&gt;I wanted a way to find it before deploying.&lt;/p&gt;
&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;The testing framework has four components: per-persona permission set&lt;br&gt;
templates, a Bash test library, per-service test scripts, and a GitHub&lt;br&gt;
Actions workflow that runs everything on pull requests.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────┐
│  GitHub Pull Request (Permission Set Changes)   │
└───────────────────┬─────────────────────────────┘
                    │
         ┌──────────▼──────────┐
         │  CI/CD Workflow     │
         │  (GitHub Actions)   │
         └──────────┬──────────┘
                    │
    ┌───────────────┼───────────────┐
    ▼               ▼               ▼
┌───────┐      ┌──────────┐   ┌──────────┐
│ S3    │      │  Glue    │   │ Athena   │
│ Tests │      │  Tests   │   │  Tests   │
└───────┘      └──────────┘   └──────────┘
                    │
         ┌──────────▼──────────┐
         │  Test Report        │
         │  (Posted to PR)     │
         └─────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;The workflow triggers on any pull request that modifies the&lt;br&gt;
identity-center Terraform directory. Tests run against real AWS accounts&lt;br&gt;
--- dev and nonprod --- using test credentials provisioned for that purpose.&lt;br&gt;
Results post as a PR comment before anyone approves the change.&lt;/p&gt;
&lt;h2&gt;
  
  
  Phase 1: Pre-Validated Templates
&lt;/h2&gt;

&lt;p&gt;Before I wrote a single test, I needed a starting point for&lt;br&gt;
permission sets that captured the patterns I'd learned the hard way.&lt;br&gt;
Templates that handle the non-obvious pieces --- zone-scoped S3 access,&lt;br&gt;
KMS conditions tied to specific services, explicit denies for&lt;br&gt;
destructive operations.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;AnalystAccess&lt;/code&gt; template is representative. Analysts&lt;br&gt;
get read-only access to the curated zone of the data lake, Athena query&lt;br&gt;
execution in the primary workgroup, and KMS decrypt --- but only when the&lt;br&gt;
decrypt request originates from S3 or Athena, not from arbitrary API&lt;br&gt;
calls:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;inline_policy&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonencode&lt;/span&gt;&lt;span class="err"&gt;(&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;Version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;
  &lt;span class="nx"&gt;Statement&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"GlueCatalogReadOnly"&lt;/span&gt;
      &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
      &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"glue:GetDatabase"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"glue:GetTable"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"glue:GetPartitions"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"glue:SearchTables"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="s2"&gt;"arn:aws:glue:*:*:catalog"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;"arn:aws:glue:*:*:database/curated_*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;"arn:aws:glue:*:*:table/curated_*/*"&lt;/span&gt;
      &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"S3CuratedReadOnly"&lt;/span&gt;
      &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
      &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"s3:GetObject"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"s3:ListBucket"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:s3:::lake-bucket-*/curated/*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:s3:::lake-bucket-*"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="nx"&gt;Condition&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;StringLike&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="s2"&gt;"s3:prefix"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"curated/*"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"AthenaQueryExecution"&lt;/span&gt;
      &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
      &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"athena:StartQueryExecution"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"athena:GetQueryExecution"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"athena:GetQueryResults"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"athena:StopQueryExecution"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:athena:*:*:workgroup/primary"&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"KMSDecryptViaSvc"&lt;/span&gt;
      &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
      &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"kms:Decrypt"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"kms:DescribeKey"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:kms:*:*:key/*"&lt;/span&gt;
      &lt;span class="nx"&gt;Condition&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;StringEquals&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="s2"&gt;"kms:ViaService"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"s3.us-east-1.amazonaws.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"athena.us-east-1.amazonaws.com"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"DenyDestructiveOps"&lt;/span&gt;
      &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Deny"&lt;/span&gt;
      &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"s3:DeleteObject"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"s3:DeleteBucket"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"glue:DeleteDatabase"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"glue:DeleteTable"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"*"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="err"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;kms:ViaService&lt;/code&gt; condition is the piece that took five&lt;br&gt;
production failures to discover. KMS decrypt without that condition&lt;br&gt;
allows an analyst to call &lt;code&gt;kms:Decrypt&lt;/code&gt; directly from their&lt;br&gt;
shell, which is not what we want. The condition locks decrypt to&lt;br&gt;
requests that pass through S3 or Athena specifically.&lt;/p&gt;

&lt;p&gt;The explicit deny block matters too. Without it, if someone later&lt;br&gt;
grants broader S3 permissions to this persona for a different reason,&lt;br&gt;
the curated zone protection evaporates. The deny creates a hard floor&lt;br&gt;
regardless of what else gets added.&lt;/p&gt;
&lt;h2&gt;
  
  
  Phase 2: The Test Framework
&lt;/h2&gt;

&lt;p&gt;I chose Bash over Python or a proper test framework deliberately. The&lt;br&gt;
tests run in CI with no dependencies beyond the AWS CLI --- no package&lt;br&gt;
installs, no virtual environments, no version pinning of test libraries.&lt;br&gt;
The machines running these tests already have the AWS CLI.&lt;/p&gt;

&lt;p&gt;The core library in &lt;code&gt;lib/test-framework.sh&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;::: {#cb3 .sourceCode}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;declare -a TESTS_PASSED=()
declare -a TESTS_FAILED=()

run_test() {
  local test_name="$1"
  local test_command="$2"
  local description="$3"

  if eval "$test_command" &amp;amp;&amp;gt;/dev/null; then
    TESTS_PASSED+=("$test_name")
    echo "  ✅ PASS: $test_name"
  else
    TESTS_FAILED+=("$test_name")
    echo "  ❌ FAIL: $test_name"
  fi
}

generate_text_report() {
  echo "Total: $((${#TESTS_PASSED[@]} + ${#TESTS_FAILED[@]}))"
  echo "Passed: ${#TESTS_PASSED[@]}"
  echo "Failed: ${#TESTS_FAILED[@]}"
  [ ${#TESTS_FAILED[@]} -gt 0 ] &amp;amp;&amp;amp; printf '  - %s\n' "${TESTS_FAILED[@]}"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;The most important design decision in the test scripts is testing&lt;br&gt;
denials as carefully as allowances. Testing only what should succeed&lt;br&gt;
tells you the permission set isn't obviously broken. Testing what should&lt;br&gt;
fail tells you it's not accidentally too permissive.&lt;/p&gt;

&lt;p&gt;::: {#cb4 .sourceCode}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Test what should succeed
run_test "s3-list-curated"
  "aws s3 ls s3://lake-bucket-dev/curated/"
  "Analyst can list curated zone"

# Test what should fail (negative test)
run_test "s3-write-denied"
  "! aws s3 cp /tmp/test.txt s3://lake-bucket-dev/curated/test.txt 2&amp;gt;&amp;amp;1 | grep -q 'AccessDenied'"
  "Analyst cannot write to curated zone"

run_test "s3-raw-zone-denied"
  "! aws s3 ls s3://lake-bucket-dev/raw/ 2&amp;gt;&amp;amp;1 | grep -q 'AccessDenied'"
  "Analyst cannot access raw zone"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;Beyond service-level tests, I run persona tests that simulate&lt;br&gt;
end-to-end workflows. An analyst's workflow isn't "call S3, then call&lt;br&gt;
Athena separately" --- it's "run an Athena query that reads encrypted S3&lt;br&gt;
data and writes results to the query results bucket." That integration&lt;br&gt;
test catches failures that individual service tests miss. The original&lt;br&gt;
five-iteration DataPlatformAccess failure? An individual S3 test would&lt;br&gt;
have passed. A persona test running an actual Athena query against the&lt;br&gt;
encrypted lake would have caught the KMS gap.&lt;/p&gt;
&lt;h2&gt;
  
  
  Phase 3: CI/CD Integration
&lt;/h2&gt;

&lt;p&gt;The GitHub Actions workflow triggers on pull requests that touch the&lt;br&gt;
identity-center Terraform directory, runs tests in a matrix against dev&lt;br&gt;
and nonprod, and posts a summary comment to the PR.&lt;/p&gt;

&lt;p&gt;::: {#cb5 .sourceCode}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;on:
  pull_request:
    paths:
      - 'common/modules/identity-center/**/*.tf'

permissions:
  contents: read
  id-token: write
  pull-requests: write

jobs:
  test-permissions:
    strategy:
      matrix:
        environment: [workloads-dev, workloads-nonprod]
    steps:
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::${{ matrix.environment.account }}:role/github-actions-role
      - run: ./scripts/test-permissions/run-permission-tests.sh --persona analyst
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;id-token: write&lt;/code&gt; permission is required for OIDC&lt;br&gt;
authentication to AWS --- the workflow assumes a role in each account&lt;br&gt;
rather than using long-lived credentials in GitHub Secrets. This is the&lt;br&gt;
right pattern: credentials rotate automatically, and there's no secret&lt;br&gt;
to rotate manually or accidentally expose.&lt;/p&gt;

&lt;p&gt;The PR comment posts the full test output with pass/fail counts per&lt;br&gt;
persona per account. A reviewer can look at the comment and immediately&lt;br&gt;
see whether the permission change has test coverage and whether the&lt;br&gt;
tests pass.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Things I Learned the Hard Way
&lt;/h2&gt;

&lt;p&gt;First: test KMS decryption through each service separately.&lt;br&gt;
&lt;code&gt;kms:Decrypt&lt;/code&gt; via S3 and &lt;code&gt;kms:Decrypt&lt;/code&gt; via Athena&lt;br&gt;
are different IAM evaluation paths even though they're the same API&lt;br&gt;
call. A test that puts an object and gets it back via S3 directly won't&lt;br&gt;
catch a broken Athena KMS path.&lt;/p&gt;

&lt;p&gt;Second: negative tests matter as much as positive ones. Before I had&lt;br&gt;
the test framework, every permission set I wrote was tested only for&lt;br&gt;
what it should allow. I had no systematic check that it didn't allow&lt;br&gt;
more. The denial tests are what give security reviewers confidence.&lt;/p&gt;

&lt;p&gt;Third: persona tests catch failures that service tests miss.&lt;br&gt;
Individual service tests are fast to write and good for regression&lt;br&gt;
coverage, but they test permissions in isolation. Real workflows cross&lt;br&gt;
service boundaries. Build both.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Changed
&lt;/h2&gt;

&lt;p&gt;Before the framework: five iterations to get one permission set&lt;br&gt;
right, every iteration a production impact. After: 95% of permission&lt;br&gt;
issues caught at PR review time. Zero production impacts from permission&lt;br&gt;
bugs since we shipped it. The templates reduced new permission set&lt;br&gt;
creation time by about 70% --- instead of starting from scratch with the&lt;br&gt;
IAM documentation, we start from a pre-validated base and modify from&lt;br&gt;
there.&lt;/p&gt;

&lt;p&gt;The time investment was about a week: two days for templates, two&lt;br&gt;
days for the test framework and scripts, one day for CI/CD integration&lt;br&gt;
and documentation. That investment paid back in the first sprint when&lt;br&gt;
the analyst permission set for a new hire went out correct on the first&lt;br&gt;
deployment.&lt;/p&gt;

&lt;p&gt;Running into IAM permission debugging loops on your team? &lt;a href="https://graycloudarch.com/#contact" rel="noopener noreferrer"&gt;Reach out&lt;/a&gt; --- permission testing infrastructure is&lt;br&gt;
one of the first things I build when joining a new platform team.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>cicd</category>
      <category>githubactions</category>
      <category>iam</category>
    </item>
    <item>
      <title>Building Apache Iceberg Lakehouse Storage with S3 Table Buckets</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Mon, 16 Mar 2026 23:28:31 +0000</pubDate>
      <link>https://forem.com/tallgray1/building-apache-iceberg-lakehouse-storage-with-s3-table-buckets-4dio</link>
      <guid>https://forem.com/tallgray1/building-apache-iceberg-lakehouse-storage-with-s3-table-buckets-4dio</guid>
      <description>&lt;p&gt;&lt;em&gt;This post was originally published on &lt;a href="https://graycloudarch.com/apache-iceberg-lakehouse-s3-table-buckets/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;The data platform team had a deadline and a storage decision to make.&lt;br&gt;
They'd committed to Apache Iceberg as the table format --- open standard,&lt;br&gt;
time travel, schema evolution, the usual reasons. What they hadn't&lt;br&gt;
locked down was where the data was actually going to live, and whether&lt;br&gt;
the storage layer would hold up under the metadata-heavy access patterns&lt;br&gt;
Iceberg requires.&lt;/p&gt;

&lt;p&gt;The default answer is regular S3. It works. Most Iceberg deployments&lt;br&gt;
run on it. But AWS launched S3 Table Buckets in late 2024, and they're&lt;br&gt;
purpose-built for exactly this workload: Iceberg metadata operations.&lt;br&gt;
The numbers made the decision easy --- 10x faster metadata queries, 50% or&lt;br&gt;
more improvement in query planning time compared to standard S3. The&lt;br&gt;
gotcha worth knowing upfront: S3 Table Bucket support requires AWS&lt;br&gt;
Provider 5.70 or later. If your Terraform modules are pinned to an older&lt;br&gt;
provider version, that's your first upgrade.&lt;/p&gt;

&lt;p&gt;We built the storage layer as a three-zone medallion architecture,&lt;br&gt;
fully managed with Terraform, with Intelligent-Tiering configured from&lt;br&gt;
day one. Here's how we did it.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Medallion Architecture
&lt;/h2&gt;

&lt;p&gt;Three zones, each with a clear contract about what data lives there&lt;br&gt;
and who owns it:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjp2jugju16x8egbj3q4j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjp2jugju16x8egbj3q4j.png" alt="Medallion architecture --- three-zone lakehouse: Source Systems flow into Raw Zone (immutable landing), then ETL into Clean Zone (normalized), then aggregation into Curated Zone (analytics-ready), consumed by Athena, QuickSight, and Tableau" width="552" height="1404"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Raw is immutable. Once data lands there, it doesn't change --- ETL&lt;br&gt;
failures don't corrupt the source record because the source record is&lt;br&gt;
untouched. Clean is normalized and domain-aligned, owned by data&lt;br&gt;
engineering. Curated is the analytics layer that BI tools, Athena&lt;br&gt;
queries, and QuickSight dashboards read from.&lt;/p&gt;

&lt;p&gt;The naming convention we landed on was &lt;code&gt;{zone}_{domain}&lt;/code&gt;&lt;br&gt;
for Glue databases --- &lt;code&gt;raw_crm&lt;/code&gt;, &lt;code&gt;clean_customer&lt;/code&gt;,&lt;br&gt;
&lt;code&gt;curated_sales_metrics&lt;/code&gt;. It looks minor, but it matters. When&lt;br&gt;
you're looking at a table in Athena or debugging a failed Glue job, the&lt;br&gt;
database name tells you exactly what tier you're in and what domain&lt;br&gt;
you're touching. Namespace collisions become impossible because the zone&lt;br&gt;
prefix scopes every domain. Data lineage is readable from table names&lt;br&gt;
alone.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why Two Modules Instead of One
&lt;/h2&gt;

&lt;p&gt;The first design question was whether to build a single composite&lt;br&gt;
module that creates the KMS key and the S3 Table Bucket together, or&lt;br&gt;
split them into separate modules. We split them.&lt;/p&gt;

&lt;p&gt;The KMS key isn't just for the lake. It's used by five downstream&lt;br&gt;
services: Athena for query results, EMR for cluster encryption, MWAA for&lt;br&gt;
DAG storage, Kinesis for stream encryption, and Glue DataBrew for&lt;br&gt;
transform outputs. If we bundled the key into the lake storage module,&lt;br&gt;
every one of those services would need a dependency chain that&lt;br&gt;
eventually resolves back through lake storage just to get a KMS key ARN.&lt;br&gt;
Separate modules mean the key has one owner, and everything else&lt;br&gt;
declares a dependency on it independently.&lt;/p&gt;

&lt;p&gt;The KMS module:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# kms-key/main.tf
resource "aws_kms_key" "this" {
  description             = var.description
  enable_key_rotation     = var.enable_key_rotation
  deletion_window_in_days = var.deletion_window_in_days

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "Enable IAM User Permissions"
        Effect = "Allow"
        Principal = { AWS = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:root" }
        Action   = "kms:*"
        Resource = "*"
      },
      {
        Sid    = "Allow Service Access"
        Effect = "Allow"
        Principal = { Service = var.service_principals }
        Action = ["kms:Decrypt", "kms:GenerateDataKey", "kms:CreateGrant"]
        Resource = "*"
      }
    ]
  })
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;service_principals&lt;/code&gt; variable takes a list of service&lt;br&gt;
principal strings ---&lt;br&gt;
&lt;code&gt;["athena.amazonaws.com", "glue.amazonaws.com"]&lt;/code&gt; and so on.&lt;br&gt;
Adding a new service that needs key access is one line in the Terragrunt&lt;br&gt;
config, no module change required.&lt;/p&gt;

&lt;p&gt;::: {}&lt;br&gt;
&lt;strong&gt;Working through something similar?&lt;/strong&gt; I advise platform teams on AWS infrastructure â€" multi-account architecture, Transit Gateway, EKS, and Terraform IaC. &lt;a href="https://graycloudarch.com/#contact" rel="noopener noreferrer"&gt;Let's talk.&lt;/a&gt;&lt;br&gt;
:::&lt;/p&gt;
&lt;h2&gt;
  
  
  The S3 Table Bucket Module
&lt;/h2&gt;

&lt;p&gt;The table bucket itself is straightforward. The interesting part is&lt;br&gt;
Intelligent-Tiering:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# s3-table-bucket/main.tf
resource "aws_s3tables_table_bucket" "this" {
  name = var.bucket_name
}

resource "aws_s3_bucket_intelligent_tiering_configuration" "this" {
  count  = var.enable_intelligent_tiering ? 1 : 0
  bucket = aws_s3tables_table_bucket.this.name
  name   = "EntireBucket"

  tiering {
    access_tier = "ARCHIVE_ACCESS"
    days        = 90
  }
  tiering {
    access_tier = "DEEP_ARCHIVE_ACCESS"
    days        = 180
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We enable Intelligent-Tiering on the entire bucket from the start.&lt;br&gt;
The 90-day threshold for Archive Access and 180-day threshold for Deep&lt;br&gt;
Archive weren't arbitrary --- they match the typical access patterns for a&lt;br&gt;
data lake: raw data is queried heavily during initial load and&lt;br&gt;
validation, then access drops off sharply once the clean layer is&lt;br&gt;
populated.&lt;/p&gt;

&lt;p&gt;The reason Intelligent-Tiering beats manual lifecycle policies here&lt;br&gt;
is subtle but important. A manual lifecycle policy moves data based on&lt;br&gt;
age. Intelligent-Tiering moves data based on actual access patterns. If&lt;br&gt;
a dataset from eight months ago suddenly becomes relevant for a&lt;br&gt;
compliance audit, Intelligent-Tiering keeps it in a more accessible tier&lt;br&gt;
automatically. A manual policy would have moved it to Deep Archive on&lt;br&gt;
day 180 regardless. For a data lake, where access patterns are genuinely&lt;br&gt;
unpredictable, letting AWS monitor actual usage is worth the small&lt;br&gt;
monitoring fee.&lt;/p&gt;

&lt;p&gt;The Terragrunt dependency chain wires the KMS key ARN into the table&lt;br&gt;
bucket configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# lake-storage/terragrunt.hcl&lt;/span&gt;
&lt;span class="nx"&gt;dependency&lt;/span&gt; &lt;span class="s2"&gt;"kms"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;config_path&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"../kms-key"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;inputs&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;bucket_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"company-lake-${local.environment}"&lt;/span&gt;
  &lt;span class="nx"&gt;kms_key_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dependency&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;kms&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;key_arn&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Glue Data Catalog
&lt;/h2&gt;

&lt;p&gt;We provisioned 12 Glue databases across the three zones --- four&lt;br&gt;
domains per zone (CRM, customer, sales, operations). The Terraform for&lt;br&gt;
each database includes the Iceberg metadata parameters that enable&lt;br&gt;
Iceberg table format for all tables created in that database:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;resource "aws_glue_catalog_database" "this" {
  name = "raw_crm"
  parameters = {
    "iceberg_enabled" = "true"
    "table_type"      = "ICEBERG"
    "format-version"  = "2"
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Format version 2 is the current Iceberg spec. It unlocks row-level&lt;br&gt;
deletes, which is required for GDPR compliance --- when a user requests&lt;br&gt;
deletion, you can execute a targeted delete on the Iceberg table rather&lt;br&gt;
than rewriting entire Parquet partitions.&lt;/p&gt;

&lt;p&gt;One thing that's easy to miss: Glue databases with Iceberg parameters&lt;br&gt;
set don't automatically create Iceberg tables. The database parameters&lt;br&gt;
act as defaults and metadata; actual table creation still happens via&lt;br&gt;
your ETL tooling (Glue jobs, Spark, Flink). What you get from Terraform&lt;br&gt;
is the catalog structure and the governance layer --- databases,&lt;br&gt;
permissions, encryption settings --- so that when the data engineering&lt;br&gt;
team writes their first Glue job, the infrastructure is already in&lt;br&gt;
place.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Cost Model
&lt;/h2&gt;

&lt;p&gt;When I presented this to the platform team's tech lead, the cost&lt;br&gt;
projection was what turned a "nice to have" into a "let's do this&lt;br&gt;
now."&lt;/p&gt;

&lt;p&gt;For a 100TB lake:&lt;/p&gt;

&lt;p&gt;Timeframe        Storage Tier     Monthly Cost&lt;/p&gt;




&lt;p&gt;First 90 days    Standard         ~\$2,300&lt;br&gt;
  After 90 days    Archive Access   ~\$400&lt;br&gt;
  After 180 days   Deep Archive     ~\$100&lt;/p&gt;

&lt;p&gt;That's roughly 80% savings once the bulk of the data ages past 90&lt;br&gt;
days, and 95% savings at 180 days. The Intelligent-Tiering monitoring&lt;br&gt;
cost is \$0.0025 per 1,000 objects --- on a 100TB lake with typical Iceberg&lt;br&gt;
file sizes, that's a few dollars a month. Negligible.&lt;/p&gt;

&lt;p&gt;The S3 Table Bucket metadata performance improvement compounds this.&lt;br&gt;
Faster query planning means less Athena scan time, which means lower&lt;br&gt;
query costs and faster results for analysts. The platform pays for&lt;br&gt;
itself in reduced query costs as the data volume grows.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deployment Sequence
&lt;/h2&gt;

&lt;p&gt;The deployment order is driven by dependencies: KMS must exist before&lt;br&gt;
S3 (bucket encryption needs the key ARN), and both must exist before&lt;br&gt;
Glue (catalog databases reference the bucket location).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuzybx88gd1rhxz7dvaxu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuzybx88gd1rhxz7dvaxu.png" alt="Deployment sequence: KMS Key must be created first, then S3 Table Bucket (which uses the key ARN), then Glue Data Catalog" width="800" height="93"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In practice, across three environments (dev, nonprod, prod), the full&lt;br&gt;
deployment took about four hours. Most of that was Terragrunt apply time&lt;br&gt;
--- the actual resource creation for each component is fast, but we ran&lt;br&gt;
plan, reviewed, applied, and verified before moving to the next&lt;br&gt;
environment.&lt;/p&gt;

&lt;p&gt;One deployment note: the first time you run&lt;br&gt;
&lt;code&gt;terragrunt plan&lt;/code&gt; on the Glue module in an account that&lt;br&gt;
hasn't had Glue configured before, you'll get an error about the Glue&lt;br&gt;
service-linked role not existing. Fix it by running&lt;br&gt;
&lt;code&gt;aws iam create-service-linked-role --aws-service-name glue.amazonaws.com&lt;/code&gt;&lt;br&gt;
before the apply. It only needs to happen once per account.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Data Team Inherited
&lt;/h2&gt;

&lt;p&gt;When we handed this over to the data engineering team, they had a&lt;br&gt;
fully provisioned catalog --- three zones, twelve databases, Iceberg&lt;br&gt;
metadata configured, encryption enabled, Intelligent-Tiering active.&lt;br&gt;
They could start writing Glue jobs and creating tables immediately&lt;br&gt;
without worrying about storage configuration, access patterns, or cost&lt;br&gt;
optimization after the fact.&lt;/p&gt;

&lt;p&gt;The Terraform modules are reusable. Adding a new domain (say, a&lt;br&gt;
&lt;code&gt;finance&lt;/code&gt; domain across all three zones) is three database&lt;br&gt;
resource declarations and one pull request. The KMS key, bucket, and&lt;br&gt;
Intelligent-Tiering configuration don't change.&lt;/p&gt;

&lt;p&gt;S3 Table Buckets are still relatively new, and the Terraform provider&lt;br&gt;
support came together in late 2024. If your team is planning an Iceberg&lt;br&gt;
migration and hasn't evaluated Table Buckets yet, the metadata&lt;br&gt;
performance gains and the cost trajectory make a strong case for&lt;br&gt;
starting there rather than retrofitting later.&lt;/p&gt;

&lt;p&gt;Building out a data platform and figuring out the storage and catalog&lt;br&gt;
architecture? &lt;a href="https://graycloudarch.com/#contact" rel="noopener noreferrer"&gt;Get in touch&lt;/a&gt; --- this kind of&lt;br&gt;
infrastructure design work is something I do regularly, whether you're&lt;br&gt;
starting from scratch or migrating an existing lake.&lt;/p&gt;

</description>
      <category>apacheiceberg</category>
      <category>aws</category>
      <category>dataengineering</category>
      <category>s3</category>
    </item>
  </channel>
</rss>
