<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Glenn Gray</title>
    <description>The latest articles on Forem by Glenn Gray (@tallgray1).</description>
    <link>https://forem.com/tallgray1</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3817657%2F22cc7f4e-c345-484f-89b0-07068c02c9c7.png</url>
      <title>Forem: Glenn Gray</title>
      <link>https://forem.com/tallgray1</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/tallgray1"/>
    <language>en</language>
    <item>
      <title>ECS vs EKS in 2026: The Decision Framework—Including ECS Anywhere</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Tue, 19 May 2026 14:19:34 +0000</pubDate>
      <link>https://forem.com/tallgray1/ecs-vs-eks-in-2026-the-decision-framework-including-ecs-anywhere-47op</link>
      <guid>https://forem.com/tallgray1/ecs-vs-eks-in-2026-the-decision-framework-including-ecs-anywhere-47op</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://graycloudarch.com/blog/ecs-vs-eks-decision-framework/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;The CTO wanted to know why the platform team had picked EKS for their new environment. They'd been running ECS for two years without issues. The team lead explained they needed GitOps, better autoscaling, and "industry-standard tooling."&lt;/p&gt;

&lt;p&gt;Three months later, they were debugging a cert-manager webhook failure at 11am. Two engineers had spent 30 hours the previous month on cluster operations. They hadn't shipped a net-new feature in six weeks.&lt;/p&gt;

&lt;p&gt;EKS wasn't wrong for them. The timing was. They had three engineers, twelve services, and no one who'd operated a Kubernetes cluster in production before. The ecosystem they wanted required them to operate it first.&lt;/p&gt;

&lt;p&gt;This is the ECS vs EKS conversation most teams don't have until after they've made the choice.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Actual Decision Axis
&lt;/h2&gt;

&lt;p&gt;Feature comparisons miss the point. Both ECS and EKS run containers reliably. The real question is: what does your team have to operate to make that happen — and what's the cost of getting it wrong?&lt;/p&gt;

&lt;p&gt;Two axes matter:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Operational capacity&lt;/strong&gt;: How much complexity can your team absorb while still shipping product? A 3-engineer platform team and a 15-engineer platform team are not playing the same game.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kubernetes maturity&lt;/strong&gt;: Have your engineers operated k8s in production under pressure? "We've done some k8s" and "we've debugged etcd under load" are not the same thing.&lt;/p&gt;

&lt;p&gt;The answer to which one you should use today often changes in 18 months. A team that's right for ECS now may be right for EKS after their platform engineers have shipped 6 months of Kubernetes work. Building with that arc in mind matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  What ECS Actually Gives You
&lt;/h2&gt;

&lt;p&gt;No control plane. That's the headline. With Fargate, there are no nodes to patch, no node groups to right-size, no kubelet to troubleshoot. AWS manages the underlying compute entirely.&lt;/p&gt;

&lt;p&gt;The IAM model is simpler by design. Task roles attach directly to task definitions — no service accounts, no IRSA, no Web Identity tokens to wire up. For engineers coming from EC2-era IAM, this maps cleanly to what they already know.&lt;/p&gt;

&lt;p&gt;ECS Fargate has no cluster fixed cost. EKS charges $0.10/hr per cluster — $72/month whether you're running one service or fifty. At low service counts or in non-production environments, that difference is real.&lt;/p&gt;

&lt;p&gt;AWS integrations are first-class rather than plugged in. ALB target group registration, CloudMap service discovery, Secrets Manager injection via ECS container secrets — these work without Helm charts or CRDs. The AWS API surface and the ECS API surface are the same surface.&lt;/p&gt;

&lt;p&gt;The internal tools team: 3 engineers, zero Kubernetes background, 8 services. ECS Fargate with a shared Terraform module got them to production in three weeks. No platform team required.&lt;/p&gt;

&lt;h2&gt;
  
  
  What EKS Actually Gives You
&lt;/h2&gt;

&lt;p&gt;Ecosystem depth that ECS simply doesn't have. Karpenter for bin-packing and just-in-time node provisioning. KEDA for event-driven autoscaling off SQS, Kafka, or custom metrics. Argo CD or Flux for GitOps with real reconciliation loops. External Secrets Operator, Cert-manager, Prometheus Operator — the tooling is mature, battle-tested, and actively maintained.&lt;/p&gt;

&lt;p&gt;ECS has no equivalent. The closest alternatives are either AWS-native (EventBridge Pipes, Application Auto Scaling) and less flexible, or custom-built and unmaintained after the engineer who wrote them leaves.&lt;/p&gt;

&lt;p&gt;Karpenter in particular changes the EC2 cost math at scale. Intelligent bin-packing and spot interruption handling can cut compute costs 30-50% compared to fixed node groups. Below 20-30 nodes the savings often don't justify the operational overhead. Above that, it's hard to ignore.&lt;/p&gt;

&lt;p&gt;Multi-cloud portability is real if you actually need it. Kubernetes manifests transfer to GKE or AKS. ECS task definitions do not. If "running this workload outside AWS" is a real scenario — not just theoretical — that matters.&lt;/p&gt;

&lt;p&gt;The data platform I worked on: mixed batch and streaming workloads, KEDA scaling on SQS queue depth. ECS autoscaling would have required custom CloudWatch metrics and polling-based triggers. KEDA handled it natively in 20 lines of YAML. That alone settled the decision.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Decision Tree
&lt;/h2&gt;

&lt;p&gt;Walk through these in order. First yes wins.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu3zch5zvciylkyaq6lyc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu3zch5zvciylkyaq6lyc.png" alt="ECS vs EKS decision framework" width="800" height="1911"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Zero Kubernetes experience on the team?&lt;/strong&gt; → ECS. The operational cost of learning k8s while building product is real and usually underestimated. The 40-hour/month cluster ops tax from the story above was paid by a team that had some k8s experience. Zero experience is worse.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Migrating from an existing ECS platform?&lt;/strong&gt; → ECS. Rewrite and replatform simultaneously fails more often than it succeeds. Stabilize on ECS, migrate later when the workload is boring.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Need KEDA, custom-metric HPA, or Karpenter?&lt;/strong&gt; → EKS. ECS autoscaling is Application Auto Scaling against CloudWatch metrics. It works, but the ceiling is lower and the custom metric path is significantly more work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Need GitOps with Argo CD or Flux?&lt;/strong&gt; → EKS. ECS has no native GitOps story. You can build one — CodePipeline + ECS deployment, Terraform-driven deployments — but you're building it. The operational difference is significant.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Five or more services sharing infrastructure?&lt;/strong&gt; → EKS. The fixed cost justifies it; shared node pools improve utilization; the per-service overhead of ECS task definitions multiplies fast.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Default&lt;/strong&gt; → ECS Fargate. Simpler, cheaper to start, and the migration path to EKS is well-understood.&lt;/p&gt;

&lt;h2&gt;
  
  
  ECS Anywhere: The Third Option
&lt;/h2&gt;

&lt;p&gt;ECS Anywhere gets overlooked in most comparisons because it doesn't fit neatly into "cloud vs cloud" comparisons. It should be in the decision tree.&lt;/p&gt;

&lt;p&gt;ECS Anywhere lets you register non-AWS compute — on-premises servers, VMs in other clouds, edge devices — as ECS external instances. Your task definitions, IAM roles, and tooling stay the same. The ECS control plane in AWS manages scheduling. The compute runs wherever you've registered it.&lt;/p&gt;

&lt;p&gt;Where this actually wins:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Regulated environments with data residency requirements.&lt;/strong&gt; If certain workloads must stay on-premises for compliance, ECS Anywhere lets you run them with the same tooling as your AWS workloads. On the GovCloud platform I built, we had ground system software that had to process flight data on local hardware before transmission. ECS Anywhere would have let us manage those workloads from the same ECS cluster as our cloud services — same Terraform modules, same IAM patterns, same observability pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Brownfield migration.&lt;/strong&gt; If you're moving workloads from on-premises to AWS and want a consistent deployment target during the migration, ECS Anywhere gives you that. Register the on-prem servers, migrate task by task, deregister when done.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Edge compute.&lt;/strong&gt; Consistent deployment tooling across dozens of edge nodes without running a k8s control plane at each site.&lt;/p&gt;

&lt;p&gt;The constraint: ECS Anywhere instances are external infrastructure you own and patch. Fargate's "no nodes to manage" advantage disappears. The tradeoff is deliberate — you're accepting node management in exchange for placement control.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Migration Path
&lt;/h2&gt;

&lt;p&gt;ECS → EKS migration is well-understood and not particularly risky if the IaC is clean.&lt;/p&gt;

&lt;p&gt;Containerized workloads move without changes. The two meaningful changes are IAM (task roles → IRSA service accounts — mechanical, not complex) and networking (ALB target group registration → Ingress or Service — also mechanical).&lt;/p&gt;

&lt;p&gt;What breaks the migration is task definitions in CloudFormation or hand-managed console resources. If your ECS deployment is 100% Terraform with a module per service, the migration is boring. If it's six engineers' worth of one-off console configurations, it's archaeology.&lt;/p&gt;

&lt;p&gt;Build ECS as if you'll migrate it. Keep task definitions in Terraform modules, service definitions composable, networking configuration explicit. The Jira ticket for "migrate from ECS to EKS" should feel like plumbing work, not a project.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistakes I See Repeatedly
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Choosing EKS because it's "industry standard."&lt;/strong&gt; Industry standard at Stripe is not industry standard at a 40-person SaaS company. The operational tax is the same either way.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Choosing ECS without accounting for the autoscaling ceiling.&lt;/strong&gt; For workloads with bursty, event-driven traffic patterns, ECS autoscaling requires CloudWatch custom metrics and Application Auto Scaling policies that are genuinely annoying to tune. Know the ceiling before you hit it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Single-cluster EKS for two services.&lt;/strong&gt; The fixed cost of the control plane ($72/month), the operational overhead of running Kubernetes, and the learning curve are all real. For two or three services, this almost never makes sense.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Underestimating the Helm/CRD surface area.&lt;/strong&gt; When a Helm-managed CRD conflicts with another controller at 2am, you need someone on the team who can debug it. "We'll figure it out" is not a plan.&lt;/p&gt;

&lt;p&gt;Building a new platform or rearchitecting an existing container environment? The choice between ECS, EKS, and ECS Anywhere usually comes down to where your team is on the Kubernetes maturity curve and what your autoscaling requirements actually are — not which technology is more capable. &lt;a href="https://graycloudarch.com/contact/" rel="noopener noreferrer"&gt;Get in touch&lt;/a&gt; if you're working through this decision — it's a conversation I have with platform teams regularly, and the right answer depends on specifics that don't fit in a blog post.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>ecs</category>
      <category>eks</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Building Apache Iceberg Lakehouse Storage with S3 Table Buckets</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Mon, 18 May 2026 17:54:46 +0000</pubDate>
      <link>https://forem.com/tallgray1/building-apache-iceberg-lakehouse-storage-with-s3-table-buckets-42oo</link>
      <guid>https://forem.com/tallgray1/building-apache-iceberg-lakehouse-storage-with-s3-table-buckets-42oo</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://graycloudarch.com/blog/apache-iceberg-lakehouse-s3-table-buckets/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;The data platform team had a deadline and a storage decision to make. They'd committed to Apache Iceberg as the table format — open standard, time travel, schema evolution, the usual reasons. What they hadn't locked down was where the data was actually going to live, and whether the storage layer would hold up under the metadata-heavy access patterns Iceberg requires.&lt;/p&gt;

&lt;p&gt;The default answer is regular S3. It works. Most Iceberg deployments run on it. But AWS launched S3 Table Buckets in late 2024, and they're purpose-built for exactly this workload: Iceberg metadata operations. The numbers made the decision easy — 10x faster metadata queries, 50% or more improvement in query planning time compared to standard S3. The gotcha worth knowing upfront: S3 Table Bucket support requires AWS Provider 5.70 or later. If your Terraform modules are pinned to an older provider version, that's your first upgrade.&lt;/p&gt;

&lt;p&gt;We built the storage layer as a three-zone medallion architecture, fully managed with Terraform. Here's how we did it — including a few things about Table Buckets that don't show up in most writeups.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Medallion Architecture
&lt;/h2&gt;

&lt;p&gt;One table bucket per environment. Zones are namespaces inside the bucket — not separate buckets, not separate Glue databases in the legacy sense:&lt;/p&gt;

&lt;p&gt;&lt;a href="/diagrams/diag-apache-iceberg-medallion.png" class="article-body-image-wrapper"&gt;&lt;img src="/diagrams/diag-apache-iceberg-medallion.png" alt="Medallion architecture — one S3 Table Bucket per environment with raw, clean, and curated namespaces inside. DMS ingests from source systems into raw. EMR Serverless Spark transforms raw to clean and clean to curated. Glue exposes a federated s3tablescatalog integration layer. Athena queries through Glue. BI layer (Superset) sits on top of Athena."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Raw is immutable. Once data lands there, it doesn't change — ETL failures don't corrupt the source record because the source record is untouched. Clean is normalized and domain-aligned, produced by Spark transforms. Curated is the analytics layer that Athena queries and BI dashboards read from.&lt;/p&gt;

&lt;p&gt;The namespace naming convention we used was &lt;code&gt;{zone}_{domain}&lt;/code&gt; — &lt;code&gt;raw_crm&lt;/code&gt;, &lt;code&gt;clean_customer&lt;/code&gt;, &lt;code&gt;curated_sales_metrics&lt;/code&gt;. When you're looking at a table in Athena or debugging a failed transform job, the namespace name tells you exactly what tier you're in and what domain you're touching. Data lineage is readable from table names alone.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Two Modules Instead of One
&lt;/h2&gt;

&lt;p&gt;The first design question was whether to build a single composite module that creates the KMS key and the S3 Table Bucket together, or split them into separate modules. We split them.&lt;/p&gt;

&lt;p&gt;The KMS key isn't just for the lake. It's used by five downstream services: Athena for query results, EMR for cluster encryption, MWAA for DAG storage, Kinesis for stream encryption, and Glue for transform outputs. If we bundled the key into the lake storage module, every one of those services would need a dependency chain that eventually resolves back through lake storage just to get a KMS key ARN. Separate modules mean the key has one owner, and everything else declares a dependency on it independently.&lt;/p&gt;

&lt;p&gt;The KMS module:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight terraform"&gt;&lt;code&gt;&lt;span class="c1"&gt;# kms-key/main.tf&lt;/span&gt;
&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_kms_key"&lt;/span&gt; &lt;span class="s2"&gt;"this"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt;             &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;description&lt;/span&gt;
  &lt;span class="nx"&gt;enable_key_rotation&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;enable_key_rotation&lt;/span&gt;
  &lt;span class="nx"&gt;deletion_window_in_days&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;deletion_window_in_days&lt;/span&gt;

  &lt;span class="nx"&gt;policy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonencode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;Version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;
    &lt;span class="nx"&gt;Statement&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Enable IAM User Permissions"&lt;/span&gt;
        &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
        &lt;span class="nx"&gt;Principal&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;AWS&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:iam::&lt;/span&gt;&lt;span class="k"&gt;${data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;aws_caller_identity&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;current&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;account_id&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;:root"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="nx"&gt;Action&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"kms:*"&lt;/span&gt;
        &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"*"&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow Service Access"&lt;/span&gt;
        &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
        &lt;span class="nx"&gt;Principal&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Service&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;service_principals&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"kms:Decrypt"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"kms:GenerateDataKey"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"kms:CreateGrant"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"*"&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;service_principals&lt;/code&gt; variable takes a list of service principal strings — &lt;code&gt;["athena.amazonaws.com", "glue.amazonaws.com", "emr-serverless.amazonaws.com"]&lt;/code&gt; and so on. Adding a new service that needs key access is one line in the Terragrunt config, no module change required.&lt;/p&gt;

&lt;h2&gt;
  
  
  The S3 Table Bucket Module
&lt;/h2&gt;

&lt;p&gt;The table bucket itself is straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight terraform"&gt;&lt;code&gt;&lt;span class="c1"&gt;# s3-table-bucket/main.tf&lt;/span&gt;
&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_s3tables_table_bucket"&lt;/span&gt; &lt;span class="s2"&gt;"this"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;bucket_name&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One important thing that trips people up: &lt;strong&gt;S3 Table Buckets are not standard S3 buckets.&lt;/strong&gt; They use the S3 Tables API, not the standard S3 API. Several standard S3 resources will fail with &lt;code&gt;NoSuchBucket (404)&lt;/code&gt; if you try to attach them to a Table Bucket:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;aws_s3_bucket_versioning&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;aws_s3_bucket_server_side_encryption_configuration&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;aws_s3_bucket_public_access_block&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;aws_s3_bucket_intelligent_tiering_configuration&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Encryption is managed internally — AES256 is applied on creation automatically. You'll want &lt;code&gt;ignore_changes = [encryption_configuration]&lt;/code&gt; in your lifecycle block or Terraform will constantly detect drift.&lt;/p&gt;

&lt;p&gt;The Terragrunt dependency chain wires the KMS key ARN into the table bucket configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# lake-storage/terragrunt.hcl&lt;/span&gt;
&lt;span class="nx"&gt;dependency&lt;/span&gt; &lt;span class="s2"&gt;"kms"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;config_path&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"../kms-key"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;inputs&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;bucket_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"company-lake-${local.environment}"&lt;/span&gt;
  &lt;span class="nx"&gt;kms_key_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dependency&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;kms&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;key_arn&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Glue Is Not the Catalog
&lt;/h2&gt;

&lt;p&gt;This is the part that most S3 Table Bucket writeups get wrong, and it matters for how you structure the rest of your Terraform.&lt;/p&gt;

&lt;p&gt;S3 Tables is the metadata source of truth. Glue is the integration layer. When you enable the S3 Tables analytics integration, AWS creates a federated catalog named &lt;code&gt;s3tablescatalog&lt;/code&gt; in your Glue Data Catalog. Table buckets, namespaces, and tables are surfaced through that catalog hierarchy — Athena and EMR see them through Glue, but Glue doesn't own them.&lt;/p&gt;

&lt;p&gt;This means you should not be creating &lt;code&gt;aws_glue_catalog_database&lt;/code&gt; resources with &lt;code&gt;location_uri&lt;/code&gt; S3 paths and trying to wire Iceberg metadata parameters onto them. That's the legacy Glue-over-S3-prefixes model. For S3 Tables, the catalog structure comes from the table bucket integration, not from manual Glue database provisioning.&lt;/p&gt;

&lt;p&gt;In Terraform, the integration resource is &lt;code&gt;aws_s3tables_table_bucket_policy&lt;/code&gt; (for access control) and the analytics integration is enabled at the account level. Once enabled, Athena queries S3 Tables through the &lt;code&gt;s3tablescatalog&lt;/code&gt; namespace automatically.&lt;/p&gt;

&lt;p&gt;The namespace naming convention (&lt;code&gt;raw&lt;/code&gt;, &lt;code&gt;clean&lt;/code&gt;, &lt;code&gt;curated&lt;/code&gt; with domain suffixes) is defined in the table bucket itself, not in Glue. Glue reflects it — it doesn't own it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Cost Model
&lt;/h2&gt;

&lt;p&gt;For a 100TB lake, the comparison against standard S3 holds:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Storage Class&lt;/th&gt;
&lt;th&gt;When&lt;/th&gt;
&lt;th&gt;Monthly Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Standard&lt;/td&gt;
&lt;td&gt;Active data&lt;/td&gt;
&lt;td&gt;~$2,300&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Standard-IA equivalent&lt;/td&gt;
&lt;td&gt;Less-accessed data&lt;/td&gt;
&lt;td&gt;~$400&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Glacier equivalent&lt;/td&gt;
&lt;td&gt;Archive&lt;/td&gt;
&lt;td&gt;~$100&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The metadata acceleration charge for Table Buckets is $0.00025 per 1,000 requests — on a 100TB lake with typical Iceberg file sizes, that's a few dollars a month. The performance improvement compounds the cost picture: 10x faster query planning means less Athena scan time, which means lower query costs as data volume grows.&lt;/p&gt;

&lt;p&gt;One note: you cannot attach &lt;code&gt;aws_s3_bucket_intelligent_tiering_configuration&lt;/code&gt; to a Table Bucket — it's a standard S3 resource and will fail. Storage cost optimization for Table Buckets happens through compaction and retention maintenance jobs (typically run on a schedule via MWAA or EMR), not through lifecycle policies.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deployment Sequence
&lt;/h2&gt;

&lt;p&gt;The deployment order is driven by dependencies: KMS must exist before S3 (bucket encryption needs the key ARN), and both must exist before the S3 Tables analytics integration (which creates the federated Glue catalog surface).&lt;/p&gt;

&lt;p&gt;&lt;a href="/diagrams/diag-apache-iceberg-pipeline.png" class="article-body-image-wrapper"&gt;&lt;img src="/diagrams/diag-apache-iceberg-pipeline.png" alt="Deployment sequence: KMS Key must be created first, then S3 Table Bucket (which uses the key ARN), then S3 Tables Analytics Integration which creates the s3tablescatalog federated view in Glue Data Catalog"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In practice, across three environments (dev, nonprod, prod), the full deployment took about four hours. Most of that was Terragrunt apply time — the actual resource creation for each component is fast, but we ran plan, reviewed, applied, and verified before moving to the next environment.&lt;/p&gt;

&lt;p&gt;One deployment note: if you're using Athena and haven't enabled S3 Tables analytics integration in the account before, do that before the apply. Athena queries S3 Tables only after the integration is enabled and the &lt;code&gt;s3tablescatalog&lt;/code&gt; namespace is visible in the Glue Data Catalog.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Data Team Inherited
&lt;/h2&gt;

&lt;p&gt;When we handed this over to the data engineering team, they had a fully provisioned storage foundation — one table bucket per environment, three namespaces per bucket, encryption enabled, and Athena wired to query through the &lt;code&gt;s3tablescatalog&lt;/code&gt; integration. They could start writing Spark jobs and creating tables immediately without worrying about storage configuration or catalog wiring after the fact.&lt;/p&gt;

&lt;p&gt;The Terraform modules are reusable. Adding a new environment is one Terragrunt leaf config. Adding a new domain namespace is a namespace declaration on the existing bucket. The KMS key and integration configuration don't change.&lt;/p&gt;

&lt;p&gt;S3 Table Buckets are still relatively new, and the Terraform provider support came together in late 2024. If your team is planning an Iceberg migration and hasn't evaluated Table Buckets yet, the metadata performance gains make a strong case for starting there rather than retrofitting later — just go in knowing they're a different API surface than standard S3, and structure your modules accordingly.&lt;/p&gt;




&lt;p&gt;Building out a data platform and figuring out the storage and catalog architecture? &lt;a href="https://graycloudarch.com/contact/" rel="noopener noreferrer"&gt;Get in touch&lt;/a&gt; — this kind of infrastructure design work is something I do regularly, whether you're starting from scratch or migrating an existing lake.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>terraform</category>
      <category>apacheiceberg</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>The 5-Minute Tax I Killed With GitHub Actions</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Mon, 18 May 2026 17:43:31 +0000</pubDate>
      <link>https://forem.com/tallgray1/the-5-minute-tax-i-killed-with-github-actions-1gpe</link>
      <guid>https://forem.com/tallgray1/the-5-minute-tax-i-killed-with-github-actions-1gpe</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://graycloudarch.com/blog/zero-touch-deployments/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Every time I finished writing a blog post, I had to do this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;sites/graycloudarch
hugo &lt;span class="nt"&gt;--minify&lt;/span&gt;
aws s3 &lt;span class="nb"&gt;sync &lt;/span&gt;public/ s3://graycloudarch-website &lt;span class="nt"&gt;--delete&lt;/span&gt;
aws cloudfront create-invalidation &lt;span class="nt"&gt;--distribution-id&lt;/span&gt; E1234ABCDEF &lt;span class="nt"&gt;--paths&lt;/span&gt; &lt;span class="s2"&gt;"/*"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Five minutes. Doesn't sound like much.&lt;/p&gt;

&lt;p&gt;But when you're trying to publish 2-3 posts per week while working full-time, those 5 minutes add up. Not just in time—in &lt;em&gt;friction&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;"I just finished writing. Now I need to context-switch to deployment mode. What was that CloudFront ID again?"&lt;/p&gt;

&lt;p&gt;Friction kills momentum.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Wanted
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;git push&lt;/code&gt; → site updates automatically → I move on to the next thing.&lt;/p&gt;

&lt;p&gt;Zero thinking. Zero context switching. Zero "oh crap, I forgot to invalidate CloudFront."&lt;/p&gt;

&lt;h2&gt;
  
  
  The Solution: GitHub Actions
&lt;/h2&gt;

&lt;p&gt;GitHub Actions can build and deploy your site every time you push to &lt;code&gt;main&lt;/code&gt;. For free.&lt;/p&gt;

&lt;p&gt;Here's the whole workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deploy graycloudarch.com&lt;/span&gt;

&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;main&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sites/graycloudarch/**'&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;content/graycloudarch/**'&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;submodules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;peaceiris/actions-hugo@v2&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;hugo-version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;latest'&lt;/span&gt;
          &lt;span class="na"&gt;extended&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Build site&lt;/span&gt;
        &lt;span class="na"&gt;working-directory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./sites/graycloudarch&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hugo --minify&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Configure AWS&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws-actions/configure-aws-credentials@v4&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;aws-access-key-id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.AWS_ACCESS_KEY_ID }}&lt;/span&gt;
          &lt;span class="na"&gt;aws-secret-access-key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.AWS_SECRET_ACCESS_KEY }}&lt;/span&gt;
          &lt;span class="na"&gt;aws-region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-east-1&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deploy&lt;/span&gt;
        &lt;span class="na"&gt;working-directory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./sites/graycloudarch&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;aws s3 sync public/ s3://graycloudarch-website --delete&lt;/span&gt;
          &lt;span class="s"&gt;aws cloudfront create-invalidation \&lt;/span&gt;
            &lt;span class="s"&gt;--distribution-id ${{ secrets.CLOUDFRONT_DISTRIBUTION }} \&lt;/span&gt;
            &lt;span class="s"&gt;--paths "/*"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Push to &lt;code&gt;main&lt;/code&gt;, GitHub Actions handles the rest.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Part That Tripped Me Up
&lt;/h2&gt;

&lt;p&gt;Hugo themes are usually Git submodules. If you don't check them out, your build fails with cryptic errors about missing layouts.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;submodules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;  &lt;span class="c1"&gt;# Don't forget this&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cost me 20 minutes of debugging before I realized. Now it's documented in code, not lost in my bash history.&lt;/p&gt;

&lt;h2&gt;
  
  
  Path Filtering: The Secret Sauce
&lt;/h2&gt;

&lt;p&gt;I run two sites in one repo: graycloudarch.com and cloudpatterns.io.&lt;/p&gt;

&lt;p&gt;Without path filtering, every push rebuilds &lt;em&gt;both&lt;/em&gt; sites, even if I only changed one. Wasted build minutes, unnecessary CloudFront invalidations, slower feedback.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sites/graycloudarch/**'&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;content/graycloudarch/**'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now GitHub Actions only runs when files for &lt;em&gt;that site&lt;/em&gt; change. Fast, efficient, no waste.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;I'm trying to hit $3K/month by March 31. That's 9 weeks.&lt;/p&gt;

&lt;p&gt;Every minute I spend deploying is a minute I'm not writing, not reaching out to clients, not building the course I want to sell.&lt;/p&gt;

&lt;p&gt;Manual deployments are a tax on my time. This workflow eliminated that tax.&lt;/p&gt;

&lt;p&gt;Now when I finish writing, I commit and push. Two minutes later, it's live. I'm already working on the next post.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Win
&lt;/h2&gt;

&lt;p&gt;It's not the 5 minutes per deployment.&lt;/p&gt;

&lt;p&gt;It's the &lt;em&gt;mental overhead&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Before: "Okay, post is done. Now I need to switch gears, build Hugo, sync to S3, remember that CloudFront command..."&lt;/p&gt;

&lt;p&gt;After: "Post is done. &lt;code&gt;git push&lt;/code&gt;. What's next?"&lt;/p&gt;

&lt;p&gt;No context switch. No friction. Just ship and move on.&lt;/p&gt;

&lt;p&gt;That's worth way more than 5 minutes.&lt;/p&gt;

&lt;p&gt;Want to set this up for your site? The workflow above works for any Hugo + S3 + CloudFront setup. Just plug in your bucket names and distribution IDs in GitHub Secrets.&lt;/p&gt;

&lt;p&gt;Or &lt;a href="https://graycloudarch.com/contact" rel="noopener noreferrer"&gt;reach out&lt;/a&gt; if you want help automating your deployments. I do this for a living.&lt;/p&gt;

</description>
      <category>githubactions</category>
      <category>cicd</category>
      <category>hugo</category>
      <category>automation</category>
    </item>
    <item>
      <title>I Spent 6 Hours Automating a 30-Minute Task (And I'd Do It Again)</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Mon, 18 May 2026 17:43:30 +0000</pubDate>
      <link>https://forem.com/tallgray1/i-spent-6-hours-automating-a-30-minute-task-and-id-do-it-again-14ee</link>
      <guid>https://forem.com/tallgray1/i-spent-6-hours-automating-a-30-minute-task-and-id-do-it-again-14ee</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://graycloudarch.com/blog/automated-infrastructure/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Look, I know what you're thinking. "Glenn, you could've just clicked through the AWS console and had both sites live in an hour."&lt;/p&gt;

&lt;p&gt;You're not wrong.&lt;/p&gt;

&lt;p&gt;But here's the thing—I'm allergic to clicking through consoles. It's a professional hazard from spending the last 5 years building enterprise platforms where "just do it manually" gets you fired.&lt;/p&gt;

&lt;p&gt;So when I sat down to launch graycloudarch.com and cloudpatterns.io, I did what any reasonable person would do: I spent 6 hours writing Terraform to automate a 30-minute task.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Manual Way (aka Hell)
&lt;/h2&gt;

&lt;p&gt;If I'd done this the normal way:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;AWS Console → ACM → Request Certificate&lt;/li&gt;
&lt;li&gt;Copy the DNS validation CNAME&lt;/li&gt;
&lt;li&gt;Cloudflare → Add DNS record&lt;/li&gt;
&lt;li&gt;Wait. Refresh. Wait more.&lt;/li&gt;
&lt;li&gt;AWS Console → CloudFront → Create Distribution&lt;/li&gt;
&lt;li&gt;Copy CloudFront domain&lt;/li&gt;
&lt;li&gt;Cloudflare → Add another DNS record&lt;/li&gt;
&lt;li&gt;Test. Find typo. Fix typo. Test again.&lt;/li&gt;
&lt;li&gt;Repeat for second domain.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Time: 40 minutes if nothing breaks (it always breaks).&lt;/p&gt;

&lt;p&gt;Chance I'd screw up a DNS record: 80%.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Automated Way (aka Overkill)
&lt;/h2&gt;

&lt;p&gt;One Terraform apply. That's it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;terraform apply
&lt;span class="c"&gt;# Go make coffee&lt;/span&gt;
&lt;span class="c"&gt;# Come back to two working sites&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But the real magic isn't the deployment—it's what happens when AWS generates those ACM validation records:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"cloudflare_record"&lt;/span&gt; &lt;span class="s2"&gt;"cert_validation"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;for_each&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;for&lt;/span&gt; &lt;span class="nx"&gt;dvo&lt;/span&gt; &lt;span class="nx"&gt;in&lt;/span&gt; &lt;span class="nx"&gt;aws_acm_certificate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;site&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;domain_validation_options&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt;
      &lt;span class="nx"&gt;dvo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;domain_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;name&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dvo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;resource_record_name&lt;/span&gt;
        &lt;span class="nx"&gt;record&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dvo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;resource_record_value&lt;/span&gt;
        &lt;span class="nx"&gt;type&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dvo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;resource_record_type&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;zone_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cloudflare_zone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;site&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;each&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;
  &lt;span class="nx"&gt;value&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;each&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;record&lt;/span&gt;
  &lt;span class="nx"&gt;type&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;each&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;type&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Terraform reads the validation records from AWS, creates them in Cloudflare, and waits for validation to complete. Zero copy-paste. Zero switching between browser tabs. Zero forgetting which CNAME goes where.&lt;/p&gt;

&lt;p&gt;I don't touch Cloudflare. I don't touch AWS Console. I just run terraform apply and go do something useful.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters (Spoiler: It's Not About Terraform)
&lt;/h2&gt;

&lt;p&gt;I'm trying to hit $3K/month by March 31. That's 9 weeks away.&lt;/p&gt;

&lt;p&gt;Every hour I spend clicking through AWS is an hour I'm not:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Writing blog posts&lt;/li&gt;
&lt;li&gt;Reaching out to potential clients on LinkedIn&lt;/li&gt;
&lt;li&gt;Building the course I want to sell&lt;/li&gt;
&lt;li&gt;Actually making money&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Manual infrastructure doesn't generate revenue. Published content generates revenue.&lt;/p&gt;

&lt;p&gt;So yeah, I spent 6 hours automating something I could've done in 30 minutes. But now when I launch my third brand (and I will), it takes 10 minutes and one terraform apply.&lt;/p&gt;

&lt;p&gt;That's the bet: upfront investment for long-term velocity.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Actually Built
&lt;/h2&gt;

&lt;p&gt;The module is dead simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ACM certificate with DNS validation&lt;/li&gt;
&lt;li&gt;S3 bucket for static hosting&lt;/li&gt;
&lt;li&gt;CloudFront distribution&lt;/li&gt;
&lt;li&gt;Cloudflare DNS records (both root and www)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Call it twice (once per brand), different inputs, same code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"graycloudarch"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"../../modules/static-site"&lt;/span&gt;
  &lt;span class="nx"&gt;domain_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"graycloudarch.com"&lt;/span&gt;
  &lt;span class="nx"&gt;bucket_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"graycloudarch-website"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"cloudpatterns"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"../../modules/static-site"&lt;/span&gt;
  &lt;span class="nx"&gt;domain_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"cloudpatterns.io"&lt;/span&gt;
  &lt;span class="nx"&gt;bucket_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"cloudpatterns-website"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. No duplication. No drift. No "wait, which CloudFront ID goes with which domain?"&lt;/p&gt;

&lt;h2&gt;
  
  
  The Part Where I Screwed Up
&lt;/h2&gt;

&lt;p&gt;Of course it didn't work perfectly the first time.&lt;/p&gt;

&lt;p&gt;Turns out when you register a domain through Cloudflare, they helpfully create a default parking page DNS record. When Terraform tried to create my root CNAME, it failed with "record already exists."&lt;/p&gt;

&lt;p&gt;Took me 20 minutes to figure out I needed &lt;code&gt;allow_overwrite = true&lt;/code&gt; in the Cloudflare resource.&lt;/p&gt;

&lt;p&gt;20 minutes I'll never get back. But at least it's documented in Git now, not lost in my bash history.&lt;/p&gt;

&lt;h2&gt;
  
  
  Would I Do This Again?
&lt;/h2&gt;

&lt;p&gt;Absolutely.&lt;/p&gt;

&lt;p&gt;Not because it's faster (it's not, the first time).&lt;/p&gt;

&lt;p&gt;Not because it's easier (it's definitely not).&lt;/p&gt;

&lt;p&gt;Because when I'm sitting at 2am writing my fifth blog post of the week and I realize I need to spin up a third site for a new product line, I can do it in 10 minutes instead of canceling my writing session to spend 45 minutes in AWS console.&lt;/p&gt;

&lt;p&gt;Automation is a bet on future you. I'm betting future Glenn will appreciate not having to remember how SSL validation works.&lt;/p&gt;

&lt;p&gt;Want the code? It's not open source (yet), but if you're building something similar and want to talk through the architecture, &lt;a href="https://graycloudarch.com/contact" rel="noopener noreferrer"&gt;hit me up&lt;/a&gt;. I'm always down to talk Terraform.&lt;/p&gt;

&lt;p&gt;Or if you just want to tell me I'm insane for spending 6 hours on this, that's cool too. My DMs are open.&lt;/p&gt;

</description>
      <category>terraform</category>
      <category>aws</category>
      <category>cloudfront</category>
      <category>automation</category>
    </item>
    <item>
      <title>The IAM Trust Policy Chicken-and-Egg (That Isn't)</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Wed, 13 May 2026 17:55:53 +0000</pubDate>
      <link>https://forem.com/tallgray1/the-iam-trust-policy-chicken-and-egg-that-isnt-2ba5</link>
      <guid>https://forem.com/tallgray1/the-iam-trust-policy-chicken-and-egg-that-isnt-2ba5</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://graycloudarch.com/blog/iam-trust-policy-chicken-and-egg/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;The pipeline role needed to trust the deployment role. The deployment role needed to trust the pipeline role. When I wrote both in Terraform and ran plan, it stopped:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Error: Cycle: module.pipeline.aws_iam_role.exec → module.deploy.aws_iam_role.target → module.pipeline.aws_iam_role.exec
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The instinct is to create one role first, then go back and edit the trust policy of the other after it exists. A manual bootstrap step. It works. It also means you can't &lt;code&gt;terraform apply&lt;/code&gt; from a clean state and get a working result — someone has to remember the second pass. The IaC tells half the story.&lt;/p&gt;

&lt;p&gt;There's a better answer. IAM trust policies don't validate that the ARNs they reference actually exist. AWS stores the JSON document and moves on. The cycle Terraform sees is real — it's a real edge in its dependency graph. The underlying constraint that dependency represents is not.&lt;/p&gt;

&lt;h2&gt;
  
  
  ARNs are deterministic before creation
&lt;/h2&gt;

&lt;p&gt;IAM role ARNs follow a fixed format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;arn:aws:iam::&amp;lt;account-id&amp;gt;:role/&amp;lt;role-name&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The account ID is fixed. The role name is chosen at definition time. Which means the full ARN is computable before &lt;code&gt;terraform apply&lt;/code&gt; runs — before the resource exists — as long as the name is stable.&lt;/p&gt;

&lt;p&gt;AWS does not validate that a referenced principal ARN exists when you create or update a trust policy. It stores the JSON. The role becomes assumable once both sides exist, regardless of which one was created first.&lt;/p&gt;

&lt;p&gt;This is different from a configuration error like referencing a nonexistent IAM role in an &lt;code&gt;aws_iam_role_policy_attachment&lt;/code&gt; — that fails at apply time because Terraform tries to call the API and gets an error. A trust policy is just a JSON document stored against the role. If the ARN in the &lt;code&gt;Principal&lt;/code&gt; field doesn't resolve to an existing entity yet, IAM doesn't complain. It just doesn't match anything. Yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  The cycle Terraform sees
&lt;/h2&gt;

&lt;p&gt;The dependency graph problem is real. Here's the code that creates it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_iam_role"&lt;/span&gt; &lt;span class="s2"&gt;"role_a"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;assume_role_policy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonencode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;Statement&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
      &lt;span class="nx"&gt;Principal&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;AWS&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_iam_role&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;role_b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;  &lt;span class="c1"&gt;# depends on role_b&lt;/span&gt;
    &lt;span class="p"&gt;}]&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_iam_role"&lt;/span&gt; &lt;span class="s2"&gt;"role_b"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;assume_role_policy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonencode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;Statement&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
      &lt;span class="nx"&gt;Principal&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;AWS&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_iam_role&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;role_a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;  &lt;span class="c1"&gt;# depends on role_a&lt;/span&gt;
    &lt;span class="p"&gt;}]&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Terraform resolves: &lt;code&gt;role_a&lt;/code&gt; needs &lt;code&gt;role_b&lt;/code&gt;'s ARN before creation → &lt;code&gt;role_b&lt;/code&gt; needs &lt;code&gt;role_a&lt;/code&gt;'s ARN before creation → cycle. It stops before creating either resource.&lt;/p&gt;

&lt;p&gt;The fix removes the dependency by computing what you already know:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="s2"&gt;"aws_caller_identity"&lt;/span&gt; &lt;span class="s2"&gt;"current"&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

&lt;span class="nx"&gt;locals&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;account_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;aws_caller_identity&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;current&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;account_id&lt;/span&gt;

  &lt;span class="nx"&gt;role_a_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:iam::${local.account_id}:role/${var.role_a_name}"&lt;/span&gt;
  &lt;span class="nx"&gt;role_b_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:iam::${local.account_id}:role/${var.role_b_name}"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_iam_role"&lt;/span&gt; &lt;span class="s2"&gt;"role_a"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;role_a_name&lt;/span&gt;
  &lt;span class="nx"&gt;assume_role_policy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonencode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;Statement&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
      &lt;span class="nx"&gt;Principal&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;AWS&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;local&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;role_b_arn&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;  &lt;span class="c1"&gt;# string, no Terraform dependency&lt;/span&gt;
    &lt;span class="p"&gt;}]&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_iam_role"&lt;/span&gt; &lt;span class="s2"&gt;"role_b"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;role_b_name&lt;/span&gt;
  &lt;span class="nx"&gt;assume_role_policy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonencode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;Statement&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
      &lt;span class="nx"&gt;Principal&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;AWS&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;local&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;role_a_arn&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;  &lt;span class="c1"&gt;# string, no Terraform dependency&lt;/span&gt;
    &lt;span class="p"&gt;}]&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No cycle. Both roles are created in a single apply. The trust relationship is live as soon as both resources exist — which they will be, after the same plan.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgraycloudarch.com%2Fdiagrams%2Fdiag-iam-chicken-egg-cross-account.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgraycloudarch.com%2Fdiagrams%2Fdiag-iam-chicken-egg-cross-account.png" alt="Two-account cross-account IAM trust relationship. Both role ARNs are constructed from known values at plan time — no deploy order required."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this pattern appears in practice
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Cross-account deployment pipeline&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;CodePipeline execution role in account A assumes a deployment role in account B. The deployment role's trust policy needs to reference the pipeline role's ARN. Each Terraform root manages its own account's roles. The ARN construction pattern resolves the cross-account dependency: each module constructs the other account's role ARN from &lt;code&gt;var.pipeline_account_id&lt;/code&gt; and a known role name — values passed in at plan time from tfvars or remote state outputs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ECS task role and execution role&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The ECS task execution role needs &lt;code&gt;iam:PassRole&lt;/code&gt; to hand the task role to ECS at launch. Some teams want the task role's trust policy to explicitly list the execution role's ARN as the allowed principal. You don't need to — &lt;code&gt;ecs-tasks.amazonaws.com&lt;/code&gt; as the service principal removes the dependency entirely. But if your security posture requires explicit principal ARNs rather than the service principal, ARN construction handles it without a two-pass apply.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Permission boundary bootstrap with an SCP&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;An SCP requires that all new IAM roles include a specific permission boundary policy. The boundary is a managed policy that must exist before any roles referencing it can be created. This isn't a circular dependency — it's a sequential one. The boundary policy must be applied first, separately. Construct its ARN deterministically (&lt;code&gt;arn:aws:iam::${var.account_id}:policy/${var.boundary_name}&lt;/code&gt;) and pass it in wherever roles are created. Document the bootstrap order with a Terraform &lt;code&gt;precondition&lt;/code&gt; block or a clear README section. Different problem, different fix.&lt;/p&gt;

&lt;h2&gt;
  
  
  When the dependency is genuine
&lt;/h2&gt;

&lt;p&gt;There's a scenario that looks identical to this but isn't: when a Terraform provisioner or data source needs to actually &lt;em&gt;call&lt;/em&gt; a role — not just reference its ARN — during resource creation.&lt;/p&gt;

&lt;p&gt;Example: a &lt;code&gt;null_resource&lt;/code&gt; provisioner that runs &lt;code&gt;aws sts assume-role&lt;/code&gt; and then operates in the target account. Here you need the role to exist and be assumable before the provisioner fires. ARN construction doesn't help — you need the resource active at execution time, not just its string value known at plan time. The correct fix is explicit &lt;code&gt;depends_on&lt;/code&gt;, not local string construction.&lt;/p&gt;

&lt;p&gt;The distinction: static JSON referencing an ARN string (solvable with ARN construction) vs. a runtime API call that needs the resource actually live (solvable with &lt;code&gt;depends_on&lt;/code&gt;). If your code needs to &lt;em&gt;assume&lt;/em&gt; the role during apply, you need ordering. If it just needs to &lt;em&gt;name&lt;/em&gt; the role in a policy document, you don't.&lt;/p&gt;

&lt;h2&gt;
  
  
  The trap in the fix
&lt;/h2&gt;

&lt;p&gt;Once you've internalized "construct ARNs deterministically," the next failure mode is &lt;strong&gt;role names that include Terraform-generated suffixes&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_iam_role"&lt;/span&gt; &lt;span class="s2"&gt;"role_a"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${var.prefix}-role-${random_id.suffix.hex}"&lt;/span&gt;  &lt;span class="c1"&gt;# ARN not deterministic until random_id exists&lt;/span&gt;
  &lt;span class="p"&gt;...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the role name includes &lt;code&gt;random_id.suffix.hex&lt;/code&gt;, the ARN can't be computed until the &lt;code&gt;random_id&lt;/code&gt; resource is created. That brings the dependency back — you're back to needing a resource output to construct the name, and the cycle re-forms if any of those names are referenced in another role's trust policy.&lt;/p&gt;

&lt;p&gt;The fix is stable, predictable role names: &lt;code&gt;"${var.prefix}-${var.env}-pipeline"&lt;/code&gt; rather than generated suffixes. IAM role names are unique per account, not globally. The habit of appending random suffixes comes from S3 bucket naming, where global uniqueness is required. IAM doesn't have that constraint. There's no reason to make the name unpredictable.&lt;/p&gt;

&lt;p&gt;If you have existing roles with generated names and need their ARNs, they're deterministic &lt;em&gt;after&lt;/em&gt; the first apply — stored in state and readable via &lt;code&gt;aws_iam_role.role_a.arn&lt;/code&gt;. The construction approach is for cases where you control the naming and are defining the role name yourself.&lt;/p&gt;

&lt;h2&gt;
  
  
  What generalizes
&lt;/h2&gt;

&lt;p&gt;The IAM trust policy deadlock is the most common place engineers hit this pattern, but it's not the only one. Wherever you encounter a Terraform circular dependency involving a predictable string — ARNs, resource names, account IDs, region names — ask whether you actually need the resource output or whether you can compute the value from what you already know.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;data.aws_caller_identity.current.account_id&lt;/code&gt; gives you the account without creating a dependency on any resource. A stable name gives you the ARN. The dependency graph edge exists only because you referenced the resource — remove the reference by computing the value directly, and the cycle disappears.&lt;/p&gt;

&lt;p&gt;The broader principle: Terraform's graph is built from references. References that aren't necessary are constraints that aren't necessary.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Untangling IAM architecture across multiple accounts — trust policies, permission boundaries, SCPs, cross-account assume-role chains — is where subtle errors compound quietly and the blast radius is real. &lt;a href="https://graycloudarch.com/contact/" rel="noopener noreferrer"&gt;I work on this regularly&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>iam</category>
      <category>terraform</category>
      <category>aws</category>
      <category>security</category>
    </item>
    <item>
      <title>What the first 24 hours of production CloudWatch data told us</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Mon, 04 May 2026 18:43:32 +0000</pubDate>
      <link>https://forem.com/tallgray1/what-the-first-24-hours-of-production-cloudwatch-data-told-us-1140</link>
      <guid>https://forem.com/tallgray1/what-the-first-24-hours-of-production-cloudwatch-data-told-us-1140</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://graycloudarch.com/blog/cloudwatch-go-live-24h/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;The morning after go-live, the first thing I looked at was CPU. One of the two delivery services was sitting at 99.8% average utilization across 9 tasks. P50 latency: 1,010ms.&lt;/p&gt;

&lt;p&gt;We'd launched deliberately without autoscaling. The plan was to observe real traffic patterns before configuring a scaling policy — you can't tune a policy you haven't seen the workload demand yet. What we didn't know was that the workload would reveal something about the task itself before we'd had a chance to watch it for a week.&lt;/p&gt;

&lt;p&gt;Thirty-six hours after go-live, we'd shipped right-sizing changes, a working autoscaling configuration, and a new observability source for ALB-layer signals. All of it came directly from what the first day of production data said. Here's how we read it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What 99.8% CPU means at 0.5 vCPU
&lt;/h2&gt;

&lt;p&gt;The service was allocated 512 ECS CPU units per task — half a vCPU. CloudWatch was telling us the tasks were spending essentially all of their scheduled CPU time working.&lt;/p&gt;

&lt;p&gt;The first instinct in this situation is to add tasks. Scale out horizontally. But adding more 0.5 vCPU containers when each one is already saturated doesn't change the constraint. In ECS, the scheduler distributes tasks across hosts, but the per-task CPU ceiling is set in the task definition. More tasks at ceiling is not materially different from fewer tasks at ceiling — you're distributing the same undersized unit more widely.&lt;/p&gt;

&lt;p&gt;The signal wasn't about count. It was about the unit itself.&lt;/p&gt;

&lt;p&gt;At 99.8% utilization, any burst in per-request processing time — a downstream API call that's slow, a cache miss, a spike in concurrent requests — queues. The task has no headroom to absorb it. That's where the 1,010ms p50 comes from: not that individual requests are slow, but that tasks are scheduled tightly enough that requests wait before they even start processing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Right-sizing the task before configuring the autoscaler
&lt;/h2&gt;

&lt;p&gt;We doubled the CPU allocation: 512 → 1,024 units. The rationale is mechanical once you see it: you can't configure a useful CPU-based autoscaling policy on a task that's already running at ceiling. If 100% CPU is the baseline, the autoscaler has nothing to respond to — it would scale out immediately on creation and never scale in.&lt;/p&gt;

&lt;p&gt;Target tracking at 70% CPU requires headroom. A 1 vCPU task running the same workload that previously pinned a 0.5 vCPU task will land around 50% utilization — below the target, room to absorb variance before triggering a scale-out, and enough signal for scale-in to be meaningful rather than noise.&lt;/p&gt;

&lt;p&gt;The second service had a different profile: 12 tasks, 1 vCPU each, hitting 92% at peak. Not saturated the same way, but thin on headroom. We went to 2 vCPU there.&lt;/p&gt;

&lt;p&gt;Two other services in the platform were running the opposite problem — allocated more memory than they'd ever used. Those went the other direction: overprovisioned memory cut back based on observed peaks. The same 24-hour data window showed both problems at once.&lt;/p&gt;

&lt;p&gt;Sequencing matters: &lt;strong&gt;right-size the task before you configure the autoscaler.&lt;/strong&gt; Otherwise you're teaching a scaling policy to respond to a signal that's already maxed out, and the first thing it does is scale out to a floor that's still running on undersized tasks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why we chose CPU tracking instead of request count
&lt;/h2&gt;

&lt;p&gt;The obvious autoscaling metric for an HTTP service is &lt;code&gt;ALBRequestCountPerTarget&lt;/code&gt;. The ALB knows the request rate per target group; scaling on that metric tracks load linearly and is highly predictable.&lt;/p&gt;

&lt;p&gt;We couldn't use it.&lt;/p&gt;

&lt;p&gt;The platform uses a cross-account Lambda to register ECS tasks with ALB target groups at boot. Because of how the registration bridge works, the ECS service resource is provisioned with &lt;code&gt;target_group_arn = null&lt;/code&gt; — the target group lives in a different account, and the service module doesn't know its ARN. &lt;code&gt;ALBRequestCountPerTarget&lt;/code&gt; requires the target group ARN to be known to the Application Auto Scaling policy. Without it, there's no way to wire the metric across accounts without building additional dependency plumbing.&lt;/p&gt;

&lt;p&gt;CPU target tracking at 70% was the correct second choice. For a CPU-bound workload — which 99.8% utilization confirms this is — CPU is a meaningful proxy for load. The metric was there, it was clean, and the task was now sized to make it useful.&lt;/p&gt;

&lt;p&gt;One thing worth noting: the cross-account registration bridge was the right architectural decision for the problem it solved. But it created a constraint three layers away in a scaling configuration we hadn't designed yet. Architecture decisions compound downstream. The fix here was straightforward; I've seen the same pattern take longer to untangle when the constraint wasn't recognized.&lt;/p&gt;

&lt;h2&gt;
  
  
  The observability gap app logs can't fill
&lt;/h2&gt;

&lt;p&gt;Application logs were already flowing to BetterStack from both services. We had route-level latency, HTTP status codes, request counts, error breakdowns — everything that happens inside a container.&lt;/p&gt;

&lt;p&gt;What the logs couldn't tell us was what happens above them. The ALB generates its own error signals: &lt;code&gt;HTTPCode_ELB_5XX_Count&lt;/code&gt; for errors the load balancer generates before a request reaches a container, &lt;code&gt;RejectedConnectionCount&lt;/code&gt; for connections refused at the ALB layer when backend capacity is exhausted, &lt;code&gt;ActiveConnectionCount&lt;/code&gt; as a proxy for in-flight load per target group. None of this appears in application logs. If the ALB had been dropping connections during the 99.8% CPU period, we would have had no signal in our observability platform.&lt;/p&gt;

&lt;p&gt;CloudWatch had the data. The gap was getting it into the same place as everything else.&lt;/p&gt;

&lt;p&gt;A 60-second Lambda in the infrastructure account — where the ALB lives — calls &lt;code&gt;GetMetricData&lt;/code&gt; and ships structured JSON to BetterStack. One EventBridge rule, no ECS changes, effectively zero cost (one CloudWatch API call per minute against Lambda's free tier). The metrics land alongside the application data and show the ALB layer that the app logs are blind to.&lt;/p&gt;

&lt;p&gt;The design decision here was Lambda over an ECS sidecar. A sidecar would have run per-service, per-task, 24 hours a day, and required task definition changes across the platform. A single Lambda running once per minute in the account that owns the ALB costs nothing and touches no ECS configuration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Autoscaling parameters worth explaining
&lt;/h2&gt;

&lt;p&gt;For the higher-load service: min=9, max=20, CPU target=70%, scale-out cooldown=60s, scale-in cooldown=300s.&lt;/p&gt;

&lt;p&gt;Setting &lt;code&gt;min_capacity&lt;/code&gt; to 9 — the current running task count — was deliberate. We'd just established that 9 tasks was a functional floor for this workload at current traffic levels. An autoscaler configured with min=2 or min=4 would have attempted to scale in on the first quiet period, bringing the service back to a state we knew was already under-provisioned. Anchoring the floor to the observed stable-state count prevents that while we accumulate enough autoscaling history to set a meaningful long-term floor.&lt;/p&gt;

&lt;p&gt;The asymmetric cooldowns — 60 seconds for scale-out, 5 minutes for scale-in — reflect the cost asymmetry of being wrong in each direction. Scaling out too slowly during a load spike means requests queue. Scaling in too aggressively during a brief quiet period means tasks are killed and restarted unnecessarily. The 5-minute scale-in cooldown is conservative; we'll revisit it once we have a week of data showing where the service naturally stabilizes.&lt;/p&gt;

&lt;h2&gt;
  
  
  What 24 hours of data drove
&lt;/h2&gt;

&lt;p&gt;We launched expecting to spend the first week observing. What the data delivered instead was a complete picture of three distinct problems: a task sizing issue that was causing queuing, a scaling policy that needed the right foundation before it could be configured, and an observability gap for a class of signals that app logs fundamentally can't surface.&lt;/p&gt;

&lt;p&gt;All three were solved from the same 24-hour data window. The pre-launch load testing hadn't revealed any of them — synthetic traffic and production ad-bidding traffic have different CPU profiles, and you don't know which until the real thing runs.&lt;/p&gt;

&lt;p&gt;The thing I'd change if running this again: put a structured post-launch data review into the go-live plan, not the next morning's to-do list. Not a formal incident review — a deliberate hour with CloudWatch after the first day's traffic has run through. The data is there. The question is whether you've planned to look at it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you're planning a production go-live and want a structured approach to post-launch data review and stabilization — or you're staring at a service running at ceiling with no autoscaling — &lt;a href="https://graycloudarch.com/contact/" rel="noopener noreferrer"&gt;get in touch&lt;/a&gt;. This is the kind of platform work I do regularly, and the pattern here applies well beyond ad delivery.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ecs</category>
      <category>cloudwatch</category>
      <category>autoscaling</category>
      <category>rightsizing</category>
    </item>
    <item>
      <title>DNS Validation: From 15 Steps to Zero</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Sat, 04 Apr 2026 22:30:30 +0000</pubDate>
      <link>https://forem.com/tallgray1/dns-validation-from-15-steps-to-zero-1nng</link>
      <guid>https://forem.com/tallgray1/dns-validation-from-15-steps-to-zero-1nng</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://graycloudarch.com/blog/dns-hell-to-automated/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;You know what's the worst part of launching a new site?&lt;/p&gt;

&lt;p&gt;SSL certificate validation.&lt;/p&gt;

&lt;p&gt;Not creating the cert—that's one click in AWS ACM. It's the validation dance:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;AWS gives you a CNAME record: &lt;code&gt;_abc123extremely-long-string-here.graycloudarch.com&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;The value is equally ridiculous: &lt;code&gt;_xyz789another-massive-string.acm-validations.aws.&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;You copy it (pray you don't miss a character)&lt;/li&gt;
&lt;li&gt;Switch to Cloudflare (or Route 53, or wherever)&lt;/li&gt;
&lt;li&gt;Paste it in&lt;/li&gt;
&lt;li&gt;Wait 5-10 minutes&lt;/li&gt;
&lt;li&gt;Refresh AWS console&lt;/li&gt;
&lt;li&gt;Still pending...&lt;/li&gt;
&lt;li&gt;Refresh again&lt;/li&gt;
&lt;li&gt;Finally validated!&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Now do it again for &lt;code&gt;www.graycloudarch.com&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;And then repeat the whole thing for your second domain.&lt;/p&gt;

&lt;p&gt;This is "DNS hell."&lt;/p&gt;

&lt;h2&gt;
  
  
  There's a Better Way
&lt;/h2&gt;

&lt;p&gt;Terraform can read AWS validation records and create them in Cloudflare automatically.&lt;/p&gt;

&lt;p&gt;Zero copy-paste. Zero browser tab switching. Zero waiting and refreshing.&lt;/p&gt;

&lt;p&gt;Here's the whole thing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Request certificate&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_acm_certificate"&lt;/span&gt; &lt;span class="s2"&gt;"site"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;domain_name&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"graycloudarch.com"&lt;/span&gt;
  &lt;span class="nx"&gt;validation_method&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"DNS"&lt;/span&gt;
  &lt;span class="nx"&gt;subject_alternative_names&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"www.graycloudarch.com"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Create validation records in Cloudflare&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"cloudflare_record"&lt;/span&gt; &lt;span class="s2"&gt;"cert_validation"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;for_each&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;for&lt;/span&gt; &lt;span class="nx"&gt;dvo&lt;/span&gt; &lt;span class="nx"&gt;in&lt;/span&gt; &lt;span class="nx"&gt;aws_acm_certificate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;site&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;domain_validation_options&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt;
    &lt;span class="nx"&gt;dvo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;domain_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;name&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dvo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;resource_record_name&lt;/span&gt;
      &lt;span class="nx"&gt;value&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dvo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;resource_record_value&lt;/span&gt;
      &lt;span class="nx"&gt;type&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dvo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;resource_record_type&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;zone_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cloudflare_zone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;site&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;each&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;
  &lt;span class="nx"&gt;value&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;each&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;
  &lt;span class="nx"&gt;type&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;each&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;type&lt;/span&gt;
  &lt;span class="nx"&gt;proxied&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;  &lt;span class="c1"&gt;# Critical - ACM validation breaks with proxy&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Wait for validation&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_acm_certificate_validation"&lt;/span&gt; &lt;span class="s2"&gt;"site"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;certificate_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_acm_certificate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;site&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
  &lt;span class="nx"&gt;validation_record_fqdns&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nx"&gt;for&lt;/span&gt; &lt;span class="nx"&gt;record&lt;/span&gt; &lt;span class="nx"&gt;in&lt;/span&gt; &lt;span class="nx"&gt;cloudflare_record&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cert_validation&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;record&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;hostname&lt;/span&gt;
  &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run &lt;code&gt;terraform apply&lt;/code&gt;. Go make coffee. Come back to a validated certificate.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Magic: for_each
&lt;/h2&gt;

&lt;p&gt;The key is this part:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;for_each&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;for&lt;/span&gt; &lt;span class="nx"&gt;dvo&lt;/span&gt; &lt;span class="nx"&gt;in&lt;/span&gt; &lt;span class="nx"&gt;aws_acm_certificate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;site&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;domain_validation_options&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt;
  &lt;span class="nx"&gt;dvo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;domain_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dvo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;resource_record_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;AWS generates validation records dynamically (one for apex domain, one for www). Terraform reads them, loops over them, and creates each one in Cloudflare.&lt;/p&gt;

&lt;p&gt;You never see the records. You never copy anything. It just works.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Screwed Up
&lt;/h2&gt;

&lt;p&gt;First time I ran this, ACM validation timed out after 30 minutes.&lt;/p&gt;

&lt;p&gt;The problem:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;proxied&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;  &lt;span class="c1"&gt;# Wrong!&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cloudflare's proxy rewrites DNS responses. ACM's validation servers hit Cloudflare's IP instead of seeing your validation record.&lt;/p&gt;

&lt;p&gt;The fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;proxied&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;  &lt;span class="c1"&gt;# Correct&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;DNS-only mode. No proxy. ACM validation works.&lt;/p&gt;

&lt;p&gt;Cost me 30 minutes of debugging. Now it's in code so I never hit it again.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;I'm running two brands: graycloudarch.com and cloudpatterns.io.&lt;/p&gt;

&lt;p&gt;Manual approach: 15 steps per domain = 30 steps total. 30 minutes minimum. High chance of typos.&lt;/p&gt;

&lt;p&gt;Terraform approach: One &lt;code&gt;terraform apply&lt;/code&gt;. 5 minutes to write the code (once), 10 minutes for AWS to validate. Then copy-paste the pattern for the second domain.&lt;/p&gt;

&lt;p&gt;When I launch my third brand (and I will), it'll take 5 minutes and one terraform apply.&lt;/p&gt;

&lt;p&gt;That's the bet: upfront automation for long-term velocity.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Part People Miss
&lt;/h2&gt;

&lt;p&gt;Most Terraform tutorials stop at requesting the certificate. They don't show you the validation loop or the waiting resource.&lt;/p&gt;

&lt;p&gt;Without &lt;code&gt;aws_acm_certificate_validation&lt;/code&gt;, Terraform exits immediately after creating the cert. It's still "Pending Validation" in AWS. When you try to use it in CloudFront, it fails.&lt;/p&gt;

&lt;p&gt;You'd have to run &lt;code&gt;terraform apply&lt;/code&gt; again later, after manually checking that validation completed.&lt;/p&gt;

&lt;p&gt;That's not automation—that's just documentation.&lt;/p&gt;

&lt;p&gt;The waiting resource makes it truly hands-off.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scaling It
&lt;/h2&gt;

&lt;p&gt;Adding a second domain is 10 lines of code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_acm_certificate"&lt;/span&gt; &lt;span class="s2"&gt;"cloudpatterns"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;domain_name&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"cloudpatterns.io"&lt;/span&gt;
  &lt;span class="nx"&gt;validation_method&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"DNS"&lt;/span&gt;
  &lt;span class="nx"&gt;subject_alternative_names&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"www.cloudpatterns.io"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"cloudflare_record"&lt;/span&gt; &lt;span class="s2"&gt;"cloudpatterns_validation"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;for_each&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="cm"&gt;/* same pattern */&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="c1"&gt;# ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_acm_certificate_validation"&lt;/span&gt; &lt;span class="s2"&gt;"cloudpatterns"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;# ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same pattern, different names. No clicking. No switching between consoles. No remembering which validation record goes where.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Win
&lt;/h2&gt;

&lt;p&gt;It's not the time savings (though 30 minutes per deployment adds up).&lt;/p&gt;

&lt;p&gt;It's the mental overhead.&lt;/p&gt;

&lt;p&gt;Manual DNS configuration requires focus. "Did I copy the whole string? Did I add the trailing dot? Is it DNS-only mode?"&lt;/p&gt;

&lt;p&gt;Terraform requires running one command. That's it.&lt;/p&gt;

&lt;p&gt;I get my focus back. I can write this blog post while Terraform validates certificates.&lt;/p&gt;

&lt;p&gt;Want the full code? It's not open source (yet), but if you're building something similar and want to talk through it, &lt;a href="https://graycloudarch.com/contact" rel="noopener noreferrer"&gt;reach out&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Or if you just want to tell me I'm overthinking this and should've clicked through Cloudflare like a normal person, that's cool too.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>terraform</category>
      <category>cloudflare</category>
      <category>dns</category>
    </item>
    <item>
      <title>Building Multi-Account AWS Infrastructure with Terraform and ECP</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Sat, 04 Apr 2026 22:30:25 +0000</pubDate>
      <link>https://forem.com/tallgray1/building-multi-account-aws-infrastructure-with-terraform-and-ecp-49an</link>
      <guid>https://forem.com/tallgray1/building-multi-account-aws-infrastructure-with-terraform-and-ecp-49an</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://graycloudarch.com/blog/multi-account-aws-ecp/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;After years of building AWS infrastructure at scale, I've learned that multi-account strategy isn't just about security—it's about organizational clarity and cost management.&lt;/p&gt;

&lt;p&gt;At a large podcast hosting platform, we implemented an Enterprise Control Plane (ECP) pattern using Terraform to manage 20+ AWS accounts. Here's what I learned:&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with Single-Account AWS
&lt;/h2&gt;

&lt;p&gt;Most companies start with one AWS account. Everything lives together: dev, staging, prod, data pipelines, security tools. It works... until it doesn't.&lt;/p&gt;

&lt;p&gt;Problems emerge:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Blast radius:&lt;/strong&gt; A misconfigured dev resource can affect production&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IAM complexity:&lt;/strong&gt; Permission boundaries become impossible to manage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost allocation:&lt;/strong&gt; Finance can't track spending by team or project&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance:&lt;/strong&gt; Auditors want logical separation between environments&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The ECP Pattern
&lt;/h2&gt;

&lt;p&gt;Enterprise Control Plane is an architectural pattern for managing multiple AWS accounts as a unified platform:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Organization Structure:&lt;/strong&gt; AWS Organizations with OUs (Organizational Units) for different environments and teams&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Centralized Networking:&lt;/strong&gt; Transit Gateway connecting all accounts through hub-and-spoke model&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security Baseline:&lt;/strong&gt; Service Control Policies (SCPs) enforcing guardrails at the organization level&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infrastructure as Code:&lt;/strong&gt; Terraform/Terragrunt managing everything from a central repository&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Key Design Decisions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Account Boundaries:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Production accounts: Isolated per application/team&lt;/li&gt;
&lt;li&gt;Non-prod accounts: Shared dev/staging to reduce overhead&lt;/li&gt;
&lt;li&gt;Platform accounts: Separate accounts for logging, monitoring, security tools&lt;/li&gt;
&lt;li&gt;Data accounts: Isolated for compliance and access control&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Network Architecture:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hub account with Transit Gateway&lt;/li&gt;
&lt;li&gt;VPC peering only where absolutely necessary&lt;/li&gt;
&lt;li&gt;Private subnet defaults for everything&lt;/li&gt;
&lt;li&gt;Centralized egress through NAT Gateway in hub&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Security Model:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SCPs prevent account-level misconfigurations&lt;/li&gt;
&lt;li&gt;IAM roles for cross-account access (no shared credentials)&lt;/li&gt;
&lt;li&gt;CloudTrail logs aggregated to security account&lt;/li&gt;
&lt;li&gt;GuardDuty and Security Hub in every account&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Terraform Structure
&lt;/h2&gt;

&lt;p&gt;We use Terragrunt to manage configurations across accounts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="s"&gt;ecp-ou-structure/&lt;/span&gt;     &lt;span class="c1"&gt;# Organization and account management&lt;/span&gt;
&lt;span class="s"&gt;ecp-network/&lt;/span&gt;          &lt;span class="c1"&gt;# Transit Gateway, VPCs, networking&lt;/span&gt;
&lt;span class="s"&gt;ecp-security/&lt;/span&gt;         &lt;span class="c1"&gt;# Security baseline, SCPs, IAM&lt;/span&gt;
&lt;span class="s"&gt;tf-live-aws-*/&lt;/span&gt;        &lt;span class="c1"&gt;# Application-specific infrastructure&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start with security:&lt;/strong&gt; SCPs first, then networking, then workloads&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automate account creation:&lt;/strong&gt; Manual account provisioning doesn't scale&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Document the why:&lt;/strong&gt; Every architectural decision needs context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plan for day 2:&lt;/strong&gt; Operations matter more than initial setup&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;After implementing ECP:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reduced security incident blast radius by 90%&lt;/li&gt;
&lt;li&gt;Finance can now track costs by team and project&lt;/li&gt;
&lt;li&gt;New environments deploy in hours, not days&lt;/li&gt;
&lt;li&gt;Passed SOC2 audit with zero infrastructure findings&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Multi-account AWS isn't just best practice—it's how you scale infrastructure beyond the startup phase.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>terraform</category>
      <category>multiaccount</category>
      <category>ecp</category>
    </item>
    <item>
      <title>Stop Manually Updating Jira After Every PR Merge</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Wed, 25 Mar 2026 07:10:48 +0000</pubDate>
      <link>https://forem.com/tallgray1/stop-manually-updating-jira-after-every-pr-merge-1c9p</link>
      <guid>https://forem.com/tallgray1/stop-manually-updating-jira-after-every-pr-merge-1c9p</guid>
      <description>&lt;p&gt;&lt;em&gt;This post was originally published on &lt;a href="https://graycloudarch.com/automate-jira-github-actions/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;You just merged a PR. Now you open Jira, find the ticket, paste the&lt;br&gt;
PR link in a comment, transition the status to Done, and update the&lt;br&gt;
deployed field. Five minutes. Twenty times a week. That's 1,700 minutes&lt;br&gt;
per year per engineer --- nearly 30 hours of pure mechanical overhead.&lt;/p&gt;

&lt;p&gt;And that's assuming you remember. On one team I worked with, we&lt;br&gt;
audited the last three months of merged PRs. Thirty percent of tickets&lt;br&gt;
had no update after merge. No comment, no transition, no link. The&lt;br&gt;
ticket just sat in In Dev until someone noticed during sprint&lt;br&gt;
review.&lt;/p&gt;

&lt;p&gt;The fix is two GitHub Actions workflows and a shared composite&lt;br&gt;
action. Here's exactly how to build it.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;Two workflows, one shared extraction layer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Workflow 1&lt;/strong&gt;: Fires on PR creation --- posts a Jira
link comment to the PR so reviewers can navigate directly to the
ticket.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Workflow 2&lt;/strong&gt;: Fires on PR merge to &lt;code&gt;main&lt;/code&gt;
--- posts a comment to the Jira ticket with the PR URL, commit SHA, and
who merged it, then transitions the ticket to Done.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both workflows need to find the Jira ticket ID. Instead of&lt;br&gt;
duplicating that logic, we extract it into a composite action.&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 1: Composite Action for Ticket Extraction
&lt;/h2&gt;

&lt;p&gt;Create&lt;br&gt;
&lt;code&gt;.github/actions/extract-jira-ticket/action.yml&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The action checks four sources in priority order --- easiest to fix&lt;br&gt;
first:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; PR title (simplest for the developer to correct)&lt;/li&gt;
&lt;li&gt; Commit messages&lt;/li&gt;
&lt;li&gt; Branch name in standard format:
&lt;code&gt;PROJECT-123-description&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt; Branch name with prefix:
&lt;code&gt;feat/PROJECT-123-description&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;::: {#cb1 .sourceCode}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Extract Jira Ticket&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Extracts Jira ticket from PR title, commits, or branch name&lt;/span&gt;

&lt;span class="na"&gt;inputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;jira-base-url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;jira-user-email&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;jira-api-token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

&lt;span class="na"&gt;outputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;jira-key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ steps.extract.outputs.jira_key }}&lt;/span&gt;
  &lt;span class="na"&gt;found&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ steps.extract.outputs.found }}&lt;/span&gt;

&lt;span class="na"&gt;runs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;using&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;composite&lt;/span&gt;
  &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Extract ticket ID&lt;/span&gt;
      &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;extract&lt;/span&gt;
      &lt;span class="na"&gt;shell&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bash&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
        &lt;span class="s"&gt;JIRA_KEY=""&lt;/span&gt;

        &lt;span class="s"&gt;# Priority 1: PR title&lt;/span&gt;
        &lt;span class="s"&gt;if [[ "${{ github.event.pull_request.title }}" =~ ([A-Z]+-[0-9]+) ]]; then&lt;/span&gt;
          &lt;span class="s"&gt;JIRA_KEY="${BASH_REMATCH[1]}"&lt;/span&gt;
        &lt;span class="s"&gt;fi&lt;/span&gt;

        &lt;span class="s"&gt;# Priority 2: Branch name&lt;/span&gt;
        &lt;span class="s"&gt;if [ -z "$JIRA_KEY" ]; then&lt;/span&gt;
          &lt;span class="s"&gt;BRANCH="${{ github.head_ref }}"&lt;/span&gt;
          &lt;span class="s"&gt;if [[ "$BRANCH" =~ ([A-Z]+-[0-9]+) ]]; then&lt;/span&gt;
            &lt;span class="s"&gt;JIRA_KEY="${BASH_REMATCH[1]}"&lt;/span&gt;
          &lt;span class="s"&gt;fi&lt;/span&gt;
        &lt;span class="s"&gt;fi&lt;/span&gt;

        &lt;span class="s"&gt;if [ -n "$JIRA_KEY" ]; then&lt;/span&gt;
          &lt;span class="s"&gt;echo "jira_key=$JIRA_KEY" &amp;gt;&amp;gt; $GITHUB_OUTPUT&lt;/span&gt;
          &lt;span class="s"&gt;echo "found=true" &amp;gt;&amp;gt; $GITHUB_OUTPUT&lt;/span&gt;
        &lt;span class="s"&gt;else&lt;/span&gt;
          &lt;span class="s"&gt;echo "found=false" &amp;gt;&amp;gt; $GITHUB_OUTPUT&lt;/span&gt;
        &lt;span class="s"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;The regex &lt;code&gt;[A-Z]+-[0-9]+&lt;/code&gt; matches any Jira ticket format:&lt;br&gt;
&lt;code&gt;PROJ-1&lt;/code&gt;, &lt;code&gt;IN-89&lt;/code&gt;, &lt;code&gt;INFRA-1234&lt;/code&gt;. If you&lt;br&gt;
have tickets with lowercase project keys, adjust accordingly.&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 2: PR Creation Workflow
&lt;/h2&gt;

&lt;p&gt;Create &lt;code&gt;.github/workflows/link-jira-on-pr.yml&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This fires when a PR is opened and posts a formatted comment with the&lt;br&gt;
Jira ticket link. If no ticket is found, it posts a warning so the&lt;br&gt;
author knows to add one --- before review, not after.&lt;/p&gt;

&lt;p&gt;::: {#cb2 .sourceCode}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Link Jira on PR&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;types&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;opened&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;permissions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pull-requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;write&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;link-jira&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./.github/actions/extract-jira-ticket&lt;/span&gt;
        &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;jira&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;jira-base-url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.JIRA_BASE_URL }}&lt;/span&gt;
          &lt;span class="na"&gt;jira-user-email&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.JIRA_USER_EMAIL }}&lt;/span&gt;
          &lt;span class="na"&gt;jira-api-token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.JIRA_API_TOKEN }}&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Post Jira link comment&lt;/span&gt;
        &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;steps.jira.outputs.found == 'true'&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/github-script@v7&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;script&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;github.rest.issues.createComment({&lt;/span&gt;
              &lt;span class="s"&gt;issue_number: context.issue.number,&lt;/span&gt;
              &lt;span class="s"&gt;owner: context.repo.owner,&lt;/span&gt;
              &lt;span class="s"&gt;repo: context.repo.repo,&lt;/span&gt;
              &lt;span class="s"&gt;body: `📋 Jira: [${{ steps.jira.outputs.jira-key }}](${{ secrets.JIRA_BASE_URL }}/browse/${{ steps.jira.outputs.jira-key }})`&lt;/span&gt;
            &lt;span class="s"&gt;})&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Warn if no ticket found&lt;/span&gt;
        &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;steps.jira.outputs.found == 'false'&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/github-script@v7&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;script&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;github.rest.issues.createComment({&lt;/span&gt;
              &lt;span class="s"&gt;issue_number: context.issue.number,&lt;/span&gt;
              &lt;span class="s"&gt;owner: context.repo.owner,&lt;/span&gt;
              &lt;span class="s"&gt;repo: context.repo.repo,&lt;/span&gt;
              &lt;span class="s"&gt;body: '⚠️ No Jira ticket found. Add a ticket ID to the PR title (e.g., `PROJ-123: Your title`).'&lt;/span&gt;
            &lt;span class="s"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;The warning step matters. It creates a feedback loop that trains the&lt;br&gt;
team to include ticket IDs upfront. Within a few weeks, the warning&lt;br&gt;
fires rarely.&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 3: PR Merge Workflow
&lt;/h2&gt;

&lt;p&gt;Create &lt;code&gt;.github/workflows/update-jira-on-merge.yml&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This fires when a PR is closed against &lt;code&gt;main&lt;/code&gt;. The&lt;br&gt;
&lt;code&gt;if: github.event.pull_request.merged == true&lt;/code&gt; guard is&lt;br&gt;
important --- the &lt;code&gt;closed&lt;/code&gt; event also fires for PRs that are&lt;br&gt;
closed without merging.&lt;/p&gt;

&lt;p&gt;::: {#cb3 .sourceCode}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Update Jira on Merge&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;types&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;closed&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;main&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;update-jira&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;github.event.pull_request.merged == &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./.github/actions/extract-jira-ticket&lt;/span&gt;
        &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;jira&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;jira-base-url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.JIRA_BASE_URL }}&lt;/span&gt;
          &lt;span class="na"&gt;jira-user-email&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.JIRA_USER_EMAIL }}&lt;/span&gt;
          &lt;span class="na"&gt;jira-api-token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.JIRA_API_TOKEN }}&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Post merge comment to Jira&lt;/span&gt;
        &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;steps.jira.outputs.found == 'true'&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}"&lt;/span&gt;
            &lt;span class="s"&gt;-u "${{ secrets.JIRA_USER_EMAIL }}:${{ secrets.JIRA_API_TOKEN }}"&lt;/span&gt;
            &lt;span class="s"&gt;-H "Content-Type: application/json"&lt;/span&gt;
            &lt;span class="s"&gt;-X POST "${{ secrets.JIRA_BASE_URL }}/rest/api/2/issue/${{ steps.jira.outputs.jira-key }}/comment"&lt;/span&gt;
            &lt;span class="s"&gt;-d "{\"body\": \"PR merged: #${{ github.event.pull_request.number }} ${{ github.event.pull_request.html_url }}\nCommit: ${{ github.sha }}\nBy: ${{ github.event.pull_request.merged_by.login }}\"}")&lt;/span&gt;

          &lt;span class="s"&gt;echo "Jira comment HTTP status: $HTTP_STATUS"&lt;/span&gt;
          &lt;span class="s"&gt;[ "$HTTP_STATUS" -eq 201 ] &amp;amp;&amp;amp; echo "✅ Comment posted" || echo "⚠️ Comment failed (non-critical)"&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Transition ticket to Done&lt;/span&gt;
        &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;steps.jira.outputs.found == 'true'&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;TRANSITION_ID="${{ secrets.JIRA_DONE_TRANSITION_ID }}"&lt;/span&gt;
          &lt;span class="s"&gt;[ -z "$TRANSITION_ID" ] &amp;amp;&amp;amp; echo "No transition ID configured, skipping" &amp;amp;&amp;amp; exit 0&lt;/span&gt;

          &lt;span class="s"&gt;curl -s -u "${{ secrets.JIRA_USER_EMAIL }}:${{ secrets.JIRA_API_TOKEN }}"&lt;/span&gt;
            &lt;span class="s"&gt;-H "Content-Type: application/json"&lt;/span&gt;
            &lt;span class="s"&gt;-X POST "${{ secrets.JIRA_BASE_URL }}/rest/api/2/issue/${{ steps.jira.outputs.jira-key }}/transitions"&lt;/span&gt;
            &lt;span class="s"&gt;-d "{\"transition\": {\"id\": \"$TRANSITION_ID\"}}"&lt;/span&gt;
          &lt;span class="s"&gt;echo "✅ Transitioned to Done"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;The comment step uses an HTTP status check rather than relying on&lt;br&gt;
curl's exit code. A failed comment doesn't fail the job --- the PR already&lt;br&gt;
merged, and a missing notification shouldn't generate noise in CI. The&lt;br&gt;
transition step is fully optional: if&lt;br&gt;
&lt;code&gt;JIRA_DONE_TRANSITION_ID&lt;/code&gt; isn't set, it skips silently. This&lt;br&gt;
lets you start with just comments and add transitions once you've&lt;br&gt;
verified the workflow runs cleanly.&lt;/p&gt;
&lt;h2&gt;
  
  
  Finding Your Transition IDs
&lt;/h2&gt;

&lt;p&gt;Transition IDs are project-specific. There's no universal "Done" ID.&lt;br&gt;
Run this against any ticket in your project to find yours:&lt;/p&gt;

&lt;p&gt;::: {#cb4 .sourceCode}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-u&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$JIRA_EMAIL&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;$JIRA_API_TOKEN&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$JIRA_BASE_URL&lt;/span&gt;&lt;span class="s2"&gt;/rest/api/2/issue/&lt;/span&gt;&lt;span class="nv"&gt;$TICKET_KEY&lt;/span&gt;&lt;span class="s2"&gt;/transitions"&lt;/span&gt;
  | jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'.transitions[] | "ID: \(.id) | \(.name)"'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;Example output:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ID: 91 | Done
ID: 31 | In Review
ID: 21 | In Progress
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Set the Done ID as &lt;code&gt;JIRA_DONE_TRANSITION_ID&lt;/code&gt; in your&lt;br&gt;
repository secrets.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Note on Jira API Versions
&lt;/h2&gt;

&lt;p&gt;Use the v2 API: &lt;code&gt;/rest/api/2/&lt;/code&gt;. Some teams try v3 and get&lt;br&gt;
silent empty responses --- &lt;code&gt;{"errorMessages":[],"errors":{}}&lt;/code&gt; ---&lt;br&gt;
that look exactly like auth failures. It's not auth. The v3 request body&lt;br&gt;
format changed, and error handling is poor. v2 is stable,&lt;br&gt;
well-documented, and works consistently.&lt;/p&gt;

&lt;h2&gt;
  
  
  Required Secrets
&lt;/h2&gt;

&lt;p&gt;Add these to your GitHub repository secrets:&lt;/p&gt;

&lt;p&gt;Secret                      Value&lt;/p&gt;




&lt;p&gt;&lt;code&gt;JIRA_BASE_URL&lt;/code&gt;             &lt;code&gt;https://yourorg.atlassian.net&lt;/code&gt;&lt;br&gt;
  &lt;code&gt;JIRA_USER_EMAIL&lt;/code&gt;           The email address tied to your API token&lt;br&gt;
  &lt;code&gt;JIRA_API_TOKEN&lt;/code&gt;            Generate at id.atlassian.com → Security → API tokens&lt;br&gt;
  &lt;code&gt;JIRA_DONE_TRANSITION_ID&lt;/code&gt;   Optional --- from the transitions API call above&lt;/p&gt;

&lt;p&gt;For org-wide rollout, set these as organization secrets and restrict&lt;br&gt;
to relevant repositories.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;After rolling this out across a team of eight engineers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Zero manual Jira updates after merge&lt;/li&gt;
&lt;li&gt;  Forgotten ticket updates dropped from 30% to 0%&lt;/li&gt;
&lt;li&gt;  Roughly 1,700 minutes per year recovered per engineer&lt;/li&gt;
&lt;li&gt;  Every merged PR has a complete audit trail: PR number, URL, commit
SHA, who merged it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The composite action pattern also means when you need to extend this&lt;br&gt;
--- adding a Slack notification on merge, posting to Confluence --- you&lt;br&gt;
extend one file, not two.&lt;/p&gt;

&lt;p&gt;If you're building out automation like this across your engineering&lt;br&gt;
platform and want a second opinion on the design, &lt;a href="https://graycloudarch.com/#contact" rel="noopener noreferrer"&gt;I'm available for advisory engagements&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>automation</category>
      <category>cicd</category>
      <category>devops</category>
      <category>githubactions</category>
    </item>
    <item>
      <title>How I Manage Claude Code Context Across 20+ Repositories</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Fri, 20 Mar 2026 06:57:13 +0000</pubDate>
      <link>https://forem.com/tallgray1/how-i-manage-claude-code-context-across-20-repositories-5b16</link>
      <guid>https://forem.com/tallgray1/how-i-manage-claude-code-context-across-20-repositories-5b16</guid>
      <description>&lt;p&gt;&lt;em&gt;This post was originally published on &lt;a href="https://graycloudarch.com/blog/managing-claude-code-context-multi-repo/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Three months ago I was re-explaining my Terragrunt state backend to&lt;br&gt;
Claude for the third time in a week. Different session, same repo, same&lt;br&gt;
repo I'd worked in the session before. Claude had no idea I was even in&lt;br&gt;
the same project.&lt;/p&gt;

&lt;p&gt;I run Claude Code daily across a 6-account AWS platform monorepo, a&lt;br&gt;
personal consulting site, homelab infrastructure, and a handful of side&lt;br&gt;
projects. Every session started with the same five minutes of "here's&lt;br&gt;
the project, here are the conventions, here's the Jira workflow" --- and&lt;br&gt;
still ended with Claude suggesting patterns that didn't fit the&lt;br&gt;
environment, because I'd inevitably forgotten to mention something.&lt;/p&gt;

&lt;p&gt;After three months of broken symlinks and abandoned experiments, I&lt;br&gt;
landed on a three-tier context hierarchy that loads the right context&lt;br&gt;
automatically depending on which directory I'm working in --- and I manage&lt;br&gt;
all of it from a single dotfiles repo.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Problem with Single-File Context
&lt;/h2&gt;

&lt;p&gt;Claude Code loads &lt;code&gt;CLAUDE.md&lt;/code&gt; from the current directory&lt;br&gt;
(and parent directories, walking up to&lt;br&gt;
&lt;code&gt;~/.claude/CLAUDE.md&lt;/code&gt;). Most teams start with one file and&lt;br&gt;
put everything in it.&lt;/p&gt;

&lt;p&gt;That breaks down quickly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Global preferences get mixed with project
specifics.&lt;/strong&gt; Your "use snake_case for variable names" preference
shouldn't live next to your Terraform state bucket configuration.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Credentials and account IDs end up in files you accidentally
commit.&lt;/strong&gt; Put AWS account IDs in a shared CLAUDE.md, and someone
will eventually push it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You can't share patterns across repos without
duplication.&lt;/strong&gt; Every new repo gets a fresh copy of the same
conventions, and updates never propagate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-employer context creates conflicts.&lt;/strong&gt; Your
consulting client's Jira workflow shouldn't contaminate your personal
project sessions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;My first attempt at fixing this was a shared scripts directory. The&lt;br&gt;
three-tier hierarchy came later, after I figured out what was actually&lt;br&gt;
wrong with the simpler approach.&lt;/p&gt;
&lt;h2&gt;
  
  
  My First Attempt: A Shared Scripts Directory
&lt;/h2&gt;

&lt;p&gt;Before landing on the three-tier system, I built something more&lt;br&gt;
obvious: a &lt;code&gt;~/shared-claude-infra/&lt;/code&gt; directory containing a&lt;br&gt;
&lt;code&gt;setup-project.sh&lt;/code&gt; script that initialized&lt;br&gt;
&lt;code&gt;.claude/&lt;/code&gt; context for each new repo.&lt;/p&gt;

&lt;p&gt;The script created the directory structure and symlinked a&lt;br&gt;
&lt;code&gt;rules/shared/&lt;/code&gt; folder back to the shared repo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mkdir -p "$PROJECT_DIR/.claude/rules"
ln -s ~/shared-claude-infra/rules "$PROJECT_DIR/.claude/rules/shared"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This worked for the first two repos I configured. Then the problems&lt;br&gt;
compounded:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Manual per-project setup.&lt;/strong&gt; Every new repo required
running the script. Miss one, and that repo has no shared context.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Two repos to maintain.&lt;/strong&gt; The shared infrastructure
lived in its own git repo, separate from dotfiles. Two places to update
when conventions changed, and they drifted.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Nested symlinks instead of directory-level
symlinks.&lt;/strong&gt; The &lt;code&gt;rules/shared&lt;/code&gt; symlink lived deep
inside the project's &lt;code&gt;.claude/&lt;/code&gt; tree. When the target moved,
every project that had run the script got a broken symlink ---
silently.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hardcoded paths that drifted.&lt;/strong&gt; The script referenced
workspace paths from three months earlier. My actual directory layout
had changed; the script still pointed at the old locations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When I eventually deleted the shared directory, a quick&lt;br&gt;
&lt;code&gt;find&lt;/code&gt; confirmed broken symlinks scattered across every repo&lt;br&gt;
that had run the setup script. The approach was inherently fragile&lt;br&gt;
because it depended on every machine, every repo, and every workspace&lt;br&gt;
path staying synchronized manually.&lt;/p&gt;

&lt;p&gt;The fix isn't a smarter script. It's inverting the relationship:&lt;br&gt;
instead of a script that runs once per project, use dotfiles that wire&lt;br&gt;
context automatically based on what directories exist.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Three-Tier Hierarchy
&lt;/h2&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;~/.claude/CLAUDE.md              ← Global: preferences, style, git workflow
~/work/{employer}/.claude/       ← Org: team structure, AWS accounts, Jira workflow
~/work/{employer}/{repo}/.claude/ ← Project: repo architecture, active tickets
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Claude Code walks up the directory tree loading&lt;br&gt;
&lt;code&gt;CLAUDE.md&lt;/code&gt; files at each level. Each tier handles a specific&lt;br&gt;
scope:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Global tier&lt;/strong&gt; (&lt;code&gt;~/.claude/&lt;/code&gt;): Everything&lt;br&gt;
that applies across all work --- communication style, git commit format,&lt;br&gt;
PR description templates, universal infrastructure patterns. No&lt;br&gt;
credentials, no account IDs, nothing employer-specific.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Org tier&lt;/strong&gt; (&lt;code&gt;~/work/{employer}/.claude/&lt;/code&gt;):&lt;br&gt;
Team structure, Jira project keys, AWS account layout, CI/CD pipeline&lt;br&gt;
conventions. Sensitive patterns (account IDs, VPC IDs, state bucket&lt;br&gt;
names) go in gitignored files within this directory. Reusable patterns&lt;br&gt;
(CI/CD templates, AWS patterns without specifics) go in committed&lt;br&gt;
files.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Project tier&lt;/strong&gt;&lt;br&gt;
(&lt;code&gt;~/work/{employer}/{repo}/.claude/&lt;/code&gt;): Architecture decisions&lt;br&gt;
for this specific repo, active tickets, ongoing work state. Always&lt;br&gt;
gitignored --- this is ephemeral working context that changes&lt;br&gt;
frequently.&lt;/p&gt;
&lt;h2&gt;
  
  
  Implementation: Symlinks from Dotfiles
&lt;/h2&gt;

&lt;p&gt;The hierarchy only works if it's consistent across machines. I manage&lt;br&gt;
all context files from a dotfiles repo using symlinks:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dotfiles/claude/
├── global/          → symlinked to ~/.claude/
├── {employer}/      → symlinked to ~/work/{employer}/.claude/
└── {personal}/      → symlinked to ~/personal/.claude/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;&lt;code&gt;install.sh&lt;/code&gt; wires these automatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Global context
ln -sf "$DOTFILES/claude/global" "$HOME/.claude"

# Per-employer context
for employer in "${EMPLOYERS[@]}"; do
  WORK_DIR="$HOME/work/$employer"
  if [ -d "$WORK_DIR" ]; then
    ln -sf "$DOTFILES/claude/$employer" "$WORK_DIR/.claude"
  fi
done
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Any machine that runs &lt;code&gt;install.sh&lt;/code&gt; gets the same context&lt;br&gt;
hierarchy. Changes committed to dotfiles propagate immediately.&lt;/p&gt;
&lt;h2&gt;
  
  
  What Each Level Contains
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Global (&lt;code&gt;~/.claude/&lt;/code&gt;)
&lt;/h3&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;~/.claude/
├── CLAUDE.md          # Preferences, active work summary
└── rules/
    ├── git-workflow.md
    ├── pr-patterns.md
    ├── infrastructure.md
    └── context-management.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;&lt;code&gt;CLAUDE.md&lt;/code&gt; is short --- preferences and a pointer to where&lt;br&gt;
active work lives. The heavy lifting goes in &lt;code&gt;rules/&lt;/code&gt; files&lt;br&gt;
that Claude loads as supplementary context.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## Communication Style
- Be direct and technical — I understand infrastructure concepts
- Explain the "why" behind decisions
- Provide specific file paths and line numbers

## Git Workflow
- Branch format: feat/TICKET-123-description
- Commit format: [TICKET-123] Brief summary\n\nWhy this change...
- Never add Co-Authored-By trailers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Org Level (&lt;code&gt;~/work/{employer}/.claude/&lt;/code&gt;)
&lt;/h3&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;~/work/{employer}/.claude/
├── {EMPLOYER}.md              # Team structure, Jira workflow — committed
└── rules/
    ├── cicd-patterns.md       # CI/CD conventions — committed
    ├── aws-patterns.md        # Account IDs, VPC IDs — GITIGNORED
    └── terraform-patterns.md  # State config, module paths — GITIGNORED
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The privacy split matters. &lt;code&gt;cicd-patterns.md&lt;/code&gt; contains&lt;br&gt;
reusable GitHub Actions patterns --- fine to commit.&lt;br&gt;
&lt;code&gt;aws-patterns.md&lt;/code&gt; contains actual account IDs --- stays&lt;br&gt;
local.&lt;/p&gt;

&lt;p&gt;A typical employer context file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## Team Structure
- Platform team: 5 engineers, all in Jira project IN
- AWS accounts: dev, nonprod, prod (+ 3 infra accounts network, security, management)
- Monorepo: ~/work/{employer}/iac — Terragrunt, 78 components

## Jira Workflow
- IN project (infrastructure): transition IDs 3=In Dev, 8=Needs Review, 9=Done
- Prefix all commits: [IT-XXX]
- API: REST v2 only — v3 silently returns empty responses

## CI/CD
- GitHub Actions with OIDC to AWS (no long-lived credentials)
- PR requires: terraform plan output posted as comment
- Merge to main triggers auto-deploy to nonprod
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The gitignored &lt;code&gt;aws-patterns.md&lt;/code&gt; contains account IDs and&lt;br&gt;
specific ARNs that Claude needs for generating Terraform configurations&lt;br&gt;
accurately but shouldn't be committed anywhere.&lt;/p&gt;
&lt;h3&gt;
  
  
  Project Level (&lt;code&gt;~/work/{employer}/{repo}/.claude/&lt;/code&gt;)
&lt;/h3&gt;

&lt;p&gt;Project context is ephemeral and always gitignored. It's the working&lt;br&gt;
memory for an ongoing effort:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## Current State
- Branch: feat/IT-89-my-app-dev-ecr
- Active ticket: IT-89 — ECR in dev
- Next: IT-90 — ECS task definition

## Architecture Decisions
- ECR in dev only; cross-account pull policies for nonprod and prod
- Mutable tags in dev, immutable in nonprod/prod
- KMS key per environment, not per repository

## Blockers
- Waiting on network DNS zone creation before cutover can proceed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I update this file at the end of each session with current state so&lt;br&gt;
the next session loads instantly without re-explaining where things&lt;br&gt;
are.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Privacy Model
&lt;/h2&gt;

&lt;p&gt;The critical insight is that context files need two categories:&lt;br&gt;
committed (shareable) and local-only (sensitive).&lt;/p&gt;

&lt;p&gt;Content                 Location                                           Committed?&lt;/p&gt;



&lt;p&gt;Personal preferences    &lt;code&gt;~/.claude/CLAUDE.md&lt;/code&gt;                              ✅&lt;br&gt;
  Git workflow rules      &lt;code&gt;~/.claude/rules/git-workflow.md&lt;/code&gt;                  ✅&lt;br&gt;
  Team structure          &lt;code&gt;{employer}/.claude/{EMPLOYER}.md&lt;/code&gt;                 ✅ sanitized&lt;br&gt;
  CI/CD patterns          &lt;code&gt;{employer}/.claude/rules/cicd-patterns.md&lt;/code&gt;        ✅&lt;br&gt;
  AWS account IDs         &lt;code&gt;{employer}/.claude/rules/aws-patterns.md&lt;/code&gt;         ❌ gitignored&lt;br&gt;
  VPC IDs, state config   &lt;code&gt;{employer}/.claude/rules/terraform-patterns.md&lt;/code&gt;   ❌ gitignored&lt;br&gt;
  Active ticket state     &lt;code&gt;{repo}/.claude/OVERRIDES.md&lt;/code&gt;                      ❌ gitignored&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;.gitignore&lt;/code&gt; at the dotfiles level handles this&lt;br&gt;
automatically by ignoring &lt;code&gt;**/aws-patterns.md&lt;/code&gt; and&lt;br&gt;
&lt;code&gt;**/terraform-patterns.md&lt;/code&gt; across all employer&lt;br&gt;
directories.&lt;/p&gt;
&lt;h2&gt;
  
  
  Custom Commands
&lt;/h2&gt;

&lt;p&gt;Beyond context files, Claude Code supports custom&lt;br&gt;
&lt;code&gt;/commands&lt;/code&gt; --- reusable prompts stored as markdown files:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;~/.claude/commands/
├── checkpoint.md       # Create context snapshot
├── sync-work.md        # Update active work status
└── pr-ready.md         # Generate PR description
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;A command file is just the prompt Claude should execute:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# checkpoint.md
Create a context checkpoint. Read the current git status across active repos,
summarize open PRs and their status, list active tickets with their current
state, and write a structured summary to ~/.claude/local.md. Include any
blocking issues and the next planned action.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Commands at the global level are available everywhere. Org-level&lt;br&gt;
commands handle employer-specific workflows like Jira transitions.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Solves in Practice
&lt;/h2&gt;

&lt;p&gt;Before this system: every new Claude session started with "here's the&lt;br&gt;
project, here are the conventions, here's where things are." Five&lt;br&gt;
minutes of ramp-up, inconsistent outputs because I'd forget to mention&lt;br&gt;
something.&lt;/p&gt;

&lt;p&gt;After: I &lt;code&gt;cd&lt;/code&gt; into a repo and Claude already knows the&lt;br&gt;
Jira workflow, the AWS account structure, the naming conventions, and&lt;br&gt;
where the active work stands. When I start a session mid-ticket, the&lt;br&gt;
project-level context tells Claude exactly what was in progress.&lt;/p&gt;

&lt;p&gt;The bigger payoff is consistency. When Claude generates Terraform, it&lt;br&gt;
generates it with the correct state backend. When it writes commit&lt;br&gt;
messages, they follow the format reviewers expect. When it suggests&lt;br&gt;
architecture, it fits the actual account model rather than a generic AWS&lt;br&gt;
example.&lt;/p&gt;

&lt;h2&gt;
  
  
  Starting Point
&lt;/h2&gt;

&lt;p&gt;If you're starting from scratch, work tier by tier:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Create &lt;code&gt;~/.claude/CLAUDE.md&lt;/code&gt; with your communication
preferences and git conventions.&lt;/li&gt;
&lt;li&gt; Add a &lt;code&gt;rules/&lt;/code&gt; directory with patterns you want loaded
consistently.&lt;/li&gt;
&lt;li&gt; Create an org-level directory when you start working with a specific
employer or major project.&lt;/li&gt;
&lt;li&gt; Add project-level context when you start a multi-session
effort.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Don't try to build the whole system at once. The global tier alone&lt;br&gt;
eliminates most of the per-session ramp-up. The org and project tiers&lt;br&gt;
pay off as work gets more complex.&lt;/p&gt;

&lt;p&gt;The thing that surprised me most wasn't the time saved on ramp-up. It&lt;br&gt;
was how much the output quality improved. When Claude knows the actual&lt;br&gt;
state backend, the actual account IDs, the actual PR format your&lt;br&gt;
reviewers expect --- the suggestions it makes fit your environment. That&lt;br&gt;
gap between "technically correct" and "actually usable" is where most of&lt;br&gt;
the friction in AI-assisted infrastructure work lives. The context&lt;br&gt;
hierarchy is mostly just closing that gap.&lt;/p&gt;

&lt;p&gt;If you're setting up Claude Code for a platform team and want to talk&lt;br&gt;
through the context design, &lt;a href="https://graycloudarch.com/#contact" rel="noopener noreferrer"&gt;I do advisory&lt;br&gt;
engagements&lt;/a&gt; for teams getting serious about AI tooling in their&lt;br&gt;
infrastructure workflow.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claudecode</category>
      <category>developertooling</category>
      <category>devops</category>
    </item>
    <item>
      <title>Great design, and easy to follow.</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Wed, 18 Mar 2026 16:23:46 +0000</pubDate>
      <link>https://forem.com/tallgray1/great-design-and-easy-to-follow-109b</link>
      <guid>https://forem.com/tallgray1/great-design-and-easy-to-follow-109b</guid>
      <description>&lt;div class="ltag__link"&gt;
  &lt;a href="/cbecerra" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__pic"&gt;
      &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3545716%2Fa8cbf641-51dd-4f99-ad6b-abe0f714fa3b.jpeg" alt="cbecerra"&gt;
    &lt;/div&gt;
  &lt;/a&gt;
  &lt;a href="https://dev.to/cbecerra/how-to-implement-aws-network-firewall-in-a-multi-account-architecture-using-transit-gateway-2nam" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__content"&gt;
      &lt;h2&gt;How to Implement AWS Network Firewall in a Multi-Account Architecture Using Transit Gateway&lt;/h2&gt;
      &lt;h3&gt;Cristhian Becerra ・ Oct 13 '25&lt;/h3&gt;
      &lt;div class="ltag__link__taglist"&gt;
        &lt;span class="ltag__link__tag"&gt;#english&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#aws&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#networking&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#cybersecurity&lt;/span&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/a&gt;
&lt;/div&gt;


</description>
      <category>english</category>
      <category>aws</category>
      <category>networking</category>
      <category>cybersecurity</category>
    </item>
    <item>
      <title>Building Automated AWS Permission Testing Infrastructure for CI/CD</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Wed, 18 Mar 2026 07:14:50 +0000</pubDate>
      <link>https://forem.com/tallgray1/building-automated-aws-permission-testing-infrastructure-for-cicd-42pk</link>
      <guid>https://forem.com/tallgray1/building-automated-aws-permission-testing-infrastructure-for-cicd-42pk</guid>
      <description>&lt;p&gt;&lt;em&gt;This post was originally published on &lt;a href="https://graycloudarch.com/aws-permission-testing-cicd/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;I deployed a permission set for our data engineers five times before&lt;br&gt;
it worked correctly.&lt;/p&gt;

&lt;p&gt;The first deployment: S3 reads worked, Glue Data Catalog reads&lt;br&gt;
worked. Athena queries failed --- the query engine needs KMS decrypt&lt;br&gt;
through a service principal, and I'd missed the&lt;br&gt;
&lt;code&gt;kms:ViaService&lt;/code&gt; condition. Second deployment: Athena worked.&lt;br&gt;
EMR Serverless job submission failed --- missing&lt;br&gt;
&lt;code&gt;iam:PassRole&lt;/code&gt;. Third deployment: EMR submission worked. Job&lt;br&gt;
execution failed --- missing permissions on the EMR Serverless execution&lt;br&gt;
role boundary. I kept deploying, engineers kept getting blocked, I kept&lt;br&gt;
opening tickets.&lt;/p&gt;

&lt;p&gt;Five iterations. Two weeks. Every failure meant a data engineer&lt;br&gt;
opened a ticket instead of running their job.&lt;/p&gt;

&lt;p&gt;The problem wasn't that IAM is complicated --- it is, but that's&lt;br&gt;
expected. The problem was that I had no way to catch these issues before&lt;br&gt;
deploying to the account where real engineers were trying to do real&lt;br&gt;
work. Every bug was a production bug.&lt;/p&gt;
&lt;h2&gt;
  
  
  The "Access Denied" Debugging Loop
&lt;/h2&gt;

&lt;p&gt;Here's what the reactive debugging cycle looks like from the&lt;br&gt;
inside.&lt;/p&gt;

&lt;p&gt;Engineer opens a ticket:&lt;br&gt;
&lt;code&gt;AccessDeniedException: User is not authorized to perform: s3:GetObject&lt;/code&gt;.&lt;br&gt;
I add &lt;code&gt;s3:GetObject&lt;/code&gt; to the permission set. Next day:&lt;br&gt;
&lt;code&gt;AccessDeniedException: s3:PutObject&lt;/code&gt;. I add&lt;br&gt;
&lt;code&gt;s3:PutObject&lt;/code&gt;. Day after: write succeeds but cleanup fails ---&lt;br&gt;
&lt;code&gt;s3:DeleteObject&lt;/code&gt;. At this point I've done four deployment&lt;br&gt;
cycles and two days of work to get S3 read/write/delete working. If I'd&lt;br&gt;
just added &lt;code&gt;s3:*&lt;/code&gt; I'd be done, but that violates&lt;br&gt;
least-privilege and opens the raw zone to write access, which we&lt;br&gt;
explicitly don't want.&lt;/p&gt;

&lt;p&gt;The deeper issue is that individual services don't fail atomically.&lt;br&gt;
Athena requires &lt;code&gt;athena:StartQueryExecution&lt;/code&gt; and&lt;br&gt;
&lt;code&gt;athena:GetQueryResults&lt;/code&gt; and&lt;br&gt;
&lt;code&gt;athena:GetQueryExecution&lt;/code&gt;, but it also requires KMS decrypt&lt;br&gt;
through the Athena service principal to read encrypted S3 results. That&lt;br&gt;
last piece isn't in the Athena docs --- you find it by failing in&lt;br&gt;
production.&lt;/p&gt;

&lt;p&gt;I wanted a way to find it before deploying.&lt;/p&gt;
&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;The testing framework has four components: per-persona permission set&lt;br&gt;
templates, a Bash test library, per-service test scripts, and a GitHub&lt;br&gt;
Actions workflow that runs everything on pull requests.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────┐
│  GitHub Pull Request (Permission Set Changes)   │
└───────────────────┬─────────────────────────────┘
                    │
         ┌──────────▼──────────┐
         │  CI/CD Workflow     │
         │  (GitHub Actions)   │
         └──────────┬──────────┘
                    │
    ┌───────────────┼───────────────┐
    ▼               ▼               ▼
┌───────┐      ┌──────────┐   ┌──────────┐
│ S3    │      │  Glue    │   │ Athena   │
│ Tests │      │  Tests   │   │  Tests   │
└───────┘      └──────────┘   └──────────┘
                    │
         ┌──────────▼──────────┐
         │  Test Report        │
         │  (Posted to PR)     │
         └─────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;The workflow triggers on any pull request that modifies the&lt;br&gt;
identity-center Terraform directory. Tests run against real AWS accounts&lt;br&gt;
--- dev and nonprod --- using test credentials provisioned for that purpose.&lt;br&gt;
Results post as a PR comment before anyone approves the change.&lt;/p&gt;
&lt;h2&gt;
  
  
  Phase 1: Pre-Validated Templates
&lt;/h2&gt;

&lt;p&gt;Before I wrote a single test, I needed a starting point for&lt;br&gt;
permission sets that captured the patterns I'd learned the hard way.&lt;br&gt;
Templates that handle the non-obvious pieces --- zone-scoped S3 access,&lt;br&gt;
KMS conditions tied to specific services, explicit denies for&lt;br&gt;
destructive operations.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;AnalystAccess&lt;/code&gt; template is representative. Analysts&lt;br&gt;
get read-only access to the curated zone of the data lake, Athena query&lt;br&gt;
execution in the primary workgroup, and KMS decrypt --- but only when the&lt;br&gt;
decrypt request originates from S3 or Athena, not from arbitrary API&lt;br&gt;
calls:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;inline_policy&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonencode&lt;/span&gt;&lt;span class="err"&gt;(&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;Version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;
  &lt;span class="nx"&gt;Statement&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"GlueCatalogReadOnly"&lt;/span&gt;
      &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
      &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"glue:GetDatabase"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"glue:GetTable"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"glue:GetPartitions"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"glue:SearchTables"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="s2"&gt;"arn:aws:glue:*:*:catalog"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;"arn:aws:glue:*:*:database/curated_*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;"arn:aws:glue:*:*:table/curated_*/*"&lt;/span&gt;
      &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"S3CuratedReadOnly"&lt;/span&gt;
      &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
      &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"s3:GetObject"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"s3:ListBucket"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:s3:::lake-bucket-*/curated/*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:s3:::lake-bucket-*"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="nx"&gt;Condition&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;StringLike&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="s2"&gt;"s3:prefix"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"curated/*"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"AthenaQueryExecution"&lt;/span&gt;
      &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
      &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"athena:StartQueryExecution"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"athena:GetQueryExecution"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"athena:GetQueryResults"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"athena:StopQueryExecution"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:athena:*:*:workgroup/primary"&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"KMSDecryptViaSvc"&lt;/span&gt;
      &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
      &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"kms:Decrypt"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"kms:DescribeKey"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:kms:*:*:key/*"&lt;/span&gt;
      &lt;span class="nx"&gt;Condition&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;StringEquals&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="s2"&gt;"kms:ViaService"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"s3.us-east-1.amazonaws.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"athena.us-east-1.amazonaws.com"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"DenyDestructiveOps"&lt;/span&gt;
      &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Deny"&lt;/span&gt;
      &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"s3:DeleteObject"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"s3:DeleteBucket"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"glue:DeleteDatabase"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"glue:DeleteTable"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"*"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="err"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;kms:ViaService&lt;/code&gt; condition is the piece that took five&lt;br&gt;
production failures to discover. KMS decrypt without that condition&lt;br&gt;
allows an analyst to call &lt;code&gt;kms:Decrypt&lt;/code&gt; directly from their&lt;br&gt;
shell, which is not what we want. The condition locks decrypt to&lt;br&gt;
requests that pass through S3 or Athena specifically.&lt;/p&gt;

&lt;p&gt;The explicit deny block matters too. Without it, if someone later&lt;br&gt;
grants broader S3 permissions to this persona for a different reason,&lt;br&gt;
the curated zone protection evaporates. The deny creates a hard floor&lt;br&gt;
regardless of what else gets added.&lt;/p&gt;
&lt;h2&gt;
  
  
  Phase 2: The Test Framework
&lt;/h2&gt;

&lt;p&gt;I chose Bash over Python or a proper test framework deliberately. The&lt;br&gt;
tests run in CI with no dependencies beyond the AWS CLI --- no package&lt;br&gt;
installs, no virtual environments, no version pinning of test libraries.&lt;br&gt;
The machines running these tests already have the AWS CLI.&lt;/p&gt;

&lt;p&gt;The core library in &lt;code&gt;lib/test-framework.sh&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;::: {#cb3 .sourceCode}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;declare -a TESTS_PASSED=()
declare -a TESTS_FAILED=()

run_test() {
  local test_name="$1"
  local test_command="$2"
  local description="$3"

  if eval "$test_command" &amp;amp;&amp;gt;/dev/null; then
    TESTS_PASSED+=("$test_name")
    echo "  ✅ PASS: $test_name"
  else
    TESTS_FAILED+=("$test_name")
    echo "  ❌ FAIL: $test_name"
  fi
}

generate_text_report() {
  echo "Total: $((${#TESTS_PASSED[@]} + ${#TESTS_FAILED[@]}))"
  echo "Passed: ${#TESTS_PASSED[@]}"
  echo "Failed: ${#TESTS_FAILED[@]}"
  [ ${#TESTS_FAILED[@]} -gt 0 ] &amp;amp;&amp;amp; printf '  - %s\n' "${TESTS_FAILED[@]}"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;The most important design decision in the test scripts is testing&lt;br&gt;
denials as carefully as allowances. Testing only what should succeed&lt;br&gt;
tells you the permission set isn't obviously broken. Testing what should&lt;br&gt;
fail tells you it's not accidentally too permissive.&lt;/p&gt;

&lt;p&gt;::: {#cb4 .sourceCode}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Test what should succeed
run_test "s3-list-curated"
  "aws s3 ls s3://lake-bucket-dev/curated/"
  "Analyst can list curated zone"

# Test what should fail (negative test)
run_test "s3-write-denied"
  "! aws s3 cp /tmp/test.txt s3://lake-bucket-dev/curated/test.txt 2&amp;gt;&amp;amp;1 | grep -q 'AccessDenied'"
  "Analyst cannot write to curated zone"

run_test "s3-raw-zone-denied"
  "! aws s3 ls s3://lake-bucket-dev/raw/ 2&amp;gt;&amp;amp;1 | grep -q 'AccessDenied'"
  "Analyst cannot access raw zone"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;Beyond service-level tests, I run persona tests that simulate&lt;br&gt;
end-to-end workflows. An analyst's workflow isn't "call S3, then call&lt;br&gt;
Athena separately" --- it's "run an Athena query that reads encrypted S3&lt;br&gt;
data and writes results to the query results bucket." That integration&lt;br&gt;
test catches failures that individual service tests miss. The original&lt;br&gt;
five-iteration DataPlatformAccess failure? An individual S3 test would&lt;br&gt;
have passed. A persona test running an actual Athena query against the&lt;br&gt;
encrypted lake would have caught the KMS gap.&lt;/p&gt;
&lt;h2&gt;
  
  
  Phase 3: CI/CD Integration
&lt;/h2&gt;

&lt;p&gt;The GitHub Actions workflow triggers on pull requests that touch the&lt;br&gt;
identity-center Terraform directory, runs tests in a matrix against dev&lt;br&gt;
and nonprod, and posts a summary comment to the PR.&lt;/p&gt;

&lt;p&gt;::: {#cb5 .sourceCode}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;on:
  pull_request:
    paths:
      - 'common/modules/identity-center/**/*.tf'

permissions:
  contents: read
  id-token: write
  pull-requests: write

jobs:
  test-permissions:
    strategy:
      matrix:
        environment: [workloads-dev, workloads-nonprod]
    steps:
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::${{ matrix.environment.account }}:role/github-actions-role
      - run: ./scripts/test-permissions/run-permission-tests.sh --persona analyst
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;id-token: write&lt;/code&gt; permission is required for OIDC&lt;br&gt;
authentication to AWS --- the workflow assumes a role in each account&lt;br&gt;
rather than using long-lived credentials in GitHub Secrets. This is the&lt;br&gt;
right pattern: credentials rotate automatically, and there's no secret&lt;br&gt;
to rotate manually or accidentally expose.&lt;/p&gt;

&lt;p&gt;The PR comment posts the full test output with pass/fail counts per&lt;br&gt;
persona per account. A reviewer can look at the comment and immediately&lt;br&gt;
see whether the permission change has test coverage and whether the&lt;br&gt;
tests pass.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Things I Learned the Hard Way
&lt;/h2&gt;

&lt;p&gt;First: test KMS decryption through each service separately.&lt;br&gt;
&lt;code&gt;kms:Decrypt&lt;/code&gt; via S3 and &lt;code&gt;kms:Decrypt&lt;/code&gt; via Athena&lt;br&gt;
are different IAM evaluation paths even though they're the same API&lt;br&gt;
call. A test that puts an object and gets it back via S3 directly won't&lt;br&gt;
catch a broken Athena KMS path.&lt;/p&gt;

&lt;p&gt;Second: negative tests matter as much as positive ones. Before I had&lt;br&gt;
the test framework, every permission set I wrote was tested only for&lt;br&gt;
what it should allow. I had no systematic check that it didn't allow&lt;br&gt;
more. The denial tests are what give security reviewers confidence.&lt;/p&gt;

&lt;p&gt;Third: persona tests catch failures that service tests miss.&lt;br&gt;
Individual service tests are fast to write and good for regression&lt;br&gt;
coverage, but they test permissions in isolation. Real workflows cross&lt;br&gt;
service boundaries. Build both.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Changed
&lt;/h2&gt;

&lt;p&gt;Before the framework: five iterations to get one permission set&lt;br&gt;
right, every iteration a production impact. After: 95% of permission&lt;br&gt;
issues caught at PR review time. Zero production impacts from permission&lt;br&gt;
bugs since we shipped it. The templates reduced new permission set&lt;br&gt;
creation time by about 70% --- instead of starting from scratch with the&lt;br&gt;
IAM documentation, we start from a pre-validated base and modify from&lt;br&gt;
there.&lt;/p&gt;

&lt;p&gt;The time investment was about a week: two days for templates, two&lt;br&gt;
days for the test framework and scripts, one day for CI/CD integration&lt;br&gt;
and documentation. That investment paid back in the first sprint when&lt;br&gt;
the analyst permission set for a new hire went out correct on the first&lt;br&gt;
deployment.&lt;/p&gt;

&lt;p&gt;Running into IAM permission debugging loops on your team? &lt;a href="https://graycloudarch.com/#contact" rel="noopener noreferrer"&gt;Reach out&lt;/a&gt; --- permission testing infrastructure is&lt;br&gt;
one of the first things I build when joining a new platform team.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>cicd</category>
      <category>githubactions</category>
      <category>iam</category>
    </item>
  </channel>
</rss>
