Forem: Glenn Gray

ECS vs EKS in 2026: The Decision Framework—Including ECS Anywhere

Glenn Gray — Tue, 19 May 2026 14:19:34 +0000

Originally published on graycloudarch.com.

The CTO wanted to know why the platform team had picked EKS for their new environment. They'd been running ECS for two years without issues. The team lead explained they needed GitOps, better autoscaling, and "industry-standard tooling."

Three months later, they were debugging a cert-manager webhook failure at 11am. Two engineers had spent 30 hours the previous month on cluster operations. They hadn't shipped a net-new feature in six weeks.

EKS wasn't wrong for them. The timing was. They had three engineers, twelve services, and no one who'd operated a Kubernetes cluster in production before. The ecosystem they wanted required them to operate it first.

This is the ECS vs EKS conversation most teams don't have until after they've made the choice.

The Actual Decision Axis

Feature comparisons miss the point. Both ECS and EKS run containers reliably. The real question is: what does your team have to operate to make that happen — and what's the cost of getting it wrong?

Two axes matter:

Operational capacity: How much complexity can your team absorb while still shipping product? A 3-engineer platform team and a 15-engineer platform team are not playing the same game.

Kubernetes maturity: Have your engineers operated k8s in production under pressure? "We've done some k8s" and "we've debugged etcd under load" are not the same thing.

The answer to which one you should use today often changes in 18 months. A team that's right for ECS now may be right for EKS after their platform engineers have shipped 6 months of Kubernetes work. Building with that arc in mind matters.

What ECS Actually Gives You

No control plane. That's the headline. With Fargate, there are no nodes to patch, no node groups to right-size, no kubelet to troubleshoot. AWS manages the underlying compute entirely.

The IAM model is simpler by design. Task roles attach directly to task definitions — no service accounts, no IRSA, no Web Identity tokens to wire up. For engineers coming from EC2-era IAM, this maps cleanly to what they already know.

ECS Fargate has no cluster fixed cost. EKS charges $0.10/hr per cluster — $72/month whether you're running one service or fifty. At low service counts or in non-production environments, that difference is real.

AWS integrations are first-class rather than plugged in. ALB target group registration, CloudMap service discovery, Secrets Manager injection via ECS container secrets — these work without Helm charts or CRDs. The AWS API surface and the ECS API surface are the same surface.

The internal tools team: 3 engineers, zero Kubernetes background, 8 services. ECS Fargate with a shared Terraform module got them to production in three weeks. No platform team required.

What EKS Actually Gives You

Ecosystem depth that ECS simply doesn't have. Karpenter for bin-packing and just-in-time node provisioning. KEDA for event-driven autoscaling off SQS, Kafka, or custom metrics. Argo CD or Flux for GitOps with real reconciliation loops. External Secrets Operator, Cert-manager, Prometheus Operator — the tooling is mature, battle-tested, and actively maintained.

ECS has no equivalent. The closest alternatives are either AWS-native (EventBridge Pipes, Application Auto Scaling) and less flexible, or custom-built and unmaintained after the engineer who wrote them leaves.

Karpenter in particular changes the EC2 cost math at scale. Intelligent bin-packing and spot interruption handling can cut compute costs 30-50% compared to fixed node groups. Below 20-30 nodes the savings often don't justify the operational overhead. Above that, it's hard to ignore.

Multi-cloud portability is real if you actually need it. Kubernetes manifests transfer to GKE or AKS. ECS task definitions do not. If "running this workload outside AWS" is a real scenario — not just theoretical — that matters.

The data platform I worked on: mixed batch and streaming workloads, KEDA scaling on SQS queue depth. ECS autoscaling would have required custom CloudWatch metrics and polling-based triggers. KEDA handled it natively in 20 lines of YAML. That alone settled the decision.

The Decision Tree

Walk through these in order. First yes wins.

Zero Kubernetes experience on the team? → ECS. The operational cost of learning k8s while building product is real and usually underestimated. The 40-hour/month cluster ops tax from the story above was paid by a team that had some k8s experience. Zero experience is worse.

Migrating from an existing ECS platform? → ECS. Rewrite and replatform simultaneously fails more often than it succeeds. Stabilize on ECS, migrate later when the workload is boring.

Need KEDA, custom-metric HPA, or Karpenter? → EKS. ECS autoscaling is Application Auto Scaling against CloudWatch metrics. It works, but the ceiling is lower and the custom metric path is significantly more work.

Need GitOps with Argo CD or Flux? → EKS. ECS has no native GitOps story. You can build one — CodePipeline + ECS deployment, Terraform-driven deployments — but you're building it. The operational difference is significant.

Five or more services sharing infrastructure? → EKS. The fixed cost justifies it; shared node pools improve utilization; the per-service overhead of ECS task definitions multiplies fast.

Default → ECS Fargate. Simpler, cheaper to start, and the migration path to EKS is well-understood.

ECS Anywhere: The Third Option

ECS Anywhere gets overlooked in most comparisons because it doesn't fit neatly into "cloud vs cloud" comparisons. It should be in the decision tree.

ECS Anywhere lets you register non-AWS compute — on-premises servers, VMs in other clouds, edge devices — as ECS external instances. Your task definitions, IAM roles, and tooling stay the same. The ECS control plane in AWS manages scheduling. The compute runs wherever you've registered it.

Where this actually wins:

Regulated environments with data residency requirements. If certain workloads must stay on-premises for compliance, ECS Anywhere lets you run them with the same tooling as your AWS workloads. On the GovCloud platform I built, we had ground system software that had to process flight data on local hardware before transmission. ECS Anywhere would have let us manage those workloads from the same ECS cluster as our cloud services — same Terraform modules, same IAM patterns, same observability pipeline.

Brownfield migration. If you're moving workloads from on-premises to AWS and want a consistent deployment target during the migration, ECS Anywhere gives you that. Register the on-prem servers, migrate task by task, deregister when done.

Edge compute. Consistent deployment tooling across dozens of edge nodes without running a k8s control plane at each site.

The constraint: ECS Anywhere instances are external infrastructure you own and patch. Fargate's "no nodes to manage" advantage disappears. The tradeoff is deliberate — you're accepting node management in exchange for placement control.

The Migration Path

ECS → EKS migration is well-understood and not particularly risky if the IaC is clean.

Containerized workloads move without changes. The two meaningful changes are IAM (task roles → IRSA service accounts — mechanical, not complex) and networking (ALB target group registration → Ingress or Service — also mechanical).

What breaks the migration is task definitions in CloudFormation or hand-managed console resources. If your ECS deployment is 100% Terraform with a module per service, the migration is boring. If it's six engineers' worth of one-off console configurations, it's archaeology.

Build ECS as if you'll migrate it. Keep task definitions in Terraform modules, service definitions composable, networking configuration explicit. The Jira ticket for "migrate from ECS to EKS" should feel like plumbing work, not a project.

Mistakes I See Repeatedly

Choosing EKS because it's "industry standard." Industry standard at Stripe is not industry standard at a 40-person SaaS company. The operational tax is the same either way.

Choosing ECS without accounting for the autoscaling ceiling. For workloads with bursty, event-driven traffic patterns, ECS autoscaling requires CloudWatch custom metrics and Application Auto Scaling policies that are genuinely annoying to tune. Know the ceiling before you hit it.

Single-cluster EKS for two services. The fixed cost of the control plane ($72/month), the operational overhead of running Kubernetes, and the learning curve are all real. For two or three services, this almost never makes sense.

Underestimating the Helm/CRD surface area. When a Helm-managed CRD conflicts with another controller at 2am, you need someone on the team who can debug it. "We'll figure it out" is not a plan.

Building a new platform or rearchitecting an existing container environment? The choice between ECS, EKS, and ECS Anywhere usually comes down to where your team is on the Kubernetes maturity curve and what your autoscaling requirements actually are — not which technology is more capable. Get in touch if you're working through this decision — it's a conversation I have with platform teams regularly, and the right answer depends on specifics that don't fit in a blog post.

Building Apache Iceberg Lakehouse Storage with S3 Table Buckets

Glenn Gray — Mon, 18 May 2026 17:54:46 +0000

Originally published on graycloudarch.com.

The data platform team had a deadline and a storage decision to make. They'd committed to Apache Iceberg as the table format — open standard, time travel, schema evolution, the usual reasons. What they hadn't locked down was where the data was actually going to live, and whether the storage layer would hold up under the metadata-heavy access patterns Iceberg requires.

The default answer is regular S3. It works. Most Iceberg deployments run on it. But AWS launched S3 Table Buckets in late 2024, and they're purpose-built for exactly this workload: Iceberg metadata operations. The numbers made the decision easy — 10x faster metadata queries, 50% or more improvement in query planning time compared to standard S3. The gotcha worth knowing upfront: S3 Table Bucket support requires AWS Provider 5.70 or later. If your Terraform modules are pinned to an older provider version, that's your first upgrade.

We built the storage layer as a three-zone medallion architecture, fully managed with Terraform. Here's how we did it — including a few things about Table Buckets that don't show up in most writeups.

The Medallion Architecture

One table bucket per environment. Zones are namespaces inside the bucket — not separate buckets, not separate Glue databases in the legacy sense:

Raw is immutable. Once data lands there, it doesn't change — ETL failures don't corrupt the source record because the source record is untouched. Clean is normalized and domain-aligned, produced by Spark transforms. Curated is the analytics layer that Athena queries and BI dashboards read from.

The namespace naming convention we used was {zone}_{domain} — raw_crm, clean_customer, curated_sales_metrics. When you're looking at a table in Athena or debugging a failed transform job, the namespace name tells you exactly what tier you're in and what domain you're touching. Data lineage is readable from table names alone.

Why Two Modules Instead of One

The first design question was whether to build a single composite module that creates the KMS key and the S3 Table Bucket together, or split them into separate modules. We split them.

The KMS key isn't just for the lake. It's used by five downstream services: Athena for query results, EMR for cluster encryption, MWAA for DAG storage, Kinesis for stream encryption, and Glue for transform outputs. If we bundled the key into the lake storage module, every one of those services would need a dependency chain that eventually resolves back through lake storage just to get a KMS key ARN. Separate modules mean the key has one owner, and everything else declares a dependency on it independently.

The KMS module:

# kms-key/main.tf
resource "aws_kms_key" "this" {
  description             = var.description
  enable_key_rotation     = var.enable_key_rotation
  deletion_window_in_days = var.deletion_window_in_days

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "Enable IAM User Permissions"
        Effect = "Allow"
        Principal = { AWS = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:root" }
        Action   = "kms:*"
        Resource = "*"
      },
      {
        Sid    = "Allow Service Access"
        Effect = "Allow"
        Principal = { Service = var.service_principals }
        Action = ["kms:Decrypt", "kms:GenerateDataKey", "kms:CreateGrant"]
        Resource = "*"
      }
    ]
  })
}

The service_principals variable takes a list of service principal strings — ["athena.amazonaws.com", "glue.amazonaws.com", "emr-serverless.amazonaws.com"] and so on. Adding a new service that needs key access is one line in the Terragrunt config, no module change required.

The S3 Table Bucket Module

The table bucket itself is straightforward:

# s3-table-bucket/main.tf
resource "aws_s3tables_table_bucket" "this" {
  name = var.bucket_name
}

One important thing that trips people up: S3 Table Buckets are not standard S3 buckets. They use the S3 Tables API, not the standard S3 API. Several standard S3 resources will fail with NoSuchBucket (404) if you try to attach them to a Table Bucket:

aws_s3_bucket_versioning
aws_s3_bucket_server_side_encryption_configuration
aws_s3_bucket_public_access_block
aws_s3_bucket_intelligent_tiering_configuration

Encryption is managed internally — AES256 is applied on creation automatically. You'll want ignore_changes = [encryption_configuration] in your lifecycle block or Terraform will constantly detect drift.

The Terragrunt dependency chain wires the KMS key ARN into the table bucket configuration:

# lake-storage/terragrunt.hcl
dependency "kms" {
  config_path = "../kms-key"
}

inputs = {
  bucket_name = "company-lake-${local.environment}"
  kms_key_arn = dependency.kms.outputs.key_arn
}

Glue Is Not the Catalog

This is the part that most S3 Table Bucket writeups get wrong, and it matters for how you structure the rest of your Terraform.

S3 Tables is the metadata source of truth. Glue is the integration layer. When you enable the S3 Tables analytics integration, AWS creates a federated catalog named s3tablescatalog in your Glue Data Catalog. Table buckets, namespaces, and tables are surfaced through that catalog hierarchy — Athena and EMR see them through Glue, but Glue doesn't own them.

This means you should not be creating aws_glue_catalog_database resources with location_uri S3 paths and trying to wire Iceberg metadata parameters onto them. That's the legacy Glue-over-S3-prefixes model. For S3 Tables, the catalog structure comes from the table bucket integration, not from manual Glue database provisioning.

In Terraform, the integration resource is aws_s3tables_table_bucket_policy (for access control) and the analytics integration is enabled at the account level. Once enabled, Athena queries S3 Tables through the s3tablescatalog namespace automatically.

The namespace naming convention (raw, clean, curated with domain suffixes) is defined in the table bucket itself, not in Glue. Glue reflects it — it doesn't own it.

The Cost Model

For a 100TB lake, the comparison against standard S3 holds:

Storage Class	When	Monthly Cost
Standard	Active data	~$2,300
Standard-IA equivalent	Less-accessed data	~$400
Glacier equivalent	Archive	~$100

The metadata acceleration charge for Table Buckets is $0.00025 per 1,000 requests — on a 100TB lake with typical Iceberg file sizes, that's a few dollars a month. The performance improvement compounds the cost picture: 10x faster query planning means less Athena scan time, which means lower query costs as data volume grows.

One note: you cannot attach aws_s3_bucket_intelligent_tiering_configuration to a Table Bucket — it's a standard S3 resource and will fail. Storage cost optimization for Table Buckets happens through compaction and retention maintenance jobs (typically run on a schedule via MWAA or EMR), not through lifecycle policies.

Deployment Sequence

The deployment order is driven by dependencies: KMS must exist before S3 (bucket encryption needs the key ARN), and both must exist before the S3 Tables analytics integration (which creates the federated Glue catalog surface).

In practice, across three environments (dev, nonprod, prod), the full deployment took about four hours. Most of that was Terragrunt apply time — the actual resource creation for each component is fast, but we ran plan, reviewed, applied, and verified before moving to the next environment.

One deployment note: if you're using Athena and haven't enabled S3 Tables analytics integration in the account before, do that before the apply. Athena queries S3 Tables only after the integration is enabled and the s3tablescatalog namespace is visible in the Glue Data Catalog.

What the Data Team Inherited

When we handed this over to the data engineering team, they had a fully provisioned storage foundation — one table bucket per environment, three namespaces per bucket, encryption enabled, and Athena wired to query through the s3tablescatalog integration. They could start writing Spark jobs and creating tables immediately without worrying about storage configuration or catalog wiring after the fact.

The Terraform modules are reusable. Adding a new environment is one Terragrunt leaf config. Adding a new domain namespace is a namespace declaration on the existing bucket. The KMS key and integration configuration don't change.

S3 Table Buckets are still relatively new, and the Terraform provider support came together in late 2024. If your team is planning an Iceberg migration and hasn't evaluated Table Buckets yet, the metadata performance gains make a strong case for starting there rather than retrofitting later — just go in knowing they're a different API surface than standard S3, and structure your modules accordingly.

Building out a data platform and figuring out the storage and catalog architecture? Get in touch — this kind of infrastructure design work is something I do regularly, whether you're starting from scratch or migrating an existing lake.

The 5-Minute Tax I Killed With GitHub Actions

Glenn Gray — Mon, 18 May 2026 17:43:31 +0000

Originally published on graycloudarch.com.

Every time I finished writing a blog post, I had to do this:

cd sites/graycloudarch
hugo --minify
aws s3 sync public/ s3://graycloudarch-website --delete
aws cloudfront create-invalidation --distribution-id E1234ABCDEF --paths "/*"

Five minutes. Doesn't sound like much.

But when you're trying to publish 2-3 posts per week while working full-time, those 5 minutes add up. Not just in time—in friction.

"I just finished writing. Now I need to context-switch to deployment mode. What was that CloudFront ID again?"

Friction kills momentum.

What I Wanted

git push → site updates automatically → I move on to the next thing.

Zero thinking. Zero context switching. Zero "oh crap, I forgot to invalidate CloudFront."

The Solution: GitHub Actions

GitHub Actions can build and deploy your site every time you push to main. For free.

Here's the whole workflow:

name: Deploy graycloudarch.com

on:
  push:
    branches: [main]
    paths:
      - 'sites/graycloudarch/**'
      - 'content/graycloudarch/**'

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          submodules: true

      - uses: peaceiris/actions-hugo@v2
        with:
          hugo-version: 'latest'
          extended: true

      - name: Build site
        working-directory: ./sites/graycloudarch
        run: hugo --minify

      - name: Configure AWS
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1

      - name: Deploy
        working-directory: ./sites/graycloudarch
        run: |
          aws s3 sync public/ s3://graycloudarch-website --delete
          aws cloudfront create-invalidation \
            --distribution-id ${{ secrets.CLOUDFRONT_DISTRIBUTION }} \
            --paths "/*"

That's it. Push to main, GitHub Actions handles the rest.

The Part That Tripped Me Up

Hugo themes are usually Git submodules. If you don't check them out, your build fails with cryptic errors about missing layouts.

- uses: actions/checkout@v4
  with:
    submodules: true  # Don't forget this

Cost me 20 minutes of debugging before I realized. Now it's documented in code, not lost in my bash history.

Path Filtering: The Secret Sauce

I run two sites in one repo: graycloudarch.com and cloudpatterns.io.

Without path filtering, every push rebuilds both sites, even if I only changed one. Wasted build minutes, unnecessary CloudFront invalidations, slower feedback.

on:
  push:
    paths:
      - 'sites/graycloudarch/**'
      - 'content/graycloudarch/**'

Now GitHub Actions only runs when files for that site change. Fast, efficient, no waste.

Why This Matters

I'm trying to hit $3K/month by March 31. That's 9 weeks.

Every minute I spend deploying is a minute I'm not writing, not reaching out to clients, not building the course I want to sell.

Manual deployments are a tax on my time. This workflow eliminated that tax.

Now when I finish writing, I commit and push. Two minutes later, it's live. I'm already working on the next post.

The Real Win

It's not the 5 minutes per deployment.

It's the mental overhead.

Before: "Okay, post is done. Now I need to switch gears, build Hugo, sync to S3, remember that CloudFront command..."

After: "Post is done. git push. What's next?"

No context switch. No friction. Just ship and move on.

That's worth way more than 5 minutes.

Want to set this up for your site? The workflow above works for any Hugo + S3 + CloudFront setup. Just plug in your bucket names and distribution IDs in GitHub Secrets.

Or reach out if you want help automating your deployments. I do this for a living.

I Spent 6 Hours Automating a 30-Minute Task (And I'd Do It Again)

Glenn Gray — Mon, 18 May 2026 17:43:30 +0000

Originally published on graycloudarch.com.

Look, I know what you're thinking. "Glenn, you could've just clicked through the AWS console and had both sites live in an hour."

You're not wrong.

But here's the thing—I'm allergic to clicking through consoles. It's a professional hazard from spending the last 5 years building enterprise platforms where "just do it manually" gets you fired.

So when I sat down to launch graycloudarch.com and cloudpatterns.io, I did what any reasonable person would do: I spent 6 hours writing Terraform to automate a 30-minute task.

The Manual Way (aka Hell)

If I'd done this the normal way:

AWS Console → ACM → Request Certificate
Copy the DNS validation CNAME
Cloudflare → Add DNS record
Wait. Refresh. Wait more.
AWS Console → CloudFront → Create Distribution
Copy CloudFront domain
Cloudflare → Add another DNS record
Test. Find typo. Fix typo. Test again.
Repeat for second domain.

Time: 40 minutes if nothing breaks (it always breaks).

Chance I'd screw up a DNS record: 80%.

The Automated Way (aka Overkill)

One Terraform apply. That's it.

terraform apply
# Go make coffee
# Come back to two working sites

But the real magic isn't the deployment—it's what happens when AWS generates those ACM validation records:

resource "cloudflare_record" "cert_validation" {
  for_each = {
    for dvo in aws_acm_certificate.site.domain_validation_options :
      dvo.domain_name => {
        name   = dvo.resource_record_name
        record = dvo.resource_record_value
        type   = dvo.resource_record_type
      }
  }

  zone_id = data.cloudflare_zone.site.id
  name    = each.value.name
  value   = each.value.record
  type    = each.value.type
}

Terraform reads the validation records from AWS, creates them in Cloudflare, and waits for validation to complete. Zero copy-paste. Zero switching between browser tabs. Zero forgetting which CNAME goes where.

I don't touch Cloudflare. I don't touch AWS Console. I just run terraform apply and go do something useful.

Why This Matters (Spoiler: It's Not About Terraform)

I'm trying to hit $3K/month by March 31. That's 9 weeks away.

Every hour I spend clicking through AWS is an hour I'm not:

Writing blog posts
Reaching out to potential clients on LinkedIn
Building the course I want to sell
Actually making money

Manual infrastructure doesn't generate revenue. Published content generates revenue.

So yeah, I spent 6 hours automating something I could've done in 30 minutes. But now when I launch my third brand (and I will), it takes 10 minutes and one terraform apply.

That's the bet: upfront investment for long-term velocity.

What I Actually Built

The module is dead simple:

ACM certificate with DNS validation
S3 bucket for static hosting
CloudFront distribution
Cloudflare DNS records (both root and www)

Call it twice (once per brand), different inputs, same code:

module "graycloudarch" {
  source      = "../../modules/static-site"
  domain_name = "graycloudarch.com"
  bucket_name = "graycloudarch-website"
}

module "cloudpatterns" {
  source      = "../../modules/static-site"
  domain_name = "cloudpatterns.io"
  bucket_name = "cloudpatterns-website"
}

That's it. No duplication. No drift. No "wait, which CloudFront ID goes with which domain?"

The Part Where I Screwed Up

Of course it didn't work perfectly the first time.

Turns out when you register a domain through Cloudflare, they helpfully create a default parking page DNS record. When Terraform tried to create my root CNAME, it failed with "record already exists."

Took me 20 minutes to figure out I needed allow_overwrite = true in the Cloudflare resource.

20 minutes I'll never get back. But at least it's documented in Git now, not lost in my bash history.

Would I Do This Again?

Absolutely.

Not because it's faster (it's not, the first time).

Not because it's easier (it's definitely not).

Because when I'm sitting at 2am writing my fifth blog post of the week and I realize I need to spin up a third site for a new product line, I can do it in 10 minutes instead of canceling my writing session to spend 45 minutes in AWS console.

Automation is a bet on future you. I'm betting future Glenn will appreciate not having to remember how SSL validation works.

Want the code? It's not open source (yet), but if you're building something similar and want to talk through the architecture, hit me up. I'm always down to talk Terraform.

Or if you just want to tell me I'm insane for spending 6 hours on this, that's cool too. My DMs are open.

The IAM Trust Policy Chicken-and-Egg (That Isn't)

Glenn Gray — Wed, 13 May 2026 17:55:53 +0000

Originally published on graycloudarch.com.

The pipeline role needed to trust the deployment role. The deployment role needed to trust the pipeline role. When I wrote both in Terraform and ran plan, it stopped:

Error: Cycle: module.pipeline.aws_iam_role.exec → module.deploy.aws_iam_role.target → module.pipeline.aws_iam_role.exec

The instinct is to create one role first, then go back and edit the trust policy of the other after it exists. A manual bootstrap step. It works. It also means you can't terraform apply from a clean state and get a working result — someone has to remember the second pass. The IaC tells half the story.

There's a better answer. IAM trust policies don't validate that the ARNs they reference actually exist. AWS stores the JSON document and moves on. The cycle Terraform sees is real — it's a real edge in its dependency graph. The underlying constraint that dependency represents is not.

ARNs are deterministic before creation

IAM role ARNs follow a fixed format:

arn:aws:iam::<account-id>:role/<role-name>

The account ID is fixed. The role name is chosen at definition time. Which means the full ARN is computable before terraform apply runs — before the resource exists — as long as the name is stable.

AWS does not validate that a referenced principal ARN exists when you create or update a trust policy. It stores the JSON. The role becomes assumable once both sides exist, regardless of which one was created first.

This is different from a configuration error like referencing a nonexistent IAM role in an aws_iam_role_policy_attachment — that fails at apply time because Terraform tries to call the API and gets an error. A trust policy is just a JSON document stored against the role. If the ARN in the Principal field doesn't resolve to an existing entity yet, IAM doesn't complain. It just doesn't match anything. Yet.

The cycle Terraform sees

The dependency graph problem is real. Here's the code that creates it:

resource "aws_iam_role" "role_a" {
  assume_role_policy = jsonencode({
    Statement = [{
      Principal = { AWS = aws_iam_role.role_b.arn }  # depends on role_b
    }]
  })
}

resource "aws_iam_role" "role_b" {
  assume_role_policy = jsonencode({
    Statement = [{
      Principal = { AWS = aws_iam_role.role_a.arn }  # depends on role_a
    }]
  })
}

Terraform resolves: role_a needs role_b's ARN before creation → role_b needs role_a's ARN before creation → cycle. It stops before creating either resource.

The fix removes the dependency by computing what you already know:

data "aws_caller_identity" "current" {}

locals {
  account_id = data.aws_caller_identity.current.account_id

  role_a_arn = "arn:aws:iam::${local.account_id}:role/${var.role_a_name}"
  role_b_arn = "arn:aws:iam::${local.account_id}:role/${var.role_b_name}"
}

resource "aws_iam_role" "role_a" {
  name = var.role_a_name
  assume_role_policy = jsonencode({
    Statement = [{
      Principal = { AWS = local.role_b_arn }  # string, no Terraform dependency
    }]
  })
}

resource "aws_iam_role" "role_b" {
  name = var.role_b_name
  assume_role_policy = jsonencode({
    Statement = [{
      Principal = { AWS = local.role_a_arn }  # string, no Terraform dependency
    }]
  })
}

No cycle. Both roles are created in a single apply. The trust relationship is live as soon as both resources exist — which they will be, after the same plan.

Where this pattern appears in practice

Cross-account deployment pipeline

CodePipeline execution role in account A assumes a deployment role in account B. The deployment role's trust policy needs to reference the pipeline role's ARN. Each Terraform root manages its own account's roles. The ARN construction pattern resolves the cross-account dependency: each module constructs the other account's role ARN from var.pipeline_account_id and a known role name — values passed in at plan time from tfvars or remote state outputs.

ECS task role and execution role

The ECS task execution role needs iam:PassRole to hand the task role to ECS at launch. Some teams want the task role's trust policy to explicitly list the execution role's ARN as the allowed principal. You don't need to — ecs-tasks.amazonaws.com as the service principal removes the dependency entirely. But if your security posture requires explicit principal ARNs rather than the service principal, ARN construction handles it without a two-pass apply.

Permission boundary bootstrap with an SCP

An SCP requires that all new IAM roles include a specific permission boundary policy. The boundary is a managed policy that must exist before any roles referencing it can be created. This isn't a circular dependency — it's a sequential one. The boundary policy must be applied first, separately. Construct its ARN deterministically (arn:aws:iam::${var.account_id}:policy/${var.boundary_name}) and pass it in wherever roles are created. Document the bootstrap order with a Terraform precondition block or a clear README section. Different problem, different fix.

When the dependency is genuine

There's a scenario that looks identical to this but isn't: when a Terraform provisioner or data source needs to actually call a role — not just reference its ARN — during resource creation.

Example: a null_resource provisioner that runs aws sts assume-role and then operates in the target account. Here you need the role to exist and be assumable before the provisioner fires. ARN construction doesn't help — you need the resource active at execution time, not just its string value known at plan time. The correct fix is explicit depends_on, not local string construction.

The distinction: static JSON referencing an ARN string (solvable with ARN construction) vs. a runtime API call that needs the resource actually live (solvable with depends_on). If your code needs to assume the role during apply, you need ordering. If it just needs to name the role in a policy document, you don't.

The trap in the fix

Once you've internalized "construct ARNs deterministically," the next failure mode is role names that include Terraform-generated suffixes.

resource "aws_iam_role" "role_a" {
  name = "${var.prefix}-role-${random_id.suffix.hex}"  # ARN not deterministic until random_id exists
  ...
}

If the role name includes random_id.suffix.hex, the ARN can't be computed until the random_id resource is created. That brings the dependency back — you're back to needing a resource output to construct the name, and the cycle re-forms if any of those names are referenced in another role's trust policy.

The fix is stable, predictable role names: "${var.prefix}-${var.env}-pipeline" rather than generated suffixes. IAM role names are unique per account, not globally. The habit of appending random suffixes comes from S3 bucket naming, where global uniqueness is required. IAM doesn't have that constraint. There's no reason to make the name unpredictable.

If you have existing roles with generated names and need their ARNs, they're deterministic after the first apply — stored in state and readable via aws_iam_role.role_a.arn. The construction approach is for cases where you control the naming and are defining the role name yourself.

What generalizes

The IAM trust policy deadlock is the most common place engineers hit this pattern, but it's not the only one. Wherever you encounter a Terraform circular dependency involving a predictable string — ARNs, resource names, account IDs, region names — ask whether you actually need the resource output or whether you can compute the value from what you already know.

data.aws_caller_identity.current.account_id gives you the account without creating a dependency on any resource. A stable name gives you the ARN. The dependency graph edge exists only because you referenced the resource — remove the reference by computing the value directly, and the cycle disappears.

The broader principle: Terraform's graph is built from references. References that aren't necessary are constraints that aren't necessary.

Untangling IAM architecture across multiple accounts — trust policies, permission boundaries, SCPs, cross-account assume-role chains — is where subtle errors compound quietly and the blast radius is real. I work on this regularly.

What the first 24 hours of production CloudWatch data told us

Glenn Gray — Mon, 04 May 2026 18:43:32 +0000

Originally published on graycloudarch.com.

The morning after go-live, the first thing I looked at was CPU. One of the two delivery services was sitting at 99.8% average utilization across 9 tasks. P50 latency: 1,010ms.

We'd launched deliberately without autoscaling. The plan was to observe real traffic patterns before configuring a scaling policy — you can't tune a policy you haven't seen the workload demand yet. What we didn't know was that the workload would reveal something about the task itself before we'd had a chance to watch it for a week.

Thirty-six hours after go-live, we'd shipped right-sizing changes, a working autoscaling configuration, and a new observability source for ALB-layer signals. All of it came directly from what the first day of production data said. Here's how we read it.

What 99.8% CPU means at 0.5 vCPU

The service was allocated 512 ECS CPU units per task — half a vCPU. CloudWatch was telling us the tasks were spending essentially all of their scheduled CPU time working.

The first instinct in this situation is to add tasks. Scale out horizontally. But adding more 0.5 vCPU containers when each one is already saturated doesn't change the constraint. In ECS, the scheduler distributes tasks across hosts, but the per-task CPU ceiling is set in the task definition. More tasks at ceiling is not materially different from fewer tasks at ceiling — you're distributing the same undersized unit more widely.

The signal wasn't about count. It was about the unit itself.

At 99.8% utilization, any burst in per-request processing time — a downstream API call that's slow, a cache miss, a spike in concurrent requests — queues. The task has no headroom to absorb it. That's where the 1,010ms p50 comes from: not that individual requests are slow, but that tasks are scheduled tightly enough that requests wait before they even start processing.

Right-sizing the task before configuring the autoscaler

We doubled the CPU allocation: 512 → 1,024 units. The rationale is mechanical once you see it: you can't configure a useful CPU-based autoscaling policy on a task that's already running at ceiling. If 100% CPU is the baseline, the autoscaler has nothing to respond to — it would scale out immediately on creation and never scale in.

Target tracking at 70% CPU requires headroom. A 1 vCPU task running the same workload that previously pinned a 0.5 vCPU task will land around 50% utilization — below the target, room to absorb variance before triggering a scale-out, and enough signal for scale-in to be meaningful rather than noise.

The second service had a different profile: 12 tasks, 1 vCPU each, hitting 92% at peak. Not saturated the same way, but thin on headroom. We went to 2 vCPU there.

Two other services in the platform were running the opposite problem — allocated more memory than they'd ever used. Those went the other direction: overprovisioned memory cut back based on observed peaks. The same 24-hour data window showed both problems at once.

Sequencing matters: right-size the task before you configure the autoscaler. Otherwise you're teaching a scaling policy to respond to a signal that's already maxed out, and the first thing it does is scale out to a floor that's still running on undersized tasks.

Why we chose CPU tracking instead of request count

The obvious autoscaling metric for an HTTP service is ALBRequestCountPerTarget. The ALB knows the request rate per target group; scaling on that metric tracks load linearly and is highly predictable.

We couldn't use it.

The platform uses a cross-account Lambda to register ECS tasks with ALB target groups at boot. Because of how the registration bridge works, the ECS service resource is provisioned with target_group_arn = null — the target group lives in a different account, and the service module doesn't know its ARN. ALBRequestCountPerTarget requires the target group ARN to be known to the Application Auto Scaling policy. Without it, there's no way to wire the metric across accounts without building additional dependency plumbing.

CPU target tracking at 70% was the correct second choice. For a CPU-bound workload — which 99.8% utilization confirms this is — CPU is a meaningful proxy for load. The metric was there, it was clean, and the task was now sized to make it useful.

One thing worth noting: the cross-account registration bridge was the right architectural decision for the problem it solved. But it created a constraint three layers away in a scaling configuration we hadn't designed yet. Architecture decisions compound downstream. The fix here was straightforward; I've seen the same pattern take longer to untangle when the constraint wasn't recognized.

The observability gap app logs can't fill

Application logs were already flowing to BetterStack from both services. We had route-level latency, HTTP status codes, request counts, error breakdowns — everything that happens inside a container.

What the logs couldn't tell us was what happens above them. The ALB generates its own error signals: HTTPCode_ELB_5XX_Count for errors the load balancer generates before a request reaches a container, RejectedConnectionCount for connections refused at the ALB layer when backend capacity is exhausted, ActiveConnectionCount as a proxy for in-flight load per target group. None of this appears in application logs. If the ALB had been dropping connections during the 99.8% CPU period, we would have had no signal in our observability platform.

CloudWatch had the data. The gap was getting it into the same place as everything else.

A 60-second Lambda in the infrastructure account — where the ALB lives — calls GetMetricData and ships structured JSON to BetterStack. One EventBridge rule, no ECS changes, effectively zero cost (one CloudWatch API call per minute against Lambda's free tier). The metrics land alongside the application data and show the ALB layer that the app logs are blind to.

The design decision here was Lambda over an ECS sidecar. A sidecar would have run per-service, per-task, 24 hours a day, and required task definition changes across the platform. A single Lambda running once per minute in the account that owns the ALB costs nothing and touches no ECS configuration.

Autoscaling parameters worth explaining

For the higher-load service: min=9, max=20, CPU target=70%, scale-out cooldown=60s, scale-in cooldown=300s.

Setting min_capacity to 9 — the current running task count — was deliberate. We'd just established that 9 tasks was a functional floor for this workload at current traffic levels. An autoscaler configured with min=2 or min=4 would have attempted to scale in on the first quiet period, bringing the service back to a state we knew was already under-provisioned. Anchoring the floor to the observed stable-state count prevents that while we accumulate enough autoscaling history to set a meaningful long-term floor.

The asymmetric cooldowns — 60 seconds for scale-out, 5 minutes for scale-in — reflect the cost asymmetry of being wrong in each direction. Scaling out too slowly during a load spike means requests queue. Scaling in too aggressively during a brief quiet period means tasks are killed and restarted unnecessarily. The 5-minute scale-in cooldown is conservative; we'll revisit it once we have a week of data showing where the service naturally stabilizes.

What 24 hours of data drove

We launched expecting to spend the first week observing. What the data delivered instead was a complete picture of three distinct problems: a task sizing issue that was causing queuing, a scaling policy that needed the right foundation before it could be configured, and an observability gap for a class of signals that app logs fundamentally can't surface.

All three were solved from the same 24-hour data window. The pre-launch load testing hadn't revealed any of them — synthetic traffic and production ad-bidding traffic have different CPU profiles, and you don't know which until the real thing runs.

The thing I'd change if running this again: put a structured post-launch data review into the go-live plan, not the next morning's to-do list. Not a formal incident review — a deliberate hour with CloudWatch after the first day's traffic has run through. The data is there. The question is whether you've planned to look at it.

If you're planning a production go-live and want a structured approach to post-launch data review and stabilization — or you're staring at a service running at ceiling with no autoscaling — get in touch. This is the kind of platform work I do regularly, and the pattern here applies well beyond ad delivery.

DNS Validation: From 15 Steps to Zero

Glenn Gray — Sat, 04 Apr 2026 22:30:30 +0000

Originally published on graycloudarch.com.

You know what's the worst part of launching a new site?

SSL certificate validation.

Not creating the cert—that's one click in AWS ACM. It's the validation dance:

AWS gives you a CNAME record: _abc123extremely-long-string-here.graycloudarch.com
The value is equally ridiculous: _xyz789another-massive-string.acm-validations.aws.
You copy it (pray you don't miss a character)
Switch to Cloudflare (or Route 53, or wherever)
Paste it in
Wait 5-10 minutes
Refresh AWS console
Still pending...
Refresh again
Finally validated!

Now do it again for www.graycloudarch.com.

And then repeat the whole thing for your second domain.

This is "DNS hell."

There's a Better Way

Terraform can read AWS validation records and create them in Cloudflare automatically.

Zero copy-paste. Zero browser tab switching. Zero waiting and refreshing.

Here's the whole thing:

# Request certificate
resource "aws_acm_certificate" "site" {
  domain_name       = "graycloudarch.com"
  validation_method = "DNS"
  subject_alternative_names = ["www.graycloudarch.com"]
}

# Create validation records in Cloudflare
resource "cloudflare_record" "cert_validation" {
  for_each = {
    for dvo in aws_acm_certificate.site.domain_validation_options :
    dvo.domain_name => {
      name   = dvo.resource_record_name
      value  = dvo.resource_record_value
      type   = dvo.resource_record_type
    }
  }

  zone_id = data.cloudflare_zone.site.id
  name    = each.value.name
  value   = each.value.value
  type    = each.value.type
  proxied = false  # Critical - ACM validation breaks with proxy
}

# Wait for validation
resource "aws_acm_certificate_validation" "site" {
  certificate_arn = aws_acm_certificate.site.arn
  validation_record_fqdns = [
    for record in cloudflare_record.cert_validation : record.hostname
  ]
}

Run terraform apply. Go make coffee. Come back to a validated certificate.

The Magic: for_each

The key is this part:

for_each = {
  for dvo in aws_acm_certificate.site.domain_validation_options :
  dvo.domain_name => { name = dvo.resource_record_name, ... }
}

AWS generates validation records dynamically (one for apex domain, one for www). Terraform reads them, loops over them, and creates each one in Cloudflare.

You never see the records. You never copy anything. It just works.

What I Screwed Up

First time I ran this, ACM validation timed out after 30 minutes.

The problem:

proxied = true  # Wrong!

Cloudflare's proxy rewrites DNS responses. ACM's validation servers hit Cloudflare's IP instead of seeing your validation record.

The fix:

proxied = false  # Correct

DNS-only mode. No proxy. ACM validation works.

Cost me 30 minutes of debugging. Now it's in code so I never hit it again.

Why This Matters

I'm running two brands: graycloudarch.com and cloudpatterns.io.

Manual approach: 15 steps per domain = 30 steps total. 30 minutes minimum. High chance of typos.

Terraform approach: One terraform apply. 5 minutes to write the code (once), 10 minutes for AWS to validate. Then copy-paste the pattern for the second domain.

When I launch my third brand (and I will), it'll take 5 minutes and one terraform apply.

That's the bet: upfront automation for long-term velocity.

The Part People Miss

Most Terraform tutorials stop at requesting the certificate. They don't show you the validation loop or the waiting resource.

Without aws_acm_certificate_validation, Terraform exits immediately after creating the cert. It's still "Pending Validation" in AWS. When you try to use it in CloudFront, it fails.

You'd have to run terraform apply again later, after manually checking that validation completed.

That's not automation—that's just documentation.

The waiting resource makes it truly hands-off.

Scaling It

Adding a second domain is 10 lines of code:

resource "aws_acm_certificate" "cloudpatterns" {
  domain_name       = "cloudpatterns.io"
  validation_method = "DNS"
  subject_alternative_names = ["www.cloudpatterns.io"]
}

resource "cloudflare_record" "cloudpatterns_validation" {
  for_each = { /* same pattern */ }
  # ...
}

resource "aws_acm_certificate_validation" "cloudpatterns" {
  # ...
}

Same pattern, different names. No clicking. No switching between consoles. No remembering which validation record goes where.

The Real Win

It's not the time savings (though 30 minutes per deployment adds up).

It's the mental overhead.

Manual DNS configuration requires focus. "Did I copy the whole string? Did I add the trailing dot? Is it DNS-only mode?"

Terraform requires running one command. That's it.

I get my focus back. I can write this blog post while Terraform validates certificates.

Want the full code? It's not open source (yet), but if you're building something similar and want to talk through it, reach out.

Or if you just want to tell me I'm overthinking this and should've clicked through Cloudflare like a normal person, that's cool too.

Building Multi-Account AWS Infrastructure with Terraform and ECP

Glenn Gray — Sat, 04 Apr 2026 22:30:25 +0000

Originally published on graycloudarch.com.

After years of building AWS infrastructure at scale, I've learned that multi-account strategy isn't just about security—it's about organizational clarity and cost management.

At a large podcast hosting platform, we implemented an Enterprise Control Plane (ECP) pattern using Terraform to manage 20+ AWS accounts. Here's what I learned:

The Problem with Single-Account AWS

Most companies start with one AWS account. Everything lives together: dev, staging, prod, data pipelines, security tools. It works... until it doesn't.

Problems emerge:

Blast radius: A misconfigured dev resource can affect production
IAM complexity: Permission boundaries become impossible to manage
Cost allocation: Finance can't track spending by team or project
Compliance: Auditors want logical separation between environments

The ECP Pattern

Enterprise Control Plane is an architectural pattern for managing multiple AWS accounts as a unified platform:

Organization Structure: AWS Organizations with OUs (Organizational Units) for different environments and teams
Centralized Networking: Transit Gateway connecting all accounts through hub-and-spoke model
Security Baseline: Service Control Policies (SCPs) enforcing guardrails at the organization level
Infrastructure as Code: Terraform/Terragrunt managing everything from a central repository

Key Design Decisions

Account Boundaries:

Production accounts: Isolated per application/team
Non-prod accounts: Shared dev/staging to reduce overhead
Platform accounts: Separate accounts for logging, monitoring, security tools
Data accounts: Isolated for compliance and access control

Network Architecture:

Hub account with Transit Gateway
VPC peering only where absolutely necessary
Private subnet defaults for everything
Centralized egress through NAT Gateway in hub

Security Model:

SCPs prevent account-level misconfigurations
IAM roles for cross-account access (no shared credentials)
CloudTrail logs aggregated to security account
GuardDuty and Security Hub in every account

Terraform Structure

We use Terragrunt to manage configurations across accounts:

ecp-ou-structure/     # Organization and account management
ecp-network/          # Transit Gateway, VPCs, networking
ecp-security/         # Security baseline, SCPs, IAM
tf-live-aws-*/        # Application-specific infrastructure

Lessons Learned

Start with security: SCPs first, then networking, then workloads
Automate account creation: Manual account provisioning doesn't scale
Document the why: Every architectural decision needs context
Plan for day 2: Operations matter more than initial setup

Results

After implementing ECP:

Reduced security incident blast radius by 90%
Finance can now track costs by team and project
New environments deploy in hours, not days
Passed SOC2 audit with zero infrastructure findings

Multi-account AWS isn't just best practice—it's how you scale infrastructure beyond the startup phase.

Stop Manually Updating Jira After Every PR Merge

Glenn Gray — Wed, 25 Mar 2026 07:10:48 +0000

This post was originally published on graycloudarch.com.

You just merged a PR. Now you open Jira, find the ticket, paste the
PR link in a comment, transition the status to Done, and update the
deployed field. Five minutes. Twenty times a week. That's 1,700 minutes
per year per engineer --- nearly 30 hours of pure mechanical overhead.

And that's assuming you remember. On one team I worked with, we
audited the last three months of merged PRs. Thirty percent of tickets
had no update after merge. No comment, no transition, no link. The
ticket just sat in In Dev until someone noticed during sprint
review.

The fix is two GitHub Actions workflows and a shared composite
action. Here's exactly how to build it.

The Architecture

Two workflows, one shared extraction layer:

Workflow 1: Fires on PR creation --- posts a Jira link comment to the PR so reviewers can navigate directly to the ticket.
Workflow 2: Fires on PR merge to main --- posts a comment to the Jira ticket with the PR URL, commit SHA, and who merged it, then transitions the ticket to Done.

Both workflows need to find the Jira ticket ID. Instead of
duplicating that logic, we extract it into a composite action.

Step 1: Composite Action for Ticket Extraction

Create
.github/actions/extract-jira-ticket/action.yml.

The action checks four sources in priority order --- easiest to fix
first:

PR title (simplest for the developer to correct)
Commit messages
Branch name in standard format: PROJECT-123-description
Branch name with prefix: feat/PROJECT-123-description

::: {#cb1 .sourceCode}

name: Extract Jira Ticket
description: Extracts Jira ticket from PR title, commits, or branch name

inputs:
  jira-base-url:
    required: true
  jira-user-email:
    required: true
  jira-api-token:
    required: true

outputs:
  jira-key:
    value: ${{ steps.extract.outputs.jira_key }}
  found:
    value: ${{ steps.extract.outputs.found }}

runs:
  using: composite
  steps:
    - name: Extract ticket ID
      id: extract
      shell: bash
      run: |
        JIRA_KEY=""

        # Priority 1: PR title
        if [[ "${{ github.event.pull_request.title }}" =~ ([A-Z]+-[0-9]+) ]]; then
          JIRA_KEY="${BASH_REMATCH[1]}"
        fi

        # Priority 2: Branch name
        if [ -z "$JIRA_KEY" ]; then
          BRANCH="${{ github.head_ref }}"
          if [[ "$BRANCH" =~ ([A-Z]+-[0-9]+) ]]; then
            JIRA_KEY="${BASH_REMATCH[1]}"
          fi
        fi

        if [ -n "$JIRA_KEY" ]; then
          echo "jira_key=$JIRA_KEY" >> $GITHUB_OUTPUT
          echo "found=true" >> $GITHUB_OUTPUT
        else
          echo "found=false" >> $GITHUB_OUTPUT
        fi

:::

The regex [A-Z]+-[0-9]+ matches any Jira ticket format:
PROJ-1, IN-89, INFRA-1234. If you
have tickets with lowercase project keys, adjust accordingly.

Step 2: PR Creation Workflow

Create .github/workflows/link-jira-on-pr.yml.

This fires when a PR is opened and posts a formatted comment with the
Jira ticket link. If no ticket is found, it posts a warning so the
author knows to add one --- before review, not after.

::: {#cb2 .sourceCode}

name: Link Jira on PR
on:
  pull_request:
    types: [opened]

permissions:
  pull-requests: write

jobs:
  link-jira:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: ./.github/actions/extract-jira-ticket
        id: jira
        with:
          jira-base-url: ${{ secrets.JIRA_BASE_URL }}
          jira-user-email: ${{ secrets.JIRA_USER_EMAIL }}
          jira-api-token: ${{ secrets.JIRA_API_TOKEN }}

      - name: Post Jira link comment
        if: steps.jira.outputs.found == 'true'
        uses: actions/github-script@v7
        with:
          script: |
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `📋 Jira: [${{ steps.jira.outputs.jira-key }}](${{ secrets.JIRA_BASE_URL }}/browse/${{ steps.jira.outputs.jira-key }})`
            })

      - name: Warn if no ticket found
        if: steps.jira.outputs.found == 'false'
        uses: actions/github-script@v7
        with:
          script: |
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: '⚠️ No Jira ticket found. Add a ticket ID to the PR title (e.g., `PROJ-123: Your title`).'
            })

:::

The warning step matters. It creates a feedback loop that trains the
team to include ticket IDs upfront. Within a few weeks, the warning
fires rarely.

Step 3: PR Merge Workflow

Create .github/workflows/update-jira-on-merge.yml.

This fires when a PR is closed against main. The
if: github.event.pull_request.merged == true guard is
important --- the closed event also fires for PRs that are
closed without merging.

::: {#cb3 .sourceCode}

name: Update Jira on Merge
on:
  pull_request:
    types: [closed]
    branches: [main]

jobs:
  update-jira:
    if: github.event.pull_request.merged == true
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: ./.github/actions/extract-jira-ticket
        id: jira
        with:
          jira-base-url: ${{ secrets.JIRA_BASE_URL }}
          jira-user-email: ${{ secrets.JIRA_USER_EMAIL }}
          jira-api-token: ${{ secrets.JIRA_API_TOKEN }}

      - name: Post merge comment to Jira
        if: steps.jira.outputs.found == 'true'
        run: |
          HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}"
            -u "${{ secrets.JIRA_USER_EMAIL }}:${{ secrets.JIRA_API_TOKEN }}"
            -H "Content-Type: application/json"
            -X POST "${{ secrets.JIRA_BASE_URL }}/rest/api/2/issue/${{ steps.jira.outputs.jira-key }}/comment"
            -d "{\"body\": \"PR merged: #${{ github.event.pull_request.number }} ${{ github.event.pull_request.html_url }}\nCommit: ${{ github.sha }}\nBy: ${{ github.event.pull_request.merged_by.login }}\"}")

          echo "Jira comment HTTP status: $HTTP_STATUS"
          [ "$HTTP_STATUS" -eq 201 ] && echo "✅ Comment posted" || echo "⚠️ Comment failed (non-critical)"

      - name: Transition ticket to Done
        if: steps.jira.outputs.found == 'true'
        run: |
          TRANSITION_ID="${{ secrets.JIRA_DONE_TRANSITION_ID }}"
          [ -z "$TRANSITION_ID" ] && echo "No transition ID configured, skipping" && exit 0

          curl -s -u "${{ secrets.JIRA_USER_EMAIL }}:${{ secrets.JIRA_API_TOKEN }}"
            -H "Content-Type: application/json"
            -X POST "${{ secrets.JIRA_BASE_URL }}/rest/api/2/issue/${{ steps.jira.outputs.jira-key }}/transitions"
            -d "{\"transition\": {\"id\": \"$TRANSITION_ID\"}}"
          echo "✅ Transitioned to Done"

:::

The comment step uses an HTTP status check rather than relying on
curl's exit code. A failed comment doesn't fail the job --- the PR already
merged, and a missing notification shouldn't generate noise in CI. The
transition step is fully optional: if
JIRA_DONE_TRANSITION_ID isn't set, it skips silently. This
lets you start with just comments and add transitions once you've
verified the workflow runs cleanly.

Finding Your Transition IDs

Transition IDs are project-specific. There's no universal "Done" ID.
Run this against any ticket in your project to find yours:

::: {#cb4 .sourceCode}

curl -u "$JIRA_EMAIL:$JIRA_API_TOKEN"
  "$JIRA_BASE_URL/rest/api/2/issue/$TICKET_KEY/transitions"
  | jq -r '.transitions[] | "ID: \(.id) | \(.name)"'

:::

Example output:

ID: 91 | Done
ID: 31 | In Review
ID: 21 | In Progress

Set the Done ID as JIRA_DONE_TRANSITION_ID in your
repository secrets.

A Note on Jira API Versions

Use the v2 API: /rest/api/2/. Some teams try v3 and get
silent empty responses --- {"errorMessages":[],"errors":{}} ---
that look exactly like auth failures. It's not auth. The v3 request body
format changed, and error handling is poor. v2 is stable,
well-documented, and works consistently.

Required Secrets

Add these to your GitHub repository secrets:

Secret Value

JIRA_BASE_URL https://yourorg.atlassian.net
JIRA_USER_EMAIL The email address tied to your API token
JIRA_API_TOKEN Generate at id.atlassian.com → Security → API tokens
JIRA_DONE_TRANSITION_ID Optional --- from the transitions API call above

For org-wide rollout, set these as organization secrets and restrict
to relevant repositories.

Results

After rolling this out across a team of eight engineers:

Zero manual Jira updates after merge
Forgotten ticket updates dropped from 30% to 0%
Roughly 1,700 minutes per year recovered per engineer
Every merged PR has a complete audit trail: PR number, URL, commit SHA, who merged it

The composite action pattern also means when you need to extend this
--- adding a Slack notification on merge, posting to Confluence --- you
extend one file, not two.

If you're building out automation like this across your engineering
platform and want a second opinion on the design, I'm available for advisory engagements.

How I Manage Claude Code Context Across 20+ Repositories

Glenn Gray — Fri, 20 Mar 2026 06:57:13 +0000

This post was originally published on graycloudarch.com.

Three months ago I was re-explaining my Terragrunt state backend to
Claude for the third time in a week. Different session, same repo, same
repo I'd worked in the session before. Claude had no idea I was even in
the same project.

I run Claude Code daily across a 6-account AWS platform monorepo, a
personal consulting site, homelab infrastructure, and a handful of side
projects. Every session started with the same five minutes of "here's
the project, here are the conventions, here's the Jira workflow" --- and
still ended with Claude suggesting patterns that didn't fit the
environment, because I'd inevitably forgotten to mention something.

After three months of broken symlinks and abandoned experiments, I
landed on a three-tier context hierarchy that loads the right context
automatically depending on which directory I'm working in --- and I manage
all of it from a single dotfiles repo.

The Problem with Single-File Context

Claude Code loads CLAUDE.md from the current directory
(and parent directories, walking up to
~/.claude/CLAUDE.md). Most teams start with one file and
put everything in it.

That breaks down quickly:

Global preferences get mixed with project specifics. Your "use snake_case for variable names" preference shouldn't live next to your Terraform state bucket configuration.
Credentials and account IDs end up in files you accidentally commit. Put AWS account IDs in a shared CLAUDE.md, and someone will eventually push it.
You can't share patterns across repos without duplication. Every new repo gets a fresh copy of the same conventions, and updates never propagate.
Multi-employer context creates conflicts. Your consulting client's Jira workflow shouldn't contaminate your personal project sessions.

My first attempt at fixing this was a shared scripts directory. The
three-tier hierarchy came later, after I figured out what was actually
wrong with the simpler approach.

My First Attempt: A Shared Scripts Directory

Before landing on the three-tier system, I built something more
obvious: a ~/shared-claude-infra/ directory containing a
setup-project.sh script that initialized
.claude/ context for each new repo.

The script created the directory structure and symlinked a
rules/shared/ folder back to the shared repo:

mkdir -p "$PROJECT_DIR/.claude/rules"
ln -s ~/shared-claude-infra/rules "$PROJECT_DIR/.claude/rules/shared"

This worked for the first two repos I configured. Then the problems
compounded:

Manual per-project setup. Every new repo required running the script. Miss one, and that repo has no shared context.
Two repos to maintain. The shared infrastructure lived in its own git repo, separate from dotfiles. Two places to update when conventions changed, and they drifted.
Nested symlinks instead of directory-level symlinks. The rules/shared symlink lived deep inside the project's .claude/ tree. When the target moved, every project that had run the script got a broken symlink --- silently.
Hardcoded paths that drifted. The script referenced workspace paths from three months earlier. My actual directory layout had changed; the script still pointed at the old locations.

When I eventually deleted the shared directory, a quick
find confirmed broken symlinks scattered across every repo
that had run the setup script. The approach was inherently fragile
because it depended on every machine, every repo, and every workspace
path staying synchronized manually.

The fix isn't a smarter script. It's inverting the relationship:
instead of a script that runs once per project, use dotfiles that wire
context automatically based on what directories exist.

The Three-Tier Hierarchy

~/.claude/CLAUDE.md              ← Global: preferences, style, git workflow
~/work/{employer}/.claude/       ← Org: team structure, AWS accounts, Jira workflow
~/work/{employer}/{repo}/.claude/ ← Project: repo architecture, active tickets

Claude Code walks up the directory tree loading
CLAUDE.md files at each level. Each tier handles a specific
scope:

Global tier (~/.claude/): Everything
that applies across all work --- communication style, git commit format,
PR description templates, universal infrastructure patterns. No
credentials, no account IDs, nothing employer-specific.

Org tier (~/work/{employer}/.claude/):
Team structure, Jira project keys, AWS account layout, CI/CD pipeline
conventions. Sensitive patterns (account IDs, VPC IDs, state bucket
names) go in gitignored files within this directory. Reusable patterns
(CI/CD templates, AWS patterns without specifics) go in committed
files.

Project tier
(~/work/{employer}/{repo}/.claude/): Architecture decisions
for this specific repo, active tickets, ongoing work state. Always
gitignored --- this is ephemeral working context that changes
frequently.

Implementation: Symlinks from Dotfiles

The hierarchy only works if it's consistent across machines. I manage
all context files from a dotfiles repo using symlinks:

dotfiles/claude/
├── global/          → symlinked to ~/.claude/
├── {employer}/      → symlinked to ~/work/{employer}/.claude/
└── {personal}/      → symlinked to ~/personal/.claude/

install.sh wires these automatically:

# Global context
ln -sf "$DOTFILES/claude/global" "$HOME/.claude"

# Per-employer context
for employer in "${EMPLOYERS[@]}"; do
  WORK_DIR="$HOME/work/$employer"
  if [ -d "$WORK_DIR" ]; then
    ln -sf "$DOTFILES/claude/$employer" "$WORK_DIR/.claude"
  fi
done

Any machine that runs install.sh gets the same context
hierarchy. Changes committed to dotfiles propagate immediately.

What Each Level Contains

Global (`~/.claude/`)

~/.claude/
├── CLAUDE.md          # Preferences, active work summary
└── rules/
    ├── git-workflow.md
    ├── pr-patterns.md
    ├── infrastructure.md
    └── context-management.md

CLAUDE.md is short --- preferences and a pointer to where
active work lives. The heavy lifting goes in rules/ files
that Claude loads as supplementary context.

## Communication Style
- Be direct and technical — I understand infrastructure concepts
- Explain the "why" behind decisions
- Provide specific file paths and line numbers

## Git Workflow
- Branch format: feat/TICKET-123-description
- Commit format: [TICKET-123] Brief summary\n\nWhy this change...
- Never add Co-Authored-By trailers

Org Level (`~/work/{employer}/.claude/`)

~/work/{employer}/.claude/
├── {EMPLOYER}.md              # Team structure, Jira workflow — committed
└── rules/
    ├── cicd-patterns.md       # CI/CD conventions — committed
    ├── aws-patterns.md        # Account IDs, VPC IDs — GITIGNORED
    └── terraform-patterns.md  # State config, module paths — GITIGNORED

The privacy split matters. cicd-patterns.md contains
reusable GitHub Actions patterns --- fine to commit.
aws-patterns.md contains actual account IDs --- stays
local.

A typical employer context file:

## Team Structure
- Platform team: 5 engineers, all in Jira project IN
- AWS accounts: dev, nonprod, prod (+ 3 infra accounts network, security, management)
- Monorepo: ~/work/{employer}/iac — Terragrunt, 78 components

## Jira Workflow
- IN project (infrastructure): transition IDs 3=In Dev, 8=Needs Review, 9=Done
- Prefix all commits: [IT-XXX]
- API: REST v2 only — v3 silently returns empty responses

## CI/CD
- GitHub Actions with OIDC to AWS (no long-lived credentials)
- PR requires: terraform plan output posted as comment
- Merge to main triggers auto-deploy to nonprod

The gitignored aws-patterns.md contains account IDs and
specific ARNs that Claude needs for generating Terraform configurations
accurately but shouldn't be committed anywhere.

Project Level (`~/work/{employer}/{repo}/.claude/`)

Project context is ephemeral and always gitignored. It's the working
memory for an ongoing effort:

## Current State
- Branch: feat/IT-89-my-app-dev-ecr
- Active ticket: IT-89 — ECR in dev
- Next: IT-90 — ECS task definition

## Architecture Decisions
- ECR in dev only; cross-account pull policies for nonprod and prod
- Mutable tags in dev, immutable in nonprod/prod
- KMS key per environment, not per repository

## Blockers
- Waiting on network DNS zone creation before cutover can proceed

I update this file at the end of each session with current state so
the next session loads instantly without re-explaining where things
are.

The Privacy Model

The critical insight is that context files need two categories:
committed (shareable) and local-only (sensitive).

Content Location Committed?

Personal preferences ~/.claude/CLAUDE.md ✅
Git workflow rules ~/.claude/rules/git-workflow.md ✅
Team structure {employer}/.claude/{EMPLOYER}.md ✅ sanitized
CI/CD patterns {employer}/.claude/rules/cicd-patterns.md ✅
AWS account IDs {employer}/.claude/rules/aws-patterns.md ❌ gitignored
VPC IDs, state config {employer}/.claude/rules/terraform-patterns.md ❌ gitignored
Active ticket state {repo}/.claude/OVERRIDES.md ❌ gitignored

The .gitignore at the dotfiles level handles this
automatically by ignoring **/aws-patterns.md and
**/terraform-patterns.md across all employer
directories.

Custom Commands

Beyond context files, Claude Code supports custom
/commands --- reusable prompts stored as markdown files:

~/.claude/commands/
├── checkpoint.md       # Create context snapshot
├── sync-work.md        # Update active work status
└── pr-ready.md         # Generate PR description

A command file is just the prompt Claude should execute:

# checkpoint.md
Create a context checkpoint. Read the current git status across active repos,
summarize open PRs and their status, list active tickets with their current
state, and write a structured summary to ~/.claude/local.md. Include any
blocking issues and the next planned action.

Commands at the global level are available everywhere. Org-level
commands handle employer-specific workflows like Jira transitions.

What This Solves in Practice

Before this system: every new Claude session started with "here's the
project, here are the conventions, here's where things are." Five
minutes of ramp-up, inconsistent outputs because I'd forget to mention
something.

After: I cd into a repo and Claude already knows the
Jira workflow, the AWS account structure, the naming conventions, and
where the active work stands. When I start a session mid-ticket, the
project-level context tells Claude exactly what was in progress.

The bigger payoff is consistency. When Claude generates Terraform, it
generates it with the correct state backend. When it writes commit
messages, they follow the format reviewers expect. When it suggests
architecture, it fits the actual account model rather than a generic AWS
example.

Starting Point

If you're starting from scratch, work tier by tier:

Create ~/.claude/CLAUDE.md with your communication preferences and git conventions.
Add a rules/ directory with patterns you want loaded consistently.
Create an org-level directory when you start working with a specific employer or major project.
Add project-level context when you start a multi-session effort.

Don't try to build the whole system at once. The global tier alone
eliminates most of the per-session ramp-up. The org and project tiers
pay off as work gets more complex.

The thing that surprised me most wasn't the time saved on ramp-up. It
was how much the output quality improved. When Claude knows the actual
state backend, the actual account IDs, the actual PR format your
reviewers expect --- the suggestions it makes fit your environment. That
gap between "technically correct" and "actually usable" is where most of
the friction in AI-assisted infrastructure work lives. The context
hierarchy is mostly just closing that gap.

If you're setting up Claude Code for a platform team and want to talk
through the context design, I do advisory
engagements for teams getting serious about AI tooling in their
infrastructure workflow.

Great design, and easy to follow.

Glenn Gray — Wed, 18 Mar 2026 16:23:46 +0000

How to Implement AWS Network Firewall in a Multi-Account Architecture Using Transit Gateway

Cristhian Becerra ・ Oct 13 '25

#english #aws #networking #cybersecurity

Building Automated AWS Permission Testing Infrastructure for CI/CD

Glenn Gray — Wed, 18 Mar 2026 07:14:50 +0000

This post was originally published on graycloudarch.com.

I deployed a permission set for our data engineers five times before
it worked correctly.

The first deployment: S3 reads worked, Glue Data Catalog reads
worked. Athena queries failed --- the query engine needs KMS decrypt
through a service principal, and I'd missed the
kms:ViaService condition. Second deployment: Athena worked.
EMR Serverless job submission failed --- missing
iam:PassRole. Third deployment: EMR submission worked. Job
execution failed --- missing permissions on the EMR Serverless execution
role boundary. I kept deploying, engineers kept getting blocked, I kept
opening tickets.

Five iterations. Two weeks. Every failure meant a data engineer
opened a ticket instead of running their job.

The problem wasn't that IAM is complicated --- it is, but that's
expected. The problem was that I had no way to catch these issues before
deploying to the account where real engineers were trying to do real
work. Every bug was a production bug.

The "Access Denied" Debugging Loop

Here's what the reactive debugging cycle looks like from the
inside.

Engineer opens a ticket:
AccessDeniedException: User is not authorized to perform: s3:GetObject.
I add s3:GetObject to the permission set. Next day:
AccessDeniedException: s3:PutObject. I add
s3:PutObject. Day after: write succeeds but cleanup fails ---
s3:DeleteObject. At this point I've done four deployment
cycles and two days of work to get S3 read/write/delete working. If I'd
just added s3:* I'd be done, but that violates
least-privilege and opens the raw zone to write access, which we
explicitly don't want.

The deeper issue is that individual services don't fail atomically.
Athena requires athena:StartQueryExecution and
athena:GetQueryResults and
athena:GetQueryExecution, but it also requires KMS decrypt
through the Athena service principal to read encrypted S3 results. That
last piece isn't in the Athena docs --- you find it by failing in
production.

I wanted a way to find it before deploying.

What I Built

The testing framework has four components: per-persona permission set
templates, a Bash test library, per-service test scripts, and a GitHub
Actions workflow that runs everything on pull requests.

┌─────────────────────────────────────────────────┐
│  GitHub Pull Request (Permission Set Changes)   │
└───────────────────┬─────────────────────────────┘
                    │
         ┌──────────▼──────────┐
         │  CI/CD Workflow     │
         │  (GitHub Actions)   │
         └──────────┬──────────┘
                    │
    ┌───────────────┼───────────────┐
    ▼               ▼               ▼
┌───────┐      ┌──────────┐   ┌──────────┐
│ S3    │      │  Glue    │   │ Athena   │
│ Tests │      │  Tests   │   │  Tests   │
└───────┘      └──────────┘   └──────────┘
                    │
         ┌──────────▼──────────┐
         │  Test Report        │
         │  (Posted to PR)     │
         └─────────────────────┘

The workflow triggers on any pull request that modifies the
identity-center Terraform directory. Tests run against real AWS accounts
--- dev and nonprod --- using test credentials provisioned for that purpose.
Results post as a PR comment before anyone approves the change.

Phase 1: Pre-Validated Templates

Before I wrote a single test, I needed a starting point for
permission sets that captured the patterns I'd learned the hard way.
Templates that handle the non-obvious pieces --- zone-scoped S3 access,
KMS conditions tied to specific services, explicit denies for
destructive operations.

The AnalystAccess template is representative. Analysts
get read-only access to the curated zone of the data lake, Athena query
execution in the primary workgroup, and KMS decrypt --- but only when the
decrypt request originates from S3 or Athena, not from arbitrary API
calls:

inline_policy = jsonencode({
  Version = "2012-10-17"
  Statement = [
    {
      Sid    = "GlueCatalogReadOnly"
      Effect = "Allow"
      Action = ["glue:GetDatabase", "glue:GetTable", "glue:GetPartitions", "glue:SearchTables"]
      Resource = [
        "arn:aws:glue:*:*:catalog",
        "arn:aws:glue:*:*:database/curated_*",
        "arn:aws:glue:*:*:table/curated_*/*"
      ]
    },
    {
      Sid    = "S3CuratedReadOnly"
      Effect = "Allow"
      Action = ["s3:GetObject", "s3:ListBucket"]
      Resource = ["arn:aws:s3:::lake-bucket-*/curated/*", "arn:aws:s3:::lake-bucket-*"]
      Condition = { StringLike = { "s3:prefix" = ["curated/*"] } }
    },
    {
      Sid    = "AthenaQueryExecution"
      Effect = "Allow"
      Action = ["athena:StartQueryExecution", "athena:GetQueryExecution", "athena:GetQueryResults", "athena:StopQueryExecution"]
      Resource = "arn:aws:athena:*:*:workgroup/primary"
    },
    {
      Sid    = "KMSDecryptViaSvc"
      Effect = "Allow"
      Action = ["kms:Decrypt", "kms:DescribeKey"]
      Resource = "arn:aws:kms:*:*:key/*"
      Condition = {
        StringEquals = { "kms:ViaService" = ["s3.us-east-1.amazonaws.com", "athena.us-east-1.amazonaws.com"] }
      }
    },
    {
      Sid    = "DenyDestructiveOps"
      Effect = "Deny"
      Action = ["s3:DeleteObject", "s3:DeleteBucket", "glue:DeleteDatabase", "glue:DeleteTable"]
      Resource = "*"
    }
  ]
})

The kms:ViaService condition is the piece that took five
production failures to discover. KMS decrypt without that condition
allows an analyst to call kms:Decrypt directly from their
shell, which is not what we want. The condition locks decrypt to
requests that pass through S3 or Athena specifically.

The explicit deny block matters too. Without it, if someone later
grants broader S3 permissions to this persona for a different reason,
the curated zone protection evaporates. The deny creates a hard floor
regardless of what else gets added.

Phase 2: The Test Framework

I chose Bash over Python or a proper test framework deliberately. The
tests run in CI with no dependencies beyond the AWS CLI --- no package
installs, no virtual environments, no version pinning of test libraries.
The machines running these tests already have the AWS CLI.

The core library in lib/test-framework.sh:

::: {#cb3 .sourceCode}

declare -a TESTS_PASSED=()
declare -a TESTS_FAILED=()

run_test() {
  local test_name="$1"
  local test_command="$2"
  local description="$3"

  if eval "$test_command" &>/dev/null; then
    TESTS_PASSED+=("$test_name")
    echo "  ✅ PASS: $test_name"
  else
    TESTS_FAILED+=("$test_name")
    echo "  ❌ FAIL: $test_name"
  fi
}

generate_text_report() {
  echo "Total: $((${#TESTS_PASSED[@]} + ${#TESTS_FAILED[@]}))"
  echo "Passed: ${#TESTS_PASSED[@]}"
  echo "Failed: ${#TESTS_FAILED[@]}"
  [ ${#TESTS_FAILED[@]} -gt 0 ] && printf '  - %s\n' "${TESTS_FAILED[@]}"
}

:::

The most important design decision in the test scripts is testing
denials as carefully as allowances. Testing only what should succeed
tells you the permission set isn't obviously broken. Testing what should
fail tells you it's not accidentally too permissive.

::: {#cb4 .sourceCode}

# Test what should succeed
run_test "s3-list-curated"
  "aws s3 ls s3://lake-bucket-dev/curated/"
  "Analyst can list curated zone"

# Test what should fail (negative test)
run_test "s3-write-denied"
  "! aws s3 cp /tmp/test.txt s3://lake-bucket-dev/curated/test.txt 2>&1 | grep -q 'AccessDenied'"
  "Analyst cannot write to curated zone"

run_test "s3-raw-zone-denied"
  "! aws s3 ls s3://lake-bucket-dev/raw/ 2>&1 | grep -q 'AccessDenied'"
  "Analyst cannot access raw zone"

:::

Beyond service-level tests, I run persona tests that simulate
end-to-end workflows. An analyst's workflow isn't "call S3, then call
Athena separately" --- it's "run an Athena query that reads encrypted S3
data and writes results to the query results bucket." That integration
test catches failures that individual service tests miss. The original
five-iteration DataPlatformAccess failure? An individual S3 test would
have passed. A persona test running an actual Athena query against the
encrypted lake would have caught the KMS gap.

Phase 3: CI/CD Integration

The GitHub Actions workflow triggers on pull requests that touch the
identity-center Terraform directory, runs tests in a matrix against dev
and nonprod, and posts a summary comment to the PR.

::: {#cb5 .sourceCode}

on:
  pull_request:
    paths:
      - 'common/modules/identity-center/**/*.tf'

permissions:
  contents: read
  id-token: write
  pull-requests: write

jobs:
  test-permissions:
    strategy:
      matrix:
        environment: [workloads-dev, workloads-nonprod]
    steps:
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::${{ matrix.environment.account }}:role/github-actions-role
      - run: ./scripts/test-permissions/run-permission-tests.sh --persona analyst

:::

The id-token: write permission is required for OIDC
authentication to AWS --- the workflow assumes a role in each account
rather than using long-lived credentials in GitHub Secrets. This is the
right pattern: credentials rotate automatically, and there's no secret
to rotate manually or accidentally expose.

The PR comment posts the full test output with pass/fail counts per
persona per account. A reviewer can look at the comment and immediately
see whether the permission change has test coverage and whether the
tests pass.

Three Things I Learned the Hard Way

First: test KMS decryption through each service separately.
kms:Decrypt via S3 and kms:Decrypt via Athena
are different IAM evaluation paths even though they're the same API
call. A test that puts an object and gets it back via S3 directly won't
catch a broken Athena KMS path.

Second: negative tests matter as much as positive ones. Before I had
the test framework, every permission set I wrote was tested only for
what it should allow. I had no systematic check that it didn't allow
more. The denial tests are what give security reviewers confidence.

Third: persona tests catch failures that service tests miss.
Individual service tests are fast to write and good for regression
coverage, but they test permissions in isolation. Real workflows cross
service boundaries. Build both.

What Changed

Before the framework: five iterations to get one permission set
right, every iteration a production impact. After: 95% of permission
issues caught at PR review time. Zero production impacts from permission
bugs since we shipped it. The templates reduced new permission set
creation time by about 70% --- instead of starting from scratch with the
IAM documentation, we start from a pre-validated base and modify from
there.

The time investment was about a week: two days for templates, two
days for the test framework and scripts, one day for CI/CD integration
and documentation. That investment paid back in the first sprint when
the analyst permission set for a new hire went out correct on the first
deployment.

Running into IAM permission debugging loops on your team? Reach out --- permission testing infrastructure is
one of the first things I build when joining a new platform team.

Forem: Glenn Gray

ECS vs EKS in 2026: The Decision Framework—Including ECS Anywhere

The Actual Decision Axis

What ECS Actually Gives You

What EKS Actually Gives You

The Decision Tree

ECS Anywhere: The Third Option

The Migration Path

Mistakes I See Repeatedly

Building Apache Iceberg Lakehouse Storage with S3 Table Buckets

The Medallion Architecture

Why Two Modules Instead of One

The S3 Table Bucket Module

Glue Is Not the Catalog

The Cost Model

Deployment Sequence

What the Data Team Inherited

The 5-Minute Tax I Killed With GitHub Actions

What I Wanted

The Solution: GitHub Actions

The Part That Tripped Me Up

Path Filtering: The Secret Sauce

Why This Matters

The Real Win

I Spent 6 Hours Automating a 30-Minute Task (And I'd Do It Again)

The Manual Way (aka Hell)

The Automated Way (aka Overkill)

Why This Matters (Spoiler: It's Not About Terraform)

What I Actually Built

The Part Where I Screwed Up

Would I Do This Again?

The IAM Trust Policy Chicken-and-Egg (That Isn't)

ARNs are deterministic before creation

The cycle Terraform sees

Where this pattern appears in practice

When the dependency is genuine

The trap in the fix

What generalizes

What the first 24 hours of production CloudWatch data told us

What 99.8% CPU means at 0.5 vCPU

Right-sizing the task before configuring the autoscaler

Why we chose CPU tracking instead of request count

The observability gap app logs can't fill

Autoscaling parameters worth explaining

What 24 hours of data drove

DNS Validation: From 15 Steps to Zero

There's a Better Way

The Magic: for_each

What I Screwed Up

Why This Matters

The Part People Miss

Scaling It

The Real Win

Building Multi-Account AWS Infrastructure with Terraform and ECP

The Problem with Single-Account AWS

The ECP Pattern

Key Design Decisions

Terraform Structure

Lessons Learned

Results

Stop Manually Updating Jira After Every PR Merge

The Architecture

Step 1: Composite Action for Ticket Extraction

Step 2: PR Creation Workflow

Step 3: PR Merge Workflow

Finding Your Transition IDs

A Note on Jira API Versions

Required Secrets

Results

How I Manage Claude Code Context Across 20+ Repositories

The Problem with Single-File Context

My First Attempt: A Shared Scripts Directory

The Three-Tier Hierarchy

Implementation: Symlinks from Dotfiles

What Each Level Contains

Global (~/.claude/)

Org Level (~/work/{employer}/.claude/)

Project Level (~/work/{employer}/{repo}/.claude/)

The Privacy Model

Custom Commands

Global (`~/.claude/`)

Org Level (`~/work/{employer}/.claude/`)

Project Level (`~/work/{employer}/{repo}/.claude/`)