Forem: Luca

Our Terraform Drift Went Undetected for Four Months. Here Is How We Found It.

Luca — Sat, 18 Apr 2026 16:16:52 +0000

In my last post I talked about state file corruption and the mess that comes with it. This one is about a quieter problem that lived alongside it for months without us noticing. Infrastructure drift.

Drift is what happens when the actual state of your cloud resources stops matching what your Terraform code says they should be. It can happen for obvious reasons like a developer tweaking a security group rule in the console during an incident. But it can also accumulate slowly and invisibly over weeks through automated processes, manual patches, and one-off fixes that never made it back into code. That second kind is what got us.

How It Started

We had been running a mix of production and staging workloads across three AWS accounts since early 2024. Our Terraform setup was the standard arrangement: S3 remote state, DynamoDB locking, workspaces for environment separation. The team was disciplined about infrastructure changes for the most part. PRs went through review, applies were done through the pipeline, and we had a rule that console changes were for emergencies only.

The rule held about 80% of the time. The other 20% was what caused the problem.

During incidents people make console changes because it is faster than writing Terraform, opening a PR and waiting for a pipeline. A security group rule gets widened to unblock something at 11pm. An EC2 instance gets a tag added so billing can track a cost spike. A CloudWatch alarm threshold gets bumped because it was firing too often. None of these feel significant in the moment. None of them get turned into Terraform code afterwards because the incident is over and there are other things to do.

Over four months that 20% added up to 47 individual resources across our three AWS accounts that had diverged from what our Terraform code described. We didn't know this until a routine audit in January 2026.

How We Found It

We found it by accident. We were running a terraform plan on our staging environment to test a new module we were writing. The plan came back with a wall of changes that had nothing to do with the module we were adding. Resources we hadn't touched in months were showing up as needing modifications. Some showed diffs on fields we didn't recognise changing. One showed a resource that Terraform thought existed but was actually gone.

We ran terraform plan across our other workspaces and found the same pattern everywhere. The cumulative drift from four months of incident-time console changes was sitting there waiting for the next apply to either overwrite it or explode trying.

The dangerous part was not the drift itself. It was the fact that some of the drifted state was intentional. The security group rule that was widened during that 11pm incident had stayed widened because widening it had actually fixed the underlying problem. If we had blindly run terraform apply it would have put that rule back to its original tighter setting and broken production again.

What We Did to Untangle It

Reconciling four months of drift manually was unpleasant. We went resource by resource through the plan output and for each differing resource asked two questions. Is the current live state correct or is the Terraform code correct? Who changed it and when?

For about 60% of the drifted resources the live state was correct and the Terraform code needed updating to match it. For 30% the Terraform code was correct and the live change was genuinely stale and could be overwritten. For the remaining 10% we genuinely couldn't tell and had to track down the people involved to get context.

The reconciliation took three engineers two full days. At the end of it we ran a clean plan across all environments and saw no unexpected changes for the first time in months.

What We Put In Place After

We came out of this with three changes that have stuck.

Drift detection on a schedule

We set up a GitHub Actions workflow that runs terraform plan in read-only mode across all our workspaces every night and posts a summary to a Slack channel. If the plan is clean it posts a green check. If there are unexpected diffs it posts a list of the affected resources.

The first week it ran we caught two new drifts within 24 hours of them happening instead of four months later. The feedback loop being that tight changes how people think about console changes because they know it'll show up in Slack the next morning.

name: Drift Detection
on:
  schedule:
    - cron: '0 6 * * *'

jobs:
  detect-drift:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
      - name: Run Plan
        run: |
          terraform init
          terraform plan -detailed-exitcode 2>&1 | tee plan-output.txt
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
      - name: Notify Slack
        if: failure()
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            {"text": "Terraform drift detected in ${{ github.workflow }}. Check the Actions run for details."}
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}

A proper incident infrastructure runbook

The root cause of most of our drift was that engineers had no fast legitimate path for making temporary infrastructure changes during incidents. Console was faster than Terraform so console won.

We wrote a short runbook for incident-time infrastructure changes. The runbook says: if you make a console change during an incident you open a follow-up ticket before the incident is closed, and that ticket must include a Terraform PR to codify the change within 48 hours. The on-call engineer who closes the incident is responsible for making sure the ticket exists.

It is not a technical solution. It is a process one. But it has worked better than we expected because it does not add friction to the incident itself, it just requires a small step at close-out when the pressure is off.

Tagging console-created resources

We turned on an AWS Config rule that flags any resource created without a specific terraform-managed tag. Resources created through our pipelines get the tag automatically. Resources created through the console do not. This gives us a live view in Config of anything in our accounts that has no corresponding Terraform management.

It does not prevent console changes but it makes them visible immediately rather than discoverable only when someone runs a plan.

Where We Are Now

Four months after putting these in place we have had zero multi-week drift go undetected. The nightly plan check has caught eight individual drifts. Six of those were small tag changes. One was a security group modification. One was a manually created S3 bucket that a developer had spun up to test something and forgotten about.

None of them had time to compound into the four-month backlog we had to deal with in January.

The honest truth is that drift is not a Terraform problem. It is a team behaviour problem that Terraform exposes. The technical tooling helps but what actually fixed it for us was making drift visible quickly enough that it stayed a small problem instead of growing into a large one.

If you are not running regular plan checks in your environments you are almost certainly accumulating drift right now. The question is just whether you will find it in days or months.

How MechCloud Changed the Way We Manage Cloud Infrastructure at Our Startup

Luca — Tue, 17 Mar 2026 17:14:12 +0000

I've been working as a DevOps engineer at a fast-growing startup for just over two years now. Like many engineers in my position, I inherited a Terraform-heavy setup - S3 backends, DynamoDB state locking, remote workspaces, the whole nine yards. For a while, it worked. Then it didn't.

The State File Problem Was Real

Our team had grown from 3 to 12 engineers over 18 months. With more people touching infrastructure simultaneously, we started hitting a wall of issues that are all too familiar in the DevOps world: corrupted state files, state lock contention, and the dreaded "state drift" where what Terraform thinks exists and what actually exists in AWS start diverging.

One particularly bad week, two of our engineers were running terraform apply at the same time on overlapping modules. The state lock failed to engage properly, the state file got corrupted, and we spent an entire afternoon manually reconciling what was supposed to be deployed versus what was actually running. That was the moment I started seriously looking for alternatives.

Enter MechCloud

A colleague mentioned MechCloud in our DevOps Slack channel. The pitch was simple but bold: a platform that lets you provision and manage cloud infrastructure without ever touching a state file. I was skeptical at first. IaC without state files? How does it know what's already deployed?

The answer, it turns out, is elegant: MechCloud treats the live cloud environment itself as the single source of truth. Instead of reconciling against a stored state file, it queries your actual cloud accounts in real time. No state, no drift, no locking issues by design.

What Stateless IaC Actually Means in Practice

MechCloud uses YAML-based blueprints to define your desired infrastructure state. You write what you want, and MechCloud figures out how to get there by comparing your definition against what's actually running in your cloud account. In practice, a few things stood out right away:

No backend setup: no S3 bucket, no DynamoDB table, no remote workspace configuration to wrangle. Connecting our AWS account took minutes.
No state drift: because MechCloud reads your live cloud on every operation, it always reflects reality. If someone manually changed a security group rule in the console, MechCloud sees it immediately.
Parallel deployments: multiple engineers can deploy simultaneously to different parts of the infrastructure without worrying about locking each other out.
No import headaches: existing infrastructure doesn't need to be imported before you can manage it, which made onboarding our existing AWS resources much less painful than I expected.

Real-Time Pricing - Surprisingly Useful

One feature I didn't expect to care about was real-time pricing. As you design your infrastructure in MechCloud, it shows you live pricing from the cloud providers right alongside your resources. Before you provision an EC2 instance, you can see exactly what it'll cost per hour and per month.

We've caught a few over-provisioning mistakes at the design stage this way - instances that were sized higher than they needed to be. It's not a feature I'd have thought to ask for, but now that I've used it I'd find it hard to go back to designing infrastructure blind on cost.

Deep AWS Integration Out of the Box

We're a pure AWS shop, and MechCloud fits well into our workflow. Connecting our AWS accounts was straightforward, and the platform gave us immediate visibility into our entire AWS footprint from a single interface. The visualization layer in particular has made architecture review conversations a lot easier - having an interactive diagram of your live infrastructure is something I didn't realize I was missing until it was in front of me.

AI Agents: The Unexpected Bonus

I wasn't expecting to use MechCloud's AI agent capabilities much - I figured it was a nice-to-have. But during a late-night incident when I needed to quickly query the state of EC2 instances across multiple accounts, I typed a plain English question into the AWS Agent and got back exactly what I needed in seconds.

It runs entirely in the cloud with no local setup and no API keys stored anywhere on our end. For quick incident queries it's saved me a fair bit of time.

Does It Replace Terraform Entirely?

Not yet, at least not for us. We're still running Terraform for some production workloads, and we've been testing MechCloud on our dev and QA instances since January 2026. The experience so far has been solid - faster deployments, zero state-related headaches, and the real-time pricing has already paid for itself in avoided over-provisioning. We're cautiously moving toward expanding its use, but I wouldn't want to overstate where we are.

What I can say is that for AWS-focused teams, MechCloud covers the core infrastructure work well - VPCs, subnets, EC2, S3, RDS, EKS. If your pain is mostly state management and you're all-in on AWS, it's genuinely worth evaluating.

My Verdict

If you're an AWS-focused DevOps engineer who's spent too many hours dealing with state file corruption, lock contention, or drift - MechCloud is worth a look. It's not a silver bullet, and we haven't fully replaced Terraform with it yet, but the stateless approach does solve real problems in a way that actually holds up day-to-day.
The pricing is reasonable for what you get, and the free tier is usable enough to form a real opinion before committing. I'm cautiously optimistic about where we'll be with it six months from now.