Forem: Matt

Why I Moved from Terraform to a Stateless IaC Platform and What I Learned

Matt — Fri, 01 May 2026 09:32:00 +0000

I was a Terraform believer.

I had written hundreds of .tf files. I had debugged state corruption at 2 AM. I had carefully designed remote backends in S3 with DynamoDB locking so my team wouldn't step on each other. I had given internal talks about Terraform best practices. If you had asked me two years ago whether I'd ever move away from it, I would have laughed.

And then I started using MechCloud — a stateless Infrastructure-as-Code platform that treats live cloud APIs as the source of truth instead of a state file — and I had to rethink a lot of things I took for granted.

This is not a hit piece on Terraform. It is still an exceptional tool with a massive ecosystem. But the experience of working without a state file made me realise how much mental overhead I had quietly normalised over the years, without even questioning it.

Here is what I learned.

What "stateless" actually means

Before I explain why I moved, I need to clarify what "stateless IaC" means — because it sounds paradoxical at first.

Every IaC tool needs to know what currently exists in your cloud account before it can figure out what changes to make. Terraform solves this by maintaining a state file — a local record of every resource it has created. When you run terraform plan, it compares your .tf files against this state file, then diffs the state file against the real cloud.

A stateless platform eliminates the state file entirely. Instead, it queries live cloud APIs at runtime to discover what actually exists. No cached record. No reconciliation between "what Terraform thinks exists" and "what actually exists." The cloud itself is the source of truth.

MechCloud does this by querying your AWS account directly — reading the real state of your resources on every run — and then using your YAML templates to determine what changes are needed.

Simple idea. Profound implications.

The problems I didn't know I had

1. State drift was a constant background anxiety

With Terraform, state drift happens whenever someone changes infrastructure outside of Terraform — through the AWS Console, a CLI command, a Lambda function, a deployment pipeline you didn't write. Terraform has no idea these changes happened. Its state file is now lying to you.

The result? terraform plan gives you a diff that is subtly wrong. You apply it, and either something breaks, or Terraform "fixes" the manual change and rolls it back, or you get a conflict error and spend an hour debugging.

I had built habits around this: periodic terraform refresh, strict policies against console access, "import" commands for resources added manually. These were all reasonable coping mechanisms. But they were all coping mechanisms for a problem that, I now realise, was architectural.

When the live cloud API is your source of truth, drift simply doesn't exist as a concept. What is in your cloud account is the truth. Every plan is based on reality, not on a cached snapshot of reality.

2. State file management was its own infrastructure problem

Where do you store the state file? S3 — but then you need bucket versioning, encryption, access policies, and lifecycle rules. You need a DynamoDB table for locking. You need IAM roles that can access both. You need to make sure the S3 bucket itself is in a region you trust.

I once spent a full day debugging a production incident that traced back to state file corruption caused by two CI/CD pipelines running terraform apply simultaneously, despite DynamoDB locking. The lock had been acquired, but a network timeout caused one pipeline to retry without properly releasing it first.

State file management is real infrastructure that you have to design, secure, maintain, and occasionally rescue. With a stateless platform, this problem category simply vanishes.

3. Onboarding a new team member was always an event

Any time a new DevOps engineer joined my team, their first week inevitably included:

Getting access to the S3 backend
Pulling the state file
Running terraform init with the right backend config
Discovering that someone had checked in a local .tfstate file by accident
Learning which modules were "safe" to plan vs. which ones would cause a cascade

With a stateless platform, there is nothing to initialise and no state file to synchronise. The new team member connects to the cloud account, and what they see in the YAML templates is exactly what they'll get. The learning curve shifts from "how does this tool manage state" to "what does this infrastructure actually do" — which is a much better thing to be learning.

4. Importing existing resources was painful

If you have resources that weren't created by Terraform — which in any real company, you always do — you had to use terraform import. This required you to:

Write the resource configuration by hand
Know the exact resource ID in AWS
Run terraform import <resource> <id>
Verify that the state now matches reality
Run terraform plan and hope it shows no diff

This was tedious and error-prone. One mismatched attribute and your next terraform apply would attempt to modify or destroy a production resource.

A stateless platform that queries live APIs can discover your existing resources automatically. There is no import step. What exists in AWS is what the platform works with.

What the transition actually looked like

I want to be honest here: the transition was not instant, and it required a mindset shift.

The biggest adjustment was moving from HCL (Terraform's language) to YAML templates. MechCloud uses a hierarchical YAML structure where resource nesting mirrors API dependency. For example, a subnet lives under a VPC in the template because it depends on the VPC ID — and that hierarchy makes the dependency explicit and readable.

Here is a simplified example of what a VPC with a subnet looks like in MechCloud's YAML template format:

resources:
  - type: aws_ec2_vpc
    name: vpc1
    props:
      cidr_block: "10.0.0.0/16"
      enable_dns_support: true
      enable_dns_hostnames: true
    resources:
      - type: aws_ec2_subnet
        name: subnet1
        props:
          cidr_block: "10.0.1.0/24"

The subnet is nested under the VPC in the resources block. MechCloud automatically injects the VPC ID into the subnet creation call. You never manually write vpc_id = aws_vpc.main.id the way you would in Terraform.

At first this felt like "magic I can't see." Over time it felt like "removing boilerplate I was always writing."

The other adjustment was psychological. When you've relied on terraform plan to show you a diff, working without a state file means trusting that the platform is accurately reading your live infrastructure. That trust took a few weeks to build — mostly through watching it work correctly, repeatedly, on real accounts.

What I won't go back to

State file management. Not for a second.

The cognitive overhead of ensuring state is accurate, backed up, locked, and shared correctly across a team adds up quietly over months and years. You stop noticing it the way you stop noticing background noise — until it's gone, and you realise how much quieter everything is.

Beyond that: the ability to discover existing infrastructure without import commands, the confidence that every plan reflects actual cloud state, and the simpler onboarding story are things I now consider table stakes.

Who should consider making this move

This approach is particularly compelling if:

Your team has experienced state drift, state corruption, or state locking incidents
You are managing infrastructure across AWS, Azure, and GCP and want a unified interface
You want to reduce the infrastructure-for-your-infrastructure overhead
You are exploring AI-assisted infrastructure provisioning (MechCloud has a PromptOps layer that lets you describe infrastructure in natural language)
You are starting a new project and can design without Terraform lock-in

It is less immediately compelling if:

You have a large existing Terraform codebase with dozens of modules (migration effort is real)
You depend on niche Terraform providers for non-cloud services
Your team is deeply invested in the HCL ecosystem and tooling

What I learned about learning

This experience taught me something beyond the technical.

I had spent years becoming expert at Terraform's way of solving a problem — so much so that I had stopped questioning whether the problem itself was necessary. State files felt like a law of nature, like something IaC simply had to have. It took using a tool that eliminated them entirely to realise they were a design choice, not a requirement.

That's the kind of learning that doesn't show up on a resume but changes how you evaluate tools for the rest of your career.

What's next

In my next post, I'll walk through a hands-on tutorial: provisioning a complete AWS VPC setup with MechCloud's YAML templates — covering the resource hierarchy, the ref: syntax for cross-resource references, and how it compares to writing the equivalent in Terraform HCL.

Beyond the Console: The Modern DevOps Guide to Architecting on AWS

Matt — Sat, 11 Apr 2026 18:30:00 +0000

The cloud landscape has changed dramatically over the last few development cycles. When I first started working with AWS, a lot of my day was spent clicking through the Management Console to provision resources or troubleshoot misconfigurations. Today, the role of a DevOps engineer looks completely different. We are no longer just the gatekeepers of infrastructure; we are the architects of internal developer platforms.

Building on AWS today requires a mindset shift. It is about creating resilient, scalable systems that empower development teams to move faster without breaking things.

The Shift to Platform Engineering

Cloud engineering on AWS has evolved significantly from traditional sysadmin tasks. The days of logging into a terminal to manually tweak an EC2 instance or configure a database are long gone. Today, our focus is on building automated, self-healing systems.

As DevOps engineers, we increasingly act as product managers for internal infrastructure. Our goal is to provide a reliable foundation that abstracts away the underlying complexity of AWS services. This shift toward platform engineering changes how we design, deploy, and maintain our cloud environments.

Infrastructure as Code Maturity

Writing Infrastructure as Code (IaC) is the absolute baseline for any serious cloud environment. Modern tools like Terraform, Pulumi, and AWS Cloud Development Kit (CDK) allow us to treat our VPCs, IAM roles, and EKS clusters exactly like application code. We use version control, require peer reviews, and run automated tests before infrastructure changes ever hit production.

Consider a scenario where a company needs to duplicate their entire production environment in a new AWS region for disaster recovery. If the original infrastructure was built via manual console clicks, this process takes weeks of painful discovery. With a mature IaC setup, deploying a complete replica to a new region is often as simple as updating a region variable and triggering a CI/CD pipeline.

This approach also introduces the power of automated security testing. We can run policy checks before a pull request is merged to catch misconfigurations early. This ensures that no one accidentally exposes an S3 bucket to the public internet or provisions an unencrypted DynamoDB table.

Scaling With Multiple Accounts

A single AWS account works fine for a new project, but it quickly becomes a tangled web of permissions as a company grows. Moving to a multi-account strategy using AWS Organizations and AWS Control Tower is a massive operational leap.

Structuring your AWS environment across multiple accounts provides several strict advantages:

Workloads are strictly isolated to limit the blast radius of security incidents.
Service Control Policies (SCPs) enforce baseline security rules across the entire organization.
Identity and Access Management (IAM) permissions become much easier to scope down to least privilege.
Finance teams gain precise cost attribution based on account-level billing.

Instead of writing complex resource policies to prevent one team from modifying another team's Lambda functions, the account boundary provides strict isolation by default. This makes compliance audits much smoother and gives developers safe sandboxes to experiment in without risking production data.

Embedding Security and Compliance

Security in AWS works best when it is embedded into every layer of the delivery process. Relying on manual security reviews at the end of a release cycle slows down development and frustrates engineers. Instead, security should be automated and invisible where possible.

One major shift is moving away from static IAM access keys. By using OpenID Connect (OIDC) for CI/CD pipelines, tools like GitHub Actions can assume temporary IAM roles to deploy infrastructure. This eliminates the risk of long-lived AWS credentials being leaked in source code.

Additionally, continuously checking your security posture with AWS Security Hub and Amazon GuardDuty provides automated threat detection. These tools act as an ever-watchful set of eyes, alerting the team to anomalous behavior like an EC2 instance communicating with a known malicious IP address.

Making Cost an Engineering Metric

AWS provides incredible flexibility, but leaving the meter running on unoptimized resources can quickly destroy an IT budget. Cloud cost optimization must be integrated directly into the engineering lifecycle rather than treated as an afterthought.

Small architectural decisions compound heavily over time on AWS. For example, routing all internal microservice traffic through a public NAT Gateway can rack up thousands of dollars in data transfer fees. Swapping that architecture to use VPC Endpoints keeps the traffic internal, drastically reducing the monthly bill while improving security.

Embracing managed services and compute optimization also drives down costs. Migrating workloads from standard x86 instances to AWS Graviton processors often yields immediate price-performance benefits. By enforcing strict tagging policies via AWS Config, teams can accurately trace these costs back to specific products or environments.

Observability Beyond the Basics

Traditional monitoring relies on answering whether a server is up or down, but modern cloud-native applications require much deeper observability. Knowing that an API Gateway is returning 500 errors is only the first step in debugging an outage. Engineers need to know exactly which microservice, database query, or third-party API caused the failure.

Implementing tools like AWS X-Ray or OpenTelemetry allows teams to trace a single user request across the entire system. You can watch a request travel through an Application Load Balancer, trigger a container in ECS, and query an Aurora database. When an alert fires in the middle of the night, having this deep context readily available reduces the mean time to recovery drastically.

Building for Developer Experience

Ultimately, the goal of a modern DevOps practice on AWS is to get out of the developers' way safely. Infrastructure teams should not be a bottleneck for application deployments. We achieve this by focusing heavily on Developer Experience (DevEx) and creating "golden paths."

Golden paths are pre-approved, standardized templates for common architectures. If a developer needs to deploy a serverless application, they shouldn't need to become an expert in API Gateway integrations and IAM execution roles. They should be able to consume a self-service module that handles the heavy lifting.

By wrapping these self-service tools in automated guardrails, we ensure that every new deployment is secure, tagged correctly, and highly available by default. This approach keeps development velocity high while maintaining the strict reliability that enterprise environments demand.

Terraform State Files Explained: What They Are, Why They Exist, and Why They Scare Everyone

Matt — Thu, 09 Apr 2026 13:00:51 +0000

If you have been using Terraform for more than a few months you have almost certainly done one of these things:

Accidentally committed a terraform.tfstate file to git
Seen a terraform plan output that made no sense because the state was out of sync
Watched a colleague delete a resource that Terraform then tried to recreate on the next apply
Had a pipeline fail with "Error acquiring the state lock"

These are not beginner mistakes. They happen to experienced engineers who understand Terraform's syntax but have not fully internalised how state actually works under the hood.

This post fixes that. By the end you will have a clear mental model of what the state file is, why it has to exist, what it actually contains, and where it breaks down.

Why does Terraform need a state file at all?

This is the question most tutorials skip and it is the most important one to answer.

When you write a Terraform resource block like this:

resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_support   = true
  enable_dns_hostnames = true
}

Terraform needs to answer three questions every time you run terraform plan:

Does this VPC already exist in AWS?
If it exists, does it match what I have declared?
If it does not match, what needs to change?

To answer question 1, Terraform has two options. It could query the AWS API on every run and scan your entire account looking for a VPC with matching attributes. Or it could maintain a record of what it has already created and use that as a reference point.

Querying the full AWS API on every run sounds appealing but it does not scale. AWS does not expose a universal "find me the resource that matches these attributes" API. Every resource type has a different API shape. Some resources require multiple API calls to fully describe. And many attributes look identical across resources (two VPCs can have the same CIDR block). Terraform would have no reliable way to know which VPC it created versus which one already existed before it ran.

So Terraform keeps a state file. It is a record that maps every resource block in your configuration to a specific real-world resource ID in your cloud account. When Terraform creates your VPC it records the VPC ID (vpc-0a1b2c3d4e5f) in the state file alongside every attribute it set. On the next run it reads the state file, calls the AWS API to fetch the current attributes of vpc-0a1b2c3d4e5f specifically, and diffs the result against your configuration.

The state file is not a cache of your cloud. It is a mapping between your Terraform configuration and real infrastructure. That distinction matters a lot.

What is actually inside a state file?

The state file is a JSON document. You should open one at least once in your career. Here is a condensed version of what a single VPC resource looks like inside it:

{
  "version": 4,
  "terraform_version": "1.7.0",
  "resources": [
    {
      "mode": "managed",
      "type": "aws_vpc",
      "name": "main",
      "provider": "provider[\"registry.terraform.io/hashicorp/aws\"]",
      "instances": [
        {
          "schema_version": 1,
          "attributes": {
            "arn": "arn:aws:ec2:ap-south-1:123456789012:vpc/vpc-0a1b2c3d4e5f",
            "cidr_block": "10.0.0.0/16",
            "enable_dns_hostnames": true,
            "enable_dns_support": true,
            "id": "vpc-0a1b2c3d4e5f",
            "owner_id": "123456789012",
            "tags": {
              "Name": "main"
            }
          }
        }
      ]
    }
  ]
}

A few things worth noting here:

The id field is the anchor. This is the AWS resource ID that Terraform uses to look up the real resource on every subsequent run. Without it Terraform cannot find the resource.

All attributes are stored. Not just the ones you declared. Terraform stores every attribute the provider returned after creation including computed attributes like arn and owner_id that you never wrote in your config. This is how it can detect drift on attributes you did not explicitly set.

The schema version is tracked. When a provider upgrades and changes its resource schema Terraform uses this to run state migrations automatically. This is why upgrading provider versions sometimes triggers state changes even when your infrastructure has not changed.

There is no history. The state file is a snapshot of right now. There is no audit log of what changed or when. If you want history you need to enable versioning on your remote backend.

The three-way diff that terraform plan runs

Understanding the state file properly means understanding the three-way diff that terraform plan performs. Most people think of plan as a simple "config vs cloud" comparison. It is not.

It is actually:

Your .tf files  <-->  State file  <-->  Real AWS resources

Step 1: Config vs State
Terraform compares your .tf files against the state file. This tells it which resources were added, removed, or changed in your configuration since the last apply.

Step 2: State vs Cloud
For every resource that exists in the state file Terraform calls the AWS API to fetch its current real-world attributes. It compares these against what the state file recorded. Differences here indicate drift. Someone changed something outside of Terraform.

Step 3: Produce a plan
Terraform combines both diffs to produce the final plan. A resource might need to change because you edited the config, because it drifted from the state, or both.

This is why a terraform plan that shows unexpected changes is almost always one of two things: you changed the config intentionally, or something changed your infrastructure outside of Terraform.

Where the state file breaks down

The state file model works well when everything goes through Terraform. It breaks down at the edges.

Drift from out-of-band changes

The state file only knows what Terraform did. If someone opens the AWS Console and changes a security group rule, adds a tag to an EC2 instance, or resizes an RDS instance, the state file has no idea. The next terraform plan will either flag it as drift (if Terraform manages that attribute) or silently ignore it (if it does not).

This is not a bug. It is a fundamental consequence of the state-file architecture. The state file is not a real-time reflection of your cloud. It is a record of what Terraform last did.

The import problem

If you have existing AWS resources that were not created by Terraform you cannot just write a resource block and run terraform apply. Terraform will try to create a new resource because it has no record of the existing one in its state.

You have to use terraform import to manually associate the existing resource ID with the state file. This works but it is tedious, it requires you to know the exact resource ID, and you still have to write a perfectly matching configuration block or the next terraform plan will show a diff and potentially modify your resource.

State file as a single point of failure

If you are using local state (the default) and you lose your state file your Terraform configuration is now disconnected from your real infrastructure. Terraform does not know any of those resources exist. It will try to create duplicates on the next apply which will either fail (for resources that enforce uniqueness) or succeed and create a mess.

This is why local state is appropriate only for learning and why every production Terraform setup needs a remote backend. But that is a topic for the next post.

Sensitive values in state

Terraform stores sensitive values in the state file in plain text. If your configuration creates an RDS instance with a password or a Secrets Manager secret the state file will contain those values. This is a well-known issue with no perfect solution. Encrypting the remote backend and tightly controlling access to it are the minimum baseline.

The commands that touch state directly

Most engineers learn terraform plan and terraform apply early. Fewer learn the state management commands that become essential when things go wrong.

terraform state list
Lists every resource tracked in the current state file. Useful for a quick audit of what Terraform knows about.

terraform state list
aws_vpc.main
aws_subnet.public
aws_internet_gateway.main

terraform state show <resource>
Shows the full recorded attributes of a specific resource. Useful for debugging drift or checking what Terraform thinks a resource looks like.

terraform state show aws_vpc.main

terraform state rm <resource>
Removes a resource from the state file without destroying the real resource. Use this when you want Terraform to stop managing a resource. The resource continues to exist in AWS but Terraform forgets about it.

terraform state mv <source> <destination>
Moves a resource within the state file. The most common use case is renaming a resource or moving it into a module without destroying and recreating it.

terraform import <resource> <id>
Adds an existing AWS resource to the state file. This does not modify the resource. It just tells Terraform "this resource block in my config corresponds to this resource ID in AWS."

terraform refresh
Updates the state file to match the real state of your infrastructure. This is essentially step 2 of the three-way diff run in isolation. Useful when you suspect drift but do not want to run a full plan.

What most engineers get wrong about state

Treating the state file as a source of truth. The state file is a reference point, not truth. The real state of your infrastructure is in AWS. The state file can be stale, partial, or corrupted. Any serious workflow accounts for that.

Never opening the state file. The state file is a plain JSON document. Reading it when something goes wrong is one of the fastest ways to understand what Terraform actually knows. Do not treat it as a black box.

Ignoring drift until it causes an incident. Drift accumulates quietly. A security group rule changed here, a tag modified there. None of it breaks anything immediately. Then someone runs terraform apply in a pipeline and Terraform "corrects" the drift at the worst possible time.

Storing state locally in any shared environment. The moment more than one person or one pipeline touches the same infrastructure, local state becomes a race condition waiting to happen.

The key mental model

Think of the state file as a marriage certificate between your Terraform configuration and your real AWS resources.

The certificate does not prove the marriage is in good shape right now. It just records that the marriage happened and identifies both parties. You still have to do the work of keeping the relationship healthy. And if you lose the certificate things get complicated fast.

Everything about Terraform state management flows from this: the need for remote backends, the import problem, drift detection, the sensitivity around who can access state and when. Once you have this mental model the rest of Terraform's state behaviour starts to make sense.