Forem: Mian Zubair

Your Production Code Is Training AI Models Right Now (And How to Audit Your Stack)

Mian Zubair — Wed, 01 Apr 2026 03:00:17 +0000

Every AI coding tool you use needs access to your code to function. Copilot reads your files for completions. Cursor indexes your project for context. LangChain traces log your prompts and outputs for observability.

The problem is not that these tools access your code. The problem is that most engineers never ask what happens to that code after the tool processes it. Where does the telemetry go? Who trains on it? Is your proprietary logic ending up in a foundation model's training set?

This week, GitHub's decision to opt all users into AI model training by default made this question impossible to ignore. But GitHub is not the only platform doing this. It is the default pattern across the entire AI tooling stack.

The Default Is Always "Opt In"

Here is how it works at almost every AI tool company: ship the feature, opt everyone in, bury the toggle three levels deep in settings, and wait for someone to notice.

GitHub opted users into training data collection. The setting is under Settings, Privacy, and you have to manually disable it. Cursor uploads your project files for cloud-based indexing to power its AI features. LangSmith, the observability layer for LangChain, logs your prompts, model outputs, and even API keys that appear in traces by default.

None of this is hidden exactly. It is documented if you know where to look. But documentation is not consent. And the default matters more than the documentation, because most engineers never change the defaults.

The real issue is compounding exposure. Each tool on its own seems manageable. But when you stack Copilot, Cursor, LangSmith, and your CI/CD telemetry together, your entire codebase is being transmitted to four different cloud providers simultaneously. None of them coordinate on data handling. Each has its own retention policy, its own training pipeline, its own definition of "anonymous".

Why This Matters for Production Systems

If you are building AI systems in production, your codebase contains things that should never leave your organization: proprietary algorithms, customer data handling logic, API keys in commit history, infrastructure patterns that reveal your architecture.

When I was building Menthera, our voice AI system handled sensitive mental health conversations. The architecture included multi-LLM orchestration across Claude, GPT, and Gemini, persistent memory via Mem0, and real-time voice processing through WebRTC. If any of that codebase had ended up in a training set, it would have exposed not just our code but the design decisions that gave us our technical edge.

This is the reality for every team shipping AI features in production. Your code is not just code. It is your competitive advantage, your security surface, and your liability.

The 4-Point Audit Every Team Should Run This Week

Here is what I recommend for any team using AI coding tools in production:

1. Inventory every AI tool touching your codebase

List them all: IDE extensions, AI coding assistants, observability platforms, CI/CD integrations. If it processes your code, it goes on the list. Most teams are surprised to find they have 5 or more AI tools with code access.

2. Check telemetry and data sharing settings for each tool

Go into settings for every tool on your list. Look for "telemetry", "data sharing", "model training", and "usage analytics". Disable anything that sends code content upstream. This takes 20 minutes and could save you from a data leak you never knew was happening.

3. Scan your commit history for secrets

Run truffleHog or gitleaks against your repository. Secrets in commit history are the first thing that leaks when your code ends up in a training pipeline. Even if you rotated the key, the old one is still in git history. And git history is exactly the kind of data that gets bulk-ingested for training.

4. Add ignore files for sensitive paths

Create a .cursorignore file to prevent Cursor from indexing sensitive directories. Add a .github/copilot configuration to block Copilot from reading specific paths. These are simple text files that take 5 minutes to set up and permanently reduce your exposure surface.

The Bigger Picture

The model powering your AI feature is replaceable. You can swap Claude for GPT for Gemini and your system keeps working. But your proprietary code appearing in someone else's training set is permanent. There is no "undo" for training data.

The engineers who treat their code as a data liability, not just a product, will build more defensible systems in the long run.

Have you ever audited what data your AI coding tools send home? Most engineers I talk to have not. The tools are too useful to question and too convenient to distrust. But convenience is exactly how data leaks become invisible.

This week is a good time to start. Run the audit. Check the settings. Treat your code like the liability it is. The 20 minutes you spend now could prevent a data exposure you would never be able to reverse.

The Difference Between Junior and Senior Engineers Isn't the Code They Write

Mian Zubair — Sat, 28 Mar 2026 18:20:27 +0000

After 4 years of shipping production systems across AI platforms, mobile apps, and AWS serverless backends, I've noticed a pattern. The engineers who ship the fastest and break the least aren't the ones writing the cleverest code. They're the ones who set up the system before writing any code at all.

Here are the four habits I've seen consistently separate senior engineers from juniors in production environments.

1. Seniors Design for Failure First

A junior engineer builds the happy path. User signs up, data saves, response returns. Everything works in development. Everything breaks in production.

A senior engineer starts with the question: "What happens when this fails?" They add circuit breakers on external API calls so one downstream timeout doesn't cascade into a full system outage. They configure DynamoDB TTLs to auto-expire stale data instead of letting tables grow unbounded. They wire up SQS dead letter queues on day one so failed messages don't silently disappear.

The difference isn't paranoia. It's experience. Once you've been woken up at 2am because a third-party API went down and took your entire service with it, you never skip failure handling again.

2. Seniors Read Code More Than They Write It

When a junior engineer gets a new task, they open a blank file and start coding. When a senior engineer gets the same task, they spend the first 30 minutes reading the existing codebase.

This isn't slowness. It's precision. They're looking for existing patterns, shared utilities, naming conventions, and architectural decisions that already exist. They want to understand what's there before adding anything new.

In a NestJS monorepo I work on for a US client, there are shared modules, custom decorators, and TypeORM repository patterns that took weeks to establish. A junior who skips the reading phase will duplicate logic, break conventions, and create tech debt that someone else has to clean up in the next sprint.

Reading code is the most underrated engineering skill. The best engineers I've worked with spend more time reading than writing. They know the codebase well enough to reuse existing patterns rather than reinvent them. That alone cuts their output time in half and reduces bugs by an order of magnitude.

3. Seniors Debug the System, Not the Symptom

A junior engineer sees a 500 error, finds the line that threw, adds a try-catch, and moves on. The bug appears fixed. It isn't.

A senior engineer traces the request through CloudWatch Logs, checks the Lambda cold start latency, examines the DynamoDB consumed capacity, and discovers the real root cause is two services upstream: a misconfigured SQS visibility timeout that causes duplicate processing under load.

The symptom was a 500 error. The cause was a queue configuration that nobody looked at since it was deployed 6 months ago. Seniors don't fix symptoms. They trace the full path and fix the system.

This is where observability tools earn their cost. Without structured logging, distributed tracing, and CloudWatch dashboards, you're debugging by guessing. Seniors set up observability before they need it.

4. Seniors Ask "What Happens at 10x?"

Juniors design for the traffic they have today. Seniors design for the traffic they expect in 6 months.

This doesn't mean premature optimization. It means asking one question before every architecture decision: "What happens when this gets 10x the current load?"

When I built the real-time voice AI system for Menthera, I added a Redis cache layer between the application and DynamoDB from the start. Not because we had scale problems on day one, but because I knew that concurrent WebRTC sessions would hammer the database if we didn't have a read cache.

Seniors add a partition strategy, a connection pool, a cache layer, and a rate limiter before the spike hits. Not after the outage. The cost of adding these later is always higher than adding them from the start.

The Real Difference

Most engineers optimize for writing code. Senior engineers optimize for not writing code.

They read before they write. They design for failure before they design for features. They trace before they patch. They plan for scale before they need it.

The code itself is the least important part of the job. The system around the code is everything.

If you take one thing from this: the next time you sit down to build something, spend the first hour on everything except writing code. Read the existing codebase. Add failure handling. Set up observability. Ask what happens at 10x. Then write the code. You'll ship faster and sleep better.

What's the one habit that changed how you approach engineering?

4 pgvector Mistakes That Silently Break Your RAG Pipeline in Production

Mian Zubair — Fri, 27 Mar 2026 16:43:39 +0000

pgvector is the fastest way to add vector search to an existing PostgreSQL database. One extension, a few SQL commands, and you have similarity search running alongside your relational data. No new infrastructure. No new SDK. No vendor lock-in.

That simplicity is also its trap. Most teams add pgvector in a day and spend the next six months debugging performance issues that have nothing to do with the extension itself. The problems are almost always configuration mistakes that tutorials skip over.

Here are four I have seen break RAG pipelines in production, and how to fix each one before your team starts debating a migration to Pinecone.

No HNSW Index Means Full Table Scans

By default, pgvector performs exact nearest neighbor search. That means it scans every single row in the table on every query. For a prototype with 10,000 vectors, this is invisible. At 500,000 vectors, queries start crossing 800 milliseconds. At a million, you are looking at multi-second response times that make your RAG pipeline feel broken.

The fix is a single SQL statement: create an HNSW index on your vector column. HNSW (Hierarchical Navigable Small World) is an approximate nearest neighbor algorithm that trades a tiny amount of accuracy for massive speed improvements. After adding the index, the same 500K-vector query drops to under 50 milliseconds.

The reason this catches teams off guard is that pgvector works perfectly without the index. There is no warning, no error, no degradation signal. It just gets slower as data grows, and most teams blame the embedding model or the LLM before they check the database.

Dimensionality Is Not Free

OpenAI's ada-002 embedding model outputs vectors with 1,536 dimensions. Each vector row in PostgreSQL consumes roughly 6 kilobytes of storage. Scale that to one million documents and you are looking at 6 gigabytes just for the embeddings column, before accounting for the HNSW index overhead, which can double or triple the total.

This matters because your AWS or cloud bill is not driven by the LLM API calls most teams obsess over. It is driven by the RDS instance size and storage needed to hold and index those vectors. A db.r6g.xlarge running pgvector with a million high-dimensional vectors costs real money every month.

The alternative is to use a smaller embedding model. Cohere's embed-v3 outputs 384 dimensions and performs competitively on most retrieval benchmarks. That cuts storage by 75 percent and proportionally reduces index build time, memory usage, and query latency. Unless your use case specifically requires the nuance of 1,536 dimensions, smaller is almost always the right production choice.

Wrong Distance Function, Wrong Results

Most tutorials use cosine similarity as the default distance function, and most teams never question it. But pgvector supports three distance functions: cosine similarity, inner product, and L2 (Euclidean) distance. Each one measures "similarity" differently, and the choice directly affects which documents appear in your top-K results.

Cosine similarity measures the angle between vectors, ignoring magnitude. Inner product considers both direction and magnitude, which makes it the better choice when your embeddings are already normalized (as most modern embedding models produce). L2 distance measures the straight-line distance between vector endpoints, which works best when magnitude carries meaningful information.

The practical impact is real. I have seen cases where switching from cosine to inner product on the same dataset changed three of the top five results. If your RAG pipeline returns mediocre answers and you have already tuned your chunking strategy and prompt, check the distance function before anything else. It is a one-line configuration change that can transform result quality.

Know the Scaling Ceiling

pgvector is not a dedicated vector database. It is an extension that adds vector operations to PostgreSQL, and PostgreSQL was not designed to be a vector search engine at scale. In practice, pgvector handles up to about five million vectors comfortably on a db.r6g.xlarge instance with proper HNSW indexing. Past ten million vectors, expect query performance to degrade under concurrent load, and index build times to become a deployment bottleneck.

For most teams, this ceiling is not a problem. The majority of production RAG systems index fewer than five million documents. If you are in that range and already running PostgreSQL, adding pgvector is the right call. You avoid the operational complexity of a separate vector database, keep your data in one place, and eliminate an entire category of infrastructure to manage.

If you are genuinely approaching the ten million mark, look at pgvector-scale (which adds partitioning and distributed indexing) or evaluate a dedicated solution like Pinecone or Weaviate. But make that decision based on actual data volume, not on anxiety about future scale.

The Config Is the Bottleneck

The pattern I see repeated is predictable. Week one, a team adds pgvector and it works great. By month two, queries slow down and nobody thinks to check the index. By month four, someone proposes migrating to a managed vector database. By month six, a senior engineer adds one HNSW index and the problem disappears.

pgvector is a genuinely excellent tool for most production RAG systems. The mistakes that break it are not bugs or limitations. They are configuration gaps that tutorials gloss over and documentation buries. Fix the index, right-size the dimensions, pick the correct distance function, and know your scaling ceiling. That is the entire playbook.

What vector store is your team running in production right now?

What Really Happens When You Deploy with AWS CDK?

Mian Zubair — Tue, 10 Mar 2026 15:26:08 +0000

A behind-the-scenes guide to CDK internals — Logical IDs, synthesis, bootstrap trust chains, and the replacement logic that most teams learn the hard way.

A Behind-the-Scenes Guide That Most Teams Learn the Hard Way

It was a Friday afternoon. The kind of Friday where everything had gone smoothly — too smoothly.

A senior engineer on a team pushed what he called a "cleanup refactor." No business logic changed. No new features. Just reorganizing CDK constructs into cleaner modules. The kind of work that gets approved in code review in minutes because nothing functional changed.

He ran cdk deploy.

CloudFormation accepted the changeset. Thirty seconds later, Slack lit up.

The production DynamoDB table — 4 million user records — was gone.

Not corrupted. Not locked. Deleted and recreated empty. Because CloudFormation saw a different Logical ID and concluded the old table should be removed and a new one created in its place.

The engineer didn't change a single schema property. He moved a construct from one file to another.

That one action cost the team 11 hours of downtime, a backup restoration, and a very uncomfortable conversation with their CTO.

This guide exists so that conversation never happens on your team.

Who This Guide Is For

If you lead an engineering team that uses — or is adopting — AWS CDK, this guide is for you.

This is not a getting-started tutorial. You won't find "how to create your first S3 bucket" here.

Instead, this is the guide that explains what is actually happening beneath the surface when your team runs cdk deploy. The mental model that separates teams who use CDK confidently from teams who are one refactor away from a production incident.

We'll cover:

Why CDK is a compiler, not a provisioning tool
The Logical ID system that silently controls resource identity
How context caching creates invisible divergence between local and CI
The bootstrap trust chain that most teams never fully understand
The replacement logic that CloudFormation uses — and CDK does not control
A production playbook for teams managing real infrastructure

Every section connects back to a single question: How does this knowledge protect production?

Part 1: CDK Is a Compiler — Not What You Think It Is

Here's the first mental model shift that changes everything.

CDK does not provision infrastructure.

Read that again. The tool your team writes infrastructure code in — it never talks to EC2, S3, DynamoDB, or any AWS service directly. Not once.

Here's what CDK actually does:

Executes your code — your TypeScript, Python, or Java runs like any normal program
Builds an in-memory construct tree — a hierarchy of objects representing your infrastructure
Synthesizes a CloudFormation template — translates that tree into a JSON/YAML template
Hands everything off to CloudFormation — and exits

That's it. CDK's job is done before a single resource is created.

CloudFormation is the engine that:

Stores the current state of your stack
Calculates the diff between old and new templates
Determines which resources need updating, replacing, or deleting
Calls the actual AWS service APIs
Handles rollback if something fails

Think of it this way:

CDK is the compiler. CloudFormation is the runtime.

This distinction matters enormously. When something goes wrong during deployment — a resource gets replaced, a permission fails, a rollback triggers — the answer almost never lives in your CDK code. It lives in the relationship between your synthesized template and CloudFormation's state machine.

Why this matters for your team: When an engineer says "CDK deleted my table," that's technically wrong. CDK produced a template. CloudFormation decided to delete the table based on that template. Understanding this boundary is the first step to debugging infrastructure issues effectively.

Part 2: The Construct Tree — CDK's Object Model

Before we can understand why refactoring causes replacements, we need to understand how CDK organizes infrastructure internally.

Everything Is a Construct

Every piece of infrastructure in CDK — a bucket, a Lambda function, an IAM role — is a Construct. Constructs are nested inside other constructs, forming a tree:

App
 └── Stack (e.g., ProdStack)
      ├── Construct (e.g., ApiService)
      │    ├── Lambda Function
      │    └── API Gateway
      └── Construct (e.g., DataLayer)
           ├── DynamoDB Table
           └── S3 Bucket

This tree is the source of truth for template generation. Every construct has a path determined by its position in the tree — and that path has consequences we'll explore in the next section.

Three Levels of Abstraction

CDK constructs operate at three levels:

Level	What It Is	Example
L1	Raw CloudFormation resource — a 1:1 mapping. Prefixed with `Cfn`.	`CfnBucket`, `CfnTable`
L2	An opinionated abstraction with sensible defaults and helper methods.	`Bucket`, `Table`, `Function`
L3	Pre-wired patterns that compose multiple resources together.	`LambdaRestApi`, `ApplicationLoadBalancedFargateService`

Most teams work at L2. It's the sweet spot — enough abstraction to move fast, enough control to customize.

The critical thing to understand: When you write CDK code, you are building an object tree in memory. No AWS API calls happen during this phase. No infrastructure is queried or created. You're constructing a blueprint.

The moment that blueprint becomes real is during synthesis.

Part 3: Synthesis and Logical IDs — Where Refactoring Becomes Dangerous

This is the section that explains the Friday afternoon disaster from our opening story.

What Happens During `cdk synth`

When you run cdk synth, your CDK application executes as a normal program. The construct tree is built, and then CDK walks that tree to produce a CloudFormation template.

During this walk, four things happen:

Each construct is visited — its properties are collected
Logical IDs are generated — a unique identifier for each resource
Tokens are resolved — cross-references between resources are wired up
A template is written to the cdk.out directory

No infrastructure exists yet. This is pure compilation.

Logical IDs — The Hidden Identity System

This is the single most important concept in CDK that most engineers never fully grasp.

Every resource in a CloudFormation template has a Logical ID. It looks something like:

UsersTableA1B2C3D4

This ID is generated from the construct's path in the tree plus a hash. For example:

Path: App/ProdStack/UsersTable
Logical ID: UsersTableA1B2C3D4

CloudFormation uses this Logical ID as the primary key for tracking resources. It's how CloudFormation knows that the UsersTable in today's deployment is the same UsersTable from yesterday's deployment.

The Refactoring Trap

Now watch what happens when an engineer "cleans up" the code by moving the table into a nested construct:

Before:

// Table is directly in the stack
new dynamodb.Table(this, "UsersTable", {
  partitionKey: { name: "id", type: dynamodb.AttributeType.STRING }
});

Path: App/ProdStack/UsersTable
Logical ID: UsersTableA1B2C3D4

After the refactor:

// Table is now inside a "Storage" construct
const storage = new Construct(this, "Storage");
new dynamodb.Table(storage, "UsersTable", {
  partitionKey: { name: "id", type: dynamodb.AttributeType.STRING }
});

Path: App/ProdStack/Storage/UsersTable
Logical ID: StorageUsersTableE5F6G7H8  ← DIFFERENT

The schema didn't change. The table configuration didn't change. But the Logical ID changed because the construct path changed.

CloudFormation receives the new template and sees:

A resource with Logical ID UsersTableA1B2C3D4 — no longer present → delete it
A resource with Logical ID StorageUsersTableE5F6G7H8 — new → create it

That's a delete and recreate. Your data is gone.

This is exactly what happened in our opening story. A "cleanup refactor" changed the construct tree, which changed Logical IDs, which CloudFormation interpreted as resource replacement.

Why this matters for your team: Your engineers need to understand that infrastructure code is not like application code. In application code, moving a function between files changes nothing about runtime behavior. In CDK, moving a construct between parent constructs changes the resource's identity. Refactoring infrastructure requires a fundamentally different discipline.

Part 4: Context Caching — The Silent Divergence

There's a common misconception I encounter repeatedly: teams believe that cdk.context.json has something to do with drift detection. It doesn't. But what it does do is equally dangerous if misunderstood.

How Context Works

Some CDK constructs need to query AWS during synthesis. The most common example:

const vpc = ec2.Vpc.fromLookup(this, "MainVpc", {
  vpcId: "vpc-0123456789abcdef0"
});

When this runs during cdk synth, CDK actually calls the AWS API to look up VPC details — availability zones, subnets, route tables. It then caches the result in cdk.context.json.

On subsequent runs, CDK reads from the cache instead of calling AWS again.

The Divergence Problem

Here's where teams get burned:

Developer A runs cdk synth locally. Context is cached with current VPC state.
The VPC changes — a new subnet is added, an AZ is modified.
CI/CD pipeline runs cdk synth — but cdk.context.json wasn't committed to git. CI performs a fresh lookup and gets different VPC data.
The template generated in CI differs from local. Resources reference different subnets. Deployment behaves unexpectedly.

The engineer stares at the diff and thinks: "I didn't change anything."

They're right — they didn't. The environment changed, and the lack of committed context allowed that change to silently propagate into the template.

What Context Is and Isn't

Context...	Does	Doesn't
Affects	Template generation during synthesis	Deployed resource state
Relates to	Lookup values cached locally	CloudFormation drift detection
Deleting it	Forces fresh AWS lookups	Fix or prevent stack drift

Drift detection — comparing what's actually deployed vs. what the template says — is handled entirely by CloudFormation. The context file has no role in that process.

Why this matters for your team: Commit cdk.context.json to version control. Treat it as part of your infrastructure definition. When the context file is committed, every developer and every CI pipeline synthesizes the same template from the same cached data. When you want to pick up environment changes, delete the context file deliberately and re-synthesize — as a conscious decision, not an accident.

Part 5: Bootstrap — The Trust Chain Nobody Explains

Every CDK tutorial tells you to run cdk bootstrap. Almost none of them explain what it actually creates or why it matters.

What Bootstrap Creates

When you run cdk bootstrap, it deploys a CloudFormation stack (called CDKToolkit) into your target account and region. This stack contains:

S3 Bucket — stores file assets (Lambda code bundles, Docker context files)
ECR Repository — stores Docker image assets
Deploy Role — an IAM role that CDK assumes to initiate deployments
CloudFormation Execution Role — the role CloudFormation assumes to create/modify resources
File Publishing Role — for uploading assets to S3
Image Publishing Role — for pushing images to ECR

The Trust Chain

Deployment flows through a specific chain of trust:

Your credentials (local or CI)
        ↓ assumes
   Deploy Role
        ↓ passes to
   CloudFormation
        ↓ assumes
   Execution Role
        ↓ calls
   AWS Service APIs (EC2, S3, DynamoDB, etc.)

Each arrow is an IAM trust relationship. If any link in this chain is misconfigured — a missing trust policy, an incorrect principal, an account ID mismatch — deployment fails. And the error messages are often cryptic enough to send engineers down the wrong debugging path for hours.

Why This Matters for Multi-Account Setups

In production environments, most teams use multiple AWS accounts — development, staging, production, shared services. CDK's bootstrap model is designed for this:

Each target account needs to be bootstrapped
The bootstrap roles in each account must trust the deploying account (often a CI/CD account)
The execution role in each account determines what CloudFormation can actually create

This is where security teams get involved — and rightfully so. The execution role in your production account determines the blast radius of a deployment. An overly permissive execution role means a bad template can create or modify anything in production.

Why this matters for your team: Bootstrap is not a one-time setup command you run and forget. It's the security boundary of your deployment pipeline. Review the execution role's permissions. Understand which accounts trust which. In mature organizations, the bootstrap template is customized to enforce least-privilege — restricting what CloudFormation can do, even if the CDK code asks for it.

Part 6: The Deploy Lifecycle — What Actually Happens

Now that we understand all the components, let's trace the full lifecycle of cdk deploy from start to finish.

Step 1: Synthesis

Your CDK app executes. The construct tree is built. Logical IDs are generated. A CloudFormation template is written to cdk.out/.

Step 2: Asset Upload

If your stack includes file assets (Lambda code) or Docker images, CDK uploads them to the S3 bucket and ECR repository created during bootstrap.

Step 3: ChangeSet Creation

CDK submits the synthesized template to CloudFormation as a ChangeSet. A ChangeSet is CloudFormation's way of previewing what will happen — it's a diff between the currently deployed template and the new one.

Step 4: CloudFormation Diff Calculation

CloudFormation compares the new template against its stored state. For each resource, it determines:

No change — resource definition is identical, skip it
Update — a mutable property changed, update in-place
Replace — an immutable property or Logical ID changed, delete and recreate

Step 5: Dependency Graph Execution

CloudFormation doesn't execute changes randomly. It builds a dependency graph and processes resources in the correct order — creating dependencies before dependents, deleting dependents before dependencies.

Step 6: API Execution

CloudFormation calls the actual AWS service APIs — CreateTable, PutBucketPolicy, CreateFunction, etc.

Step 7: State Update

Once all changes are applied (or rolled back on failure), CloudFormation updates its internal state to reflect the new reality.

The key insight: CDK exits after Step 3. Once the ChangeSet is submitted, CDK's role is finished. Everything from Step 4 onward is CloudFormation operating independently. When you're watching your terminal during cdk deploy, CDK is just polling CloudFormation for status updates — it's not controlling the process.

Part 7: Replacement Logic — Who Decides, and How

This is the question I get asked most often: "Why did CloudFormation replace my resource?"

The answer is never CDK. It's always CloudFormation, and it follows a specific decision tree:

Reason 1: Logical ID Changed

As we covered in Part 3, if the construct path changes, the Logical ID changes. CloudFormation interprets this as "old resource removed, new resource added." This is the most common cause of unintended replacements.

Reason 2: Immutable Property Changed

Some resource properties can only be set at creation time. Changing them requires replacement. Examples:

DynamoDB partition key or sort key
RDS engine type
EC2 instance type in some configurations
S3 bucket name (if explicitly set)

CloudFormation knows which properties are immutable for each resource type. When one changes, replacement is the only option.

Reason 3: Resource Type Changed

If you change a resource from one type to another (rare, but it happens during refactors), CloudFormation treats it as a deletion and creation.

How to Protect Against Unintended Replacement

Always review the ChangeSet before executing. CDK provides a built-in tool for this:

cdk diff

This shows you exactly what CloudFormation will do — including which resources will be replaced. Make this a mandatory step in your deployment process. In CI/CD pipelines, generate the diff as a review artifact before applying changes.

Part 8: The Questions Your Team Will Ask — Answered

These are the questions that come up in every CDK engagement I've been part of. Having clear answers to these saves hours of debugging and prevents production incidents.

"Why did my resource get replaced when I didn't change anything?"

You changed the construct path. The Logical ID shifted. CloudFormation interpreted this as a new resource. Check the ChangeSet — it will show the old and new Logical IDs. The fix: either revert the path change, or migrate the resource using CloudFormation's resource import feature.

"Does deleting cdk.context.json fix drift?"

No. Drift detection compares deployed resources against CloudFormation's stored state. The context file only affects synthesis. Deleting it forces fresh lookups, which may change your template — but it tells you nothing about drift. Use aws cloudformation detect-stack-drift for that.

"Why does the CI template differ from my local template?"

Because context wasn't committed. Your local machine has cached lookup results. CI performed fresh lookups and got different data. Commit cdk.context.json. If you deliberately want fresh lookups, delete the file and re-run cdk synth locally, then commit the updated cache.

"Why do I get permission errors during deployment?"

The trust chain is broken. Remember: your credentials → Deploy Role → CloudFormation → Execution Role → AWS APIs. Check each link. Common issues: the deploy role doesn't trust your CI account, the execution role lacks permission for a specific service, or the bootstrap stack is out of date.

"Why does refactoring break production?"

Because infrastructure identity depends on construct path stability. In application code, moving a class between packages is a safe operation. In CDK, moving a construct between parents changes the Logical ID, which changes the resource identity. Infrastructure code requires architectural discipline that application code does not.

The Production Playbook

These five practices are what separate teams that deploy CDK with confidence from teams that deploy with crossed fingers.

1. Separate Stateful and Stateless Stacks

NetworkStack      → VPCs, Subnets, NAT Gateways
DatabaseStack     → DynamoDB Tables, RDS Instances, ElastiCache
ApplicationStack  → Lambdas, API Gateways, ECS Services
MonitoringStack   → Alarms, Dashboards, SNS Topics

Stateful resources (databases, storage) live in stacks that change rarely. Stateless resources (compute, APIs) live in stacks that change frequently. This separation limits the blast radius of any single deployment. Your database stack should be boring — deployed once, modified almost never.

2. Apply Removal Policies to Stateful Resources

const table = new dynamodb.Table(this, "UsersTable", {
  partitionKey: { name: "id", type: dynamodb.AttributeType.STRING },
  removalPolicy: RemovalPolicy.RETAIN,
});

RemovalPolicy.RETAIN tells CloudFormation: "Even if you think this resource should be deleted, don't." If a Logical ID change causes CloudFormation to attempt replacement, the old resource will be retained instead of deleted. You'll have an orphaned resource to clean up, but you won't have data loss.

Apply this to every DynamoDB table, every RDS instance, every S3 bucket that holds data you cannot afford to lose.

3. Avoid Volatile Lookups

Every fromLookup() call introduces non-determinism into your synthesis. The template you get depends on the state of your AWS account at synthesis time.

Prefer explicit configuration:

// Instead of this:
const vpc = ec2.Vpc.fromLookup(this, "Vpc", { vpcId: "vpc-abc123" });

// Consider this — explicit, deterministic:
const vpc = ec2.Vpc.fromVpcAttributes(this, "Vpc", {
  vpcId: "vpc-abc123",
  availabilityZones: ["us-east-1a", "us-east-1b"],
  publicSubnetIds: ["subnet-111", "subnet-222"],
});

Deterministic synthesis means the same code always produces the same template, regardless of when or where it runs. That's a property worth protecting.

4. Refactor Using a Migration Strategy

Never restructure constructs and deploy in one step. Use a phased approach:

Phase 1: Add the new construct alongside the old one. Deploy.

Phase 2: Migrate data or traffic from old to new.

Phase 3: Switch references to point to the new resource.

Phase 4: Remove the old construct (with RETAIN policy, so the underlying resource persists until you manually clean up).

This is slower than a single refactor-and-deploy. It's also the only approach that doesn't risk data loss.

5. Review ChangeSets — Always

Make this a non-negotiable rule:

# Before every production deployment:
cdk diff

# In CI/CD pipelines:
cdk deploy --require-approval broadening

No engineer should deploy to production without reading the ChangeSet. No CI pipeline should apply changes without human approval for anything that modifies IAM or deletes resources.

The five minutes spent reviewing a ChangeSet can save you the 11 hours it takes to restore a database from backup.

The Full Lifecycle — At a Glance

[CDK Code Written]
       ↓
[cdk synth runs your program]
       ↓
[Construct Tree built in memory]
       ↓
[Logical IDs generated from construct paths]
       ↓
[CloudFormation template written to cdk.out/]
       ↓
[Assets uploaded to S3/ECR via bootstrap roles]
       ↓
[ChangeSet submitted to CloudFormation]
       ↓
[CloudFormation diffs old vs new template]
       ↓
[Update / Replace / Delete decided per resource]
       ↓
[AWS APIs called in dependency order]
       ↓
[Stack state updated — deployment complete]

Key Takeaways

CDK is a compiler. It produces CloudFormation templates. It does not manage infrastructure directly.
Logical IDs are resource identity. They're derived from construct paths. Change the path, change the identity.
Refactoring is not free. Moving constructs is an infrastructure operation, not a code cleanup.
Context caching affects templates, not drift. Commit cdk.context.json to version control.
Bootstrap is your security boundary. The execution role determines what CloudFormation can do in each account.
CloudFormation decides replacement, not CDK. Immutable property changes and Logical ID changes trigger replacement.
Separation of stateful and stateless is non-negotiable. Your database stack should be the most boring stack in your codebase.
Deterministic synthesis prevents surprises. Same code, same template, every time.

What's Next

This guide covers the foundation — the mental model every team needs before they can use CDK safely at scale. But there's more ground to cover: multi-account deployment strategies, custom constructs, pipeline architecture, and testing infrastructure code.

I write about cloud architecture, AWS patterns, and the hard-won lessons from building production infrastructure.

If this guide saved you from a future production incident — or explained something you've been struggling with — follow me on LinkedIn. I publish in-depth guides like this regularly.

Let me know in the comments: What's the most painful CDK lesson your team has learned?