Forem: Abhishek Gupta

HCP Terraform's free tier is gone - what AWS teams should actually do next

Abhishek Gupta — Sun, 26 Apr 2026 01:36:00 +0000

When the HashiCorp BSL licence change landed in August 2023, we thought "HashiCorp won't do anything too aggressive - they need the community too much."

I was wrong. Fast forward to today: IBM owns HashiCorp for $6.4 billion, the HCP Terraform free tier sunsets on March 31 2026, the Resources Under Management (RUM) pricing model has replaced the predictable per-seat model, and the cost estimation features that used to be table stakes have been quietly removed from standard tiers.

If your team is still on HCP Terraform's free tier, you have very little runway before that decision gets made for you.

But this post is not a vent about IBM. It is about something more useful: there is a specific and time-limited set of conditions right now that make this the best moment in years to rethink your IaC and infrastructure design workflow end-to-end - not just swap a remote backend.

What actually changed, precisely

The August 2023 BSL change: Terraform can no longer be freely used in certain commercial contexts - specifically in products that compete directly with HashiCorp's offerings. This spawned the OpenTofu fork under the Linux Foundation.

The IBM acquisition completed the picture. HashiCorp as a community-first company is not the entity you are now transacting with. IBM is a $60 billion enterprise software company optimising for enterprise revenue. The trajectory of HCP Terraform pricing from here is predictable, and it does not favour small teams.

The RUM model charges $0.10–$0.99 per managed resource per month. For a typical Series B AWS environment with a few hundred managed resources across environments, this compounds non-linearly as infrastructure scales - the opposite of what you want from toolchain cost when your AWS bill is also scaling.

The landscape of alternatives

OpenTofu - The most obvious move for preserving your HCL investment with minimal disruption. Apache 2.0, under Linux Foundation governance, broadly compatible with Terraform. Does not solve any underlying workflow problems - you are still writing HCL, still running applies without pre-deployment cost visibility. Free.

Scalr - Worth a look specifically for its pricing model: meaningful free tier with all features included, paid from ~$99/month. Explicitly positioned as a Terraform Cloud drop-in. Best choice if minimal disruption is the only goal.

Spacelift / env0 - Mature IaC orchestration platforms with robust remote state management, CI/CD integration, and policy enforcement. Both adding AI features. Serious options for teams with deep Terraform investment and enterprise requirements. $349–$399/month entry.

Pulumi - A more fundamental change: infrastructure in TypeScript, Python, or Go instead of HCL. $98.5M raised, over half the Fortune 50 as customers. If your team is already comfortable with TypeScript, the cognitive load is lower than it sounds. Free for individuals.

None of these platforms address the core workflow problem: the absence of pre-deployment validation. You still design, write IaC, deploy, and then discover whether your architecture handles load. The feedback loop is still post-deployment.

The workflow problem nobody talks about enough

Before we get into what to switch to, name the underlying problem clearly - because it shapes how to evaluate alternatives.

My current workflow - and yours probably looks similar - involves at minimum four disconnected tools for any significant infrastructure decision:

A diagramming tool (draw.io, Lucidchart) for the design conversation
The AWS Pricing Calculator to manually estimate cost - static, single-traffic-level, rebuilt from scratch every time the design changes
Terraform / CDK for the IaC that implements the design - often diverging from the diagram because the diagram was decorative
A post-deployment load testing tool (k6, Gatling, JMeter) to find out whether the architecture handles the traffic it was designed for - which requires deployed infrastructure

The fundamental absurdity of step 4: we are spending real AWS dollars to provision real infrastructure to discover whether our design was correct. When it is not - wrong concurrency limit, missing circuit breaker, absent CloudFront layer - we fix it after the fact, under time pressure.

The workflow change I made

Six weeks ago, someone sent me a link to pinpole with the note "this is weird but try it." I tried it. It is the most significant change to my infrastructure design workflow in several years.

pinpole is a browser-based canvas: drag AWS services from a palette, wire them, configure each service to reflect your actual intended configuration, run a traffic simulation against the design before any infrastructure exists. Spike from 300 RPS to 3,000 RPS against a Route 53 → API Gateway → Lambda → DynamoDB topology. Watch Lambda concurrency saturation in real time. Watch API Gateway throttling. Watch estimated monthly cost update live as the simulation runs. All in a browser tab. No AWS account required. No provisioned resources.

The first simulation I ran surfaced five findings in under two minutes. CloudFront absent. Lambda provisioned concurrency not configured (showing as cold-start spikes under the Spike pattern). Circuit breaker pattern missing on DynamoDB calls. I accepted the CloudFront recommendation - it was applied to the canvas automatically, the simulation reran, API Gateway RPS dropped, estimated cost reduced.

The workflow that emerged:

1. Design on pinpole canvas
   ↓
2. Spike simulation at anticipated peak - apply AI recommendations
   ↓
3. Export to Terraform from canvas state (native IaC export)
   ↓
4. Commit to source control → standard IaC pipeline
   ↓
5. Deploy with validated cost and performance expectations

pinpole is not a Terraform replacement. It is the design-time layer that sits before it. The HCP Terraform disruption happens to have created the right moment to add that layer, because teams are already revisiting their toolchain.

DynamoDB vs RDS at 10K, 100K, and 1M RPS: a pre-deployment simulation comparison

Abhishek Gupta — Thu, 23 Apr 2026 23:41:00 +0000

I have made this mistake exactly once. About three years into my AWS career, I inherited a Lambda-based API with DynamoDB on the backend and was tasked with migrating it to Aurora PostgreSQL - the data model had grown relational and the team wanted proper foreign key constraints.

The migration went smoothly in UAT. We promoted to production on a Tuesday night. By Thursday morning, Lambda concurrency was exhausted, Aurora was throwing connection pool errors, and I was sitting in a war room with the CTO trying to explain why a database migration - not a code change - had caused a full API outage at 80K RPS.

I had never tested what the connection behaviour would look like at production load. I had assumed it would be fine.

It was not fine.

You cannot assume your way through database selection at scale.

The methodology

I ran a structured simulation comparison using pinpole's pre-deployment canvas. Two separate canvases - one for DynamoDB, one for RDS/Aurora - with a third for Aurora Serverless v2 as a middle-ground comparison.

Canvas topology (both configurations):

Route 53 → CloudFront → API Gateway → Lambda → [DynamoDB | RDS + Proxy]
WAF (in front of CloudFront) · SQS (write decoupling path) · ElastiCache (RDS scenario)

Note on RDS Proxy: If you're running Lambda against RDS at any significant scale without RDS Proxy managing the connection pool, you will exhaust database connections under burst load. This is essentially the architecture bug that caused my Tuesday night disaster. RDS Proxy is always present in the RDS/Aurora configurations below.

All configurations at production-realistic specs: Lambda at 1,769 MB (1 vCPU equivalent), 30-second timeout; API Gateway at 10K RPS burst limit; DynamoDB in both on-demand and provisioned modes; RDS PostgreSQL on db.r6g instances; Aurora Serverless v2 with ACU limits appropriate to each tier.

I ran four traffic patterns at each RPS level: Constant (steady baseline), Ramp (linear growth to peak over 10 minutes), Spike (sudden 10× burst), and Wave (oscillating between 30% and 100% of peak). Each run saved to execution history for comparison.

Results at 10K RPS

Configuration	p50	p99	Monthly Estimate
DynamoDB on-demand	2ms	7ms	~$8,400
DynamoDB provisioned (9K RCU / 1K WCU)	2ms	7ms	~$1,150
RDS PostgreSQL db.r6g.2xlarge + Proxy	3ms	11ms	~$870
Aurora MySQL db.r6g.2xlarge + Proxy	4ms	13ms	~$980
Aurora Serverless v2 (avg 4 ACU)	4ms	15ms	~$720

The biggest surprise at 10K RPS: DynamoDB on-demand is nearly 10× more expensive than a well-configured RDS instance for sustained, predictable traffic. DynamoDB's reputation as the "serverless database" leads engineers to assume it is cheap at modest scales. For a product with consistent diurnal load patterns, provisioned capacity is rarely the wrong answer, and on-demand is rarely the right one.

Under the Spike pattern (10K → 100K instantaneous), DynamoDB on-demand absorbed the spike without configuration changes. RDS PostgreSQL with a fixed instance showed connection pool pressure - p99 climbed to 38ms for about 90 seconds.

Results at 100K RPS

Configuration	p50	p99	Monthly Estimate
DynamoDB provisioned (auto-scaling)	2ms	9ms	~$4,200
RDS PostgreSQL db.r6g.8xlarge + Proxy	3ms	14ms	~$3,100
Aurora MySQL db.r6g.8xlarge + Proxy	4ms	16ms	~$3,400
Aurora Serverless v2 (avg 18 ACU)	4ms	19ms	~$2,900

At 100K RPS, DynamoDB on-demand becomes structurally expensive. The per-request pricing model that looks benign at 10K RPS scales linearly. Provisioned DynamoDB with auto-scaling changes the picture significantly. RDS cost is still competitive - fixed instance overhead is now amortised more efficiently.

The Spike pattern at this tier produced the most diagnostic information. DynamoDB auto-scaling took 3-7 minutes to fully respond - during which pinpole flagged elevated p99 and recommended more aggressive scale-out settings. This is behaviour you need to know before deployment.

Results at 1M RPS

Configuration	p50	p99	Monthly Estimate
DynamoDB provisioned (high WCU, DAX)	1ms	4ms	~$28,000
RDS PostgreSQL read replicas + Proxy	4ms	22ms	~$18,000
Aurora Global + Proxy	3ms	15ms	~$24,000

At 1M RPS, DynamoDB with provisioned capacity and DAX caching is competitive on cost and substantially superior on latency. The operational complexity of the RDS path has increased materially - you now need read replicas, connection pooling strategy, and careful instance sizing - while the cost gap has narrowed.

The actual decision framework

The database selection cannot be made responsibly without running the numbers at your anticipated traffic volume. The right answer at 10K RPS is sometimes the wrong answer at 100K RPS. The three factors that matter:

1. Access pattern complexity. If your queries require joins, complex filtering, or ad-hoc analytical access, RDS is the correct starting point regardless of the cost model. DynamoDB's cost advantage evaporates if you are engineering around its access pattern constraints.

2. Traffic predictability. Predictable diurnal load → provisioned DynamoDB or fixed RDS instance. Genuinely unpredictable or bursty traffic → DynamoDB on-demand or Aurora Serverless v2. Do not pay on-demand pricing for predictable traffic.

3. Scale trajectory. If you are at 10K RPS today and heading for 1M RPS in 18 months, the database you choose now needs to perform well at that scale. Running the 1M RPS simulation before making the 10K RPS decision is an hour of canvas work, not a Thursday morning war room.

Full comparison with complete per-node metrics at each tier, Aurora Serverless v2 deep-dive, and the access pattern decision matrix →

How to model Lambda cold-start behaviour under spike traffic before you deploy

Abhishek Gupta — Thu, 23 Apr 2026 01:41:00 +0000

There is a class of AWS incident I have started calling the "everything looked fine in testing" failure.

The pattern is consistent. You design a serverless API. Lambda function with sensible defaults, wired through API Gateway, pointing at DynamoDB. You test it in dev throughout the week. Latency is acceptable. Costs track to plan. Then a campaign drops, or a new enterprise customer brings their three thousand users on day one, and your traffic goes from 300 RPS to 3,000 RPS in under a minute.

Lambda, which has never had to spin up more than a dozen concurrent environments at once, is now being asked to handle a hundred. Cold starts accumulate. p99 latency goes from 80ms to 2,400ms. API Gateway timeout windows close on in-flight requests. Customers see errors. The Slack channel fires. You spend a Saturday explaining to your CTO why the architecture that "passed all our tests" just fell over under a load it should have anticipated.

I have been in this situation. Not once.

The second time is when I stopped treating load testing as a post-deployment activity.

The cold start problem, precisely stated

Lambda's execution model does not maintain persistent servers. When an invocation arrives and no warm execution environment exists, Lambda must provision one: select a host, initialise the execution environment, load the runtime, execute your module-level initialisation code.

That sequence is the cold start. And its duration varies along several dimensions that are non-obvious:

Runtime matters. Node.js 20 with V8: typically under 100ms for lightweight functions. Python: comparable. Java with JVM class-loading: 300ms to well over a second.

Memory allocation matters. Lambda allocates CPU proportionally to memory. A function at 1,024 MB gets significantly more CPU than one at 128 MB. Counterintuitively, increasing memory can reduce cold start latency and total cost simultaneously - the faster initialisation more than compensates for the higher per-GB-second rate.

The spike dynamic is what kills you. Cold starts at steady state are manageable. The problem is spike behaviour. Under rapid traffic increase, Lambda must provision new environments in parallel. You can hit dozens or hundreds of concurrent cold starts at the exact moment your users' experience is most consequential. A steady-state load test does not expose this.

The pre-deployment simulation model

For the simulation, I built the architecture on a pinpole canvas:

Route 53 → CloudFront → API Gateway → Lambda (Node.js 20, 512MB) → DynamoDB

Lambda configured without provisioned concurrency - the common default for a new service with uncertain traffic. Reserved concurrency set explicitly, not at default unlimited.

The Lambda config panel exposes the parameters that directly affect cold-start modelling:

Parameter	Value	Cold Start Relevance
Runtime	Node.js 20.x	High - directly factors into latency model
Memory	512 MB	High - CPU allocation, init speed
Reserved concurrency	Set explicitly	Critical - defines throttle ceiling
Provisioned concurrency	Off	The variable under test

I ran a Spike traffic pattern: 300 RPS baseline → 3,000 RPS over 60 seconds. The concurrency graph showed cold-start accumulation in real time. At peak: 90 concurrent cold starts, 2,400ms p99 latency.

What the simulation surfaced that a load test could not

1. The burst scaling limit. Lambda's initial burst quota is 500–3,000 concurrent executions depending on region, then 500 new environments per minute thereafter. This is not visible in the Lambda console until you hit it. The simulation reflects these constraints - the concurrency graph under spike traffic is not a smooth ramp. It shows the actual burst behaviour, including the plateau and the recovery slope.

2. Timeout alignment. The simulation flagged a configuration where API Gateway's integration timeout and Lambda's execution timeout were both set to 29 seconds. Under concurrency pressure, invocations that queue before executing can consume their window before execution even begins. Surface this in a canvas session: costs nothing. Discover it in a 2 AM incident: costs considerably more.

3. The provisioned concurrency trade-off, quantified. I accepted the recommendation to enable provisioned concurrency, reran the simulation, and compared in the execution history view. p99 at peak: 80ms. The cost of provisioned concurrency was visible in the live estimate alongside the latency improvement. The trade-off was explicit and quantified before any IaC was written.

The reproducibility argument

The result I value most is not the simulation output itself - it is that the output is reproducible and shareable. When I shared this analysis with my team, I shared the simulation run: the exact canvas configuration, the traffic pattern, the concurrency graph, the before-and-after comparison. Not an assertion about expected behaviour. A versioned record of what the model showed.

That is a materially different quality of architectural evidence.

Full post with complete simulation methodology, burst scaling model details, provisioned concurrency cost analysis, and the pre-deployment Lambda checklist →

FinOps at design time: I found $3,840/month in avoidable spend before writing a line of Terraform

Abhishek Gupta — Mon, 20 Apr 2026 10:51:00 +0000

FinOps is almost entirely retrospective. AWS Cost Explorer tells you what happened last billing cycle. Trusted Advisor tells you which resources are underutilised right now. Cost anomaly alerts fire after the anomaly has already run for hours.

Every tool in the standard FinOps stack analyses infrastructure that already exists. Which means by the time any of them are useful, the structural decisions that determine 80% of your architecture's lifetime cost have already been made, deployed, and are now expensive to reverse.

I have been an AWS solutions architect for nine years. The pattern is consistent, and I have been complicit in it: design the architecture, write the IaC, deploy, and then discover the cost. The Pricing Calculator gives you a static estimate that assumes steady-state traffic and correct configuration. Neither assumption holds under a real workload.

This post is about a session where I broke that pattern - and caught $3,840 per month in avoidable spend before a single resource was provisioned.

The architecture

Event processing pipeline for a Series B SaaS product. Customer activity events ingested via API, processed asynchronously, stored for downstream analytics. Expected baseline: 1,200 RPS, with a 6× spike on campaign days.
Canvas topology in pinpole:

Route 53 → API Gateway → Lambda (ingest) → SQS → Lambda (processor) → DynamoDB

Lambda configured at 512 MB, reserved concurrency 200. DynamoDB in on-demand capacity mode. The AWS Pricing Calculator estimate at steady-state baseline: ~$4,100/month.

Under a Constant simulation at 1,200 RPS, everything looked healthy. Cost settled at $4,230/month - close to the Pricing Calculator number, which felt like a good sign.

Old workflow would have stopped there. Steady state is fine, cost is in range, proceed to deploy. pinpole's workflow does not stop there.

Finding 1: DynamoDB on-demand at spike load

I ran a Spike pattern at 7,200 RPS - the 6× campaign day load. The AI recommendations panel updated within seconds.

The finding: DynamoDB on-demand at 7,200 RPS ingest, with 1.4× write amplification to a secondary index, was going to produce approximately $2,890/month in DynamoDB write costs alone on campaign days. Provisioned capacity with auto-scaling - minimum 1,500 WCU, maximum 12,000 WCU, target utilisation 70% - would bring that to approximately $740/month.
The Pricing Calculator estimate had modelled DynamoDB at steady-state write volume. It had not accounted for the spike multiplier. The difference: $2,150/month per month from one configuration decision.

Finding 2: Lambda memory allocation

The AI recommendation engine flagged that both Lambda functions at 512 MB were likely operating in a region of the memory/cost curve where increasing memory allocation reduces total compute cost despite the higher per-GB-second rate. The reason: execution duration drops non-linearly when CPU increases, because Lambda allocates CPU proportionally to memory.
I accepted the recommendation to 1,024 MB, reran the simulation. Projected Lambda cost dropped. The configuration that performs better under load also costs less to run - that counterintuitive result does not surface in any static calculator.

Finding 3: No distribution layer in front of API Gateway

Under spike load, API Gateway was absorbing the full request volume directly. Adding CloudFront to the canvas and rerunning showed that cacheable responses no longer hit the origin - API Gateway RPS at the ingest layer dropped meaningfully at peak, and the monthly API Gateway cost reduction offset the CloudFront cost.

The result

	Before	After
DynamoDB (campaign day)	$2,890/mo	$740/mo
Lambda (both functions)	Baseline	Reduced
API Gateway + CloudFront	$X	$X − delta
Total identified saving		$3,840/mo

All three findings identified before a deployment pipeline was touched. The post-deployment validation on the optimised configuration came in at $30 under the simulation projection.

The broader point

The dollar figure matters less than the mechanism. These are not obscure optimisations. DynamoDB capacity mode, Lambda memory right-sizing, and distribution layer decisions exist in almost every event-driven AWS architecture. They are routinely not caught until the first billing cycle - not because engineers are negligent, but because the tools required to catch them have historically required deployed infrastructure.

That constraint is removable. The feedback loop that FinOps typically operates in - deploy, observe, optimise, redeploy - now has a step zero.
Full post with simulation methodology, execution history, and the design-time FinOps checklist I now run on every new service →

14-day Pro trial, no credit card. Free tier available at app.pinpole.cloud