Forem: Harsh Thakkar

Why Your AWS CI/CD Pipeline maybe Slower Than It Should Be (Mine Was Too)

Harsh Thakkar — Mon, 13 Apr 2026 06:39:30 +0000

It was one of those days where nothing was technically broken… but everything felt off.

Deployments were going through. Pipelines were green. No alarms screaming.
And yet every push took forever.

I remember staring at the screen after triggering a simple change. A tiny config tweak. Something that should’ve gone through in a couple of minutes. Instead, I watched my AWS pipeline crawl… step by step… like it had all the time in the world.

For a YAML change.

At the time, I told myself, “Yeah, CI/CD pipelines are just slow sometimes.”
That was my first mistake.

The lie we tell ourselves

If your pipeline works, you stop questioning it.

That’s what I did.

I had a pretty standard setup:

Code pushed → CodePipeline triggers
CodeBuild runs tests + build
Artifacts go to S3
Deploy via CodeDeploy

Nothing exotic. No weird hacks. It looked clean.

But under the surface, it was quietly inefficient in ways I didn’t notice for months.

The moment it clicked

One afternoon, I had to deploy 5 times in a row (😅).
Same pipeline. Same steps. Same wait… every time.

That’s when it hit me:

I was spending more time waiting for my pipeline than actually coding.

And worse… I had accepted it.

Where the time was actually going

I finally sat down and traced a single run end-to-end. Not casually. Properly.

And yeah… it was uncomfortable.

1. CodeBuild was doing way too much

I had bundled everything into one build phase:

install dependencies
run tests
build artifacts
package everything

Seemed efficient, right?

Except… every single run started from scratch.

No caching.

So even if I changed one line, it:

reinstalled node modules
rebuilt layers
redid everything like it had never seen my project before

That alone was eating 6–8 minutes.

What I didn’t realize at the time:

Stateless builds are great… until they’re unnecessarily stateless.

2. I ignored caching because it felt “optional”

AWS makes caching in CodeBuild possible, but not exactly obvious.

I skipped it initially because:

It adds config complexity
Cache invalidation is annoying
“It’s fine for now”

Classic.

When I finally enabled caching for dependencies (node_modules, pip, etc.), build times dropped almost immediately.

Not dramatically. But noticeably.

Still… caching comes with trade-offs:

Sometimes stale dependencies sneak in
Debugging weird build issues becomes harder
You need to think about cache keys (which I initially didn’t 😅)

3. Serial execution everywhere

This one hurt a bit.

My pipeline stages were strictly linear:

Build → Test → Package → Deploy

No parallelism. No optimization.

Even independent steps were waiting on each other.

Looking back, I could’ve:

Run tests in parallel with certain build steps
Split pipelines by service instead of monolith builds
Avoid blocking everything for one slow task

But I didn’t. Because linear pipelines are easy to reason about.

And sometimes… we choose simplicity over speed without realizing the cost.

4. Artifact handling was... lazy

I was passing around large artifacts between stages.
Bigger than they needed to be.

Stuff that didn’t even change between runs was getting repackaged and uploaded again.

It wasn’t obvious at first. But S3 upload + download latency adds up.

Especially when:

You compress everything every time
You don’t separate static vs dynamic assets
You treat artifacts like a dumping ground

In hindsight, this was just… sloppy engineering.

5. Over-triggering pipelines

This one was subtle.

Every push triggered the full pipeline even for:

README changes
minor config tweaks
non-deployable updates

So I was burning compute time (and patience) on changes that didn’t need deployment.

A simple filter or conditional trigger would’ve helped.

But I didn’t add it until much later.

What changed after all this

Not overnight. And not perfectly.

But gradually:

I split heavy builds into smaller, more focused steps
Added caching (carefully… and with some regret during debugging 😅)
Introduced conditional triggers
Reduced artifact size and duplication
Parallelized what I could without making things unreadable

And the result?

My pipeline dropped from ~18 minutes to around 6–8 minutes on average.

Still not blazing fast. But acceptable.

More importantly it felt under control.

The part nobody talks about

Faster pipelines aren’t free.

Every optimization introduces trade-offs:

Caching → faster builds, harder debugging
Parallelism → speed, but more complexity
Smaller artifacts → better performance, but more structure required
Conditional triggers → efficiency, but risk of missing deployments

There’s no perfect setup.

Just… intentional ones.

What I’d do differently now

If I were starting fresh:

I wouldn’t aim for the perfect pipeline.

I’d aim for visibility first.

Measure each stage early
Understand where time goes
Optimize only what actually hurts

Because honestly…

Most pipelines aren’t slow because of AWS.
They’re slow because of decisions we stopped questioning.

Final thought

If your pipeline feels slow, it probably is.

And if you’ve gotten used to it… that’s the real problem.

I did too.

Until one day I couldn’t ignore it anymore.

Why Most AWS-Based Developer Toolchains Fail After 6 Months (And What I Changed)

Harsh Thakkar — Sun, 05 Apr 2026 06:37:18 +0000

This is a story of an internal organizational project, that taught me more than any practical roadmap could ever...

The first time it broke, it wasn’t even during a deploy.⚠️

It was a random Wednesday afternoon. No traffic spike, no big release, nothing dramatic. Just a Slack message from a backend team:
“It seems builds are taking like 25 minutes now. Did something change?”

Nothing had changed. That was the problem.😐

Six months earlier, I had proudly stitched together what I thought was a clean AWS-native developer toolchain. Code went into GitHub, triggered a pipeline, flowed through build, test, deploy everything nicely wired with managed services. Minimal servers, maximum “cloud-native elegance.”

It felt… modern.✨

For about three months.

Then things started getting weird in small ways. Not failures. Friction.

Build times creeping up. Logs harder to trace. Random permission errors that fixed themselves if you retried. Nobody panicked, because individually, each issue was… tolerable.😬

but collectively the system was rotting !

The illusion I bought into

At the time, I genuinely believed this:

“If we use more managed services, we’ll have less to worry about.”💡

In hindsight, that wasn’t wrong. It was just incomplete.

What I didn’t realize was that I was trading operational burden for cognitive burden. And the latter is sneakier.

Because when something breaks in a traditional setup, you at least know where to look.

When it breaks across five AWS services glued together by IAM roles and implicit triggers… good luck 😅

The day it actually failed 💥

We had a hotfix that needed to go out quickly. Nothing major, just a small patch to fix a data validation issue.

Pipeline triggered. Build started.▶️

Then it hung.⛔

No error. Just… stuck.😶

We checked logs. Partial logs. Because the logs were split across services. One part in build logs, one part in deployment logs, some events in CloudWatch, some not showing up at all.

After 40 minutes, other team member manually redeployed from their system.

It worked. ✅

That was the moment I knew the system had failed not because it crashed, but because no one trusted it anymore.

Where things actually went wrong ?

It wasn’t a single bad decision. It was a series of reasonable ones.

That’s what makes this tricky.

1. We optimized for setup, not longevity 🏗️

Early on, everything was fast to set up. Click here, configure that, connect this trigger.

In hindsight, we built something that was easy to create but hard to understand.

There’s a difference.

After a few months, nobody remembered:

which service triggered what
why certain permissions existed
what would break if we changed something small

Including me.🙋‍♂️

2. We let IAM complexity spiral 🌀

At first, permissions were tight. Thoughtful.

Then came edge cases:

“just add this permission for now”
“we’ll clean this up later”
“it’s blocking the pipeline”

We never cleaned it up.

Six months in, we had roles that nobody fully understood. Some were over-permissive, others randomly failed due to missing access.⚠️

The worst part? Failures weren’t consistent.

Retrying sometimes “fixed” things. That’s dangerous it teaches people to ignore root causes.🚫

3. Debugging became archaeology

This one hurt the most.😓

To debug a single pipeline run, we had to:

jump between multiple dashboards
correlate timestamps manually
guess which service dropped the signal

There was no single narrative of “what happened.”

Just fragments.

I remember thinking: why is this harder than debugging a monolith on a single server?

That question stuck with me.

4. We overcomposed the system 🧱

At some point, we crossed a line from modular to fragmented.

Every small concern became its own piece:

build
test
artifact storage
deploy orchestration
notifications

Individually, each piece made sense.👍

Together, they formed a system that had too many moving parts to reason about.

What I didn’t realize at the time:
Every boundary you introduce is also a failure point.⚠️

What I changed (and what felt wrong at first)

The fixes weren’t glamorous. Some even felt like a step backward.

I reduced the number of services

This was controversial internally.

Instead of chaining multiple AWS services, I consolidated parts of the pipeline into fewer components even if that meant slightly more responsibility in one place.

Less “cloud-native purity.”
More predictability.

And honestly? Things got easier to debug almost immediately.

I started designing for failure visibility

Not prevention. Visibility.🔍

We added:

clearer, centralized logging (not perfect, just better)
explicit failure points instead of silent retries
fewer “magic” triggers

I stopped trying to make everything seamless.

Because seamless systems are hard to inspect.

I treated IAM as code and not configuration💻

This was a big shift.

Instead of tweaking permissions ad hoc, we started defining them more explicitly and reviewing changes like actual code.

It slowed us down in the short term.

But it removed that creeping uncertainty of “who can do what anymore?”

I accepted a bit of duplication📄

Earlier, I tried to DRY everything out across pipelines and environments.

Now? Some duplication stays.

Why?

Because over-abstraction in infrastructure makes things harder to reason about when something breaks.

Clarity > cleverness.

Every time.✔️

The uncomfortable truth😬

Most AWS-based developer toolchains don’t fail because AWS is unreliable.

They fail because:

they become too abstract
too distributed
too “smart”

And nobody owns the full picture anymore.

It’s not a tooling problem. It’s a design mindset problem.

What I’d do differently from day one

If I had to rebuild everything:

I’d start with a simple question:

“When this breaks at 2 AM, how quickly can someone understand what happened?”

Not “how scalable is it”
Not “how serverless is it”
Not “how elegant is it”

Just that.🎯

Because six months in, that’s the only thing that really matters.

One last thing

People love to say:
“Use managed services so you can focus on business logic.”

I still agree with that.

But there’s a hidden cost.

You’re not eliminating complexity.
You’re relocating it.

And if you’re not careful…
you’ll end up debugging a system that nobody fully understands.

Including the person who built it.🙃