<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Harsh Thakkar</title>
    <description>The latest articles on Forem by Harsh Thakkar (@harsh0369).</description>
    <link>https://forem.com/harsh0369</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3856978%2Ff9074d18-9097-4004-80d1-1af7ea7b70e4.jpg</url>
      <title>Forem: Harsh Thakkar</title>
      <link>https://forem.com/harsh0369</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/harsh0369"/>
    <language>en</language>
    <item>
      <title>Why Most AWS-Based Developer Toolchains Fail After 6 Months (And What I Changed)</title>
      <dc:creator>Harsh Thakkar</dc:creator>
      <pubDate>Sun, 05 Apr 2026 06:37:18 +0000</pubDate>
      <link>https://forem.com/harsh0369/why-most-aws-based-developer-toolchains-fail-after-6-months-and-what-i-changed-2j2g</link>
      <guid>https://forem.com/harsh0369/why-most-aws-based-developer-toolchains-fail-after-6-months-and-what-i-changed-2j2g</guid>
      <description>&lt;p&gt;This is a story of an internal organizational project, that taught me more than any practical roadmap could ever...&lt;/p&gt;

&lt;p&gt;The first time it broke, it wasn’t even during a deploy.⚠️&lt;/p&gt;

&lt;p&gt;It was a random Wednesday afternoon. No traffic spike, no big release, nothing dramatic. Just a Slack message from a backend team:&lt;br&gt;
“It seems builds are taking like 25 minutes now. Did something change?”&lt;/p&gt;

&lt;p&gt;Nothing had changed. That was the problem.😐&lt;/p&gt;




&lt;p&gt;Six months earlier, I had proudly stitched together what I thought was a clean AWS-native developer toolchain. Code went into GitHub, triggered a pipeline, flowed through build, test, deploy everything nicely wired with managed services. Minimal servers, maximum “cloud-native elegance.”&lt;/p&gt;

&lt;p&gt;It felt… modern.✨&lt;/p&gt;

&lt;p&gt;For about three months.&lt;/p&gt;

&lt;p&gt;Then things started getting weird in small ways. Not failures. Friction.&lt;/p&gt;

&lt;p&gt;Build times creeping up. Logs harder to trace. Random permission errors that fixed themselves if you retried. Nobody panicked, because individually, each issue was… tolerable.😬&lt;/p&gt;

&lt;p&gt;but collectively the system was rotting !&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;The illusion I bought into&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At the time, I genuinely believed this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“If we use more managed services, we’ll have less to worry about.”💡&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In hindsight, that wasn’t wrong. It was just incomplete.&lt;/p&gt;

&lt;p&gt;What I didn’t realize was that I was trading operational burden for cognitive burden. And the latter is sneakier.&lt;/p&gt;

&lt;p&gt;Because when something breaks in a traditional setup, you at least know where to look.&lt;/p&gt;

&lt;p&gt;When it breaks across five AWS services glued together by IAM roles and implicit triggers… good luck 😅&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;The day it actually failed 💥&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We had a hotfix that needed to go out quickly. Nothing major, just a small patch to fix a data validation issue.&lt;/p&gt;

&lt;p&gt;Pipeline triggered. Build started.▶️&lt;/p&gt;

&lt;p&gt;Then it hung.⛔&lt;/p&gt;

&lt;p&gt;No error. Just… stuck.😶&lt;/p&gt;

&lt;p&gt;We checked logs. Partial logs. Because the logs were split across services. One part in build logs, one part in deployment logs, some events in CloudWatch, some not showing up at all.&lt;/p&gt;

&lt;p&gt;After 40 minutes, other team member manually redeployed from their system.&lt;/p&gt;

&lt;p&gt;It worked. ✅&lt;/p&gt;

&lt;p&gt;That was the moment I knew the system had failed not because it crashed, but because no one trusted it anymore.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Where things actually went wrong ?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It wasn’t a single bad decision. It was a series of reasonable ones.&lt;/p&gt;

&lt;p&gt;That’s what makes this tricky.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. We optimized for setup, not longevity 🏗️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Early on, everything was fast to set up. Click here, configure that, connect this trigger.&lt;/p&gt;

&lt;p&gt;In hindsight, we built something that was easy to create but hard to understand.&lt;/p&gt;

&lt;p&gt;There’s a difference.&lt;/p&gt;

&lt;p&gt;After a few months, nobody remembered:&lt;/p&gt;

&lt;p&gt;which service triggered what&lt;br&gt;
why certain permissions existed&lt;br&gt;
what would break if we changed something small&lt;/p&gt;

&lt;p&gt;Including me.🙋‍♂️&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. We let IAM complexity spiral 🌀&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At first, permissions were tight. Thoughtful.&lt;/p&gt;

&lt;p&gt;Then came edge cases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“just add this permission for now”&lt;/li&gt;
&lt;li&gt;“we’ll clean this up later”&lt;/li&gt;
&lt;li&gt;“it’s blocking the pipeline”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We never cleaned it up.&lt;/p&gt;

&lt;p&gt;Six months in, we had roles that nobody fully understood. Some were over-permissive, others randomly failed due to missing access.⚠️&lt;/p&gt;

&lt;p&gt;The worst part? Failures weren’t consistent.&lt;/p&gt;

&lt;p&gt;Retrying sometimes “fixed” things. That’s dangerous it teaches people to ignore root causes.🚫&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Debugging became archaeology&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This one hurt the most.😓&lt;/p&gt;

&lt;p&gt;To debug a single pipeline run, we had to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;jump between multiple dashboards&lt;/li&gt;
&lt;li&gt;correlate timestamps manually&lt;/li&gt;
&lt;li&gt;guess which service dropped the signal&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There was no single narrative of “what happened.”&lt;/p&gt;

&lt;p&gt;Just fragments.&lt;/p&gt;

&lt;p&gt;I remember thinking: why is this harder than debugging a monolith on a single server?&lt;/p&gt;

&lt;p&gt;That question stuck with me.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. We overcomposed the system 🧱&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At some point, we crossed a line from modular to fragmented.&lt;/p&gt;

&lt;p&gt;Every small concern became its own piece:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;build&lt;/li&gt;
&lt;li&gt;test&lt;/li&gt;
&lt;li&gt;artifact storage&lt;/li&gt;
&lt;li&gt;deploy orchestration&lt;/li&gt;
&lt;li&gt;notifications&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Individually, each piece made sense.👍&lt;/p&gt;

&lt;p&gt;Together, they formed a system that had too many moving parts to reason about.&lt;/p&gt;

&lt;p&gt;What I didn’t realize at the time:&lt;br&gt;
&lt;strong&gt;Every boundary you introduce is also a failure point.⚠️&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;What I changed (and what felt wrong at first)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The fixes weren’t glamorous. Some even felt like a step backward.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I reduced the number of services&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This was controversial internally.&lt;/p&gt;

&lt;p&gt;Instead of chaining multiple AWS services, I consolidated parts of the pipeline into fewer components even if that meant slightly more responsibility in one place.&lt;/p&gt;

&lt;p&gt;Less “cloud-native purity.”&lt;br&gt;
More predictability.&lt;/p&gt;

&lt;p&gt;And honestly? Things got easier to debug almost immediately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I started designing for failure visibility&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not prevention. Visibility.🔍&lt;/p&gt;

&lt;p&gt;We added:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;clearer, centralized logging (not perfect, just better)&lt;/li&gt;
&lt;li&gt;explicit failure points instead of silent retries&lt;/li&gt;
&lt;li&gt;fewer “magic” triggers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I stopped trying to make everything seamless.&lt;/p&gt;

&lt;p&gt;Because seamless systems are hard to inspect.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I treated IAM as code and not configuration💻&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This was a big shift.&lt;/p&gt;

&lt;p&gt;Instead of tweaking permissions ad hoc, we started defining them more explicitly and reviewing changes like actual code.&lt;/p&gt;

&lt;p&gt;It slowed us down in the short term.&lt;/p&gt;

&lt;p&gt;But it removed that creeping uncertainty of “who can do what anymore?”&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I accepted a bit of duplication📄&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Earlier, I tried to DRY everything out across pipelines and environments.&lt;/p&gt;

&lt;p&gt;Now? Some duplication stays.&lt;/p&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Because over-abstraction in infrastructure makes things harder to reason about when something breaks.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Clarity &amp;gt; cleverness.&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Every time.✔️&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The uncomfortable truth😬&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most AWS-based developer toolchains don’t fail because AWS is unreliable.&lt;/p&gt;

&lt;p&gt;They fail because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;they become too abstract&lt;/li&gt;
&lt;li&gt;too distributed&lt;/li&gt;
&lt;li&gt;too “smart”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And nobody owns the full picture anymore.&lt;/p&gt;

&lt;p&gt;It’s not a tooling problem. It’s a design mindset problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I’d do differently from day one&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If I had to rebuild everything:&lt;/p&gt;

&lt;p&gt;I’d start with a simple question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“When this breaks at 2 AM, how quickly can someone understand what happened?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Not “how scalable is it”&lt;br&gt;
Not “how serverless is it”&lt;br&gt;
Not “how elegant is it”&lt;/p&gt;

&lt;p&gt;Just that.🎯&lt;/p&gt;

&lt;p&gt;Because six months in, that’s the only thing that really matters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One last thing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;People love to say:&lt;br&gt;
“Use managed services so you can focus on business logic.”&lt;/p&gt;

&lt;p&gt;I still agree with that.&lt;/p&gt;

&lt;p&gt;But there’s a hidden cost.&lt;/p&gt;

&lt;p&gt;You’re not eliminating complexity.&lt;br&gt;
You’re relocating it.&lt;/p&gt;

&lt;p&gt;And if you’re not careful…&lt;br&gt;
you’ll end up debugging a system that nobody fully understands.&lt;/p&gt;

&lt;p&gt;Including the person who built it.🙃&lt;/p&gt;

</description>
      <category>productivity</category>
      <category>devops</category>
      <category>architecture</category>
      <category>aws</category>
    </item>
  </channel>
</rss>
