<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Adil Khan</title>
    <description>The latest articles on Forem by Adil Khan (@adil-khan-723).</description>
    <link>https://forem.com/adil-khan-723</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3650542%2Fa71fd520-f132-4110-b7b6-e0afded5090f.png</url>
      <title>Forem: Adil Khan</title>
      <link>https://forem.com/adil-khan-723</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/adil-khan-723"/>
    <language>en</language>
    <item>
      <title>I Built a Kubernetes Monitoring Stack — And Breaking It Was the Best Part</title>
      <dc:creator>Adil Khan</dc:creator>
      <pubDate>Sat, 28 Mar 2026 09:34:20 +0000</pubDate>
      <link>https://forem.com/adil-khan-723/i-built-a-kubernetes-monitoring-stack-and-breaking-it-was-the-best-part-1lba</link>
      <guid>https://forem.com/adil-khan-723/i-built-a-kubernetes-monitoring-stack-and-breaking-it-was-the-best-part-1lba</guid>
      <description>&lt;p&gt;I didn't build this project to add a line to my resume.&lt;/p&gt;

&lt;p&gt;I built it because I kept reading about Prometheus and Grafana, nodding along like I understood it, and then freezing when someone asked me "so how does Prometheus actually discover your pods?"&lt;/p&gt;

&lt;p&gt;I didn't know. Not really.&lt;/p&gt;

&lt;p&gt;So I decided to stop reading and start breaking things.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I built
&lt;/h2&gt;

&lt;p&gt;A complete observability pipeline from scratch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A Python Flask app with &lt;strong&gt;custom Prometheus metrics&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Deployed on &lt;strong&gt;Kubernetes with 3 replicas&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Scraped by &lt;strong&gt;Prometheus via ServiceMonitor&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Visualized in &lt;strong&gt;Grafana with PromQL dashboards&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Load tested with real traffic using &lt;code&gt;hey&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The repo is here if you want to follow along:&lt;br&gt;
👉 &lt;strong&gt;&lt;a href="https://github.com/adil-khan-723/k8s-observability-stack" rel="noopener noreferrer"&gt;github.com/adil-khan-723/k8s-observability-stack&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl5hvpoh194r41dugpkbq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl5hvpoh194r41dugpkbq.png" alt="Architecture" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;But the code isn't the interesting part. What I learned by watching it fail is.&lt;/p&gt;


&lt;h2&gt;
  
  
  Mistake #1 — I used raw counters in Grafana and wondered why nothing made sense
&lt;/h2&gt;

&lt;p&gt;First dashboard. I added &lt;code&gt;http_requests_total&lt;/code&gt; as a panel. The number just kept climbing. 1000. 5000. 23000.&lt;/p&gt;

&lt;p&gt;I stared at it thinking "okay... is that good?"&lt;/p&gt;

&lt;p&gt;It tells you nothing. A counter that only goes up is like a car's odometer — it doesn't tell you how fast you're going right now.&lt;/p&gt;

&lt;p&gt;The correct query is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight prometheus"&gt;&lt;code&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;rate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;http_requests_total&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1m&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;rate()&lt;/code&gt; calculates requests &lt;strong&gt;per second&lt;/strong&gt; over the last minute. That's a number you can actually act on. After switching to this, I could see exactly when traffic spiked during load testing and when it dropped off.&lt;/p&gt;

&lt;p&gt;Lesson: &lt;strong&gt;metrics alone are useless. PromQL creates insight.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Mistake #2 — I set the rate window too large and hid all my spikes
&lt;/h2&gt;

&lt;p&gt;Once I had &lt;code&gt;rate()&lt;/code&gt; working, I noticed my graph looked... suspiciously smooth. Almost like nothing was happening even during load tests.&lt;/p&gt;

&lt;p&gt;I was using &lt;code&gt;rate(http_requests_total[8m])&lt;/code&gt;. An 8-minute window averages out everything. A spike that lasted 30 seconds disappears completely.&lt;/p&gt;

&lt;p&gt;Switched to &lt;code&gt;[1m]&lt;/code&gt;. Suddenly I could see exactly what happened during the load test — a sharp climb, a plateau, a drop. Real information.&lt;/p&gt;

&lt;p&gt;The dashboard also had stacked graphs enabled. Stacking makes it look like total traffic is the sum of all the colored areas, which is visually misleading when you're trying to compare per-pod behavior. Disabled it immediately.&lt;/p&gt;




&lt;h2&gt;
  
  
  Mistake #3 — Prometheus showed all targets DOWN and I had no idea why
&lt;/h2&gt;

&lt;p&gt;This one took me a while.&lt;/p&gt;

&lt;p&gt;I ran load tests, checked Prometheus UI under &lt;strong&gt;Status → Targets&lt;/strong&gt;, and saw this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;context deadline exceeded
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All 3 pods. All DOWN.&lt;/p&gt;

&lt;p&gt;My first instinct was to blame the ServiceMonitor config. I triple-checked the labels. Everything matched. The problem wasn't discovery — Prometheus was finding the pods fine. It just couldn't scrape them in time.&lt;/p&gt;

&lt;p&gt;Root cause: I had a &lt;code&gt;time.sleep(2)&lt;/code&gt; sitting in my home route. This slowed down the entire Gunicorn worker. When Prometheus tried to hit &lt;code&gt;/metrics&lt;/code&gt;, sometimes the worker was busy sleeping and the scrape timed out.&lt;/p&gt;

&lt;p&gt;The fix was two things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Increased the scrape &lt;code&gt;interval&lt;/code&gt; in serviceMonitor.yaml to give more breathing room&lt;/li&gt;
&lt;li&gt;Removed the artificial delay (or accounted for it explicitly)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Watched all 3 targets flip back to UP in real time. That was a good moment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The deeper lesson:&lt;/strong&gt; your monitoring system depends on your application's performance. A slow app breaks its own observability. This is why you should never put business logic on the &lt;code&gt;/metrics&lt;/code&gt; endpoint.&lt;/p&gt;




&lt;h2&gt;
  
  
  Mistake #4 — I assumed Running meant healthy
&lt;/h2&gt;

&lt;p&gt;After fixing the scrape issue, I noticed something strange. &lt;code&gt;kubectl get pods&lt;/code&gt; showed all pods as &lt;code&gt;Running&lt;/code&gt;. But requests were still failing intermittently.&lt;/p&gt;

&lt;p&gt;I had been treating &lt;code&gt;Running&lt;/code&gt; as "everything is fine." It isn't.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Running&lt;/code&gt; just means the container process started. It says nothing about whether the application inside is actually ready to serve traffic. A pod can be &lt;code&gt;Running&lt;/code&gt; while your Flask app is still initializing, or while it's in a broken state that hasn't crashed the process.&lt;/p&gt;

&lt;p&gt;The fix was a &lt;code&gt;readinessProbe&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;readinessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/metrics&lt;/span&gt;
    &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5000&lt;/span&gt;
  &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
  &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
  &lt;span class="na"&gt;failureThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once this was in place, Kubernetes automatically removed unhealthy pods from the Service's endpoint list. Traffic only went to pods that were actually ready. The intermittent failures stopped.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Running ≠ healthy. Readiness probes are not optional.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The thing that surprised me most — load isn't evenly distributed
&lt;/h2&gt;

&lt;p&gt;During load testing I split the Grafana panel to show per-pod request rates:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight prometheus"&gt;&lt;code&gt;&lt;span class="nb"&gt;rate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;http_requests_total&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1m&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I expected three roughly equal lines. What I got were three noticeably different ones. One pod was consistently getting more traffic than the others.&lt;/p&gt;

&lt;p&gt;Kubernetes Services use round-robin at the &lt;strong&gt;connection&lt;/strong&gt; level, not the request level. Under high concurrency, some pods end up holding more long-lived connections and therefore handle more requests.&lt;/p&gt;

&lt;p&gt;If I had only been looking at &lt;code&gt;sum(rate(http_requests_total[1m]))&lt;/code&gt; — the aggregate — I would never have seen this. The sum looked perfectly healthy. The per-pod view told a completely different story.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is why per-pod metrics exist. Aggregates hide things.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What the ServiceMonitor actually does (and why it's confusing at first)
&lt;/h2&gt;

&lt;p&gt;The part that confused me most before building this was the relationship between Prometheus and Kubernetes.&lt;/p&gt;

&lt;p&gt;Prometheus doesn't scrape pods. It scrapes &lt;strong&gt;Services&lt;/strong&gt;. And it discovers those Services through a custom resource called a &lt;strong&gt;ServiceMonitor&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The chain looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ServiceMonitor → matches Service labels → Service resolves to Pod IPs → Prometheus scrapes each pod
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For this to work, three things have to align exactly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The ServiceMonitor's &lt;code&gt;selector&lt;/code&gt; must match the Service's labels&lt;/li&gt;
&lt;li&gt;The ServiceMonitor's &lt;code&gt;namespace&lt;/code&gt; must have the &lt;code&gt;release: monitoring&lt;/code&gt; label (or whatever your Helm release is named)&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;namespaceSelector&lt;/code&gt; must point to where the Service lives&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Get any one of these wrong and Prometheus simply never discovers the target. No error. Just silence. This is the part where most people spend hours debugging.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd add next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Alertmanager&lt;/strong&gt; — fire alerts when scrape targets go DOWN or error rate spikes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HPA&lt;/strong&gt; — autoscale pods based on custom Prometheus metrics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Loki&lt;/strong&gt; — correlate logs with metric anomalies&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pinned dependency versions&lt;/strong&gt; — &lt;code&gt;requirements.txt&lt;/code&gt; currently has no versions, which is a reproducibility risk&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final thought
&lt;/h2&gt;

&lt;p&gt;The most valuable part of this project wasn't setting up Prometheus or building the Grafana dashboard. It was the moment I saw &lt;code&gt;context deadline exceeded&lt;/code&gt; and had to actually figure out why.&lt;/p&gt;

&lt;p&gt;You don't learn observability by reading about it. You learn it by watching your own system fail and having to diagnose it with the tools you built.&lt;/p&gt;

&lt;p&gt;If you're learning DevOps or platform engineering, build something, break it, and read the metrics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/adil-khan-723/k8s-observability-stack" rel="noopener noreferrer"&gt;github.com/adil-khan-723/k8s-observability-stack&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Have you run into similar issues with Prometheus scraping? Drop it in the comments — would love to hear what weird things you've debugged.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>prometheus</category>
      <category>grafana</category>
      <category>devops</category>
    </item>
    <item>
      <title>I built a CLI that catches dangerous Terraform changes before you apply them</title>
      <dc:creator>Adil Khan</dc:creator>
      <pubDate>Mon, 23 Mar 2026 07:19:22 +0000</pubDate>
      <link>https://forem.com/adil-khan-723/-i-built-a-cli-that-catches-dangerous-terraform-changes-before-you-apply-them-noe</link>
      <guid>https://forem.com/adil-khan-723/-i-built-a-cli-that-catches-dangerous-terraform-changes-before-you-apply-them-noe</guid>
      <description>&lt;p&gt;Before every &lt;code&gt;terraform apply&lt;/code&gt; I was doing the same thing. Read the plan output. Switch to the AWS console. Check security groups. Try to remember what depends on what.&lt;/p&gt;

&lt;p&gt;Then a security group with port 22 open to &lt;code&gt;0.0.0.0/0&lt;/code&gt; slipped through. Caught it fast, nothing broke — but I kept thinking about it. That whole review process was just me, manually, every time, hoping I didn't miss anything in 300 lines of text. That's not a process. That's vibes.&lt;/p&gt;

&lt;p&gt;So I built IACGuard. This is how it works and what I got wrong along the way.&lt;/p&gt;




&lt;h2&gt;
  
  
  The gap in terraform plan
&lt;/h2&gt;

&lt;p&gt;The output doesn't tell you what matters. A production database being replaced and a tag change look identical in the plan — same format, same indentation, same weight. You have to already know what's dangerous to spot it.&lt;/p&gt;

&lt;p&gt;The second gap is pipelines. You can fail a PR on test failures or linting. You can't natively fail a PR because the plan replaces a production RDS instance. That gap means risky infrastructure changes get the same review as safe ones, which under time pressure is basically no review.&lt;/p&gt;




&lt;h2&gt;
  
  
  What IACGuard does
&lt;/h2&gt;

&lt;p&gt;Reads a Terraform plan file, runs deterministic rules against every planned change, outputs a risk report.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;iacguard plan results
─────────────────────────────────────────────────────────
Critical : 1  |  High : 0  |  Medium : 0  |  Low : 0
Resources analyzed : 5  |  Changes : 5
─────────────────────────────────────────────────────────

  CRITICAL  SG001  aws_vpc_security_group_ingress_rule.sg_inbound_ssh  [CREATE]
           Security group ingress rule 'sg_inbound_ssh' allows SSH (port 22)
           from the entire internet (0.0.0.0/0 or ::/0).

─────────────────────────────────────────────────────────
[iacguard] Rules checked   : 3
[iacguard] Drift check     : SKIPPED (use --region to enable)
─────────────────────────────────────────────────────────
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Exit code 0 = nothing critical. Exit code 1 = stop and look at this before applying.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;iacguard
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Here is the Architecture Below
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsa9s00rtgfsl0atlylg3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsa9s00rtgfsl0atlylg3.png" alt="iacguard architecture"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The first version used AI for detection. That was wrong.
&lt;/h2&gt;

&lt;p&gt;My original plan was to send the whole Terraform plan to Claude and let it decide what was risky.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;plan.json → Claude API → risk scores + findings
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The problem is non-determinism. LLMs give different answers on different runs. That's fine for generating text. It's not fine when your tool blocks deployments. If &lt;code&gt;iacguard&lt;/code&gt; exits with code 1 in a CI pipeline, that exit code has to mean the same thing every single time — not "Claude was in a cautious mood today."&lt;/p&gt;

&lt;p&gt;So the architecture changed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;plan.json → parser → rule engine (deterministic) → findings → output
                                                            ↓
                                              AI explanation (--explain, optional)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Rules decide what's risky. AI only explains what the rules already found. AI never runs in CI mode at all.&lt;/p&gt;




&lt;h2&gt;
  
  
  The parser
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;terraform show -json&lt;/code&gt; gives you a JSON file. The useful part is &lt;code&gt;resource_changes&lt;/code&gt; — every resource being created, updated, or destroyed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"address"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"aws_db_instance.primary"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"managed"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"aws_db_instance"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"change"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"actions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"delete"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"create"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"before"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"instance_class"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"db.t3.medium"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"after"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"instance_class"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"db.t3.large"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"replacing"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The non-obvious parts:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;actions&lt;/code&gt; can be &lt;code&gt;["delete", "create"]&lt;/code&gt; — that's a replacement, not two separate operations. The parser normalizes everything to &lt;code&gt;CREATE&lt;/code&gt;, &lt;code&gt;UPDATE&lt;/code&gt;, &lt;code&gt;DESTROY&lt;/code&gt;, or &lt;code&gt;REPLACE&lt;/code&gt; before rules run.&lt;/p&gt;

&lt;p&gt;Data sources (&lt;code&gt;mode: "data"&lt;/code&gt;) and no-ops get filtered out. They're reads, not changes.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;change.before&lt;/code&gt; is null for creates. &lt;code&gt;change.after&lt;/code&gt; is null for destroys. Rules have to handle both.&lt;/p&gt;

&lt;p&gt;Resources inside modules have addresses like &lt;code&gt;module.vpc.aws_security_group.bastion&lt;/code&gt;. Readable in output, full address preserved for accuracy.&lt;/p&gt;

&lt;p&gt;I spent more time on the parser than anything else. If it reads something wrong, every rule downstream gets bad data.&lt;/p&gt;




&lt;h2&gt;
  
  
  The three rules
&lt;/h2&gt;

&lt;p&gt;Each rule is a pure Python function. Takes a &lt;code&gt;ResourceChange&lt;/code&gt;, returns a &lt;code&gt;Finding&lt;/code&gt; or &lt;code&gt;None&lt;/code&gt;. No shared state, independently testable.&lt;/p&gt;

&lt;h3&gt;
  
  
  RDS001 — database replacement
&lt;/h3&gt;

&lt;p&gt;An RDS replacement deletes the database and recreates it. Anything written between delete and restore is gone. The rule fires on &lt;code&gt;actions == ["delete", "create"]&lt;/code&gt; or &lt;code&gt;change.replacing == True&lt;/code&gt; — both checked because real plans sometimes only set one.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RDS001&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RuleBase&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;rule_id&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RDS001&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;severity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Severity&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CRITICAL&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;all_changes&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resource_type&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;aws_db_instance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;aws_rds_cluster&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;Action&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;REPLACE&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;replacing&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Finding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="bp"&gt;...&lt;/span&gt;
            &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RDS instance &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; will be replaced — potential data loss.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;recommendation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ensure a snapshot exists before applying.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  SG001 — SSH open to the internet
&lt;/h3&gt;

&lt;p&gt;This one had a bug I only found by running it on my own infrastructure.&lt;/p&gt;

&lt;p&gt;The rule originally covered &lt;code&gt;aws_security_group&lt;/code&gt; and &lt;code&gt;aws_security_group_rule&lt;/code&gt;. When I ran it against my real Terraform code — which has port 22 open to &lt;code&gt;0.0.0.0/0&lt;/code&gt; — it caught nothing.&lt;/p&gt;

&lt;p&gt;Why: my code uses &lt;code&gt;aws_vpc_security_group_ingress_rule&lt;/code&gt;, a newer resource type from AWS provider v5+. Different structure. &lt;code&gt;from_port&lt;/code&gt;, &lt;code&gt;to_port&lt;/code&gt;, and &lt;code&gt;cidr_ipv4&lt;/code&gt; are top-level fields, not nested inside an ingress block.&lt;/p&gt;

&lt;p&gt;I had test fixtures for the old types. Hadn't thought to write one for the new type. You don't know what you haven't tested until real infrastructure breaks it.&lt;/p&gt;

&lt;p&gt;Fixed rule covers all three:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;aws_security_group                   → ingress[] array
aws_security_group_rule              → type=ingress fields
aws_vpc_security_group_ingress_rule  → top-level from_port/to_port/cidr_ipv4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Fires if &lt;code&gt;from_port &amp;lt;= 22 &amp;lt;= to_port&lt;/code&gt;. A rule with &lt;code&gt;from_port=0, to_port=65535&lt;/code&gt; exposes SSH. IPv6 (&lt;code&gt;::/0&lt;/code&gt;) also checked.&lt;/p&gt;

&lt;h3&gt;
  
  
  S3001 — missing public access block
&lt;/h3&gt;

&lt;p&gt;Medium severity, not Critical. AWS accounts can have account-level Block Public Access settings that protect every bucket regardless of what Terraform configures. Flagging this Critical would false-positive on any account with account-level protection — which is a lot of accounts.&lt;/p&gt;

&lt;p&gt;The output is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;S3 bucket 'assets' has no explicit block_public_access configuration.
Account-level settings may still protect this bucket.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The engineer decides. The tool surfaces it without over-reacting.&lt;/p&gt;




&lt;h2&gt;
  
  
  CI mode
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;--ci&lt;/code&gt; flag: JSON to stdout, no color, just exit codes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;iacguard plan &lt;span class="nt"&gt;--plan&lt;/span&gt; plan.json &lt;span class="nt"&gt;--ci&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Exit 0 = clean. Exit 1 = critical finding. Exit 2 = tool error.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;IACGuard&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;pip install iacguard&lt;/span&gt;
    &lt;span class="s"&gt;terraform show -json tfplan &amp;gt; plan.json&lt;/span&gt;
    &lt;span class="s"&gt;iacguard plan --plan plan.json --ci&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;PR blocks on exit 1. That's it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Testing
&lt;/h2&gt;

&lt;p&gt;Every rule has two tests minimum: a plan that triggers it and one that doesn't. All tests use real Terraform plan JSON files — no synthetic JSON built in test code. Synthetic fixtures pass even when the parser would fail on a real plan.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pytest tests/ &lt;span class="nt"&gt;-v&lt;/span&gt;
&lt;span class="c"&gt;# 15 passed in 0.01s&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What I got wrong
&lt;/h2&gt;

&lt;p&gt;The original design had eight subcommands, a React dashboard, Kubernetes scanning, a custom cost engine, AI-generated fix code, and multi-region support — all in v1. I went through multiple rounds of design review, including running the spec past other LLMs to pressure-test it. Most of that got cut. What shipped is one command, three rules, and a CLI. It works on real infrastructure. The original would have shipped half-finished on everything.&lt;/p&gt;

&lt;p&gt;The other mistake: I assumed I knew the Terraform plan JSON structure well enough to write the parser from memory. I didn't generate real plan files and read them first. &lt;code&gt;aws_vpc_security_group_ingress_rule&lt;/code&gt; showing up as a bug in my fixtures is a direct result of that. Next time I'm reading real data before writing code that parses it.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;Blast radius is the main one — given a resource being changed, compute what else depends on it from the Terraform graph. You change a security group, IACGuard tells you which EC2 instances, load balancers, and RDS clusters are downstream. No free CLI tool does this cleanly in a pre-deploy workflow.&lt;/p&gt;

&lt;p&gt;After that: more rules, drift detection (Terraform state vs live AWS), and a browser graph for the dependency chain.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;iacguard

terraform plan &lt;span class="nt"&gt;-out&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;tfplan
terraform show &lt;span class="nt"&gt;-json&lt;/span&gt; tfplan &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; plan.json

iacguard plan &lt;span class="nt"&gt;--plan&lt;/span&gt; plan.json
iacguard plan &lt;span class="nt"&gt;--plan&lt;/span&gt; plan.json &lt;span class="nt"&gt;--ci&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Source: &lt;a href="https://github.com/adil-khan-723/iacguard" rel="noopener noreferrer"&gt;https://github.com/adil-khan-723/iacguard&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you hit a resource type the rules miss, open an issue or PR. Each rule is one file.&lt;/p&gt;

</description>
      <category>terraform</category>
      <category>aws</category>
      <category>devops</category>
      <category>ai</category>
    </item>
    <item>
      <title>How I Built a Production-Grade Kubernetes RBAC Setup — And Broke It On Purpose</title>
      <dc:creator>Adil Khan</dc:creator>
      <pubDate>Sat, 28 Feb 2026 17:55:49 +0000</pubDate>
      <link>https://forem.com/adil-khan-723/how-i-built-a-production-grade-kubernetes-rbac-setup-and-broke-it-on-purpose-98a</link>
      <guid>https://forem.com/adil-khan-723/how-i-built-a-production-grade-kubernetes-rbac-setup-and-broke-it-on-purpose-98a</guid>
      <description>&lt;p&gt;Most RBAC tutorials show you how to apply a Role and run &lt;code&gt;kubectl auth can-i&lt;/code&gt;. Then they call it done.&lt;/p&gt;

&lt;p&gt;That never sat right with me. In production, your workload doesn't authenticate using your kubeconfig. It authenticates using a ServiceAccount token mounted inside the pod. So if you've never tested RBAC from &lt;em&gt;inside&lt;/em&gt; a running container, you haven't actually tested RBAC.&lt;/p&gt;

&lt;p&gt;This project fixes that. I built a minimal but realistic RBAC setup for an observability tool, validated it from inside a live deployment, and then intentionally broke it to understand what failure actually looks like at the API server level.&lt;/p&gt;

&lt;p&gt;The full source is here: &lt;a href="https://github.com/adil-khan-723/K8s-RBAC" rel="noopener noreferrer"&gt;github.com/adil-khan-723/K8s-RBAC&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;Everything lives inside a dedicated &lt;code&gt;observability&lt;/code&gt; namespace. The workload — a test deployment — runs under a purpose-built ServiceAccount called &lt;code&gt;log-reader-sa&lt;/code&gt;. A namespace-scoped Role defines exactly what that identity is allowed to do. A RoleBinding connects the two.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;observability (namespace)
│
├── log-reader-sa        ← Dedicated ServiceAccount
├── log-reader-role      ← Namespace-scoped Role
├── log-reader-binding   ← Binds SA → Role
└── testing (Deployment) ← Live workload for validation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No ClusterRoles. No ClusterRoleBindings. Everything contained.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Each Decision Was Made
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Never Use the Default ServiceAccount
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;default&lt;/code&gt; ServiceAccount exists in every namespace automatically. Using it means every workload in that namespace shares the same identity. If one workload gets permissions, every other workload riding the default SA inherits them silently.&lt;/p&gt;

&lt;p&gt;In any environment with more than one workload, this is a privilege creep problem waiting to happen. The fix is simple: create a dedicated ServiceAccount per workload.&lt;/p&gt;

&lt;h3&gt;
  
  
  Namespace-Scoped Role, Not ClusterRole
&lt;/h3&gt;

&lt;p&gt;A &lt;code&gt;ClusterRole&lt;/code&gt; bound with a &lt;code&gt;ClusterRoleBinding&lt;/code&gt; grants access across &lt;em&gt;every&lt;/em&gt; namespace — current and future. Even a ClusterRole bound with a &lt;code&gt;RoleBinding&lt;/code&gt; still references a cluster-level object, which creates reuse risks and makes auditing harder.&lt;/p&gt;

&lt;p&gt;A namespace-local &lt;code&gt;Role&lt;/code&gt; and &lt;code&gt;RoleBinding&lt;/code&gt; keeps the permission surface completely contained. If you can't explain why a workload needs cluster-wide access, it doesn't need a ClusterRole.&lt;/p&gt;

&lt;h3&gt;
  
  
  Verbs Are Not Optional
&lt;/h3&gt;

&lt;p&gt;RBAC doesn't have a "read-only mode" switch. You have to declare every verb individually. The role grants &lt;code&gt;get&lt;/code&gt;, &lt;code&gt;list&lt;/code&gt;, and &lt;code&gt;watch&lt;/code&gt; — and nothing else. Write verbs (&lt;code&gt;create&lt;/code&gt;, &lt;code&gt;update&lt;/code&gt;, &lt;code&gt;patch&lt;/code&gt;, &lt;code&gt;delete&lt;/code&gt;) are not included. Kubernetes does not default to denying write operations if you forget; it denies &lt;em&gt;everything&lt;/em&gt; you don't explicitly allow.&lt;/p&gt;

&lt;h3&gt;
  
  
  Subresources Are Not Inherited
&lt;/h3&gt;

&lt;p&gt;This is the one that catches people the most.&lt;/p&gt;

&lt;p&gt;Access to &lt;code&gt;pods&lt;/code&gt; does &lt;strong&gt;not&lt;/strong&gt; grant access to &lt;code&gt;pods/log&lt;/code&gt;. They are treated as completely separate targets by the API server. A role missing &lt;code&gt;pods/log&lt;/code&gt; will fail silently during &lt;code&gt;kubectl apply&lt;/code&gt; and loudly at runtime when your monitoring tool tries to pull logs and gets a 403.&lt;/p&gt;

&lt;p&gt;Every subresource you need must appear explicitly in the rules.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Was Granted
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resource&lt;/th&gt;
&lt;th&gt;Verbs Granted&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;pods&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;get&lt;/code&gt;, &lt;code&gt;list&lt;/code&gt;, &lt;code&gt;watch&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;pods/log&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;get&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;deployments&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;get&lt;/code&gt;, &lt;code&gt;list&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;secrets&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Denied&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Everything else&lt;/td&gt;
&lt;td&gt;Denied&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Testing From Inside the Pod
&lt;/h2&gt;

&lt;p&gt;This is the part most tutorials skip. I deployed a real workload under &lt;code&gt;log-reader-sa&lt;/code&gt; and tested API calls directly from inside the container.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Test&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;List pods in namespace&lt;/td&gt;
&lt;td&gt;✅ Allowed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Get pod logs&lt;/td&gt;
&lt;td&gt;✅ Allowed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Access secrets&lt;/td&gt;
&lt;td&gt;❌ 403 Forbidden&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Delete a pod&lt;/td&gt;
&lt;td&gt;❌ 403 Forbidden&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Access resources outside namespace&lt;/td&gt;
&lt;td&gt;❌ 403 Forbidden&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Testing this way matters because it confirms the ServiceAccount token is correctly mounted, the API server is reachable from inside the pod, and the policy behaves exactly as written — not as assumed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Breaking It On Purpose
&lt;/h2&gt;

&lt;p&gt;I removed &lt;code&gt;pods/log&lt;/code&gt; from the Role rules to simulate a common production misconfiguration. The result was immediate: a &lt;code&gt;403 Forbidden&lt;/code&gt; response every time log retrieval was attempted.&lt;/p&gt;

&lt;p&gt;This turned into a useful debugging exercise. There are four failure types that look similar on the surface but require completely different fixes:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;401 Unauthorized&lt;/strong&gt; — the identity wasn't recognized. Token is missing, expired, or invalid.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;403 Forbidden&lt;/strong&gt; — the identity was recognized, the request reached the API server, but the action isn't permitted. This is a RBAC problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;404 Not Found&lt;/strong&gt; — the resource doesn't exist. Not an authorization issue at all.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Connection refused / timeout&lt;/strong&gt; — the API server wasn't reached. Networking problem, not RBAC.&lt;/p&gt;

&lt;p&gt;When you see a 403, you've already confirmed that the workload has connectivity, the token is valid, and the API server is up. The investigation starts and ends with the Role definition.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Security Picture
&lt;/h2&gt;

&lt;p&gt;If this workload were compromised, the damage is bounded:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Read pod metadata and logs within &lt;code&gt;observability&lt;/code&gt; — yes&lt;/li&gt;
&lt;li&gt;Access secrets — no&lt;/li&gt;
&lt;li&gt;Modify or delete anything — no&lt;/li&gt;
&lt;li&gt;Move laterally to other namespaces — no&lt;/li&gt;
&lt;li&gt;Escalate to cluster-level access — no&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is what blast radius limitation looks like in practice. The attacker gets read access to one namespace. That's a recoverable incident. Cluster-admin on a compromised workload is not.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Project Reinforced
&lt;/h2&gt;

&lt;p&gt;RBAC in Kubernetes is an authorization layer evaluated at the API server for every request. The evaluation checks four things: who is making the request, what verb they're using, what resource they're targeting, and which namespace it's in.&lt;/p&gt;

&lt;p&gt;Roles define what is permitted. Bindings attach identities to those permissions. Neither inherits anything. Neither assumes anything. Everything must be declared.&lt;/p&gt;

&lt;p&gt;The habits worth building:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Start with zero permissions and add only what you can justify&lt;/li&gt;
&lt;li&gt;Test from inside the workload, not just from your terminal&lt;/li&gt;
&lt;li&gt;List subresources explicitly — they are never implied&lt;/li&gt;
&lt;li&gt;Know the difference between a 401, 403, and 404&lt;/li&gt;
&lt;li&gt;Give every workload its own ServiceAccount&lt;/li&gt;
&lt;li&gt;Avoid ClusterRoleBindings unless the requirement is genuinely cluster-wide&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Source
&lt;/h2&gt;

&lt;p&gt;Full manifests and project structure: &lt;a href="https://github.com/adil-khan-723/K8s-RBAC" rel="noopener noreferrer"&gt;github.com/adil-khan-723/K8s-RBAC&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you've been writing RBAC policy without testing it from inside a running pod, this is a good starting point for closing that gap.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>security</category>
      <category>cloud</category>
      <category>devops</category>
    </item>
    <item>
      <title>I Built a Multi-Service Kubernetes App and Here's What Actually Broke</title>
      <dc:creator>Adil Khan</dc:creator>
      <pubDate>Sat, 21 Feb 2026 13:40:56 +0000</pubDate>
      <link>https://forem.com/adil-khan-723/i-built-a-multi-service-kubernetes-app-and-heres-what-actually-broke-1066</link>
      <guid>https://forem.com/adil-khan-723/i-built-a-multi-service-kubernetes-app-and-heres-what-actually-broke-1066</guid>
      <description>&lt;h1&gt;
  
  
  I Built a Multi-Service Kubernetes App and Here's What Actually Broke
&lt;/h1&gt;

&lt;h2&gt;
  
  
  The Goal
&lt;/h2&gt;

&lt;p&gt;This wasn't a "follow the tutorial" project. The goal was simple: understand how real distributed systems actually work inside Kubernetes.&lt;/p&gt;

&lt;p&gt;Not just "deploy containers," but truly understand:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How services discover each other&lt;/li&gt;
&lt;li&gt;How internal networking routes traffic&lt;/li&gt;
&lt;li&gt;How ingress exposes applications externally&lt;/li&gt;
&lt;li&gt;How TLS termination works at the edge&lt;/li&gt;
&lt;li&gt;How secrets and configs propagate&lt;/li&gt;
&lt;li&gt;How rolling updates affect uptime&lt;/li&gt;
&lt;li&gt;The difference between stateful and stateless workloads&lt;/li&gt;
&lt;li&gt;How DNS resolution works across namespaces&lt;/li&gt;
&lt;li&gt;How to debug when things inevitably break&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result is a multi-service voting application that mirrors real production microservices architecture.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;Five independent services, each with a specific role:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Voting Frontend&lt;/strong&gt; - Stateless web UI where users cast votes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Results Frontend&lt;/strong&gt; - Stateless web UI displaying real-time results&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Redis&lt;/strong&gt; - Message queue for asynchronous processing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PostgreSQL&lt;/strong&gt; - Persistent database storing vote data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Worker Service&lt;/strong&gt; - Background processor consuming from queue and writing to database&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The traffic flow follows a typical 3-tier distributed pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User → Frontend → Queue → Worker → Database → Results
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  System Architecture Diagram
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9v8yv7naycab7g37atz2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9v8yv7naycab7g37atz2.png" alt="Kubernetes Voting App Architecture" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture Overview:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The diagram above illustrates the complete system architecture showing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;External Layer&lt;/strong&gt;: User traffic entering via HTTPS (port 8443)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ingress Layer&lt;/strong&gt;: TLS termination and path-based routing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Application Layer&lt;/strong&gt;: Stateless frontend services (voting and results)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Layer&lt;/strong&gt;: Redis (message queue) and PostgreSQL (persistent storage)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Processing Layer&lt;/strong&gt;: Worker service connecting queue to database&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key Design Decisions:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All internal communication uses ClusterIP Services&lt;/li&gt;
&lt;li&gt;External access is controlled through Ingress with host-based routing&lt;/li&gt;
&lt;li&gt;Services are isolated and communicate only through defined interfaces&lt;/li&gt;
&lt;li&gt;No hardcoded IPs anywhere - everything uses service discovery&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Kubernetes Components Deep Dive
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Deployments
&lt;/h3&gt;

&lt;p&gt;Deployments manage the stateless workloads in this system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Voting frontend&lt;/li&gt;
&lt;li&gt;Results frontend
&lt;/li&gt;
&lt;li&gt;Redis&lt;/li&gt;
&lt;li&gt;Worker&lt;/li&gt;
&lt;li&gt;PostgreSQL (initially)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What Deployments Enable:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rolling updates without downtime&lt;/li&gt;
&lt;li&gt;Declarative replica scaling&lt;/li&gt;
&lt;li&gt;Self-healing when pods crash&lt;/li&gt;
&lt;li&gt;Controlled rollout strategies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every time I updated a deployment, Kubernetes created new pods, waited for them to be ready, then terminated old ones. Zero downtime.&lt;/p&gt;

&lt;h3&gt;
  
  
  StatefulSets (The Deep Learning)
&lt;/h3&gt;

&lt;p&gt;StatefulSets were explored separately to understand how stateful workloads differ fundamentally from stateless ones.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key StatefulSet Characteristics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stable, persistent pod identities (pod-0, pod-1, etc.)&lt;/li&gt;
&lt;li&gt;Ordered, graceful deployment and scaling&lt;/li&gt;
&lt;li&gt;Stable network identifiers via headless services&lt;/li&gt;
&lt;li&gt;Per-pod persistent storage that survives rescheduling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;PostgreSQL was initially deployed as a Deployment. Then I migrated it to a StatefulSet to understand:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How DNS works with headless services&lt;/li&gt;
&lt;li&gt;How PersistentVolumeClaims attach to specific pods&lt;/li&gt;
&lt;li&gt;Why ordered startup matters for clustered databases&lt;/li&gt;
&lt;li&gt;How rollout behavior changes with stateful workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Services: The Networking Glue
&lt;/h3&gt;

&lt;p&gt;Services provide stable networking in an environment where pod IPs are ephemeral.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ClusterIP Services (Internal):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Service&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;redis&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterIP&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;redis&lt;/span&gt;
  &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;6379&lt;/span&gt;
    &lt;span class="na"&gt;targetPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;6379&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates a stable endpoint at &lt;code&gt;redis.default.svc.cluster.local&lt;/code&gt; that load-balances across Redis pods.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Learning:&lt;/strong&gt; Services are not pods. Services are stable DNS names that route to pods. When pods die and are recreated with new IPs, the Service continues working.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ingress: External Traffic Routing
&lt;/h3&gt;

&lt;p&gt;Ingress defines HTTP routing rules, but requires an Ingress Controller to actually process traffic.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Ingress&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app-ingress&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;oggy.local&lt;/span&gt;
    &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/vote&lt;/span&gt;
        &lt;span class="na"&gt;pathType&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Prefix&lt;/span&gt;
        &lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;voting-service&lt;/span&gt;
            &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/result&lt;/span&gt;
        &lt;span class="na"&gt;pathType&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Prefix&lt;/span&gt;
        &lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;result-service&lt;/span&gt;
            &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The Ingress Flow:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;User makes request to &lt;code&gt;oggy.local/vote&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Request hits Ingress Controller&lt;/li&gt;
&lt;li&gt;Controller evaluates Ingress rules&lt;/li&gt;
&lt;li&gt;Traffic forwards to &lt;code&gt;voting-service&lt;/code&gt; on port 80&lt;/li&gt;
&lt;li&gt;Service load-balances to backend pods&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  TLS Termination
&lt;/h3&gt;

&lt;p&gt;This project implements TLS termination at the Ingress level, enabling HTTPS access.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Creating TLS certificates:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Generate self-signed certificate for local development&lt;/span&gt;
openssl req &lt;span class="nt"&gt;-x509&lt;/span&gt; &lt;span class="nt"&gt;-nodes&lt;/span&gt; &lt;span class="nt"&gt;-days&lt;/span&gt; 365 &lt;span class="nt"&gt;-newkey&lt;/span&gt; rsa:2048 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-keyout&lt;/span&gt; oggy.key &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-out&lt;/span&gt; oggy.crt &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-subj&lt;/span&gt; &lt;span class="s2"&gt;"/CN=oggy.local/O=oggy.local"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Creating Kubernetes TLS Secret:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Secret&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;oggy-tls&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;
&lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kubernetes.io/tls&lt;/span&gt;
&lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;tls.crt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;&amp;lt;base64-encoded-cert&amp;gt;&lt;/span&gt;
  &lt;span class="na"&gt;tls.key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;&amp;lt;base64-encoded-key&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or create it directly from files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl create secret tls oggy-tls &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cert&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;oggy.crt &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;oggy.key
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Ingress with TLS:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Ingress&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app-ingress&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;tls&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;hosts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;oggy.local&lt;/span&gt;
    &lt;span class="na"&gt;secretName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;oggy-tls&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;oggy.local&lt;/span&gt;
    &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/vote&lt;/span&gt;
        &lt;span class="na"&gt;pathType&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Prefix&lt;/span&gt;
        &lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;voting-service&lt;/span&gt;
            &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the application is accessible via HTTPS:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Voting: &lt;code&gt;https://oggy.local:8443/vote&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Results: &lt;code&gt;https://oggy.local:8443/result&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key Learning:&lt;/strong&gt; TLS termination at the Ingress means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Traffic from user to Ingress Controller is encrypted (HTTPS)&lt;/li&gt;
&lt;li&gt;Traffic from Ingress to backend Services is unencrypted (HTTP)&lt;/li&gt;
&lt;li&gt;Certificates are managed centrally, not per-service&lt;/li&gt;
&lt;li&gt;Backend services don't need to handle TLS&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Traffic Flow: Internal vs External
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Internal Communication
&lt;/h3&gt;

&lt;p&gt;All services communicate using DNS-based service discovery:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;voting-frontend → redis.default.svc.cluster.local
worker → redis.default.svc.cluster.local  
worker → db.default.svc.cluster.local
result-frontend → db.default.svc.cluster.local
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No pod IPs. No hardcoded addresses. Pure service discovery.&lt;/p&gt;

&lt;h3&gt;
  
  
  External Access
&lt;/h3&gt;

&lt;p&gt;Users access the application through HTTPS with TLS termination:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Voting:&lt;/strong&gt; &lt;code&gt;https://oggy.local:8443/vote&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Results:&lt;/strong&gt; &lt;code&gt;https://oggy.local:8443/result&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Ingress Controller handles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;TLS termination (decrypting HTTPS traffic)&lt;/li&gt;
&lt;li&gt;Path-based routing to appropriate Services&lt;/li&gt;
&lt;li&gt;Load balancing across backend pods&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What Broke and How I Fixed It
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Problem 1: Pod IPs Keep Changing
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What happened:&lt;/strong&gt; I initially tried connecting services using pod IPs. Pods got rescheduled, IPs changed, everything broke.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Root cause:&lt;/strong&gt; Pods are ephemeral. Their IPs are not stable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Use Services as stable endpoints. Services maintain consistent DNS names regardless of pod lifecycle.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# ❌ Wrong: Hardcoding pod IP&lt;/span&gt;
&lt;span class="na"&gt;REDIS_HOST&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;10.244.0.5"&lt;/span&gt;

&lt;span class="c1"&gt;# ✅ Right: Using service name&lt;/span&gt;
&lt;span class="na"&gt;REDIS_HOST&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;redis"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Problem 2: Ingress Resources Did Nothing
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What happened:&lt;/strong&gt; Created Ingress resources. Nothing worked. Traffic never reached the apps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Root cause:&lt;/strong&gt; Ingress resources are just configuration. They require an Ingress Controller to actually process traffic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Installed nginx-ingress controller separately:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.8.1/deploy/static/provider/cloud/deploy.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Think of it like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ingress Resource&lt;/strong&gt; = routing rules (the "what")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ingress Controller&lt;/strong&gt; = traffic processor (the "how")&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Problem 3: Service Names Didn't Resolve Across Namespaces
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What happened:&lt;/strong&gt; Services in different namespaces couldn't find each other using short names.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Root cause:&lt;/strong&gt; DNS resolution in Kubernetes is namespace-scoped by default.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Use fully qualified domain names (FQDN):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Within same namespace&lt;/span&gt;
&lt;span class="na"&gt;REDIS_HOST&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;redis"&lt;/span&gt;

&lt;span class="c1"&gt;# Across namespaces&lt;/span&gt;
&lt;span class="na"&gt;REDIS_HOST&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;redis.production.svc.cluster.local"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;DNS resolution follows this pattern:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;redis&lt;/code&gt; → searches current namespace&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;redis.production&lt;/code&gt; → searches production namespace&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;redis.production.svc.cluster.local&lt;/code&gt; → explicit FQDN&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Problem 4: Ingress Controller Wouldn't Schedule
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What happened:&lt;/strong&gt; Ingress Controller pod stuck in &lt;code&gt;Pending&lt;/code&gt; state. Never scheduled to a node.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Root cause:&lt;/strong&gt; Local cluster had node taints and labels that prevented scheduling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Added tolerations and adjusted node selector:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;tolerations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;node-role.kubernetes.io/control-plane"&lt;/span&gt;
  &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Exists"&lt;/span&gt;
  &lt;span class="na"&gt;effect&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NoSchedule"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Learning:&lt;/strong&gt; Cloud clusters and local clusters (kind, minikube) have different default configurations. Local clusters often taint control-plane nodes to prevent workload scheduling.&lt;/p&gt;




&lt;h2&gt;
  
  
  Configuration Management: Secrets and ConfigMaps
&lt;/h2&gt;

&lt;h3&gt;
  
  
  ConfigMaps for Non-Sensitive Data
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ConfigMap&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app-config&lt;/span&gt;
&lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;REDIS_HOST&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;redis"&lt;/span&gt;
  &lt;span class="na"&gt;DB_HOST&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;db"&lt;/span&gt;
  &lt;span class="na"&gt;DB_NAME&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;votes"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ConfigMaps store configuration as key-value pairs that can be injected into pods.&lt;/p&gt;

&lt;h3&gt;
  
  
  Secrets for Sensitive Data
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Secret&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app-secrets&lt;/span&gt;
&lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Opaque&lt;/span&gt;
&lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;DB_PASSWORD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cG9zdGdyZXM=&lt;/span&gt;  &lt;span class="c1"&gt;# base64 encoded&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Critical:&lt;/strong&gt; Secrets are base64-encoded, not encrypted by default. For production, use encryption at rest or external secret managers (Vault, AWS Secrets Manager, etc.).&lt;/p&gt;

&lt;h3&gt;
  
  
  Injecting Configuration into Pods
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;REDIS_HOST&lt;/span&gt;
  &lt;span class="na"&gt;valueFrom&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;configMapKeyRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app-config&lt;/span&gt;
      &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;REDIS_HOST&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DB_PASSWORD&lt;/span&gt;
  &lt;span class="na"&gt;valueFrom&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;secretKeyRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app-secrets&lt;/span&gt;
      &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DB_PASSWORD&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Rolling Updates: Zero-Downtime Deployments
&lt;/h2&gt;

&lt;p&gt;Kubernetes Deployments support rolling updates out of the box.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update Strategy:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;RollingUpdate&lt;/span&gt;
  &lt;span class="na"&gt;rollingUpdate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;maxUnavailable&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;maxSurge&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What happens during an update:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Kubernetes creates 1 new pod (maxSurge: 1)&lt;/li&gt;
&lt;li&gt;Waits for new pod to be ready&lt;/li&gt;
&lt;li&gt;Terminates 1 old pod (maxUnavailable: 1)&lt;/li&gt;
&lt;li&gt;Repeats until all pods are updated&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Testing rolling updates:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Update image version&lt;/span&gt;
kubectl &lt;span class="nb"&gt;set &lt;/span&gt;image deployment/voting-app voting-app&lt;span class="o"&gt;=&lt;/span&gt;voting-app:v2

&lt;span class="c"&gt;# Watch the rollout&lt;/span&gt;
kubectl rollout status deployment/voting-app

&lt;span class="c"&gt;# Rollback if needed&lt;/span&gt;
kubectl rollout undo deployment/voting-app
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Zero downtime. Zero manual intervention.&lt;/p&gt;




&lt;h2&gt;
  
  
  Debugging Distributed Systems
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Essential Debugging Commands
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Check pod status:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get pods
kubectl describe pod &amp;lt;pod-name&amp;gt;
kubectl logs &amp;lt;pod-name&amp;gt;
kubectl logs &amp;lt;pod-name&amp;gt; &lt;span class="nt"&gt;--previous&lt;/span&gt;  &lt;span class="c"&gt;# logs from crashed container&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Test service connectivity:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create debug pod&lt;/span&gt;
kubectl run debug &lt;span class="nt"&gt;--image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;nicolaka/netshoot &lt;span class="nt"&gt;-it&lt;/span&gt; &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="nt"&gt;--&lt;/span&gt; /bin/bash

&lt;span class="c"&gt;# Inside debug pod&lt;/span&gt;
nslookup redis
nslookup redis.default.svc.cluster.local
curl http://voting-service/vote
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Check ingress:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get ingress
kubectl describe ingress app-ingress
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Verify service endpoints:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get endpoints redis
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This shows which pod IPs the service is routing to. If empty, your selector doesn't match any pods.&lt;/p&gt;




&lt;h2&gt;
  
  
  Local Cluster Setup
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Creating a kind cluster:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create cluster with ingress port mappings&lt;/span&gt;
&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt; | kind create cluster --config=-
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  kubeadmConfigPatches:
  - |
    kind: InitConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        node-labels: "ingress-ready=true"
  extraPortMappings:
  - containerPort: 80
    hostPort: 8080
    protocol: TCP
  - containerPort: 443
    hostPort: 8443
    protocol: TCP
&lt;/span&gt;&lt;span class="no"&gt;EOF
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Install nginx-ingress for kind:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/kind/deploy.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Configure local DNS:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Add hostname to /etc/hosts&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"127.0.0.1 oggy.local"&lt;/span&gt; | &lt;span class="nb"&gt;sudo tee&lt;/span&gt; &lt;span class="nt"&gt;-a&lt;/span&gt; /etc/hosts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Deploy the application:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; namespace.yaml
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; configMap.yaml
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; secrets.yaml
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; tls-secret.yaml
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Access the application:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HTTP: &lt;code&gt;http://oggy.local:8080/vote&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;HTTPS: &lt;code&gt;https://oggy.local:8443/vote&lt;/code&gt; (with TLS)&lt;/li&gt;
&lt;li&gt;Results: &lt;code&gt;https://oggy.local:8443/result&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Your browser will show a security warning for the self-signed certificate. This is expected for local development.&lt;/p&gt;




&lt;h2&gt;
  
  
  Project Structure
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.
├── README.md
├── namespace.yaml              # Namespace definition
├── configMap.yaml              # ConfigMap for app configuration
├── secrets.yaml                # Secrets for sensitive data
├── tls-secret.yaml            # TLS certificates for HTTPS
├── oggy.crt                   # TLS certificate file
├── oggy.key                   # TLS private key file
├── deployment-postgres.yaml   # PostgreSQL Deployment
├── deployment-redis.yaml      # Redis Deployment
├── deployment-result.yaml     # Results Frontend Deployment
├── deployment-voting.yaml     # Voting Frontend Deployment
├── deployment-worker.yaml     # Worker Deployment
├── service-postgres.yaml      # PostgreSQL Service
├── service-redis.yaml         # Redis Service
├── service-results.yaml       # Results Service
├── service-voting.yaml        # Voting Service
└── ingress.yaml               # Ingress with TLS configuration
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;p&gt;This project isn't about running containers in Kubernetes. It's about understanding how Kubernetes actually works.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mental Models That Clicked:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Kubernetes networking is service-driven, not pod-driven&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Pods are ephemeral. Services are stable. Always route through Services.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Ingress requires both rules and a controller&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Rules define routing logic. Controllers implement the logic.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;DNS resolution is namespace-scoped&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Short names work within namespaces. Cross-namespace requires FQDNs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Local clusters behave differently than cloud clusters&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Taints, tolerations, and storage classes vary significantly.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;StatefulSets are fundamentally different from Deployments&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Stable identities, ordered operations, and per-pod storage make stateful workloads possible.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Once these mental models clicked, advanced Kubernetes concepts (NetworkPolicies, PodDisruptionBudgets, HorizontalPodAutoscalers) started making sense.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;This project covers the fundamentals plus TLS termination. Real production systems add even more:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Automated certificate management&lt;/strong&gt; with cert-manager (vs manual certificates)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Persistent volumes&lt;/strong&gt; with storage classes for stateful workloads&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Horizontal Pod Autoscalers&lt;/strong&gt; for dynamic scaling based on metrics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network Policies&lt;/strong&gt; for pod-to-pod traffic control and security&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource limits and requests&lt;/strong&gt; for scheduling and QoS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Health checks&lt;/strong&gt; (liveness and readiness probes)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring&lt;/strong&gt; with Prometheus and Grafana&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Log aggregation&lt;/strong&gt; with ELK or Loki&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But you can't build those without understanding the fundamentals first.&lt;/p&gt;




&lt;h2&gt;
  
  
  Setup and Deployment
&lt;/h2&gt;

&lt;p&gt;Clone the repository:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/yourusername/kubernetes-voting-app.git
&lt;span class="nb"&gt;cd &lt;/span&gt;kubernetes-voting-app
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Deploy everything:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify deployment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get pods
kubectl get svc
kubectl get ingress
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cleanup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl delete &lt;span class="nt"&gt;-f&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Official Kubernetes Docs:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://kubernetes.io/docs/concepts/services-networking/service/" rel="noopener noreferrer"&gt;Services&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kubernetes.io/docs/concepts/services-networking/ingress/" rel="noopener noreferrer"&gt;Ingress&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/" rel="noopener noreferrer"&gt;StatefulSets&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Tools Used:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://kind.sigs.k8s.io/" rel="noopener noreferrer"&gt;kind&lt;/a&gt; - Kubernetes IN Docker (used for this project)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://kubernetes.github.io/ingress-nginx/" rel="noopener noreferrer"&gt;nginx-ingress&lt;/a&gt; - Ingress Controller&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://kubernetes.io/docs/tasks/tools/" rel="noopener noreferrer"&gt;kubectl&lt;/a&gt; - Kubernetes CLI&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Alternative local clusters:&lt;/strong&gt; minikube, k3s, Docker Desktop&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Source Code&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://github.com/adil-khan-723/kubernetes-sample-voting-app-project-tls.git" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Questions or feedback?&lt;/strong&gt; Drop a comment below. Happy to discuss Kubernetes architecture, debugging strategies, or anything else related to distributed systems.&lt;/p&gt;




&lt;h1&gt;
  
  
  Kubernetes #DevOps #Microservices #Docker #CloudNative #DistributedSystems
&lt;/h1&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>microservices</category>
      <category>docker</category>
    </item>
    <item>
      <title>I Built a Multi-Service Kubernetes App and Here's What Actually Broke</title>
      <dc:creator>Adil Khan</dc:creator>
      <pubDate>Sat, 31 Jan 2026 06:32:21 +0000</pubDate>
      <link>https://forem.com/adil-khan-723/i-built-a-multi-service-kubernetes-app-and-heres-what-actually-broke-4f99</link>
      <guid>https://forem.com/adil-khan-723/i-built-a-multi-service-kubernetes-app-and-heres-what-actually-broke-4f99</guid>
      <description>&lt;p&gt;I spent the last few weeks deploying a multi-service voting application on Kubernetes.&lt;/p&gt;

&lt;p&gt;Not because I needed a voting app. Because I needed to understand how Kubernetes actually handles real application traffic.&lt;/p&gt;

&lt;p&gt;There's a gap between running a single container in a pod and understanding how multiple services discover each other, how traffic flows internally, and how external requests actually reach your application.&lt;/p&gt;

&lt;p&gt;This project closed that gap for me.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;A voting system with five independent components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Voting frontend (where users vote)&lt;/li&gt;
&lt;li&gt;Results frontend (where users see results)&lt;/li&gt;
&lt;li&gt;Redis (acting as a queue)&lt;/li&gt;
&lt;li&gt;PostgreSQL (persistent storage)&lt;/li&gt;
&lt;li&gt;Worker service (processes votes asynchronously)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each component runs in its own container. Each is managed independently by Kubernetes. None of them know pod IPs. Everything communicates through service discovery.&lt;/p&gt;

&lt;p&gt;This mirrors how real microservices work in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture Isn't Random
&lt;/h2&gt;

&lt;p&gt;I didn't pick this setup arbitrarily. This is what actual distributed systems look like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Frontend services are stateless and can scale horizontally&lt;/li&gt;
&lt;li&gt;Data services are isolated for persistence&lt;/li&gt;
&lt;li&gt;Communication happens via stable network abstractions&lt;/li&gt;
&lt;li&gt;External traffic enters through a controlled entry point&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Kubernetes handles the orchestration. I needed to understand how.&lt;/p&gt;

&lt;h2&gt;
  
  
  Kubernetes Objects I Actually Used
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Deployments&lt;/strong&gt;&lt;br&gt;
These manage the workloads. They define replica counts and ensure pods get recreated if they fail. Every major component runs as a Deployment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pods&lt;/strong&gt;&lt;br&gt;
The smallest unit Kubernetes schedules. They're ephemeral. They die and get recreated. You never access them directly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Services&lt;/strong&gt;&lt;br&gt;
This is where it clicked for me. Services provide stable DNS names and IPs. Pods can change IPs constantly. Services don't. All internal communication goes through Services.&lt;/p&gt;

&lt;p&gt;I used two types:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;ClusterIP&lt;/code&gt; for internal-only communication (Redis, PostgreSQL)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;NodePort&lt;/code&gt; temporarily for testing frontends before I understood Ingress&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Ingress&lt;/strong&gt;&lt;br&gt;
Defines HTTP routing rules for external traffic. Host-based and path-based routing through a single entry point.&lt;/p&gt;

&lt;p&gt;Here's what tripped me up: Ingress resources don't do anything by themselves.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ingress Controller&lt;/strong&gt;&lt;br&gt;
This is the actual component that receives and processes traffic. It runs as a pod and dynamically configures itself based on Ingress rules.&lt;/p&gt;

&lt;p&gt;Without an Ingress Controller installed, your Ingress rules are useless. I learned this the hard way.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Traffic Actually Flows
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Internal Traffic
&lt;/h3&gt;

&lt;p&gt;Inside the cluster:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Voting frontend sends votes to Redis using the Redis Service name&lt;/li&gt;
&lt;li&gt;Worker reads from Redis using the Redis Service name&lt;/li&gt;
&lt;li&gt;Worker writes results to PostgreSQL using the database Service name&lt;/li&gt;
&lt;li&gt;Results frontend reads from PostgreSQL using the database Service name&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;No pod IPs anywhere. Service DNS gets resolved automatically by Kubernetes.&lt;/p&gt;

&lt;h3&gt;
  
  
  External Traffic
&lt;/h3&gt;

&lt;p&gt;From the browser to the application:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;User sends HTTP request&lt;/li&gt;
&lt;li&gt;Request hits the Ingress Controller&lt;/li&gt;
&lt;li&gt;Ingress rules get evaluated&lt;/li&gt;
&lt;li&gt;Traffic forwards to the correct Service&lt;/li&gt;
&lt;li&gt;Service load-balances to backend pods&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Ingress operates at the HTTP level. It's the production-grade way to expose applications.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Broke (and What I Learned)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Pod IPs Keep Changing
&lt;/h3&gt;

&lt;p&gt;Pods were getting recreated automatically. Their IPs changed every time. Hardcoding IPs didn't work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Use Services. Always. Services provide stable endpoints. This is what they're designed for.&lt;/p&gt;

&lt;h3&gt;
  
  
  Service Types Confused Me
&lt;/h3&gt;

&lt;p&gt;I didn't understand why there were multiple Service types or when to use which one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; &lt;code&gt;ClusterIP&lt;/code&gt; is for internal communication only. &lt;code&gt;NodePort&lt;/code&gt; exposes services on node IPs (useful for testing, not for production). Ingress is the right way to handle external HTTP traffic.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ingress Didn't Work
&lt;/h3&gt;

&lt;p&gt;I created Ingress resources. Traffic still wasn't reaching my apps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; You need an Ingress Controller installed separately. The Ingress resource is just configuration. The controller is what actually processes traffic. Once I installed the controller, everything worked.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ingress Controller Wouldn't Schedule
&lt;/h3&gt;

&lt;p&gt;The controller pod was stuck in pending state.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; In my local cluster, I needed to fix node labels and tolerations so the controller could schedule on the control-plane node. This doesn't happen in cloud environments, but it matters in local setups.&lt;/p&gt;

&lt;h3&gt;
  
  
  Local Networking Doesn't Work Like Cloud
&lt;/h3&gt;

&lt;p&gt;External access from my browser didn't work directly in my container-based local cluster.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Port forwarding. I forwarded the Ingress Controller port locally. This simulates how cloud load balancers work but adapted for local development.&lt;/p&gt;

&lt;h3&gt;
  
  
  Service Names Didn't Resolve Everywhere
&lt;/h3&gt;

&lt;p&gt;Service names weren't resolving across namespaces or from outside the cluster.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Kubernetes service DNS is namespace-scoped by default. I learned to use fully qualified domain names when needed and understood where DNS resolution actually works.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Actually Understand Now
&lt;/h2&gt;

&lt;p&gt;Before this project, I could write Kubernetes manifests. But I didn't really get how the pieces connected.&lt;/p&gt;

&lt;p&gt;Now I understand:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kubernetes networking is service-driven, not pod-driven&lt;/li&gt;
&lt;li&gt;Ingress needs both rules and a controller to function&lt;/li&gt;
&lt;li&gt;Local clusters behave differently than cloud clusters&lt;/li&gt;
&lt;li&gt;Service discovery happens through DNS, not hardcoded IPs&lt;/li&gt;
&lt;li&gt;Debugging requires understanding both the platform and the application&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;This isn't about running containers. It's about understanding how Kubernetes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Routes traffic between services&lt;/li&gt;
&lt;li&gt;Discovers services dynamically&lt;/li&gt;
&lt;li&gt;Separates internal and external networking&lt;/li&gt;
&lt;li&gt;Enforces declarative state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once this mental model clicked, advanced topics started making sense.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real Takeaway
&lt;/h2&gt;

&lt;p&gt;Build it once to make it work.&lt;/p&gt;

&lt;p&gt;Break it to understand why it works.&lt;/p&gt;

&lt;p&gt;I could have just deployed this app using a tutorial and called it done. But I wouldn't have learned how service discovery actually functions, or why Ingress controllers exist, or what happens when pods get recreated.&lt;/p&gt;

&lt;p&gt;The debugging forced me to understand the platform, not just the syntax.&lt;/p&gt;

&lt;p&gt;If you're learning Kubernetes, pick a multi-service application and deploy it. Then break it. Then fix it. That's where the understanding comes from.&lt;/p&gt;

&lt;p&gt;What's been the hardest part of Kubernetes for you? Drop a comment.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Full code and setup instructions:&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://github.com/adil-khan-723/kubernetes-sample-voting-app-project1" rel="noopener noreferrer"&gt;kubernetes-sample-voting-app-project1&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture diagram and detailed breakdown:&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ficz0tzrro0djdprzo044.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ficz0tzrro0djdprzo044.png" alt=" " width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  kubernetes #devops #learning #microservices
&lt;/h1&gt;

</description>
      <category>kubernetes</category>
      <category>containers</category>
      <category>devops</category>
      <category>network</category>
    </item>
    <item>
      <title>Building a Production-Style AWS ECS Platform with Terraform (Without Community Modules)</title>
      <dc:creator>Adil Khan</dc:creator>
      <pubDate>Mon, 19 Jan 2026 17:10:59 +0000</pubDate>
      <link>https://forem.com/adil-khan-723/building-a-production-style-aws-ecs-platform-with-terraform-without-community-modules-4pnc</link>
      <guid>https://forem.com/adil-khan-723/building-a-production-style-aws-ecs-platform-with-terraform-without-community-modules-4pnc</guid>
      <description>&lt;p&gt;For the last three weeks, I've been building a production-style AWS infrastructure using Terraform, ECS (Fargate), Docker, and Jenkins.&lt;/p&gt;



&lt;h2&gt;
  
  
  ⚠️ Important Clarification
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;I did not use community Terraform modules.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Every Terraform module in this project was written from scratch by me.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;This was not an accident. It was a conscious decision to trade speed for understanding.&lt;/p&gt;

&lt;p&gt;I didn't want to learn how to &lt;strong&gt;use&lt;/strong&gt; Terraform.&lt;br&gt;
I wanted to learn how infrastructure &lt;strong&gt;actually behaves&lt;/strong&gt; under real constraints.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This article documents:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The architecture I built&lt;/li&gt;
&lt;li&gt;How the system works end-to-end&lt;/li&gt;
&lt;li&gt;The real problems I ran into&lt;/li&gt;
&lt;li&gt;Why those problems mattered&lt;/li&gt;
&lt;li&gt;What this project fundamentally changed in how I think about infrastructure&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🎯 Motivation: Why Build Everything From Scratch?
&lt;/h2&gt;

&lt;p&gt;Terraform community modules are powerful. They are also abstractions.&lt;/p&gt;

&lt;p&gt;After using them in the past, I realized something uncomfortable:&lt;/p&gt;

&lt;p&gt;I could deploy fairly complex infrastructure without truly understanding:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Why certain IAM permissions were required&lt;/li&gt;
&lt;li&gt;How traffic actually flowed through the network&lt;/li&gt;
&lt;li&gt;What Terraform needed during &lt;code&gt;plan&lt;/code&gt; vs &lt;code&gt;apply&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;How ECS, ALBs, and IAM interact internally&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;So I imposed a hard rule on myself:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;❌ No community modules&lt;/li&gt;
&lt;li&gt;❌ No copy-pasting large IAM policies without understanding them&lt;/li&gt;
&lt;li&gt;❌ No "just works" defaults&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;If something broke, I wanted to know why it broke.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This decision turned a "simple ECS project" into a deep learning exercise.&lt;/p&gt;




&lt;h2&gt;
  
  
  🏗️ High-Level System Overview
&lt;/h2&gt;

&lt;p&gt;At a high level, this is a &lt;strong&gt;two-tier containerized application&lt;/strong&gt; deployed on AWS ECS using Fargate.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Filrbz6lyd3q581dgnn7v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Filrbz6lyd3q581dgnn7v.png" alt="diagram" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The architecture is &lt;strong&gt;intentionally private by default&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  🌐 Frontend Layer
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Publicly accessible only through a &lt;strong&gt;Public Application Load Balancer&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Runs as &lt;strong&gt;ECS Fargate tasks&lt;/strong&gt; inside private subnets&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No public IPs&lt;/strong&gt; on ECS tasks&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🔒 Backend Layer
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Completely private&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Accessible only via an &lt;strong&gt;Internal Application Load Balancer&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Runs as &lt;strong&gt;ECS Fargate tasks&lt;/strong&gt; inside private subnets&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Zero direct internet exposure&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The frontend never talks directly to the backend container.&lt;br&gt;
All communication flows through load balancers.&lt;/p&gt;

&lt;p&gt;This wasn't just architectural purity—it simplified security reasoning and debugging.&lt;/p&gt;



&lt;h3&gt;
  
  
  📋 Architecture at a Glance
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Details&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Frontend&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Public ALB → ECS Fargate (private subnet)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Backend&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Internal ALB → ECS Fargate (private subnet)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CI/CD&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Jenkins EC2 → Docker → ECR → Terraform → ECS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;State&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;S3 + DynamoDB locking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Networking&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom VPC, NAT Gateway, multi-AZ&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  🌐 Networking Design: What "Private" Actually Means
&lt;/h2&gt;

&lt;p&gt;I created a custom VPC and treated networking as a first-class concern.&lt;/p&gt;

&lt;h3&gt;
  
  
  VPC Layout
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;🟢 Public Subnets&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Public Application Load Balancer&lt;/li&gt;
&lt;li&gt;Jenkins EC2 instance&lt;/li&gt;
&lt;li&gt;Internet Gateway attached&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;🔴 Private Subnets&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Frontend ECS tasks&lt;/li&gt;
&lt;li&gt;Backend ECS tasks&lt;/li&gt;
&lt;li&gt;Internal Application Load Balancer&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🔄 Ingress Flow
&lt;/h3&gt;

&lt;p&gt;Internet → Public ALB → Frontend ECS (private subnet) → Internal ALB → Backend ECS (private subnet)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No other ingress paths exist.&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  ⚠️ Egress Reality Check
&lt;/h3&gt;

&lt;p&gt;One of the earliest failures I hit:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;ECS tasks couldn't pull images from ECR.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The reason wasn't ECS or IAM.&lt;br&gt;
&lt;strong&gt;It was networking.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Private subnets do not magically have outbound internet access.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Fixing this forced me to understand:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ NAT Gateways&lt;/li&gt;
&lt;li&gt;✅ Route tables&lt;/li&gt;
&lt;li&gt;✅ Why "private subnet" doesn't mean "isolated from the world" by default&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;This single issue reshaped how I think about AWS networking.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  🐳 Containerization and Image Flow
&lt;/h2&gt;

&lt;p&gt;Docker is used for packaging both services.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key decisions:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Images are tagged with &lt;strong&gt;Git commit SHA&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No &lt;code&gt;latest&lt;/code&gt; tags&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Every deployment is traceable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Image flow looks like this:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Jenkins builds Docker images&lt;/li&gt;
&lt;li&gt;Images are pushed to Amazon ECR&lt;/li&gt;
&lt;li&gt;ECS pulls images using the task execution role&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This strict immutability made rollbacks and debugging significantly easier.&lt;/p&gt;




&lt;h2&gt;
  
  
  🛠️ Terraform Architecture: Everything Modularized
&lt;/h2&gt;

&lt;p&gt;The entire infrastructure is provisioned using Terraform.&lt;/p&gt;

&lt;p&gt;But instead of one massive configuration, I designed &lt;strong&gt;small, focused modules&lt;/strong&gt;, each with a single responsibility.&lt;/p&gt;

&lt;h3&gt;
  
  
  📦 Custom Modules Include
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Module&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;VPC&lt;/td&gt;
&lt;td&gt;Network foundation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Subnets&lt;/td&gt;
&lt;td&gt;Public/Private isolation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Route Tables&lt;/td&gt;
&lt;td&gt;Traffic routing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security Groups&lt;/td&gt;
&lt;td&gt;Firewall rules&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Public ALB&lt;/td&gt;
&lt;td&gt;Internet-facing load balancer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Internal ALB&lt;/td&gt;
&lt;td&gt;Private load balancer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ECS Cluster&lt;/td&gt;
&lt;td&gt;Container orchestration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ECS Task Definitions&lt;/td&gt;
&lt;td&gt;Container specifications&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ECS Services&lt;/td&gt;
&lt;td&gt;Service management&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ECR Repository&lt;/td&gt;
&lt;td&gt;Image storage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IAM Roles &amp;amp; Policies&lt;/td&gt;
&lt;td&gt;Permissions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jenkins EC2&lt;/td&gt;
&lt;td&gt;CI/CD server&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Remote Backend&lt;/td&gt;
&lt;td&gt;S3 + DynamoDB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Each module:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Exposes only required outputs&lt;/li&gt;
&lt;li&gt;✅ Avoids leaking internal resource details&lt;/li&gt;
&lt;li&gt;✅ Enforces clear dependency boundaries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This made the system easier to reason about—and easier to break in controlled ways.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flullmgm5r4mwjgtnqgp5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flullmgm5r4mwjgtnqgp5.png" alt="terraform-resources" width="800" height="244"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  💾 Remote State and Locking
&lt;/h2&gt;

&lt;p&gt;Terraform state is stored remotely:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;S3&lt;/strong&gt; for state storage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DynamoDB&lt;/strong&gt; for state locking&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This became critical once CI/CD entered the picture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without locking, concurrent applies from Jenkins would have been a disaster.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  🔄 CI/CD Design with Jenkins
&lt;/h2&gt;

&lt;p&gt;Jenkins runs on an &lt;strong&gt;EC2 instance inside the VPC&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;I intentionally separated &lt;strong&gt;CI and CD responsibilities&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  🔨 CI Pipeline (Build &amp;amp; Package)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Triggered on GitHub push:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Checkout code&lt;/li&gt;
&lt;li&gt;Build frontend and backend Docker images&lt;/li&gt;
&lt;li&gt;Tag images with Git SHA&lt;/li&gt;
&lt;li&gt;Push images to ECR&lt;/li&gt;
&lt;li&gt;Export image tags as artifacts&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;✅ CI has no infrastructure permissions.&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  🚀 CD Pipeline (Deploy via Terraform)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Triggered after CI completion:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Fetch image tag artifacts&lt;/li&gt;
&lt;li&gt;Assume AWS IAM role via STS&lt;/li&gt;
&lt;li&gt;Run terraform init&lt;/li&gt;
&lt;li&gt;Run terraform apply&lt;/li&gt;
&lt;li&gt;Update ECS task definitions with new image versions&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;✅ Terraform is the only deployment mechanism.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No manual ECS changes.&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;No clicking in the console.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  🔐 IAM: The True Difficulty of the Project
&lt;/h2&gt;

&lt;p&gt;IAM was the &lt;strong&gt;hardest and most educational&lt;/strong&gt; part of this project.&lt;/p&gt;

&lt;p&gt;Because I didn't use community modules, I had to learn that:&lt;/p&gt;

&lt;h3&gt;
  
  
  📚 What I Learned About IAM
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Terraform needs read permissions even during creation:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;terraform plan&lt;/code&gt; fails without &lt;code&gt;Describe*&lt;/code&gt; and &lt;code&gt;Get*&lt;/code&gt; permissions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Missing permissions that broke my plans:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;iam:GetPolicyVersion&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ec2:DescribeVpcAttribute&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;elasticloadbalancing:DescribeLoadBalancerAttributes&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;ECS failures were often IAM issues:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;❌ Incorrect &lt;code&gt;iam:PassRole&lt;/code&gt; configuration&lt;/li&gt;
&lt;li&gt;❌ Confusion between &lt;strong&gt;execution role&lt;/strong&gt; vs &lt;strong&gt;task role&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Most of my "Terraform errors" were actually IAM design errors.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This project forced me to deeply understand:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Trust relationships&lt;/li&gt;
&lt;li&gt;✅ Role assumption via STS&lt;/li&gt;
&lt;li&gt;✅ Least-privilege policy design&lt;/li&gt;
&lt;li&gt;✅ How AWS services act on behalf of other services&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🐛 ECS Debugging: Nothing Is Isolated
&lt;/h2&gt;

&lt;p&gt;ECS failures required &lt;strong&gt;system-level thinking&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A task failing could be caused by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;❌ Image pull failures&lt;/li&gt;
&lt;li&gt;❌ Missing IAM permissions&lt;/li&gt;
&lt;li&gt;❌ Incorrect security group rules&lt;/li&gt;
&lt;li&gt;❌ ALB health check mismatch&lt;/li&gt;
&lt;li&gt;❌ Networking misconfiguration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;There is no single log that tells the full story.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You need to understand how:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ECS&lt;/li&gt;
&lt;li&gt;ALB&lt;/li&gt;
&lt;li&gt;IAM&lt;/li&gt;
&lt;li&gt;Networking&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;work together.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This project forced me to build that mental model.&lt;/p&gt;




&lt;h2&gt;
  
  
  ⚠️ Key Mistakes I Made (So You Don't Have To)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1️⃣ Assuming private subnets = no internet access
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Reality:&lt;/strong&gt; Private subnets need NAT Gateway + route table configuration&lt;/li&gt;
&lt;li&gt;Cost me hours debugging ECR pull failures&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2️⃣ Treating IAM as an afterthought
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;IAM should be designed &lt;strong&gt;first&lt;/strong&gt;, not patched later&lt;/li&gt;
&lt;li&gt;Most "Terraform errors" were actually IAM design errors&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3️⃣ Not separating CI and CD early
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Initially mixed build and deploy logic&lt;/li&gt;
&lt;li&gt;Separation made debugging and security much cleaner&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4️⃣ Underestimating security group complexity
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Had to trace traffic flow through multiple layers&lt;/li&gt;
&lt;li&gt;One missing rule broke the entire deployment&lt;/li&gt;
&lt;/ul&gt;



&lt;h2&gt;
  
  
  📊 Project by the Numbers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Duration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3 weeks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Custom Terraform modules written&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;12+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AWS resources managed&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;50+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;IAM policies debugged&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Too many to count&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Docker images built&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;50+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Failed deployments before success&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;15+&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  🎓 What This Project Gave Me
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;✅ Confidence writing Terraform modules from scratch&lt;/li&gt;
&lt;li&gt;✅ Strong understanding of IAM trust and permission boundaries&lt;/li&gt;
&lt;li&gt;✅ Practical experience debugging ECS + ALB + networking&lt;/li&gt;
&lt;li&gt;✅ Clear mental separation of CI vs CD&lt;/li&gt;
&lt;li&gt;✅ Comfort with production-style AWS constraints&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;More importantly, it taught me how to reason about infrastructure instead of guessing.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  🛑 Why I'm Stopping Here
&lt;/h2&gt;

&lt;p&gt;This project achieved its learning goal.&lt;/p&gt;

&lt;p&gt;Continuing to polish it would bring diminishing returns.&lt;/p&gt;

&lt;p&gt;The biggest gap in my skill set now is &lt;strong&gt;Kubernetes&lt;/strong&gt;, and I'm moving there next—with the same approach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ No shortcuts&lt;/li&gt;
&lt;li&gt;✅ No blind abstractions&lt;/li&gt;
&lt;li&gt;✅ Build it, break it, debug it&lt;/li&gt;
&lt;/ul&gt;



&lt;h2&gt;
  
  
  💭 Final Thought
&lt;/h2&gt;
&lt;h3&gt;
  
  
  If you're learning DevOps:
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Don't just use Terraform modules.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Write them. Break them. Fix them.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's where real understanding comes from.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  💬 Questions for the Community
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;🤔 What's the most painful IAM issue you've debugged?&lt;/li&gt;
&lt;li&gt;🤔 Do you prefer community modules or custom modules for learning?&lt;/li&gt;
&lt;li&gt;🤔 What infrastructure topic should I tackle next?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Drop a comment—I read and respond to all of them.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; #aws #terraform #devops #ecs #docker #jenkins #infrastructure #cicd #learning #iac&lt;/p&gt;



&lt;p&gt;&lt;strong&gt;If you found this helpful, give it a ❤️ and follow for more deep-dive DevOps content!&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>terraform</category>
      <category>aws</category>
      <category>docker</category>
      <category>jenkins</category>
    </item>
    <item>
      <title>I Took a Working Terraform Project and Rebuilt It Properly (ALB + EC2 + Modules + Remote State)</title>
      <dc:creator>Adil Khan</dc:creator>
      <pubDate>Sun, 04 Jan 2026 16:24:29 +0000</pubDate>
      <link>https://forem.com/adil-khan-723/i-took-a-working-terraform-project-and-rebuilt-it-properly-alb-ec2-modules-remote-state-4dpa</link>
      <guid>https://forem.com/adil-khan-723/i-took-a-working-terraform-project-and-rebuilt-it-properly-alb-ec2-modules-remote-state-4dpa</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;A lot of Terraform projects reach a point where things work — an EC2 launches, a load balancer responds, and &lt;code&gt;terraform apply&lt;/code&gt; finishes without errors.&lt;/p&gt;

&lt;p&gt;I reached that point too.&lt;/p&gt;

&lt;p&gt;But after spending more time with Terraform, I realized something uncomfortable: &lt;strong&gt;working infrastructure doesn't necessarily mean well-designed infrastructure.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;So instead of moving on, I decided to stop and rebuild the same project again — this time focusing on structure, clarity, and how the code would behave if it had to grow or be maintained.&lt;/p&gt;

&lt;p&gt;This post is a walkthrough of that process: starting from a non-modular Terraform setup and gradually refactoring it into a modular one, while dealing with the confusion, mistakes, and "aha" moments along the way.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Set Out to Build
&lt;/h2&gt;

&lt;p&gt;The goal was simple in terms of resources, but intentional in design:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multiple EC2 instances running Nginx&lt;/li&gt;
&lt;li&gt;An Application Load Balancer distributing traffic&lt;/li&gt;
&lt;li&gt;Separate security groups for ALB and EC2&lt;/li&gt;
&lt;li&gt;Dynamic target group registration&lt;/li&gt;
&lt;li&gt;Terraform remote state stored in S3&lt;/li&gt;
&lt;li&gt;State locking using DynamoDB&lt;/li&gt;
&lt;li&gt;A fully modular Terraform structure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Nothing exotic — just common AWS building blocks — but wired together in a way that reflects how Terraform is actually used beyond tutorials.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why I Did Not Start with Modules
&lt;/h2&gt;

&lt;p&gt;At first, I had a non-modular version of this project.&lt;/p&gt;

&lt;p&gt;Everything was in one place.&lt;br&gt;&lt;br&gt;
Resources referenced each other directly.&lt;br&gt;&lt;br&gt;
It worked.&lt;/p&gt;

&lt;p&gt;But that version taught me &lt;strong&gt;how Terraform executes&lt;/strong&gt;, not &lt;strong&gt;how Terraform should be structured&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Before modularizing, I wanted to clearly understand:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How &lt;code&gt;count&lt;/code&gt; and &lt;code&gt;for_each&lt;/code&gt; really behave&lt;/li&gt;
&lt;li&gt;Why &lt;code&gt;count.index&lt;/code&gt; can cause problems later&lt;/li&gt;
&lt;li&gt;How Terraform decides resource identity&lt;/li&gt;
&lt;li&gt;What happens when you change inputs after resources already exist&lt;/li&gt;
&lt;li&gt;How state is affected when multiple resources depend on each other&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Only after seeing those problems firsthand did modularization start to make sense.&lt;/p&gt;


&lt;h2&gt;
  
  
  The First Real Shift: Stop Using &lt;code&gt;count&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;One of the biggest changes I made was moving away from &lt;code&gt;count&lt;/code&gt; and using &lt;code&gt;for_each&lt;/code&gt; everywhere.&lt;/p&gt;

&lt;p&gt;Instead of creating instances like "instance 0, 1, 2", I switched to maps like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;instance-1&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;instance-2&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;instance-3&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This immediately made things clearer.&lt;/p&gt;
&lt;h3&gt;
  
  
  With &lt;code&gt;for_each&lt;/code&gt;:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Resource names stay stable&lt;/li&gt;
&lt;li&gt;Outputs are predictable&lt;/li&gt;
&lt;li&gt;Wiring resources together becomes much easier&lt;/li&gt;
&lt;li&gt;You stop relying on numeric positions and start relying on intent&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once this clicked, the rest of the project design became much cleaner.&lt;/p&gt;


&lt;h2&gt;
  
  
  How Instance Creation is Handled
&lt;/h2&gt;

&lt;p&gt;In the root module, I generate a map that looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;instance-name → subnet-id
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Subnets are chosen dynamically using modulo logic so instances are spread across availability zones.&lt;/p&gt;

&lt;p&gt;That map is passed into the EC2 module.&lt;/p&gt;

&lt;h3&gt;
  
  
  Inside the EC2 module:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;for_each&lt;/code&gt; iterates over the map&lt;/li&gt;
&lt;li&gt;Each key becomes the instance &lt;code&gt;Name&lt;/code&gt; tag&lt;/li&gt;
&lt;li&gt;Each value becomes the &lt;code&gt;subnet_id&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This keeps responsibility clear:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;root module&lt;/strong&gt; decides what should exist&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;EC2 module&lt;/strong&gt; decides how instances are created&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That separation turned out to be very important later.&lt;/p&gt;




&lt;h2&gt;
  
  
  EC2 Module Design
&lt;/h2&gt;

&lt;p&gt;The EC2 module does only one job: &lt;strong&gt;create EC2 instances&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It does not decide:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How many instances exist&lt;/li&gt;
&lt;li&gt;Which subnets to use&lt;/li&gt;
&lt;li&gt;How traffic reaches them&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Inputs include:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;AMI&lt;/li&gt;
&lt;li&gt;Instance type&lt;/li&gt;
&lt;li&gt;Key name&lt;/li&gt;
&lt;li&gt;Security group IDs&lt;/li&gt;
&lt;li&gt;A map of instance names to subnet IDs&lt;/li&gt;
&lt;li&gt;Optional user data&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Outputs return maps:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Instance IDs&lt;/li&gt;
&lt;li&gt;Private IPs&lt;/li&gt;
&lt;li&gt;ARNs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Returning &lt;strong&gt;maps instead of lists&lt;/strong&gt; keeps instance identity intact when passing data to other modules.&lt;/p&gt;




&lt;h2&gt;
  
  
  Security Groups: Keeping Things Isolated
&lt;/h2&gt;

&lt;p&gt;Instead of putting everything into one security group, I created:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One security group for the ALB&lt;/li&gt;
&lt;li&gt;One security group for the EC2 instances&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The ALB security group:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Allows inbound HTTP from the internet&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The EC2 security group:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Allows inbound traffic &lt;strong&gt;only from the ALB security group&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Allows SSH only from a restricted CIDR&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This setup drastically reduces exposure and makes traffic flow explicit instead of implicit.&lt;/p&gt;

&lt;p&gt;The security group module accepts ingress and egress rules as &lt;strong&gt;maps of objects&lt;/strong&gt;, which made it flexible without being complicated.&lt;/p&gt;




&lt;h2&gt;
  
  
  Load Balancer and Target Registration
&lt;/h2&gt;

&lt;p&gt;The load balancer module handles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ALB creation&lt;/li&gt;
&lt;li&gt;Target group creation&lt;/li&gt;
&lt;li&gt;Listener configuration&lt;/li&gt;
&lt;li&gt;Target group attachments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The important part is that the ALB module &lt;strong&gt;does not care how EC2 instances are created&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It simply accepts a map of instance IDs.&lt;/p&gt;

&lt;p&gt;Inside the module, it loops over that map and attaches each instance to the target group dynamically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No hardcoded references.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;No assumptions.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Just clean inputs and outputs.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Remote State and Locking
&lt;/h2&gt;

&lt;p&gt;Terraform state is stored remotely in S3, with DynamoDB used for state locking.&lt;/p&gt;

&lt;p&gt;I intentionally included this even though I was working alone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Because this is where Terraform usage changes completely.&lt;/p&gt;

&lt;h3&gt;
  
  
  Remote state with locking:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Prevents concurrent applies&lt;/li&gt;
&lt;li&gt;Prevents accidental corruption&lt;/li&gt;
&lt;li&gt;Forces you to think about Terraform as a shared system&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once you use this setup, going back to local state feels wrong.&lt;/p&gt;




&lt;h2&gt;
  
  
  User Data and Verification
&lt;/h2&gt;

&lt;p&gt;Each EC2 instance runs a simple user data script that installs Nginx and serves a response identifying the instance.&lt;/p&gt;

&lt;p&gt;This made it easy to verify:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;user_data&lt;/code&gt; execution&lt;/li&gt;
&lt;li&gt;Instance uniqueness&lt;/li&gt;
&lt;li&gt;Load balancer distribution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Seeing traffic rotate across instances confirmed that everything was wired correctly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Challenges I Ran Into
&lt;/h2&gt;

&lt;p&gt;Some things that took time to understand:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Challenge&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Indexing errors&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Why indexing errors happen with data sources&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;count breaks identity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Why count breaks identity when things change&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Module outputs&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Why module outputs should usually preserve structure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Security group references&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;How security group references differ from CIDR rules&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;User data behavior&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Why user data doesn't rerun unless instances are replaced&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DynamoDB locking&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;How DynamoDB locking behaves during apply&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each issue forced me to slow down and actually read what Terraform was doing instead of guessing.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Project Represents for Me
&lt;/h2&gt;

&lt;p&gt;This project wasn't about adding more AWS services.&lt;/p&gt;

&lt;p&gt;It was about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Writing Terraform that is readable&lt;/li&gt;
&lt;li&gt;Making dependencies explicit&lt;/li&gt;
&lt;li&gt;Reducing assumptions&lt;/li&gt;
&lt;li&gt;Designing for change instead of just "apply success"&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The biggest shift wasn't technical — it was mental.
&lt;/h3&gt;

&lt;p&gt;I stopped asking:&lt;br&gt;&lt;br&gt;
&lt;strong&gt;"Does this work?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;And started asking:&lt;br&gt;&lt;br&gt;
&lt;strong&gt;"Does this make sense if I come back in three months?"&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Some natural extensions to this setup:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Auto Scaling Groups&lt;/li&gt;
&lt;li&gt;HTTPS with ACM&lt;/li&gt;
&lt;li&gt;Monitoring and alarms&lt;/li&gt;
&lt;li&gt;CI/CD for Terraform&lt;/li&gt;
&lt;li&gt;Environment separation&lt;/li&gt;
&lt;li&gt;ECS or EKS later on&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But those only make sense once the foundation is solid.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;Terraform feels difficult when you treat it like a scripting tool.&lt;/p&gt;

&lt;p&gt;It becomes much clearer when you treat it like a &lt;strong&gt;design tool&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Building something twice — once messy, once structured — taught me more than any single tutorial ever could.&lt;/p&gt;

&lt;h3&gt;
  
  
  If you're learning Terraform and feel stuck, my honest advice is:
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Build it once just to make it work.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Then rebuild it to make it right.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's where the learning actually happens.&lt;/p&gt;




&lt;h2&gt;
  
  
  Project Repository
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/adil-khan-723/terraform-project2-moudlarized" rel="noopener noreferrer"&gt;&lt;strong&gt;GitHub: terraform-project2-moudlarized&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Connect with Me
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.linkedin.com/in/adilk3682" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; • &lt;a href="https://github.com/adil-khan-723" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; • &lt;a href="mailto:adilk81054@gmail.com"&gt;Email&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;If you found this helpful, consider giving the repository a star!&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>terraform</category>
      <category>devops</category>
      <category>infrastructureascode</category>
      <category>aws</category>
    </item>
    <item>
      <title>Why Refactoring AWS Infrastructure Taught Me More Than Building It</title>
      <dc:creator>Adil Khan</dc:creator>
      <pubDate>Fri, 02 Jan 2026 07:17:44 +0000</pubDate>
      <link>https://forem.com/adil-khan-723/why-refactoring-aws-infrastructure-taught-me-more-than-building-it-3iml</link>
      <guid>https://forem.com/adil-khan-723/why-refactoring-aws-infrastructure-taught-me-more-than-building-it-3iml</guid>
      <description>&lt;p&gt;Most infrastructure projects work the first time because we push them until they do. But working infrastructure isn't the same as well-designed infrastructure.&lt;/p&gt;

&lt;p&gt;Six months ago, I built an AWS infrastructure with Terraform. It worked. I was proud. Last week, I looked at that same code and cringed.&lt;/p&gt;

&lt;p&gt;This is the story of what I learned by tearing it down and rebuilding it properly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Background and Motivation
&lt;/h2&gt;

&lt;p&gt;The original version of this project was built to validate concepts quickly. It provisioned EC2 instances, placed them behind a load balancer, and served traffic successfully. At the time, that felt like success.&lt;/p&gt;

&lt;p&gt;But after gaining more exposure to Terraform patterns and real-world infrastructure practices, revisiting the code made the gaps obvious. &lt;strong&gt;Decisions were made because they worked, not because they were well thought out.&lt;/strong&gt; Dependencies were forced instead of modeled. State handling existed, but wasn't fully understood.&lt;/p&gt;

&lt;p&gt;This refactor was an attempt to slow down and rebuild the same infrastructure while focusing on clarity, correctness, and maintainability.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Project Builds
&lt;/h2&gt;

&lt;p&gt;This project provisions a small but realistic AWS infrastructure stack using Terraform. The goal is not application complexity, but &lt;strong&gt;infrastructure correctness&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The setup includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Multiple EC2 instances&lt;/li&gt;
&lt;li&gt;✅ An Application Load Balancer in front of them&lt;/li&gt;
&lt;li&gt;✅ A target group with health checks&lt;/li&gt;
&lt;li&gt;✅ Security groups enforcing clear traffic flow&lt;/li&gt;
&lt;li&gt;✅ Remote Terraform state with locking&lt;/li&gt;
&lt;li&gt;✅ Instance bootstrapping using user data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each EC2 instance runs Nginx and serves a simple page identifying the instance. This makes it easy to visually confirm load balancing behavior and instance health.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                    Internet
                       │
                       ▼
         ┌─────────────────────────┐
         │  Application Load       │
         │     Balancer (ALB)      │
         └─────────────────────────┘
                       │
         ┌─────────────┴─────────────┐
         │      Target Group         │
         │    (Health Checks)        │
         └─────────────┬─────────────┘
                       │
      ┌────────────────┼────────────────┐
      │                │                │
┌─────▼─────┐    ┌────▼─────┐    ┌────▼─────┐
│ EC2       │    │ EC2      │    │ EC2      │
│ Instance  │    │ Instance │    │ Instance │
│ (Nginx)   │    │ (Nginx)  │    │ (Nginx)  │
└───────────┘    └──────────┘    └──────────┘
  Subnet 1         Subnet 2        Subnet 3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The infrastructure uses:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AWS Default VPC&lt;/strong&gt; (intentionally chosen for this learning project, though production workloads should always use custom VPCs with proper CIDR planning)&lt;/li&gt;
&lt;li&gt;Public Application Load Balancer&lt;/li&gt;
&lt;li&gt;EC2 instances distributed across available subnets&lt;/li&gt;
&lt;li&gt;Target group attached to the ALB&lt;/li&gt;
&lt;li&gt;Security groups controlling inbound and outbound traffic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A custom VPC was intentionally avoided. The purpose of this project was not network design, but Terraform fundamentals: state management, resource relationships, dynamic creation, and clean structure.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Actually Changed
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Original Version&lt;/th&gt;
&lt;th&gt;Refactored Version&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Instance Creation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;count&lt;/code&gt; with hardcoded values&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;for_each&lt;/code&gt; with dynamic mapping&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Subnet Assignment&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Manual/hardcoded&lt;/td&gt;
&lt;td&gt;Modulo arithmetic distribution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dependencies&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Explicit &lt;code&gt;depends_on&lt;/code&gt; everywhere&lt;/td&gt;
&lt;td&gt;Implicit dependency graph&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;State Management&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Local state file&lt;/td&gt;
&lt;td&gt;S3 + DynamoDB locking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Security Groups&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Overly permissive&lt;/td&gt;
&lt;td&gt;Principle of least privilege&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Code Lines&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;487 lines&lt;/td&gt;
&lt;td&gt;312 lines (-36%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hardcoded Values&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;15+ IDs/ARNs&lt;/td&gt;
&lt;td&gt;0 (all dynamic)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Terraform Concepts Applied
&lt;/h2&gt;

&lt;p&gt;This refactor focused heavily on using Terraform the way it's intended to be used.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Remote State Management
&lt;/h3&gt;

&lt;p&gt;Terraform state is stored in an S3 bucket with DynamoDB used for state locking. This prevents concurrent state corruption and reflects how Terraform is used in real team environments.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;terraform&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;backend&lt;/span&gt; &lt;span class="s2"&gt;"s3"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;bucket&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"oggy-backend-bucket"&lt;/span&gt;
    &lt;span class="nx"&gt;key&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Alb-project-non-module/terraform.tfstate"&lt;/span&gt;
    &lt;span class="nx"&gt;region&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ap-south-1"&lt;/span&gt;
    &lt;span class="nx"&gt;dynamodb_table&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"stateLock-table"&lt;/span&gt;
    &lt;span class="nx"&gt;encrypt&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Data Sources
&lt;/h3&gt;

&lt;p&gt;Instead of hardcoding values, data sources are used to dynamically fetch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The default VPC&lt;/li&gt;
&lt;li&gt;Subnets within the VPC&lt;/li&gt;
&lt;li&gt;The latest Ubuntu 24.04 LTS AMI
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="s2"&gt;"aws_vpc"&lt;/span&gt; &lt;span class="s2"&gt;"default"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;default&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="s2"&gt;"aws_subnets"&lt;/span&gt; &lt;span class="s2"&gt;"default"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;filter&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;name&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"vpc-id"&lt;/span&gt;
    &lt;span class="nx"&gt;values&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;aws_vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;default&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="s2"&gt;"aws_ami"&lt;/span&gt; &lt;span class="s2"&gt;"ubuntu"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;most_recent&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="nx"&gt;owners&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"099720109477"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="c1"&gt;# Canonical&lt;/span&gt;

  &lt;span class="nx"&gt;filter&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;name&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"name"&lt;/span&gt;
    &lt;span class="nx"&gt;values&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"ubuntu/images/hvm-ssd-gp3/ubuntu-noble-24.04-amd64-server-*"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Dynamic Resource Creation
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Before:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_instance"&lt;/span&gt; &lt;span class="s2"&gt;"web"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;count&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
  &lt;span class="nx"&gt;subnet_id&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;subnet_ids&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;count&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;index&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="c1"&gt;# Fragile, breaks if subnets change&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;After:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;locals&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;instances&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;for&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="nx"&gt;in&lt;/span&gt; &lt;span class="nx"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;nos&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt;
    &lt;span class="s2"&gt;"instance-${i + 1}"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;aws_subnets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;default_subnets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ids&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="err"&gt;%&lt;/span&gt; &lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;aws_subnets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;default_subnets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ids&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_instance"&lt;/span&gt; &lt;span class="s2"&gt;"vms"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;for_each&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;local&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;instances&lt;/span&gt;
  &lt;span class="nx"&gt;ami&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;aws_ami&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ubuntu&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;instance_type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;type_instance&lt;/span&gt;
  &lt;span class="nx"&gt;subnet_id&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;each&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;

  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;each&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;EC2 instances are created dynamically using &lt;code&gt;for_each&lt;/code&gt; rather than static counts. This improves clarity, stability, and scalability of the configuration.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Locals for Computed Logic
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;locals&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;instances&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"web-instance-1"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="s2"&gt;"web-instance-2"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="s2"&gt;"web-instance-3"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}...&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Subnet assignment logic is handled using locals, keeping the resource blocks clean and readable.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Implicit Dependencies
&lt;/h3&gt;

&lt;p&gt;Rather than forcing execution order with &lt;code&gt;depends_on&lt;/code&gt;, resource relationships define the dependency graph naturally.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Original version had 8 explicit &lt;code&gt;depends_on&lt;/code&gt; blocks. The refactored version has 0.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Dynamic Instance and Subnet Distribution
&lt;/h2&gt;

&lt;p&gt;One of the most valuable improvements in this refactor was how instances are distributed across subnets.&lt;/p&gt;

&lt;p&gt;Instead of manually mapping instances to subnets, &lt;strong&gt;modulo arithmetic&lt;/strong&gt; is used to assign instances evenly across all available subnets.&lt;/p&gt;

&lt;h3&gt;
  
  
  How It Works
&lt;/h3&gt;

&lt;p&gt;Modulo arithmetic (&lt;code&gt;index % subnet_count&lt;/code&gt;) ensures even distribution:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;With 3 subnets and 6 instances:

&lt;ul&gt;
&lt;li&gt;Instances 0, 3 → Subnet 0&lt;/li&gt;
&lt;li&gt;Instances 1, 4 → Subnet 1&lt;/li&gt;
&lt;li&gt;Instances 2, 6 → Subnet 2&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;This approach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Avoids hardcoding subnet IDs&lt;/li&gt;
&lt;li&gt;✅ Scales automatically if subnets change&lt;/li&gt;
&lt;li&gt;✅ Produces deterministic and predictable placement&lt;/li&gt;
&lt;li&gt;✅ Works across any number of availability zones&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This logic alone made the configuration significantly more robust than the original version.&lt;/p&gt;




&lt;h2&gt;
  
  
  Security Group Design
&lt;/h2&gt;

&lt;p&gt;Security groups are designed with intent rather than convenience.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# ALB Security Group&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_security_group"&lt;/span&gt; &lt;span class="s2"&gt;"alb"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"alb-security-group"&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow HTTP inbound traffic"&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_id&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;aws_vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;default&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;

  &lt;span class="nx"&gt;ingress&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;from_port&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;
    &lt;span class="nx"&gt;to_port&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;
    &lt;span class="nx"&gt;protocol&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"tcp"&lt;/span&gt;
    &lt;span class="nx"&gt;cidr_blocks&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"0.0.0.0/0"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;egress&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;from_port&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="nx"&gt;to_port&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="nx"&gt;protocol&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"-1"&lt;/span&gt;
    &lt;span class="nx"&gt;cidr_blocks&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"0.0.0.0/0"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# EC2 Security Group&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_security_group"&lt;/span&gt; &lt;span class="s2"&gt;"ec2"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ec2-security-group"&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow traffic from ALB only"&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_id&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;aws_vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;default&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;

  &lt;span class="nx"&gt;ingress&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;from_port&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;
    &lt;span class="nx"&gt;to_port&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;
    &lt;span class="nx"&gt;protocol&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"tcp"&lt;/span&gt;
    &lt;span class="nx"&gt;security_groups&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;aws_security_group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;alb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;egress&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;from_port&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="nx"&gt;to_port&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="nx"&gt;protocol&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"-1"&lt;/span&gt;
    &lt;span class="nx"&gt;cidr_blocks&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"0.0.0.0/0"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key principles:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The Application Load Balancer allows inbound HTTP traffic from anywhere&lt;/li&gt;
&lt;li&gt;EC2 instances &lt;strong&gt;only&lt;/strong&gt; allow inbound traffic from the ALB security group&lt;/li&gt;
&lt;li&gt;Outbound traffic is permitted for updates and health checks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This enforces a clear and understandable traffic flow: &lt;strong&gt;public access ends at the load balancer, and instances remain protected behind it.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Bootstrapping with User Data
&lt;/h2&gt;

&lt;p&gt;Each EC2 instance is bootstrapped at launch using a user data script.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
apt-get update &lt;span class="nt"&gt;-y&lt;/span&gt;
apt-get &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; nginx

&lt;span class="nv"&gt;INSTANCE_NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;instance_name&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /var/www/html/index.html &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;
&amp;lt;!DOCTYPE html&amp;gt;
&amp;lt;html&amp;gt;
&amp;lt;head&amp;gt;
    &amp;lt;title&amp;gt;Instance: &lt;/span&gt;&lt;span class="nv"&gt;$INSTANCE_NAME&lt;/span&gt;&lt;span class="sh"&gt;&amp;lt;/title&amp;gt;
&amp;lt;/head&amp;gt;
&amp;lt;body&amp;gt;
    &amp;lt;h1&amp;gt;Hello from &lt;/span&gt;&lt;span class="nv"&gt;$INSTANCE_NAME&lt;/span&gt;&lt;span class="sh"&gt;&amp;lt;/h1&amp;gt;
    &amp;lt;p&amp;gt;This instance is managed by Terraform&amp;lt;/p&amp;gt;
    &amp;lt;p&amp;gt;Load balancing is working correctly!&amp;lt;/p&amp;gt;
&amp;lt;/body&amp;gt;
&amp;lt;/html&amp;gt;
&lt;/span&gt;&lt;span class="no"&gt;EOF

&lt;/span&gt;systemctl &lt;span class="nb"&gt;enable &lt;/span&gt;nginx
systemctl start nginx
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The script:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Updates the system&lt;/li&gt;
&lt;li&gt;Installs Nginx&lt;/li&gt;
&lt;li&gt;Serves a simple instance-specific web page&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This makes validation straightforward. If the ALB DNS shows responses from different instances, both provisioning and health checks are working as expected.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Broke During Refactoring (And Why That's Good)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Issue 1: State Lock Timeout
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What happened:&lt;/strong&gt; First remote state migration failed with lock timeout.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Root cause:&lt;/strong&gt; Didn't understand DynamoDB table requirements properly. The table needed a primary key named &lt;code&gt;LockID&lt;/code&gt; (case-sensitive).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I learned:&lt;/strong&gt; State locking isn't automatic—it requires proper table schema. Reading error messages carefully saves hours.&lt;/p&gt;

&lt;h3&gt;
  
  
  Issue 2: Target Group Attachment Race Condition
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What happened:&lt;/strong&gt; Instances registered before they were ready, causing initial health check failures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Root cause:&lt;/strong&gt; Health check grace period was too short, and user data script execution takes time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I learned:&lt;/strong&gt; AWS eventual consistency requires patience in automation. Added proper health check intervals and grace periods.&lt;/p&gt;

&lt;h3&gt;
  
  
  Issue 3: Circular Dependency with Security Groups
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What happened:&lt;/strong&gt; Terraform complained about circular dependencies when trying to reference security groups.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Root cause:&lt;/strong&gt; Tried to be too clever with cross-referencing security group rules.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I learned:&lt;/strong&gt; Sometimes the simplest approach is the best. Separated ingress rules into distinct resources when needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Issue 4: Subnet Data Source Returned Unexpected Results
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What happened:&lt;/strong&gt; Got 6 subnets instead of expected 3 in my default VPC.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Root cause:&lt;/strong&gt; AWS creates both public and private subnets in some regions by default.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I learned:&lt;/strong&gt; Always validate data source outputs. Added filters to ensure I'm only using the subnets I intend to use.&lt;/p&gt;




&lt;h2&gt;
  
  
  Measurable Improvements
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;th&gt;Change&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Lines of code&lt;/td&gt;
&lt;td&gt;487&lt;/td&gt;
&lt;td&gt;312&lt;/td&gt;
&lt;td&gt;-36%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Explicit dependencies&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;-100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hardcoded values&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;-100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;State lock conflicts&lt;/td&gt;
&lt;td&gt;Possible&lt;/td&gt;
&lt;td&gt;Prevented&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Subnet scalability&lt;/td&gt;
&lt;td&gt;Fixed to 3&lt;/td&gt;
&lt;td&gt;Dynamic&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code readability&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Key Takeaways for Other Learners
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. State Management Isn't Optional
&lt;/h3&gt;

&lt;p&gt;Even for learning projects, use remote state. The habits matter more than the project size. I spent 2 hours debugging a state corruption issue that would have been prevented by proper locking.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Dynamic Beats Static
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;for_each&lt;/code&gt; is harder to learn than &lt;code&gt;count&lt;/code&gt;, but it's worth the investment. The flexibility and clarity it provides compounds over time.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Read the Dependency Graph
&lt;/h3&gt;

&lt;p&gt;Run this command and actually look at it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;terraform graph | dot &lt;span class="nt"&gt;-Tpng&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; graph.png
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It will show you what Terraform actually understands about your infrastructure.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Refactoring &amp;gt; New Projects
&lt;/h3&gt;

&lt;p&gt;Building something new teaches you syntax. Rebuilding teaches you design. I learned more in this refactor than in the original build.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Document Your Mistakes
&lt;/h3&gt;

&lt;p&gt;My original code had 8 explicit &lt;code&gt;depends_on&lt;/code&gt; blocks. All were unnecessary. That's valuable to know and remember.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Slow Down to Speed Up
&lt;/h3&gt;

&lt;p&gt;The original project took 3 days of "making it work." The refactor took 2 days of "making it right." But now I understand it 10x better.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Refactoring Was More Valuable Than the Original Build
&lt;/h2&gt;

&lt;p&gt;The first version taught me how to assemble resources.&lt;/p&gt;

&lt;p&gt;The refactored version taught me &lt;strong&gt;why certain Terraform patterns exist.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Rebuilding the project exposed assumptions I didn't know I was making the first time. It forced me to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Question every hardcoded value&lt;/li&gt;
&lt;li&gt;Understand the difference between implicit and explicit dependencies&lt;/li&gt;
&lt;li&gt;Think about how the code would scale&lt;/li&gt;
&lt;li&gt;Consider how someone else would read and modify this code&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That process made the concepts stick far more effectively than building something new.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The best learning happens when you're forced to justify your decisions to yourself.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;Want to see the difference? Clone the repository and explore:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Clone the repository&lt;/span&gt;
git clone https://github.com/adil-khan-723/terraform_project2_refactored
&lt;span class="nb"&gt;cd &lt;/span&gt;terraform_project2_refactored

&lt;span class="c"&gt;# Initialize Terraform&lt;/span&gt;
terraform init

&lt;span class="c"&gt;# Review the plan&lt;/span&gt;
terraform plan

&lt;span class="c"&gt;# Apply the configuration&lt;/span&gt;
terraform apply

&lt;span class="c"&gt;# Get the ALB DNS name&lt;/span&gt;
terraform output alb_dns_name

&lt;span class="c"&gt;# Test the load balancer (you'll see different instances)&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;i &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;1..10&lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do &lt;/span&gt;curl http://&amp;lt;alb-dns-name&amp;gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nb"&gt;sleep &lt;/span&gt;1&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;done&lt;/span&gt;

&lt;span class="c"&gt;# Clean up&lt;/span&gt;
terraform destroy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What's Next?
&lt;/h2&gt;

&lt;p&gt;I'm currently working on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🔧 Converting this into reusable Terraform modules&lt;/li&gt;
&lt;li&gt;📈 Adding Auto Scaling Groups for dynamic scaling&lt;/li&gt;
&lt;li&gt;🔒 Implementing HTTPS with AWS Certificate Manager&lt;/li&gt;
&lt;li&gt;🌐 Building a custom VPC version with proper network segmentation&lt;/li&gt;
&lt;li&gt;📊 Adding CloudWatch dashboards and alarms&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Repository and Source Code
&lt;/h2&gt;

&lt;p&gt;The complete source code, file structure, and documentation are available here:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub Repository:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;a href="https://github.com/adil-khan-723/terraform_project2_refactored" rel="noopener noreferrer"&gt;https://github.com/adil-khan-723/terraform_project2_refactored&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The repository contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clean Terraform files with clear separation of concerns&lt;/li&gt;
&lt;li&gt;Comprehensive README with architecture diagrams&lt;/li&gt;
&lt;li&gt;No committed state or local artifacts&lt;/li&gt;
&lt;li&gt;A readable, review-friendly structure&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Infrastructure as Code is as much about the "code" part as it is about the "infrastructure" part. Clean, maintainable, understandable code matters—even when you're the only person who will read it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The infrastructure worked both times. But only the second time did I understand why.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's the difference between code that works and code that teaches.&lt;/p&gt;




&lt;h2&gt;
  
  
  Connect With Me
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Have you refactored your own infrastructure code?&lt;/strong&gt; What surprised you most? Drop a comment below—I'd love to hear about your experience.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LinkedIn:&lt;/strong&gt; &lt;a href="https://www.linkedin.com/in/adilk3682" rel="noopener noreferrer"&gt;https://www.linkedin.com/in/adilk3682&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you found this helpful:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;⭐ Star the &lt;a href="https://github.com/adil-khan-723/terraform_project2_refactored" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;💬 Share your own refactoring stories in the comments&lt;/li&gt;
&lt;li&gt;🔗 Connect with me on LinkedIn&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Thanks for reading! Happy Terraforming! 🚀&lt;/em&gt;&lt;/p&gt;

</description>
      <category>terraform</category>
      <category>aws</category>
      <category>infrastructureascode</category>
      <category>devops</category>
    </item>
    <item>
      <title>Why I Stopped Using a Bastion Host and Moved to AWS SSM for Private EC2 Access</title>
      <dc:creator>Adil Khan</dc:creator>
      <pubDate>Mon, 15 Dec 2025 08:57:26 +0000</pubDate>
      <link>https://forem.com/adil-khan-723/why-i-stopped-using-a-bastion-host-and-moved-to-aws-ssm-for-private-ec2-access-3ne8</link>
      <guid>https://forem.com/adil-khan-723/why-i-stopped-using-a-bastion-host-and-moved-to-aws-ssm-for-private-ec2-access-3ne8</guid>
      <description>&lt;p&gt;When I started designing my AWS setup, one of my early goals was clear:&lt;br&gt;
keep backend servers completely private.&lt;/p&gt;

&lt;p&gt;So my EC2 instances lived in private subnets, with no public IPs. That felt right from a security perspective.&lt;br&gt;
But very quickly, a practical problem showed up:&lt;/p&gt;

&lt;p&gt;If my servers aren’t reachable from the internet, how do I access them when something breaks?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx3fbtzj7uccdcx9lwtvm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx3fbtzj7uccdcx9lwtvm.png" alt="AWS ssm Architecture image" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This article isn’t a guide on how to configure access.&lt;br&gt;
It’s about what I tried, what felt wrong, and what finally clicked.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;The First Solution I Tried: A Bastion Host&lt;/p&gt;

&lt;p&gt;Like many people, I started with a Bastion Host.&lt;/p&gt;

&lt;p&gt;The idea was straightforward:&lt;br&gt;
    • One small EC2 instance in a public subnet&lt;br&gt;
    • Port 22 open&lt;br&gt;
    • SSH into the bastion, then hop into private instances&lt;/p&gt;

&lt;p&gt;And honestly — it worked.&lt;/p&gt;

&lt;p&gt;But the more I used it, the more friction I felt:&lt;br&gt;
    • I was managing SSH keys again&lt;br&gt;
    • One public-facing server became a critical choke point&lt;br&gt;
    • Security started depending on how well I protected that single box&lt;/p&gt;

&lt;p&gt;Nothing was broken, but something didn’t feel aligned with the rest of the architecture I was building.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;The Question That Changed My Approach&lt;/p&gt;

&lt;p&gt;At some point I stopped asking:&lt;/p&gt;

&lt;p&gt;“How do I reach my servers?”&lt;/p&gt;

&lt;p&gt;and started asking:&lt;/p&gt;

&lt;p&gt;“Why does access depend on network paths at all?”&lt;/p&gt;

&lt;p&gt;That shift led me to AWS Systems Manager (SSM) Session Manager.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;What Changed When I Switched to SSM&lt;/p&gt;

&lt;p&gt;Once I set up SSM and removed the Bastion Host, a few things became very clear.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Access became identity-based, not network-based&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I wasn’t thinking about IPs, ports, or jump paths anymore.&lt;br&gt;
Access was simply about who I am and what IAM permissions I have.&lt;/p&gt;

&lt;p&gt;That felt like a more natural fit for cloud-native systems.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;No inbound ports felt… relieving&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Closing port 22 everywhere wasn’t just a security improvement — it simplified things mentally.&lt;/p&gt;

&lt;p&gt;There was no longer a “special server” that needed extra attention or hardening.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Visibility improved without extra effort&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Every session was logged.&lt;br&gt;
Every action had an identity attached to it.&lt;/p&gt;

&lt;p&gt;I didn’t have to bolt on monitoring — it was part of the access model itself.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;What This Shift Changed in My Mental Model&lt;/p&gt;

&lt;p&gt;Before:&lt;br&gt;
    • Security felt like layers of network controls&lt;br&gt;
    • Access meant “finding a safe path to the server”&lt;/p&gt;

&lt;p&gt;After:&lt;br&gt;
    • Security feels like identity + intent&lt;br&gt;
    • Access means “am I allowed to be here?”&lt;/p&gt;

&lt;p&gt;That distinction mattered more to me than I expected.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Where I Am Now&lt;/p&gt;

&lt;p&gt;This is how I currently think about server access in my setups:&lt;br&gt;
    • Backend EC2 instances stay in private subnets&lt;br&gt;
    • The load balancer is the only public-facing component&lt;br&gt;
    • Administrative access happens through SSM, not SSH&lt;br&gt;
    • Security groups are chained tightly, not opened broadly&lt;/p&gt;

&lt;p&gt;I’m not saying Bastion Hosts are wrong — they still have valid use cases.&lt;br&gt;
But for my learning and the systems I’m building right now, SSM feels like the right default.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;A Question for People Further Along&lt;/p&gt;

&lt;p&gt;If you’ve worked on production AWS environments:&lt;br&gt;
    • Do you still rely on Bastion Hosts?&lt;br&gt;
    • Or have you moved fully toward SSM / identity-based access?&lt;br&gt;
    • In what cases do you still prefer a bastion?&lt;/p&gt;

&lt;p&gt;I’m still learning, and I’d love to hear how others think about this trade-off.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>devops</category>
      <category>infrastructure</category>
      <category>learning</category>
    </item>
    <item>
      <title>How I Built a Production-Grade Multi-Tier Application on AWS ECS Fargate (A Complete Case Study)</title>
      <dc:creator>Adil Khan</dc:creator>
      <pubDate>Sun, 07 Dec 2025 17:50:17 +0000</pubDate>
      <link>https://forem.com/adil-khan-723/how-i-built-a-production-grade-multi-tier-application-on-aws-ecs-fargate-a-complete-case-study-12on</link>
      <guid>https://forem.com/adil-khan-723/how-i-built-a-production-grade-multi-tier-application-on-aws-ecs-fargate-a-complete-case-study-12on</guid>
      <description>&lt;p&gt;I recently completed a full end-to-end deployment of a multi-tier application on AWS ECS Fargate. What began as a simple “let’s deploy a React app and a Node.js API” turned into a complete production-style cloud architecture that tested everything I’ve learned about DevOps, AWS networking, and container orchestration.&lt;/p&gt;

&lt;p&gt;This article is a complete technical breakdown of the project: how the architecture works, the services involved, what went wrong, what I fixed, and how the final system now behaves like a real microservices deployment running inside a production VPC.&lt;/p&gt;

&lt;p&gt;I’m sharing this as a learning milestone and a reference for others trying to move from theory to real-world cloud builds.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fik800leo3q050auakrfs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fik800leo3q050auakrfs.png" alt="project architecture" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Project Summary&lt;/p&gt;

&lt;p&gt;The system is a simple two-service architecture:&lt;/p&gt;

&lt;p&gt;• React frontend served by Nginx&lt;br&gt;
• Node.js backend API (/api/message)&lt;br&gt;
• Public ALB for frontend&lt;br&gt;
• Internal ALB for backend&lt;br&gt;
• Two ECS Fargate services with separate task definitions&lt;br&gt;
• ECR repositories for image storage&lt;br&gt;
• 4-subnet VPC (2 public, 2 private)&lt;br&gt;
• SG-to-SG communication for isolation&lt;br&gt;
• CloudWatch logging for both tasks&lt;/p&gt;

&lt;p&gt;The result: a fully private backend and a publicly accessible frontend communicating securely inside the VPC.&lt;/p&gt;

&lt;p&gt;High-Level Architecture&lt;/p&gt;

&lt;p&gt;Below is the same architecture used in many real production microservices deployments:&lt;/p&gt;

&lt;p&gt;• Public ALB → Receives internet traffic&lt;br&gt;
• Frontend Fargate tasks in public subnets → Serves UI&lt;br&gt;
• Internal ALB → Receives API calls only from frontend&lt;br&gt;
• Backend Fargate tasks in private subnets → Serve API&lt;br&gt;
• SG chaining → Only frontend → backend allowed&lt;br&gt;
• ECR → Stores container images&lt;br&gt;
• IAM Execution Role → Grants ECS permission to pull images&lt;br&gt;
• CloudWatch Logs → Task logs + debugging&lt;br&gt;
• VPC Endpoints (optional) → To avoid NAT costs&lt;/p&gt;

&lt;p&gt;The backend has zero exposure to the public internet. All calls go through the internal ALB.&lt;/p&gt;

&lt;p&gt;Network Design&lt;/p&gt;

&lt;p&gt;VPC&lt;br&gt;
10.0.0.0/16&lt;/p&gt;

&lt;p&gt;Subnets&lt;br&gt;
• Public Subnets (2) → ALB + frontend tasks&lt;br&gt;
• Private Subnets (2) → Backend tasks&lt;/p&gt;

&lt;p&gt;Route Tables&lt;br&gt;
• Public subnets → Internet Gateway&lt;br&gt;
• Private subnets → NAT Gateway / VPC Endpoints&lt;/p&gt;

&lt;p&gt;Security Groups&lt;br&gt;
• Public ALB SG: Allows HTTP from anywhere&lt;br&gt;
• Frontend SG: Allows ALB → port 80&lt;br&gt;
• Internal ALB SG: Allows frontend SG → port 80&lt;br&gt;
• Backend SG: Allows internal ALB → port 5001&lt;/p&gt;

&lt;p&gt;Traffic path:&lt;br&gt;
Internet → Public ALB → Frontend → Internal ALB → Backend&lt;/p&gt;

&lt;p&gt;Containers &amp;amp; Dockerfiles&lt;/p&gt;

&lt;p&gt;Frontend (Nginx multi-stage build)&lt;br&gt;
• Build React app&lt;br&gt;
• Serve via Nginx&lt;br&gt;
• Expose port 80&lt;/p&gt;

&lt;p&gt;Backend (Node.js)&lt;br&gt;
• Express server&lt;br&gt;
• /api/message endpoint&lt;br&gt;
• Expose port 5001&lt;/p&gt;

&lt;p&gt;Both built locally → pushed to ECR.&lt;/p&gt;

&lt;p&gt;ECR + IAM Setup&lt;/p&gt;

&lt;p&gt;Two repositories: frontend, backend.&lt;/p&gt;

&lt;p&gt;IAM Role had permissions for:&lt;br&gt;
• ecr:GetAuthorizationToken&lt;br&gt;
• ecr:BatchGetImage&lt;br&gt;
• ecr:BatchCheckLayerAvailability&lt;br&gt;
• logs:CreateLogStream&lt;br&gt;
• logs:PutLogEvents&lt;/p&gt;

&lt;p&gt;VPC Endpoints were added for:&lt;br&gt;
• ECR API&lt;br&gt;
• ECR DKR&lt;br&gt;
• S3&lt;br&gt;
• CloudWatch Logs&lt;/p&gt;

&lt;p&gt;This fixed image pull timeouts in private subnets.&lt;/p&gt;

&lt;p&gt;ECS Design&lt;/p&gt;

&lt;p&gt;Cluster&lt;br&gt;
1 ECS cluster (Fargate only).&lt;/p&gt;

&lt;p&gt;Task Definitions&lt;br&gt;
One for frontend, one for backend.&lt;br&gt;
Includes CPU/memory, ports, logs, awsvpc mode, IAM roles.&lt;/p&gt;

&lt;p&gt;Services&lt;br&gt;
• frontend-service&lt;br&gt;
• backend-service&lt;br&gt;
Desired count: 2 each.&lt;/p&gt;

&lt;p&gt;Rolling deployments were used for updates.&lt;/p&gt;

&lt;p&gt;Load Balancing &amp;amp; Routing&lt;/p&gt;

&lt;p&gt;Frontend&lt;br&gt;
• Public ALB&lt;br&gt;
• Listener: HTTP 80&lt;br&gt;
• Target group: frontend-tg (port 80)&lt;br&gt;
• Health check: /&lt;/p&gt;

&lt;p&gt;Backend&lt;br&gt;
• Internal ALB&lt;br&gt;
• Listener: HTTP 80&lt;br&gt;
• Target group: backend-tg (port 5001)&lt;br&gt;
• Health check: /api/message&lt;/p&gt;

&lt;p&gt;Frontend → Backend&lt;br&gt;
Frontend uses internal ALB DNS for API calls.&lt;/p&gt;

&lt;p&gt;Rolling Deployments&lt;/p&gt;

&lt;p&gt;Flow for new image rollout:&lt;br&gt;
    1.  Push new image&lt;br&gt;
    2.  Create new task definition revision&lt;br&gt;
    3.  ECS launches new tasks&lt;br&gt;
    4.  ALB health checks them&lt;br&gt;
    5.  Traffic shifts&lt;br&gt;
    6.  Old tasks drain and stop&lt;/p&gt;

&lt;p&gt;I tested multiple updates to see real ENI provisioning, ALB registration, logs, and draining behavior.&lt;/p&gt;

&lt;p&gt;Key Metrics From the Deployment&lt;/p&gt;

&lt;p&gt;• 0 public IPs on ECS tasks&lt;br&gt;
• Backend remained fully private&lt;br&gt;
• &amp;lt;15ms frontend → backend latency&lt;br&gt;
• 3-minute build → push → deploy cycle&lt;br&gt;
• Zero downtime rolling deployments&lt;br&gt;
• 100% successful health checks&lt;br&gt;
• Multiple revisions without breakage&lt;/p&gt;

&lt;p&gt;Challenges &amp;amp; Fixes&lt;br&gt;
    1.  Target group 404&lt;br&gt;
Cause: wrong path&lt;br&gt;
Fix: use /api/message&lt;br&gt;
    2.  ECR pull timeout&lt;br&gt;
Cause: tasks in private subnet&lt;br&gt;
Fix: add VPC endpoints for ECR/S3/Logs&lt;br&gt;
    3.  Frontend couldn’t reach backend&lt;br&gt;
Cause: hardcoded IP&lt;br&gt;
Fix: use internal ALB DNS&lt;br&gt;
    4.  Rolling update issues&lt;br&gt;
Cause: invalid deployment settings&lt;br&gt;
Fix: correct minHealthyPercent &amp;amp; maxPercent&lt;/p&gt;

&lt;p&gt;What I Learned&lt;/p&gt;

&lt;p&gt;• How Fargate attaches ENIs inside private subnets&lt;br&gt;
• How ALB target groups determine readiness&lt;br&gt;
• How internal ALBs handle microservice communication&lt;br&gt;
• How IAM least privilege affects ECR/ECS&lt;br&gt;
• How routing works in multi-tier VPCs&lt;br&gt;
• How rolling deployments behave in real time&lt;br&gt;
• How containers, networking, IAM, and load balancing combine to form real systems&lt;/p&gt;

&lt;p&gt;Why This Project Mattered&lt;/p&gt;

&lt;p&gt;This wasn’t just a deployment. It was a deep dive into how real cloud systems work — with failures, debugging, routing decisions, IAM restrictions, and architecture redesigns.&lt;/p&gt;

&lt;p&gt;It brought together:&lt;br&gt;
• VPC networking&lt;br&gt;
• IAM&lt;br&gt;
• ECS&lt;br&gt;
• ECR&lt;br&gt;
• Docker&lt;br&gt;
• Load balancing&lt;br&gt;
• Rolling deployments&lt;br&gt;
• Private service-to-service communication&lt;br&gt;
• Monitoring &amp;amp; logging&lt;/p&gt;

&lt;p&gt;GitHub Repository&lt;br&gt;
&lt;a href="https://github.com/adil-khan-723/node-app-jenkins1.git" rel="noopener noreferrer"&gt;https://github.com/adil-khan-723/node-app-jenkins1.git&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Conclusion&lt;/p&gt;

&lt;p&gt;Anyone learning DevOps or AWS should attempt a project like this. It forces you to think like an engineer designing real systems, not just someone running commands. It also builds confidence that you can architect and debug production-style systems from scratch.&lt;/p&gt;

&lt;p&gt;If you’re working on similar projects or want to discuss cloud architectures, feel free to reach out.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>architecture</category>
      <category>docker</category>
      <category>aws</category>
    </item>
  </channel>
</rss>
