Forem: Adil Khan

Building a human-in-the-loop AI agent for CI/CD failure recovery

Adil Khan — Mon, 27 Apr 2026 10:19:04 +0000

My pipelines break every week. Usually Docker cache issues, sometimes a pip dependency conflict, occasionally a missing credential that someone rotated and forgot to update. Each one takes 20-40 minutes to diagnose. Most of that time is spent opening logs, figuring out which stage failed, cross-checking job config, then deciding what to actually fix.

I built a tool that does most of that for me. It watches Jenkins, finds the failure, runs deterministic checks before touching any LLM, posts the analysis to a web UI, and waits for me to click a button before doing anything.

Here's what actually mattered.

The 90% token reduction that made this actually useful

The first thing I tried was dumping the entire build log into an LLM prompt. 10,000 tokens per failure. Results were slow, expensive, and often wrong — the model kept fixating on warning messages from passing stages.

The fix was obvious in retrospect: only send the failed stage.

Jenkins logs have clear boundaries: [Pipeline] stage (Docker Build). I parse these, find the stage containing the error, extract just that block, and cap the whole context at 1000 tokens. The log cleaner strips ANSI codes, timestamps, [INFO] prefixes, and progress bars.

What goes to the LLM is about 550 tokens of actual signal. The model does noticeably better with it.

Deterministic verification first, LLM second

This was the other big lesson.

A common Jenkins failure is a tool name mismatch. The Jenkinsfile says maven 'Maven3' but Global Tool Configuration has it as Maven-3. The pipeline fails with a cryptic error. An LLM will hallucinate an explanation.

I wrote a crawler that parses the tools {} block from the Jenkinsfile, queries the Jenkins API for configured global tools, and does exact match followed by Levenshtein fuzzy match at 0.85 threshold. If it finds a mismatch, it flags it before the LLM sees anything.

The verification report goes into the LLM prompt alongside the log. The model gets facts, not raw noise. It stops hallucinating tool configuration issues because I've already determined whether there is one.

There's one more thing the context builder adds: the actual Groovy source for the failing stage. Not just the error output — the stage('Build') { ... } block itself. That gives the LLM ground truth about what the Jenkinsfile says, so it can spot typos and wrong step names directly instead of guessing from stack traces.

Architecture

Six modules in sequence:

Parser — receives the webhook payload, identifies which stage failed, pulls that stage's logs, strips ANSI codes and timestamps and INFO lines
Verification crawler — parses Jenkinsfile tool references and credential IDs, queries Jenkins API to check against Global Tool Configuration, plugin manager, and credentials store
Context builder — assembles a 1000-token payload: metadata, verification findings, failing stage Groovy source, cleaned log
Analyzer — calls the LLM, caches responses by MD5 hash of the context
Agent — maps the LLM's recommendation to a concrete executor; tool mismatches and missing credentials are always diagnostic-only
Web UI — shows analysis, proposed fix, confidence score; Approve or Reject before anything runs

No component knows about the others. The LLM is one node in the chain, not the coordinator.

Why I kept humans in the loop

I thought about auto-fixing things. The temptation is real — if confidence is high and the fix is a simple retry, why wait for a button click?

Three reasons I didn't:

Trust. The system is new. I want to watch what it recommends before letting it act. Even a 90% accurate auto-fix will damage something eventually, and I'd rather see it coming.

Scope. Some fixes are never safe to automate — tool name mismatches, missing credentials, IAM issues. These need a human to touch the configuration. The system flags them as "diagnostic only" with no action button.

Audit trail. Every approved fix writes to an append-only JSONL log: timestamp, fix type, job name, build number, confidence score. No secret values, ever.

What's auto-fixable and what isn't

Auto-fixable (with approval):

Retry the pipeline
Clear Docker, npm, pip, or Maven cache
Pull a fresh image
Increase stage timeout

Never auto-fixed — always diagnostic:

Tool name mismatches
Missing plugins
Missing credentials
IAM issues

The logic is simple. The first group is safe to retry — worst case it fails again. The second group has too many ways to make things worse. Wrong credentials, wrong permissions, wrong tool version. Better to surface exactly what's wrong and let a human fix it intentionally.

The token math

Full log to LLM: ~10,000 tokens. At Haiku pricing that's around $0.03 per failure.

Selective context: ~1,000 tokens. That's around $0.003 per failure.

With response caching on repeat failures, production cost works out to around $0.01/month. The 90% reduction isn't theoretical — it's what happens when you stop sending passing stage logs to an LLM.

Running it locally on an M4

I have a 32GB M4 MacBook Air. llama3.1:8b handles analysis tasks. qwen2.5-coder:32b handles generation (Jenkinsfile Copilot mode).

Switching to Anthropic is one .env change:

LLM_PROVIDER=anthropic

No code changes. The provider abstraction layer handles routing — Haiku for analysis, Sonnet for generation, same interface throughout. I learned the lesson that model routing decisions should live in config, not scattered across function calls, about two weeks in.

Copilot mode

The reactive side (failure → fix) was the main goal. But I added a second mode: generate pipelines from natural language.

Type a description in the web UI chat — "build a Docker image, push to ECR, deploy to ECS" — and the system picks a base Jenkinsfile template, sends it to the LLM with your request, validates the output (brace balance, required Declarative Pipeline keywords), retries once with a correction prompt if it fails, then shows you the result. Approve and it creates or updates the Jenkins job via API.

What I'd do differently

Start with the verification crawler. I built the LLM integration first and spent a week confused about why the model kept getting tool names wrong. The crawler should have come first.

Response caching. Same failure, same log, same verification result — hits the MD5 cache instead of calling the LLM again. It's implemented. I haven't seen it fire much in practice because my failures aren't repetitive enough, but it matters at scale.

Test isolation. 152 tests, none requiring live Jenkins, GitHub, or LLM. Everything is mocked at the boundary. This made the whole thing buildable without external accounts until late in development. Worth the setup cost upfront.

What's still missing (honest)

GitHub Actions crawler — verification is Jenkins-only right now
GitHub repo committer — Copilot mode creates Jenkins jobs but can't commit workflow YAML to a repo
Secrets manager — currently a thin audit stub, not a real write to Jenkins credentials store or GitHub secrets API
Redis cache — it's in docker-compose, commented out; cache is in-memory for now

The reactive pipeline (webhook → analyze → approve → fix) works end to end. These are next.

Numbers

Token reduction: ~10,000 tokens → ~1,000 tokens per failure analysis
Test suite: 152 tests, 0 requiring live external services
LLM cost in testing: ~$0.015 total (Anthropic free credits)
Providers supported: Ollama (local), Anthropic

The code

Repo goes public after Phase 6 cleanup: github.com/adil-khan-723/cicd-ai-copilot. Questions meanwhile: adilk81054@gmail.com or dev.to/adil-khan-723.

If you're building something similar, the part worth stealing is the verification-before-LLM pattern. The model does much better when it's given facts about your specific configuration rather than raw log output.

I Built a Kubernetes Monitoring Stack — And Breaking It Was the Best Part

Adil Khan — Sat, 28 Mar 2026 09:34:20 +0000

I didn't build this project to add a line to my resume.

I built it because I kept reading about Prometheus and Grafana, nodding along like I understood it, and then freezing when someone asked me "so how does Prometheus actually discover your pods?"

I didn't know. Not really.

So I decided to stop reading and start breaking things.

What I built

A complete observability pipeline from scratch:

A Python Flask app with custom Prometheus metrics
Deployed on Kubernetes with 3 replicas
Scraped by Prometheus via ServiceMonitor
Visualized in Grafana with PromQL dashboards
Load tested with real traffic using hey

The repo is here if you want to follow along:
👉 github.com/adil-khan-723/k8s-observability-stack

But the code isn't the interesting part. What I learned by watching it fail is.

Mistake #1 — I used raw counters in Grafana and wondered why nothing made sense

First dashboard. I added http_requests_total as a panel. The number just kept climbing. 1000. 5000. 23000.

I stared at it thinking "okay... is that good?"

It tells you nothing. A counter that only goes up is like a car's odometer — it doesn't tell you how fast you're going right now.

The correct query is:

sum(rate(http_requests_total[1m]))

rate() calculates requests per second over the last minute. That's a number you can actually act on. After switching to this, I could see exactly when traffic spiked during load testing and when it dropped off.

Lesson: metrics alone are useless. PromQL creates insight.

Mistake #2 — I set the rate window too large and hid all my spikes

Once I had rate() working, I noticed my graph looked... suspiciously smooth. Almost like nothing was happening even during load tests.

I was using rate(http_requests_total[8m]). An 8-minute window averages out everything. A spike that lasted 30 seconds disappears completely.

Switched to [1m]. Suddenly I could see exactly what happened during the load test — a sharp climb, a plateau, a drop. Real information.

The dashboard also had stacked graphs enabled. Stacking makes it look like total traffic is the sum of all the colored areas, which is visually misleading when you're trying to compare per-pod behavior. Disabled it immediately.

Mistake #3 — Prometheus showed all targets DOWN and I had no idea why

This one took me a while.

I ran load tests, checked Prometheus UI under Status → Targets, and saw this:

context deadline exceeded

All 3 pods. All DOWN.

My first instinct was to blame the ServiceMonitor config. I triple-checked the labels. Everything matched. The problem wasn't discovery — Prometheus was finding the pods fine. It just couldn't scrape them in time.

Root cause: I had a time.sleep(2) sitting in my home route. This slowed down the entire Gunicorn worker. When Prometheus tried to hit /metrics, sometimes the worker was busy sleeping and the scrape timed out.

The fix was two things:

Increased the scrape interval in serviceMonitor.yaml to give more breathing room
Removed the artificial delay (or accounted for it explicitly)

Watched all 3 targets flip back to UP in real time. That was a good moment.

The deeper lesson: your monitoring system depends on your application's performance. A slow app breaks its own observability. This is why you should never put business logic on the /metrics endpoint.

Mistake #4 — I assumed Running meant healthy

After fixing the scrape issue, I noticed something strange. kubectl get pods showed all pods as Running. But requests were still failing intermittently.

I had been treating Running as "everything is fine." It isn't.

Running just means the container process started. It says nothing about whether the application inside is actually ready to serve traffic. A pod can be Running while your Flask app is still initializing, or while it's in a broken state that hasn't crashed the process.

The fix was a readinessProbe:

readinessProbe:
  httpGet:
    path: /metrics
    port: 5000
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 2

Once this was in place, Kubernetes automatically removed unhealthy pods from the Service's endpoint list. Traffic only went to pods that were actually ready. The intermittent failures stopped.

Running ≠ healthy. Readiness probes are not optional.

The thing that surprised me most — load isn't evenly distributed

During load testing I split the Grafana panel to show per-pod request rates:

rate(http_requests_total[1m])

I expected three roughly equal lines. What I got were three noticeably different ones. One pod was consistently getting more traffic than the others.

Kubernetes Services use round-robin at the connection level, not the request level. Under high concurrency, some pods end up holding more long-lived connections and therefore handle more requests.

If I had only been looking at sum(rate(http_requests_total[1m])) — the aggregate — I would never have seen this. The sum looked perfectly healthy. The per-pod view told a completely different story.

This is why per-pod metrics exist. Aggregates hide things.

What the ServiceMonitor actually does (and why it's confusing at first)

The part that confused me most before building this was the relationship between Prometheus and Kubernetes.

Prometheus doesn't scrape pods. It scrapes Services. And it discovers those Services through a custom resource called a ServiceMonitor.

The chain looks like this:

ServiceMonitor → matches Service labels → Service resolves to Pod IPs → Prometheus scrapes each pod

For this to work, three things have to align exactly:

The ServiceMonitor's selector must match the Service's labels
The ServiceMonitor's namespace must have the release: monitoring label (or whatever your Helm release is named)
The namespaceSelector must point to where the Service lives

Get any one of these wrong and Prometheus simply never discovers the target. No error. Just silence. This is the part where most people spend hours debugging.

What I'd add next

Alertmanager — fire alerts when scrape targets go DOWN or error rate spikes
HPA — autoscale pods based on custom Prometheus metrics
Loki — correlate logs with metric anomalies
Pinned dependency versions — requirements.txt currently has no versions, which is a reproducibility risk

Final thought

The most valuable part of this project wasn't setting up Prometheus or building the Grafana dashboard. It was the moment I saw context deadline exceeded and had to actually figure out why.

You don't learn observability by reading about it. You learn it by watching your own system fail and having to diagnose it with the tools you built.

If you're learning DevOps or platform engineering, build something, break it, and read the metrics.

Repo: github.com/adil-khan-723/k8s-observability-stack

Have you run into similar issues with Prometheus scraping? Drop it in the comments — would love to hear what weird things you've debugged.

I built a CLI that catches dangerous Terraform changes before you apply them

Adil Khan — Mon, 23 Mar 2026 07:19:22 +0000

Before every terraform apply I was doing the same thing. Read the plan output. Switch to the AWS console. Check security groups. Try to remember what depends on what.

Then a security group with port 22 open to 0.0.0.0/0 slipped through. Caught it fast, nothing broke — but I kept thinking about it. That whole review process was just me, manually, every time, hoping I didn't miss anything in 300 lines of text. That's not a process. That's vibes.

So I built IACGuard. This is how it works and what I got wrong along the way.

The gap in terraform plan

The output doesn't tell you what matters. A production database being replaced and a tag change look identical in the plan — same format, same indentation, same weight. You have to already know what's dangerous to spot it.

The second gap is pipelines. You can fail a PR on test failures or linting. You can't natively fail a PR because the plan replaces a production RDS instance. That gap means risky infrastructure changes get the same review as safe ones, which under time pressure is basically no review.

What IACGuard does

Reads a Terraform plan file, runs deterministic rules against every planned change, outputs a risk report.

iacguard plan results
─────────────────────────────────────────────────────────
Critical : 1  |  High : 0  |  Medium : 0  |  Low : 0
Resources analyzed : 5  |  Changes : 5
─────────────────────────────────────────────────────────

  CRITICAL  SG001  aws_vpc_security_group_ingress_rule.sg_inbound_ssh  [CREATE]
           Security group ingress rule 'sg_inbound_ssh' allows SSH (port 22)
           from the entire internet (0.0.0.0/0 or ::/0).

─────────────────────────────────────────────────────────
[iacguard] Rules checked   : 3
[iacguard] Drift check     : SKIPPED (use --region to enable)
─────────────────────────────────────────────────────────

Exit code 0 = nothing critical. Exit code 1 = stop and look at this before applying.

pip install iacguard

Here is the Architecture Below

The first version used AI for detection. That was wrong.

My original plan was to send the whole Terraform plan to Claude and let it decide what was risky.

plan.json → Claude API → risk scores + findings

The problem is non-determinism. LLMs give different answers on different runs. That's fine for generating text. It's not fine when your tool blocks deployments. If iacguard exits with code 1 in a CI pipeline, that exit code has to mean the same thing every single time — not "Claude was in a cautious mood today."

So the architecture changed:

plan.json → parser → rule engine (deterministic) → findings → output
                                                            ↓
                                              AI explanation (--explain, optional)

Rules decide what's risky. AI only explains what the rules already found. AI never runs in CI mode at all.

The parser

terraform show -json gives you a JSON file. The useful part is resource_changes — every resource being created, updated, or destroyed.

{
  "address": "aws_db_instance.primary",
  "mode": "managed",
  "type": "aws_db_instance",
  "change": {
    "actions": ["delete", "create"],
    "before": { "instance_class": "db.t3.medium" },
    "after":  { "instance_class": "db.t3.large" },
    "replacing": true
  }
}

The non-obvious parts:

actions can be ["delete", "create"] — that's a replacement, not two separate operations. The parser normalizes everything to CREATE, UPDATE, DESTROY, or REPLACE before rules run.

Data sources (mode: "data") and no-ops get filtered out. They're reads, not changes.

change.before is null for creates. change.after is null for destroys. Rules have to handle both.

Resources inside modules have addresses like module.vpc.aws_security_group.bastion. Readable in output, full address preserved for accuracy.

I spent more time on the parser than anything else. If it reads something wrong, every rule downstream gets bad data.

The three rules

Each rule is a pure Python function. Takes a ResourceChange, returns a Finding or None. No shared state, independently testable.

RDS001 — database replacement

An RDS replacement deletes the database and recreates it. Anything written between delete and restore is gone. The rule fires on actions == ["delete", "create"] or change.replacing == True — both checked because real plans sometimes only set one.

class RDS001(RuleBase):
    rule_id  = "RDS001"
    severity = Severity.CRITICAL

    def check(self, change, all_changes):
        if change.resource_type not in {"aws_db_instance", "aws_rds_cluster"}:
            return None
        if change.action != Action.REPLACE and not change.replacing:
            return None
        return Finding(
            ...
            message=f"RDS instance '{change.name}' will be replaced — potential data loss.",
            recommendation="Ensure a snapshot exists before applying."
        )

SG001 — SSH open to the internet

This one had a bug I only found by running it on my own infrastructure.

The rule originally covered aws_security_group and aws_security_group_rule. When I ran it against my real Terraform code — which has port 22 open to 0.0.0.0/0 — it caught nothing.

Why: my code uses aws_vpc_security_group_ingress_rule, a newer resource type from AWS provider v5+. Different structure. from_port, to_port, and cidr_ipv4 are top-level fields, not nested inside an ingress block.

I had test fixtures for the old types. Hadn't thought to write one for the new type. You don't know what you haven't tested until real infrastructure breaks it.

Fixed rule covers all three:

aws_security_group                   → ingress[] array
aws_security_group_rule              → type=ingress fields
aws_vpc_security_group_ingress_rule  → top-level from_port/to_port/cidr_ipv4

Fires if from_port <= 22 <= to_port. A rule with from_port=0, to_port=65535 exposes SSH. IPv6 (::/0) also checked.

S3001 — missing public access block

Medium severity, not Critical. AWS accounts can have account-level Block Public Access settings that protect every bucket regardless of what Terraform configures. Flagging this Critical would false-positive on any account with account-level protection — which is a lot of accounts.

The output is:

S3 bucket 'assets' has no explicit block_public_access configuration.
Account-level settings may still protect this bucket.

The engineer decides. The tool surfaces it without over-reacting.

CI mode

--ci flag: JSON to stdout, no color, just exit codes.

iacguard plan --plan plan.json --ci

Exit 0 = clean. Exit 1 = critical finding. Exit 2 = tool error.

- name: IACGuard
  run: |
    pip install iacguard
    terraform show -json tfplan > plan.json
    iacguard plan --plan plan.json --ci

PR blocks on exit 1. That's it.

Testing

Every rule has two tests minimum: a plan that triggers it and one that doesn't. All tests use real Terraform plan JSON files — no synthetic JSON built in test code. Synthetic fixtures pass even when the parser would fail on a real plan.

pytest tests/ -v
# 15 passed in 0.01s

What I got wrong

The original design had eight subcommands, a React dashboard, Kubernetes scanning, a custom cost engine, AI-generated fix code, and multi-region support — all in v1. I went through multiple rounds of design review, including running the spec past other LLMs to pressure-test it. Most of that got cut. What shipped is one command, three rules, and a CLI. It works on real infrastructure. The original would have shipped half-finished on everything.

The other mistake: I assumed I knew the Terraform plan JSON structure well enough to write the parser from memory. I didn't generate real plan files and read them first. aws_vpc_security_group_ingress_rule showing up as a bug in my fixtures is a direct result of that. Next time I'm reading real data before writing code that parses it.

What's next

Blast radius is the main one — given a resource being changed, compute what else depends on it from the Terraform graph. You change a security group, IACGuard tells you which EC2 instances, load balancers, and RDS clusters are downstream. No free CLI tool does this cleanly in a pre-deploy workflow.

After that: more rules, drift detection (Terraform state vs live AWS), and a browser graph for the dependency chain.

Try it

pip install iacguard

terraform plan -out=tfplan
terraform show -json tfplan > plan.json

iacguard plan --plan plan.json
iacguard plan --plan plan.json --ci

Source: https://github.com/adil-khan-723/iacguard

If you hit a resource type the rules miss, open an issue or PR. Each rule is one file.

How I Built a Production-Grade Kubernetes RBAC Setup — And Broke It On Purpose

Adil Khan — Sat, 28 Feb 2026 17:55:49 +0000

Most RBAC tutorials show you how to apply a Role and run kubectl auth can-i. Then they call it done.

That never sat right with me. In production, your workload doesn't authenticate using your kubeconfig. It authenticates using a ServiceAccount token mounted inside the pod. So if you've never tested RBAC from inside a running container, you haven't actually tested RBAC.

This project fixes that. I built a minimal but realistic RBAC setup for an observability tool, validated it from inside a live deployment, and then intentionally broke it to understand what failure actually looks like at the API server level.

The full source is here: github.com/adil-khan-723/K8s-RBAC

The Setup

Everything lives inside a dedicated observability namespace. The workload — a test deployment — runs under a purpose-built ServiceAccount called log-reader-sa. A namespace-scoped Role defines exactly what that identity is allowed to do. A RoleBinding connects the two.

observability (namespace)
│
├── log-reader-sa        ← Dedicated ServiceAccount
├── log-reader-role      ← Namespace-scoped Role
├── log-reader-binding   ← Binds SA → Role
└── testing (Deployment) ← Live workload for validation

No ClusterRoles. No ClusterRoleBindings. Everything contained.

Why Each Decision Was Made

Never Use the Default ServiceAccount

The default ServiceAccount exists in every namespace automatically. Using it means every workload in that namespace shares the same identity. If one workload gets permissions, every other workload riding the default SA inherits them silently.

In any environment with more than one workload, this is a privilege creep problem waiting to happen. The fix is simple: create a dedicated ServiceAccount per workload.

Namespace-Scoped Role, Not ClusterRole

A ClusterRole bound with a ClusterRoleBinding grants access across every namespace — current and future. Even a ClusterRole bound with a RoleBinding still references a cluster-level object, which creates reuse risks and makes auditing harder.

A namespace-local Role and RoleBinding keeps the permission surface completely contained. If you can't explain why a workload needs cluster-wide access, it doesn't need a ClusterRole.

Verbs Are Not Optional

RBAC doesn't have a "read-only mode" switch. You have to declare every verb individually. The role grants get, list, and watch — and nothing else. Write verbs (create, update, patch, delete) are not included. Kubernetes does not default to denying write operations if you forget; it denies everything you don't explicitly allow.

Subresources Are Not Inherited

This is the one that catches people the most.

Access to pods does not grant access to pods/log. They are treated as completely separate targets by the API server. A role missing pods/log will fail silently during kubectl apply and loudly at runtime when your monitoring tool tries to pull logs and gets a 403.

Every subresource you need must appear explicitly in the rules.

What Was Granted

Resource	Verbs Granted
`pods`	`get`, `list`, `watch`
`pods/log`	`get`
`deployments`	`get`, `list`
`secrets`	Denied
Everything else	Denied

Testing From Inside the Pod

This is the part most tutorials skip. I deployed a real workload under log-reader-sa and tested API calls directly from inside the container.

Test	Result
List pods in namespace	✅ Allowed
Get pod logs	✅ Allowed
Access secrets	❌ 403 Forbidden
Delete a pod	❌ 403 Forbidden
Access resources outside namespace	❌ 403 Forbidden

Testing this way matters because it confirms the ServiceAccount token is correctly mounted, the API server is reachable from inside the pod, and the policy behaves exactly as written — not as assumed.

Breaking It On Purpose

I removed pods/log from the Role rules to simulate a common production misconfiguration. The result was immediate: a 403 Forbidden response every time log retrieval was attempted.

This turned into a useful debugging exercise. There are four failure types that look similar on the surface but require completely different fixes:

401 Unauthorized — the identity wasn't recognized. Token is missing, expired, or invalid.

403 Forbidden — the identity was recognized, the request reached the API server, but the action isn't permitted. This is a RBAC problem.

404 Not Found — the resource doesn't exist. Not an authorization issue at all.

Connection refused / timeout — the API server wasn't reached. Networking problem, not RBAC.

When you see a 403, you've already confirmed that the workload has connectivity, the token is valid, and the API server is up. The investigation starts and ends with the Role definition.

The Security Picture

If this workload were compromised, the damage is bounded:

Read pod metadata and logs within observability — yes
Access secrets — no
Modify or delete anything — no
Move laterally to other namespaces — no
Escalate to cluster-level access — no

This is what blast radius limitation looks like in practice. The attacker gets read access to one namespace. That's a recoverable incident. Cluster-admin on a compromised workload is not.

What This Project Reinforced

RBAC in Kubernetes is an authorization layer evaluated at the API server for every request. The evaluation checks four things: who is making the request, what verb they're using, what resource they're targeting, and which namespace it's in.

Roles define what is permitted. Bindings attach identities to those permissions. Neither inherits anything. Neither assumes anything. Everything must be declared.

The habits worth building:

Start with zero permissions and add only what you can justify
Test from inside the workload, not just from your terminal
List subresources explicitly — they are never implied
Know the difference between a 401, 403, and 404
Give every workload its own ServiceAccount
Avoid ClusterRoleBindings unless the requirement is genuinely cluster-wide

Source

Full manifests and project structure: github.com/adil-khan-723/K8s-RBAC

If you've been writing RBAC policy without testing it from inside a running pod, this is a good starting point for closing that gap.

I Built a Multi-Service Kubernetes App and Here's What Actually Broke

Adil Khan — Sat, 21 Feb 2026 13:40:56 +0000

I Built a Multi-Service Kubernetes App and Here's What Actually Broke

The Goal

This wasn't a "follow the tutorial" project. The goal was simple: understand how real distributed systems actually work inside Kubernetes.

Not just "deploy containers," but truly understand:

How services discover each other
How internal networking routes traffic
How ingress exposes applications externally
How TLS termination works at the edge
How secrets and configs propagate
How rolling updates affect uptime
The difference between stateful and stateless workloads
How DNS resolution works across namespaces
How to debug when things inevitably break

The result is a multi-service voting application that mirrors real production microservices architecture.

The Architecture

Five independent services, each with a specific role:

Voting Frontend - Stateless web UI where users cast votes
Results Frontend - Stateless web UI displaying real-time results
Redis - Message queue for asynchronous processing
PostgreSQL - Persistent database storing vote data
Worker Service - Background processor consuming from queue and writing to database

The traffic flow follows a typical 3-tier distributed pattern:

User → Frontend → Queue → Worker → Database → Results

System Architecture Diagram

Architecture Overview:

The diagram above illustrates the complete system architecture showing:

External Layer: User traffic entering via HTTPS (port 8443)
Ingress Layer: TLS termination and path-based routing
Application Layer: Stateless frontend services (voting and results)
Data Layer: Redis (message queue) and PostgreSQL (persistent storage)
Processing Layer: Worker service connecting queue to database

Key Design Decisions:

All internal communication uses ClusterIP Services
External access is controlled through Ingress with host-based routing
Services are isolated and communicate only through defined interfaces
No hardcoded IPs anywhere - everything uses service discovery

Kubernetes Components Deep Dive

Deployments

Deployments manage the stateless workloads in this system:

Voting frontend
Results frontend
Redis
Worker
PostgreSQL (initially)

What Deployments Enable:

Rolling updates without downtime
Declarative replica scaling
Self-healing when pods crash
Controlled rollout strategies

Every time I updated a deployment, Kubernetes created new pods, waited for them to be ready, then terminated old ones. Zero downtime.

StatefulSets (The Deep Learning)

StatefulSets were explored separately to understand how stateful workloads differ fundamentally from stateless ones.

Key StatefulSet Characteristics:

Stable, persistent pod identities (pod-0, pod-1, etc.)
Ordered, graceful deployment and scaling
Stable network identifiers via headless services
Per-pod persistent storage that survives rescheduling

PostgreSQL was initially deployed as a Deployment. Then I migrated it to a StatefulSet to understand:

How DNS works with headless services
How PersistentVolumeClaims attach to specific pods
Why ordered startup matters for clustered databases
How rollout behavior changes with stateful workloads

Services: The Networking Glue

Services provide stable networking in an environment where pod IPs are ephemeral.

ClusterIP Services (Internal):

apiVersion: v1
kind: Service
metadata:
  name: redis
spec:
  type: ClusterIP
  selector:
    app: redis
  ports:
  - port: 6379
    targetPort: 6379

This creates a stable endpoint at redis.default.svc.cluster.local that load-balances across Redis pods.

Key Learning: Services are not pods. Services are stable DNS names that route to pods. When pods die and are recreated with new IPs, the Service continues working.

Ingress: External Traffic Routing

Ingress defines HTTP routing rules, but requires an Ingress Controller to actually process traffic.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: app-ingress
spec:
  rules:
  - host: oggy.local
    http:
      paths:
      - path: /vote
        pathType: Prefix
        backend:
          service:
            name: voting-service
            port:
              number: 80
      - path: /result
        pathType: Prefix
        backend:
          service:
            name: result-service
            port:
              number: 80

The Ingress Flow:

User makes request to oggy.local/vote
Request hits Ingress Controller
Controller evaluates Ingress rules
Traffic forwards to voting-service on port 80
Service load-balances to backend pods

TLS Termination

This project implements TLS termination at the Ingress level, enabling HTTPS access.

Creating TLS certificates:

# Generate self-signed certificate for local development
openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
  -keyout oggy.key \
  -out oggy.crt \
  -subj "/CN=oggy.local/O=oggy.local"

Creating Kubernetes TLS Secret:

apiVersion: v1
kind: Secret
metadata:
  name: oggy-tls
  namespace: default
type: kubernetes.io/tls
data:
  tls.crt: <base64-encoded-cert>
  tls.key: <base64-encoded-key>

Or create it directly from files:

kubectl create secret tls oggy-tls \
  --cert=oggy.crt \
  --key=oggy.key

Ingress with TLS:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: app-ingress
spec:
  tls:
  - hosts:
    - oggy.local
    secretName: oggy-tls
  rules:
  - host: oggy.local
    http:
      paths:
      - path: /vote
        pathType: Prefix
        backend:
          service:
            name: voting-service
            port:
              number: 80

Now the application is accessible via HTTPS:

Voting: https://oggy.local:8443/vote
Results: https://oggy.local:8443/result

Key Learning: TLS termination at the Ingress means:

Traffic from user to Ingress Controller is encrypted (HTTPS)
Traffic from Ingress to backend Services is unencrypted (HTTP)
Certificates are managed centrally, not per-service
Backend services don't need to handle TLS

Traffic Flow: Internal vs External

Internal Communication

All services communicate using DNS-based service discovery:

voting-frontend → redis.default.svc.cluster.local
worker → redis.default.svc.cluster.local  
worker → db.default.svc.cluster.local
result-frontend → db.default.svc.cluster.local

No pod IPs. No hardcoded addresses. Pure service discovery.

External Access

Users access the application through HTTPS with TLS termination:

Voting: https://oggy.local:8443/vote
Results: https://oggy.local:8443/result

The Ingress Controller handles:

TLS termination (decrypting HTTPS traffic)
Path-based routing to appropriate Services
Load balancing across backend pods

What Broke and How I Fixed It

Problem 1: Pod IPs Keep Changing

What happened: I initially tried connecting services using pod IPs. Pods got rescheduled, IPs changed, everything broke.

Root cause: Pods are ephemeral. Their IPs are not stable.

Solution: Use Services as stable endpoints. Services maintain consistent DNS names regardless of pod lifecycle.

# ❌ Wrong: Hardcoding pod IP
REDIS_HOST: "10.244.0.5"

# ✅ Right: Using service name
REDIS_HOST: "redis"

Problem 2: Ingress Resources Did Nothing

What happened: Created Ingress resources. Nothing worked. Traffic never reached the apps.

Root cause: Ingress resources are just configuration. They require an Ingress Controller to actually process traffic.

Solution: Installed nginx-ingress controller separately:

kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.8.1/deploy/static/provider/cloud/deploy.yaml

Think of it like this:

Ingress Resource = routing rules (the "what")
Ingress Controller = traffic processor (the "how")

Problem 3: Service Names Didn't Resolve Across Namespaces

What happened: Services in different namespaces couldn't find each other using short names.

Root cause: DNS resolution in Kubernetes is namespace-scoped by default.

Solution: Use fully qualified domain names (FQDN):

# Within same namespace
REDIS_HOST: "redis"

# Across namespaces
REDIS_HOST: "redis.production.svc.cluster.local"

DNS resolution follows this pattern:

redis → searches current namespace
redis.production → searches production namespace
redis.production.svc.cluster.local → explicit FQDN

Problem 4: Ingress Controller Wouldn't Schedule

What happened: Ingress Controller pod stuck in Pending state. Never scheduled to a node.

Root cause: Local cluster had node taints and labels that prevented scheduling.

Solution: Added tolerations and adjusted node selector:

tolerations:
- key: "node-role.kubernetes.io/control-plane"
  operator: "Exists"
  effect: "NoSchedule"

Learning: Cloud clusters and local clusters (kind, minikube) have different default configurations. Local clusters often taint control-plane nodes to prevent workload scheduling.

Configuration Management: Secrets and ConfigMaps

ConfigMaps for Non-Sensitive Data

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
data:
  REDIS_HOST: "redis"
  DB_HOST: "db"
  DB_NAME: "votes"

ConfigMaps store configuration as key-value pairs that can be injected into pods.

Secrets for Sensitive Data

apiVersion: v1
kind: Secret
metadata:
  name: app-secrets
type: Opaque
data:
  DB_PASSWORD: cG9zdGdyZXM=  # base64 encoded

Critical: Secrets are base64-encoded, not encrypted by default. For production, use encryption at rest or external secret managers (Vault, AWS Secrets Manager, etc.).

Injecting Configuration into Pods

env:
- name: REDIS_HOST
  valueFrom:
    configMapKeyRef:
      name: app-config
      key: REDIS_HOST
- name: DB_PASSWORD
  valueFrom:
    secretKeyRef:
      name: app-secrets
      key: DB_PASSWORD

Rolling Updates: Zero-Downtime Deployments

Kubernetes Deployments support rolling updates out of the box.

Update Strategy:

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxUnavailable: 1
    maxSurge: 1

What happens during an update:

Kubernetes creates 1 new pod (maxSurge: 1)
Waits for new pod to be ready
Terminates 1 old pod (maxUnavailable: 1)
Repeats until all pods are updated

Testing rolling updates:

# Update image version
kubectl set image deployment/voting-app voting-app=voting-app:v2

# Watch the rollout
kubectl rollout status deployment/voting-app

# Rollback if needed
kubectl rollout undo deployment/voting-app

Zero downtime. Zero manual intervention.

Debugging Distributed Systems

Essential Debugging Commands

Check pod status:

kubectl get pods
kubectl describe pod <pod-name>
kubectl logs <pod-name>
kubectl logs <pod-name> --previous  # logs from crashed container

Test service connectivity:

# Create debug pod
kubectl run debug --image=nicolaka/netshoot -it --rm -- /bin/bash

# Inside debug pod
nslookup redis
nslookup redis.default.svc.cluster.local
curl http://voting-service/vote

Check ingress:

kubectl get ingress
kubectl describe ingress app-ingress

Verify service endpoints:

kubectl get endpoints redis

This shows which pod IPs the service is routing to. If empty, your selector doesn't match any pods.

Local Cluster Setup

Creating a kind cluster:

# Create cluster with ingress port mappings
cat <<EOF | kind create cluster --config=-
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  kubeadmConfigPatches:
  - |
    kind: InitConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        node-labels: "ingress-ready=true"
  extraPortMappings:
  - containerPort: 80
    hostPort: 8080
    protocol: TCP
  - containerPort: 443
    hostPort: 8443
    protocol: TCP
EOF

Install nginx-ingress for kind:

kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/kind/deploy.yaml

Configure local DNS:

# Add hostname to /etc/hosts
echo "127.0.0.1 oggy.local" | sudo tee -a /etc/hosts

Deploy the application:

kubectl apply -f namespace.yaml
kubectl apply -f configMap.yaml
kubectl apply -f secrets.yaml
kubectl apply -f tls-secret.yaml
kubectl apply -f .

Access the application:

HTTP: http://oggy.local:8080/vote
HTTPS: https://oggy.local:8443/vote (with TLS)
Results: https://oggy.local:8443/result

Note: Your browser will show a security warning for the self-signed certificate. This is expected for local development.

Project Structure

.
├── README.md
├── namespace.yaml              # Namespace definition
├── configMap.yaml              # ConfigMap for app configuration
├── secrets.yaml                # Secrets for sensitive data
├── tls-secret.yaml            # TLS certificates for HTTPS
├── oggy.crt                   # TLS certificate file
├── oggy.key                   # TLS private key file
├── deployment-postgres.yaml   # PostgreSQL Deployment
├── deployment-redis.yaml      # Redis Deployment
├── deployment-result.yaml     # Results Frontend Deployment
├── deployment-voting.yaml     # Voting Frontend Deployment
├── deployment-worker.yaml     # Worker Deployment
├── service-postgres.yaml      # PostgreSQL Service
├── service-redis.yaml         # Redis Service
├── service-results.yaml       # Results Service
├── service-voting.yaml        # Voting Service
└── ingress.yaml               # Ingress with TLS configuration

Key Takeaways

This project isn't about running containers in Kubernetes. It's about understanding how Kubernetes actually works.

Mental Models That Clicked:

Kubernetes networking is service-driven, not pod-driven

Pods are ephemeral. Services are stable. Always route through Services.
Ingress requires both rules and a controller

Rules define routing logic. Controllers implement the logic.
DNS resolution is namespace-scoped

Short names work within namespaces. Cross-namespace requires FQDNs.
Local clusters behave differently than cloud clusters

Taints, tolerations, and storage classes vary significantly.
StatefulSets are fundamentally different from Deployments

Stable identities, ordered operations, and per-pod storage make stateful workloads possible.

Once these mental models clicked, advanced Kubernetes concepts (NetworkPolicies, PodDisruptionBudgets, HorizontalPodAutoscalers) started making sense.

What's Next

This project covers the fundamentals plus TLS termination. Real production systems add even more:

Automated certificate management with cert-manager (vs manual certificates)
Persistent volumes with storage classes for stateful workloads
Horizontal Pod Autoscalers for dynamic scaling based on metrics
Network Policies for pod-to-pod traffic control and security
Resource limits and requests for scheduling and QoS
Health checks (liveness and readiness probes)
Monitoring with Prometheus and Grafana
Log aggregation with ELK or Loki

But you can't build those without understanding the fundamentals first.

Setup and Deployment

Clone the repository:

git clone https://github.com/yourusername/kubernetes-voting-app.git
cd kubernetes-voting-app

Deploy everything:

kubectl apply -f .

Verify deployment:

kubectl get pods
kubectl get svc
kubectl get ingress

Cleanup:

kubectl delete -f .

Resources

Official Kubernetes Docs:

Tools Used:

kind - Kubernetes IN Docker (used for this project)
nginx-ingress - Ingress Controller
kubectl - Kubernetes CLI

Alternative local clusters: minikube, k3s, Docker Desktop

Source Code
GitHub

Questions or feedback? Drop a comment below. Happy to discuss Kubernetes architecture, debugging strategies, or anything else related to distributed systems.

Kubernetes #DevOps #Microservices #Docker #CloudNative #DistributedSystems

I Built a Multi-Service Kubernetes App and Here's What Actually Broke

Adil Khan — Sat, 31 Jan 2026 06:32:21 +0000

I spent the last few weeks deploying a multi-service voting application on Kubernetes.

Not because I needed a voting app. Because I needed to understand how Kubernetes actually handles real application traffic.

There's a gap between running a single container in a pod and understanding how multiple services discover each other, how traffic flows internally, and how external requests actually reach your application.

This project closed that gap for me.

What I Built

A voting system with five independent components:

Voting frontend (where users vote)
Results frontend (where users see results)
Redis (acting as a queue)
PostgreSQL (persistent storage)
Worker service (processes votes asynchronously)

Each component runs in its own container. Each is managed independently by Kubernetes. None of them know pod IPs. Everything communicates through service discovery.

This mirrors how real microservices work in production.

The Architecture Isn't Random

I didn't pick this setup arbitrarily. This is what actual distributed systems look like:

Frontend services are stateless and can scale horizontally
Data services are isolated for persistence
Communication happens via stable network abstractions
External traffic enters through a controlled entry point

Kubernetes handles the orchestration. I needed to understand how.

Kubernetes Objects I Actually Used

Deployments
These manage the workloads. They define replica counts and ensure pods get recreated if they fail. Every major component runs as a Deployment.

Pods
The smallest unit Kubernetes schedules. They're ephemeral. They die and get recreated. You never access them directly.

Services
This is where it clicked for me. Services provide stable DNS names and IPs. Pods can change IPs constantly. Services don't. All internal communication goes through Services.

I used two types:

ClusterIP for internal-only communication (Redis, PostgreSQL)
NodePort temporarily for testing frontends before I understood Ingress

Ingress
Defines HTTP routing rules for external traffic. Host-based and path-based routing through a single entry point.

Here's what tripped me up: Ingress resources don't do anything by themselves.

Ingress Controller
This is the actual component that receives and processes traffic. It runs as a pod and dynamically configures itself based on Ingress rules.

Without an Ingress Controller installed, your Ingress rules are useless. I learned this the hard way.

How Traffic Actually Flows

Internal Traffic

Inside the cluster:

Voting frontend sends votes to Redis using the Redis Service name
Worker reads from Redis using the Redis Service name
Worker writes results to PostgreSQL using the database Service name
Results frontend reads from PostgreSQL using the database Service name

No pod IPs anywhere. Service DNS gets resolved automatically by Kubernetes.

External Traffic

From the browser to the application:

User sends HTTP request
Request hits the Ingress Controller
Ingress rules get evaluated
Traffic forwards to the correct Service
Service load-balances to backend pods

Ingress operates at the HTTP level. It's the production-grade way to expose applications.

What Actually Broke (and What I Learned)

Pod IPs Keep Changing

Pods were getting recreated automatically. Their IPs changed every time. Hardcoding IPs didn't work.

Solution: Use Services. Always. Services provide stable endpoints. This is what they're designed for.

Service Types Confused Me

I didn't understand why there were multiple Service types or when to use which one.

Solution: ClusterIP is for internal communication only. NodePort exposes services on node IPs (useful for testing, not for production). Ingress is the right way to handle external HTTP traffic.

Ingress Didn't Work

I created Ingress resources. Traffic still wasn't reaching my apps.

Solution: You need an Ingress Controller installed separately. The Ingress resource is just configuration. The controller is what actually processes traffic. Once I installed the controller, everything worked.

Ingress Controller Wouldn't Schedule

The controller pod was stuck in pending state.

Solution: In my local cluster, I needed to fix node labels and tolerations so the controller could schedule on the control-plane node. This doesn't happen in cloud environments, but it matters in local setups.

Local Networking Doesn't Work Like Cloud

External access from my browser didn't work directly in my container-based local cluster.

Solution: Port forwarding. I forwarded the Ingress Controller port locally. This simulates how cloud load balancers work but adapted for local development.

Service Names Didn't Resolve Everywhere

Service names weren't resolving across namespaces or from outside the cluster.

Solution: Kubernetes service DNS is namespace-scoped by default. I learned to use fully qualified domain names when needed and understood where DNS resolution actually works.

What I Actually Understand Now

Before this project, I could write Kubernetes manifests. But I didn't really get how the pieces connected.

Now I understand:

Kubernetes networking is service-driven, not pod-driven
Ingress needs both rules and a controller to function
Local clusters behave differently than cloud clusters
Service discovery happens through DNS, not hardcoded IPs
Debugging requires understanding both the platform and the application

Why This Matters

This isn't about running containers. It's about understanding how Kubernetes:

Routes traffic between services
Discovers services dynamically
Separates internal and external networking
Enforces declarative state

Once this mental model clicked, advanced topics started making sense.

Real Takeaway

Build it once to make it work.

Break it to understand why it works.

I could have just deployed this app using a tutorial and called it done. But I wouldn't have learned how service discovery actually functions, or why Ingress controllers exist, or what happens when pods get recreated.

The debugging forced me to understand the platform, not just the syntax.

If you're learning Kubernetes, pick a multi-service application and deploy it. Then break it. Then fix it. That's where the understanding comes from.

What's been the hardest part of Kubernetes for you? Drop a comment.

Full code and setup instructions:
kubernetes-sample-voting-app-project1

Architecture diagram and detailed breakdown:

kubernetes #devops #learning #microservices

Building a Production-Style AWS ECS Platform with Terraform (Without Community Modules)

Adil Khan — Mon, 19 Jan 2026 17:10:59 +0000

For the last three weeks, I've been building a production-style AWS infrastructure using Terraform, ECS (Fargate), Docker, and Jenkins.

⚠️ Important Clarification

I did not use community Terraform modules.

Every Terraform module in this project was written from scratch by me.

This was not an accident. It was a conscious decision to trade speed for understanding.

I didn't want to learn how to use Terraform.
I wanted to learn how infrastructure actually behaves under real constraints.

This article documents:

The architecture I built
How the system works end-to-end
The real problems I ran into
Why those problems mattered
What this project fundamentally changed in how I think about infrastructure

🎯 Motivation: Why Build Everything From Scratch?

Terraform community modules are powerful. They are also abstractions.

After using them in the past, I realized something uncomfortable:

I could deploy fairly complex infrastructure without truly understanding:

Why certain IAM permissions were required
How traffic actually flowed through the network
What Terraform needed during plan vs apply
How ECS, ALBs, and IAM interact internally

So I imposed a hard rule on myself:

❌ No community modules
❌ No copy-pasting large IAM policies without understanding them
❌ No "just works" defaults

If something broke, I wanted to know why it broke.

This decision turned a "simple ECS project" into a deep learning exercise.

🏗️ High-Level System Overview

At a high level, this is a two-tier containerized application deployed on AWS ECS using Fargate.

The architecture is intentionally private by default.

🌐 Frontend Layer

Publicly accessible only through a Public Application Load Balancer
Runs as ECS Fargate tasks inside private subnets
No public IPs on ECS tasks

🔒 Backend Layer

Completely private
Accessible only via an Internal Application Load Balancer
Runs as ECS Fargate tasks inside private subnets
Zero direct internet exposure

The frontend never talks directly to the backend container.
All communication flows through load balancers.

This wasn't just architectural purity—it simplified security reasoning and debugging.

📋 Architecture at a Glance

Component	Details
Frontend	Public ALB → ECS Fargate (private subnet)
Backend	Internal ALB → ECS Fargate (private subnet)
CI/CD	Jenkins EC2 → Docker → ECR → Terraform → ECS
State	S3 + DynamoDB locking
Networking	Custom VPC, NAT Gateway, multi-AZ

🌐 Networking Design: What "Private" Actually Means

I created a custom VPC and treated networking as a first-class concern.

VPC Layout

🟢 Public Subnets

Public Application Load Balancer
Jenkins EC2 instance
Internet Gateway attached

🔴 Private Subnets

Frontend ECS tasks
Backend ECS tasks
Internal Application Load Balancer

🔄 Ingress Flow

Internet → Public ALB → Frontend ECS (private subnet) → Internal ALB → Backend ECS (private subnet)

No other ingress paths exist.

⚠️ Egress Reality Check

One of the earliest failures I hit:

ECS tasks couldn't pull images from ECR.

The reason wasn't ECS or IAM.
It was networking.

Private subnets do not magically have outbound internet access.

Fixing this forced me to understand:

✅ NAT Gateways
✅ Route tables
✅ Why "private subnet" doesn't mean "isolated from the world" by default

This single issue reshaped how I think about AWS networking.

🐳 Containerization and Image Flow

Docker is used for packaging both services.

Key decisions:

Images are tagged with Git commit SHA
No latest tags
Every deployment is traceable

Image flow looks like this:

Jenkins builds Docker images
Images are pushed to Amazon ECR
ECS pulls images using the task execution role

This strict immutability made rollbacks and debugging significantly easier.

🛠️ Terraform Architecture: Everything Modularized

The entire infrastructure is provisioned using Terraform.

But instead of one massive configuration, I designed small, focused modules, each with a single responsibility.

📦 Custom Modules Include

Module	Purpose
VPC	Network foundation
Subnets	Public/Private isolation
Route Tables	Traffic routing
Security Groups	Firewall rules
Public ALB	Internet-facing load balancer
Internal ALB	Private load balancer
ECS Cluster	Container orchestration
ECS Task Definitions	Container specifications
ECS Services	Service management
ECR Repository	Image storage
IAM Roles & Policies	Permissions
Jenkins EC2	CI/CD server
Remote Backend	S3 + DynamoDB

Each module:

✅ Exposes only required outputs
✅ Avoids leaking internal resource details
✅ Enforces clear dependency boundaries

This made the system easier to reason about—and easier to break in controlled ways.

💾 Remote State and Locking

Terraform state is stored remotely:

S3 for state storage
DynamoDB for state locking

This became critical once CI/CD entered the picture.

Without locking, concurrent applies from Jenkins would have been a disaster.

🔄 CI/CD Design with Jenkins

Jenkins runs on an EC2 instance inside the VPC.

I intentionally separated CI and CD responsibilities.

🔨 CI Pipeline (Build & Package)

Triggered on GitHub push:

Checkout code
Build frontend and backend Docker images
Tag images with Git SHA
Push images to ECR
Export image tags as artifacts

✅ CI has no infrastructure permissions.

🚀 CD Pipeline (Deploy via Terraform)

Triggered after CI completion:

Fetch image tag artifacts
Assume AWS IAM role via STS
Run terraform init
Run terraform apply
Update ECS task definitions with new image versions

✅ Terraform is the only deployment mechanism.

No manual ECS changes.
No clicking in the console.

🔐 IAM: The True Difficulty of the Project

IAM was the hardest and most educational part of this project.

Because I didn't use community modules, I had to learn that:

📚 What I Learned About IAM

Terraform needs read permissions even during creation:

terraform plan fails without Describe* and Get* permissions
Missing permissions that broke my plans:
- iam:GetPolicyVersion
- ec2:DescribeVpcAttribute
- elasticloadbalancing:DescribeLoadBalancerAttributes

ECS failures were often IAM issues:

❌ Incorrect iam:PassRole configuration
❌ Confusion between execution role vs task role

Most of my "Terraform errors" were actually IAM design errors.

This project forced me to deeply understand:

✅ Trust relationships
✅ Role assumption via STS
✅ Least-privilege policy design
✅ How AWS services act on behalf of other services

🐛 ECS Debugging: Nothing Is Isolated

ECS failures required system-level thinking.

A task failing could be caused by:

❌ Image pull failures
❌ Missing IAM permissions
❌ Incorrect security group rules
❌ ALB health check mismatch
❌ Networking misconfiguration

There is no single log that tells the full story.

You need to understand how:

ECS
ALB
IAM
Networking

work together.

This project forced me to build that mental model.

⚠️ Key Mistakes I Made (So You Don't Have To)

1️⃣ Assuming private subnets = no internet access

Reality: Private subnets need NAT Gateway + route table configuration
Cost me hours debugging ECR pull failures

2️⃣ Treating IAM as an afterthought

IAM should be designed first, not patched later
Most "Terraform errors" were actually IAM design errors

3️⃣ Not separating CI and CD early

Initially mixed build and deploy logic
Separation made debugging and security much cleaner

4️⃣ Underestimating security group complexity

Had to trace traffic flow through multiple layers
One missing rule broke the entire deployment

📊 Project by the Numbers

Metric	Count
Duration	3 weeks
Custom Terraform modules written	12+
AWS resources managed	50+
IAM policies debugged	Too many to count
Docker images built	50+
Failed deployments before success	15+

🎓 What This Project Gave Me

✅ Confidence writing Terraform modules from scratch
✅ Strong understanding of IAM trust and permission boundaries
✅ Practical experience debugging ECS + ALB + networking
✅ Clear mental separation of CI vs CD
✅ Comfort with production-style AWS constraints

More importantly, it taught me how to reason about infrastructure instead of guessing.

🛑 Why I'm Stopping Here

This project achieved its learning goal.

Continuing to polish it would bring diminishing returns.

The biggest gap in my skill set now is Kubernetes, and I'm moving there next—with the same approach:

✅ No shortcuts
✅ No blind abstractions
✅ Build it, break it, debug it

💭 Final Thought

If you're learning DevOps:

Don't just use Terraform modules.

Write them. Break them. Fix them.

That's where real understanding comes from.

💬 Questions for the Community

🤔 What's the most painful IAM issue you've debugged?
🤔 Do you prefer community modules or custom modules for learning?
🤔 What infrastructure topic should I tackle next?

Drop a comment—I read and respond to all of them.

Tags: #aws #terraform #devops #ecs #docker #jenkins #infrastructure #cicd #learning #iac

If you found this helpful, give it a ❤️ and follow for more deep-dive DevOps content!

I Took a Working Terraform Project and Rebuilt It Properly (ALB + EC2 + Modules + Remote State)

Adil Khan — Sun, 04 Jan 2026 16:24:29 +0000

Introduction

A lot of Terraform projects reach a point where things work — an EC2 launches, a load balancer responds, and terraform apply finishes without errors.

I reached that point too.

But after spending more time with Terraform, I realized something uncomfortable: working infrastructure doesn't necessarily mean well-designed infrastructure.

So instead of moving on, I decided to stop and rebuild the same project again — this time focusing on structure, clarity, and how the code would behave if it had to grow or be maintained.

This post is a walkthrough of that process: starting from a non-modular Terraform setup and gradually refactoring it into a modular one, while dealing with the confusion, mistakes, and "aha" moments along the way.

What I Set Out to Build

The goal was simple in terms of resources, but intentional in design:

Multiple EC2 instances running Nginx
An Application Load Balancer distributing traffic
Separate security groups for ALB and EC2
Dynamic target group registration
Terraform remote state stored in S3
State locking using DynamoDB
A fully modular Terraform structure

Nothing exotic — just common AWS building blocks — but wired together in a way that reflects how Terraform is actually used beyond tutorials.

Why I Did Not Start with Modules

At first, I had a non-modular version of this project.

Everything was in one place.

Resources referenced each other directly.

It worked.

But that version taught me how Terraform executes, not how Terraform should be structured.

Before modularizing, I wanted to clearly understand:

How count and for_each really behave
Why count.index can cause problems later
How Terraform decides resource identity
What happens when you change inputs after resources already exist
How state is affected when multiple resources depend on each other

Only after seeing those problems firsthand did modularization start to make sense.

The First Real Shift: Stop Using `count`

One of the biggest changes I made was moving away from count and using for_each everywhere.

Instead of creating instances like "instance 0, 1, 2", I switched to maps like:

instance-1
instance-2
instance-3

This immediately made things clearer.

With `for_each`:

Resource names stay stable
Outputs are predictable
Wiring resources together becomes much easier
You stop relying on numeric positions and start relying on intent

Once this clicked, the rest of the project design became much cleaner.

How Instance Creation is Handled

In the root module, I generate a map that looks like:

instance-name → subnet-id

Subnets are chosen dynamically using modulo logic so instances are spread across availability zones.

That map is passed into the EC2 module.

Inside the EC2 module:

for_each iterates over the map
Each key becomes the instance Name tag
Each value becomes the subnet_id

This keeps responsibility clear:

The root module decides what should exist
The EC2 module decides how instances are created

That separation turned out to be very important later.

EC2 Module Design

The EC2 module does only one job: create EC2 instances.

It does not decide:

How many instances exist
Which subnets to use
How traffic reaches them

Inputs include:

AMI
Instance type
Key name
Security group IDs
A map of instance names to subnet IDs
Optional user data

Outputs return maps:

Instance IDs
Private IPs
ARNs

Returning maps instead of lists keeps instance identity intact when passing data to other modules.

Security Groups: Keeping Things Isolated

Instead of putting everything into one security group, I created:

One security group for the ALB
One security group for the EC2 instances

The ALB security group:

Allows inbound HTTP from the internet

The EC2 security group:

Allows inbound traffic only from the ALB security group
Allows SSH only from a restricted CIDR

This setup drastically reduces exposure and makes traffic flow explicit instead of implicit.

The security group module accepts ingress and egress rules as maps of objects, which made it flexible without being complicated.

Load Balancer and Target Registration

The load balancer module handles:

ALB creation
Target group creation
Listener configuration
Target group attachments

The important part is that the ALB module does not care how EC2 instances are created.

It simply accepts a map of instance IDs.

Inside the module, it loops over that map and attaches each instance to the target group dynamically.

No hardcoded references.

No assumptions.

Just clean inputs and outputs.

Remote State and Locking

Terraform state is stored remotely in S3, with DynamoDB used for state locking.

I intentionally included this even though I was working alone.

Why?

Because this is where Terraform usage changes completely.

Remote state with locking:

Prevents concurrent applies
Prevents accidental corruption
Forces you to think about Terraform as a shared system

Once you use this setup, going back to local state feels wrong.

User Data and Verification

Each EC2 instance runs a simple user data script that installs Nginx and serves a response identifying the instance.

This made it easy to verify:

user_data execution
Instance uniqueness
Load balancer distribution

Seeing traffic rotate across instances confirmed that everything was wired correctly.

Challenges I Ran Into

Some things that took time to understand:

Challenge	Description
Indexing errors	Why indexing errors happen with data sources
count breaks identity	Why count breaks identity when things change
Module outputs	Why module outputs should usually preserve structure
Security group references	How security group references differ from CIDR rules
User data behavior	Why user data doesn't rerun unless instances are replaced
DynamoDB locking	How DynamoDB locking behaves during apply

Each issue forced me to slow down and actually read what Terraform was doing instead of guessing.

What This Project Represents for Me

This project wasn't about adding more AWS services.

It was about:

Writing Terraform that is readable
Making dependencies explicit
Reducing assumptions
Designing for change instead of just "apply success"

The biggest shift wasn't technical — it was mental.

I stopped asking:

"Does this work?"

And started asking:

"Does this make sense if I come back in three months?"

What's Next

Some natural extensions to this setup:

Auto Scaling Groups
HTTPS with ACM
Monitoring and alarms
CI/CD for Terraform
Environment separation
ECS or EKS later on

But those only make sense once the foundation is solid.

Final Thought

Terraform feels difficult when you treat it like a scripting tool.

It becomes much clearer when you treat it like a design tool.

Building something twice — once messy, once structured — taught me more than any single tutorial ever could.

If you're learning Terraform and feel stuck, my honest advice is:

Build it once just to make it work.

Then rebuild it to make it right.

That's where the learning actually happens.

Project Repository

GitHub: terraform-project2-moudlarized

Connect with Me

LinkedIn • GitHub • Email

If you found this helpful, consider giving the repository a star!

Why Refactoring AWS Infrastructure Taught Me More Than Building It

Adil Khan — Fri, 02 Jan 2026 07:17:44 +0000

Most infrastructure projects work the first time because we push them until they do. But working infrastructure isn't the same as well-designed infrastructure.

Six months ago, I built an AWS infrastructure with Terraform. It worked. I was proud. Last week, I looked at that same code and cringed.

This is the story of what I learned by tearing it down and rebuilding it properly.

Background and Motivation

The original version of this project was built to validate concepts quickly. It provisioned EC2 instances, placed them behind a load balancer, and served traffic successfully. At the time, that felt like success.

But after gaining more exposure to Terraform patterns and real-world infrastructure practices, revisiting the code made the gaps obvious. Decisions were made because they worked, not because they were well thought out. Dependencies were forced instead of modeled. State handling existed, but wasn't fully understood.

This refactor was an attempt to slow down and rebuild the same infrastructure while focusing on clarity, correctness, and maintainability.

What This Project Builds

This project provisions a small but realistic AWS infrastructure stack using Terraform. The goal is not application complexity, but infrastructure correctness.

The setup includes:

✅ Multiple EC2 instances
✅ An Application Load Balancer in front of them
✅ A target group with health checks
✅ Security groups enforcing clear traffic flow
✅ Remote Terraform state with locking
✅ Instance bootstrapping using user data

Each EC2 instance runs Nginx and serves a simple page identifying the instance. This makes it easy to visually confirm load balancing behavior and instance health.

Architecture Overview

                    Internet
                       │
                       ▼
         ┌─────────────────────────┐
         │  Application Load       │
         │     Balancer (ALB)      │
         └─────────────────────────┘
                       │
         ┌─────────────┴─────────────┐
         │      Target Group         │
         │    (Health Checks)        │
         └─────────────┬─────────────┘
                       │
      ┌────────────────┼────────────────┐
      │                │                │
┌─────▼─────┐    ┌────▼─────┐    ┌────▼─────┐
│ EC2       │    │ EC2      │    │ EC2      │
│ Instance  │    │ Instance │    │ Instance │
│ (Nginx)   │    │ (Nginx)  │    │ (Nginx)  │
└───────────┘    └──────────┘    └──────────┘
  Subnet 1         Subnet 2        Subnet 3

The infrastructure uses:

AWS Default VPC (intentionally chosen for this learning project, though production workloads should always use custom VPCs with proper CIDR planning)
Public Application Load Balancer
EC2 instances distributed across available subnets
Target group attached to the ALB
Security groups controlling inbound and outbound traffic

A custom VPC was intentionally avoided. The purpose of this project was not network design, but Terraform fundamentals: state management, resource relationships, dynamic creation, and clean structure.

What Actually Changed

Aspect	Original Version	Refactored Version
Instance Creation	`count` with hardcoded values	`for_each` with dynamic mapping
Subnet Assignment	Manual/hardcoded	Modulo arithmetic distribution
Dependencies	Explicit `depends_on` everywhere	Implicit dependency graph
State Management	Local state file	S3 + DynamoDB locking
Security Groups	Overly permissive	Principle of least privilege
Code Lines	487 lines	312 lines (-36%)
Hardcoded Values	15+ IDs/ARNs	0 (all dynamic)

Terraform Concepts Applied

This refactor focused heavily on using Terraform the way it's intended to be used.

1. Remote State Management

Terraform state is stored in an S3 bucket with DynamoDB used for state locking. This prevents concurrent state corruption and reflects how Terraform is used in real team environments.

terraform {
  backend "s3" {
    bucket         = "oggy-backend-bucket"
    key            = "Alb-project-non-module/terraform.tfstate"
    region         = "ap-south-1"
    dynamodb_table = "stateLock-table"
    encrypt        = true
  }
}

2. Data Sources

Instead of hardcoding values, data sources are used to dynamically fetch:

The default VPC
Subnets within the VPC
The latest Ubuntu 24.04 LTS AMI

data "aws_vpc" "default" {
  default = true
}

data "aws_subnets" "default" {
  filter {
    name   = "vpc-id"
    values = [data.aws_vpc.default.id]
  }
}

data "aws_ami" "ubuntu" {
  most_recent = true
  owners      = ["099720109477"] # Canonical

  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd-gp3/ubuntu-noble-24.04-amd64-server-*"]
  }
}

3. Dynamic Resource Creation

Before:

resource "aws_instance" "web" {
  count         = 3
  subnet_id     = var.subnet_ids[count.index]
  # Fragile, breaks if subnets change
}

After:

locals {
  instances = {
    for i in range(var.nos) :
    "instance-${i + 1}" => data.aws_subnets.default_subnets.ids[
      i % length(data.aws_subnets.default_subnets.ids)
    ]
  }
}

resource "aws_instance" "vms" {
  for_each      = local.instances
  ami           = data.aws_ami.ubuntu.id
  instance_type = var.type_instance
  subnet_id     = each.value

  tags = {
    Name = each.key
  }
}

EC2 instances are created dynamically using for_each rather than static counts. This improves clarity, stability, and scalability of the configuration.

4. Locals for Computed Logic

locals {
  instances = {
    "web-instance-1" = {}
    "web-instance-2" = {}
    "web-instance-3" = {}...
  }
}

Subnet assignment logic is handled using locals, keeping the resource blocks clean and readable.

5. Implicit Dependencies

Rather than forcing execution order with depends_on, resource relationships define the dependency graph naturally.

Original version had 8 explicit depends_on blocks. The refactored version has 0.

Dynamic Instance and Subnet Distribution

One of the most valuable improvements in this refactor was how instances are distributed across subnets.

Instead of manually mapping instances to subnets, modulo arithmetic is used to assign instances evenly across all available subnets.

How It Works

Modulo arithmetic (index % subnet_count) ensures even distribution:

With 3 subnets and 6 instances:
- Instances 0, 3 → Subnet 0
- Instances 1, 4 → Subnet 1
- Instances 2, 6 → Subnet 2

This approach:

✅ Avoids hardcoding subnet IDs
✅ Scales automatically if subnets change
✅ Produces deterministic and predictable placement
✅ Works across any number of availability zones

This logic alone made the configuration significantly more robust than the original version.

Security Group Design

Security groups are designed with intent rather than convenience.

# ALB Security Group
resource "aws_security_group" "alb" {
  name        = "alb-security-group"
  description = "Allow HTTP inbound traffic"
  vpc_id      = data.aws_vpc.default.id

  ingress {
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

# EC2 Security Group
resource "aws_security_group" "ec2" {
  name        = "ec2-security-group"
  description = "Allow traffic from ALB only"
  vpc_id      = data.aws_vpc.default.id

  ingress {
    from_port       = 80
    to_port         = 80
    protocol        = "tcp"
    security_groups = [aws_security_group.alb.id]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

Key principles:

The Application Load Balancer allows inbound HTTP traffic from anywhere
EC2 instances only allow inbound traffic from the ALB security group
Outbound traffic is permitted for updates and health checks

This enforces a clear and understandable traffic flow: public access ends at the load balancer, and instances remain protected behind it.

Bootstrapping with User Data

Each EC2 instance is bootstrapped at launch using a user data script.

#!/bin/bash
apt-get update -y
apt-get install -y nginx

INSTANCE_NAME="${instance_name}"

cat > /var/www/html/index.html <<EOF
<!DOCTYPE html>
<html>
<head>
    <title>Instance: $INSTANCE_NAME</title>
</head>
<body>
    <h1>Hello from $INSTANCE_NAME</h1>
    <p>This instance is managed by Terraform</p>
    <p>Load balancing is working correctly!</p>
</body>
</html>
EOF

systemctl enable nginx
systemctl start nginx

The script:

Updates the system
Installs Nginx
Serves a simple instance-specific web page

This makes validation straightforward. If the ALB DNS shows responses from different instances, both provisioning and health checks are working as expected.

What Broke During Refactoring (And Why That's Good)

Issue 1: State Lock Timeout

What happened: First remote state migration failed with lock timeout.

Root cause: Didn't understand DynamoDB table requirements properly. The table needed a primary key named LockID (case-sensitive).

What I learned: State locking isn't automatic—it requires proper table schema. Reading error messages carefully saves hours.

Issue 2: Target Group Attachment Race Condition

What happened: Instances registered before they were ready, causing initial health check failures.

Root cause: Health check grace period was too short, and user data script execution takes time.

What I learned: AWS eventual consistency requires patience in automation. Added proper health check intervals and grace periods.

Issue 3: Circular Dependency with Security Groups

What happened: Terraform complained about circular dependencies when trying to reference security groups.

Root cause: Tried to be too clever with cross-referencing security group rules.

What I learned: Sometimes the simplest approach is the best. Separated ingress rules into distinct resources when needed.

Issue 4: Subnet Data Source Returned Unexpected Results

What happened: Got 6 subnets instead of expected 3 in my default VPC.

Root cause: AWS creates both public and private subnets in some regions by default.

What I learned: Always validate data source outputs. Added filters to ensure I'm only using the subnets I intend to use.

Measurable Improvements

Metric	Before	After	Change
Lines of code	487	312	-36%
Explicit dependencies	8	0	-100%
Hardcoded values	15	0	-100%
State lock conflicts	Possible	Prevented	✅
Subnet scalability	Fixed to 3	Dynamic	✅
Code readability	Moderate	High	✅

Key Takeaways for Other Learners

1. State Management Isn't Optional

Even for learning projects, use remote state. The habits matter more than the project size. I spent 2 hours debugging a state corruption issue that would have been prevented by proper locking.

2. Dynamic Beats Static

for_each is harder to learn than count, but it's worth the investment. The flexibility and clarity it provides compounds over time.

3. Read the Dependency Graph

Run this command and actually look at it:

terraform graph | dot -Tpng > graph.png

It will show you what Terraform actually understands about your infrastructure.

4. Refactoring > New Projects

Building something new teaches you syntax. Rebuilding teaches you design. I learned more in this refactor than in the original build.

5. Document Your Mistakes

My original code had 8 explicit depends_on blocks. All were unnecessary. That's valuable to know and remember.

6. Slow Down to Speed Up

The original project took 3 days of "making it work." The refactor took 2 days of "making it right." But now I understand it 10x better.

Why Refactoring Was More Valuable Than the Original Build

The first version taught me how to assemble resources.

The refactored version taught me why certain Terraform patterns exist.

Rebuilding the project exposed assumptions I didn't know I was making the first time. It forced me to:

Question every hardcoded value
Understand the difference between implicit and explicit dependencies
Think about how the code would scale
Consider how someone else would read and modify this code

That process made the concepts stick far more effectively than building something new.

The best learning happens when you're forced to justify your decisions to yourself.

Try It Yourself

Want to see the difference? Clone the repository and explore:

# Clone the repository
git clone https://github.com/adil-khan-723/terraform_project2_refactored
cd terraform_project2_refactored

# Initialize Terraform
terraform init

# Review the plan
terraform plan

# Apply the configuration
terraform apply

# Get the ALB DNS name
terraform output alb_dns_name

# Test the load balancer (you'll see different instances)
for i in {1..10}; do curl http://<alb-dns-name>; sleep 1; done

# Clean up
terraform destroy

What's Next?

I'm currently working on:

🔧 Converting this into reusable Terraform modules
📈 Adding Auto Scaling Groups for dynamic scaling
🔒 Implementing HTTPS with AWS Certificate Manager
🌐 Building a custom VPC version with proper network segmentation
📊 Adding CloudWatch dashboards and alarms

Repository and Source Code

The complete source code, file structure, and documentation are available here:

GitHub Repository:

https://github.com/adil-khan-723/terraform_project2_refactored

The repository contains:

Clean Terraform files with clear separation of concerns
Comprehensive README with architecture diagrams
No committed state or local artifacts
A readable, review-friendly structure

Final Thoughts

Infrastructure as Code is as much about the "code" part as it is about the "infrastructure" part. Clean, maintainable, understandable code matters—even when you're the only person who will read it.

The infrastructure worked both times. But only the second time did I understand why.

That's the difference between code that works and code that teaches.

Connect With Me

Have you refactored your own infrastructure code? What surprised you most? Drop a comment below—I'd love to hear about your experience.

LinkedIn: https://www.linkedin.com/in/adilk3682

If you found this helpful:

⭐ Star the GitHub repository
💬 Share your own refactoring stories in the comments
🔗 Connect with me on LinkedIn

Thanks for reading! Happy Terraforming! 🚀

Why I Stopped Using a Bastion Host and Moved to AWS SSM for Private EC2 Access

Adil Khan — Mon, 15 Dec 2025 08:57:26 +0000

When I started designing my AWS setup, one of my early goals was clear:
keep backend servers completely private.

So my EC2 instances lived in private subnets, with no public IPs. That felt right from a security perspective.
But very quickly, a practical problem showed up:

If my servers aren’t reachable from the internet, how do I access them when something breaks?

This article isn’t a guide on how to configure access.
It’s about what I tried, what felt wrong, and what finally clicked.

⸻

The First Solution I Tried: A Bastion Host

Like many people, I started with a Bastion Host.

The idea was straightforward:
• One small EC2 instance in a public subnet
• Port 22 open
• SSH into the bastion, then hop into private instances

And honestly — it worked.

But the more I used it, the more friction I felt:
• I was managing SSH keys again
• One public-facing server became a critical choke point
• Security started depending on how well I protected that single box

Nothing was broken, but something didn’t feel aligned with the rest of the architecture I was building.

⸻

The Question That Changed My Approach

At some point I stopped asking:

“How do I reach my servers?”

and started asking:

“Why does access depend on network paths at all?”

That shift led me to AWS Systems Manager (SSM) Session Manager.

⸻

What Changed When I Switched to SSM

Once I set up SSM and removed the Bastion Host, a few things became very clear.

Access became identity-based, not network-based

I wasn’t thinking about IPs, ports, or jump paths anymore.
Access was simply about who I am and what IAM permissions I have.

That felt like a more natural fit for cloud-native systems.

⸻

No inbound ports felt… relieving

Closing port 22 everywhere wasn’t just a security improvement — it simplified things mentally.

There was no longer a “special server” that needed extra attention or hardening.

⸻

Visibility improved without extra effort

Every session was logged.
Every action had an identity attached to it.

I didn’t have to bolt on monitoring — it was part of the access model itself.

⸻

What This Shift Changed in My Mental Model

Before:
• Security felt like layers of network controls
• Access meant “finding a safe path to the server”

After:
• Security feels like identity + intent
• Access means “am I allowed to be here?”

That distinction mattered more to me than I expected.

⸻

Where I Am Now

This is how I currently think about server access in my setups:
• Backend EC2 instances stay in private subnets
• The load balancer is the only public-facing component
• Administrative access happens through SSM, not SSH
• Security groups are chained tightly, not opened broadly

I’m not saying Bastion Hosts are wrong — they still have valid use cases.
But for my learning and the systems I’m building right now, SSM feels like the right default.

⸻

A Question for People Further Along

If you’ve worked on production AWS environments:
• Do you still rely on Bastion Hosts?
• Or have you moved fully toward SSM / identity-based access?
• In what cases do you still prefer a bastion?

I’m still learning, and I’d love to hear how others think about this trade-off.

How I Built a Production-Grade Multi-Tier Application on AWS ECS Fargate (A Complete Case Study)

Adil Khan — Sun, 07 Dec 2025 17:50:17 +0000

I recently completed a full end-to-end deployment of a multi-tier application on AWS ECS Fargate. What began as a simple “let’s deploy a React app and a Node.js API” turned into a complete production-style cloud architecture that tested everything I’ve learned about DevOps, AWS networking, and container orchestration.

This article is a complete technical breakdown of the project: how the architecture works, the services involved, what went wrong, what I fixed, and how the final system now behaves like a real microservices deployment running inside a production VPC.

I’m sharing this as a learning milestone and a reference for others trying to move from theory to real-world cloud builds.

Project Summary

The system is a simple two-service architecture:

• React frontend served by Nginx
• Node.js backend API (/api/message)
• Public ALB for frontend
• Internal ALB for backend
• Two ECS Fargate services with separate task definitions
• ECR repositories for image storage
• 4-subnet VPC (2 public, 2 private)
• SG-to-SG communication for isolation
• CloudWatch logging for both tasks

The result: a fully private backend and a publicly accessible frontend communicating securely inside the VPC.

High-Level Architecture

Below is the same architecture used in many real production microservices deployments:

• Public ALB → Receives internet traffic
• Frontend Fargate tasks in public subnets → Serves UI
• Internal ALB → Receives API calls only from frontend
• Backend Fargate tasks in private subnets → Serve API
• SG chaining → Only frontend → backend allowed
• ECR → Stores container images
• IAM Execution Role → Grants ECS permission to pull images
• CloudWatch Logs → Task logs + debugging
• VPC Endpoints (optional) → To avoid NAT costs

The backend has zero exposure to the public internet. All calls go through the internal ALB.

Network Design

VPC
10.0.0.0/16

Subnets
• Public Subnets (2) → ALB + frontend tasks
• Private Subnets (2) → Backend tasks

Route Tables
• Public subnets → Internet Gateway
• Private subnets → NAT Gateway / VPC Endpoints

Security Groups
• Public ALB SG: Allows HTTP from anywhere
• Frontend SG: Allows ALB → port 80
• Internal ALB SG: Allows frontend SG → port 80
• Backend SG: Allows internal ALB → port 5001

Traffic path:
Internet → Public ALB → Frontend → Internal ALB → Backend

Containers & Dockerfiles

Frontend (Nginx multi-stage build)
• Build React app
• Serve via Nginx
• Expose port 80

Backend (Node.js)
• Express server
• /api/message endpoint
• Expose port 5001

Both built locally → pushed to ECR.

ECR + IAM Setup

Two repositories: frontend, backend.

IAM Role had permissions for:
• ecr:GetAuthorizationToken
• ecr:BatchGetImage
• ecr:BatchCheckLayerAvailability
• logs:CreateLogStream
• logs:PutLogEvents

VPC Endpoints were added for:
• ECR API
• ECR DKR
• S3
• CloudWatch Logs

This fixed image pull timeouts in private subnets.

ECS Design

Cluster
1 ECS cluster (Fargate only).

Task Definitions
One for frontend, one for backend.
Includes CPU/memory, ports, logs, awsvpc mode, IAM roles.

Services
• frontend-service
• backend-service
Desired count: 2 each.

Rolling deployments were used for updates.

Load Balancing & Routing

Frontend
• Public ALB
• Listener: HTTP 80
• Target group: frontend-tg (port 80)
• Health check: /

Backend
• Internal ALB
• Listener: HTTP 80
• Target group: backend-tg (port 5001)
• Health check: /api/message

Frontend → Backend
Frontend uses internal ALB DNS for API calls.

Rolling Deployments

Flow for new image rollout:
1. Push new image
2. Create new task definition revision
3. ECS launches new tasks
4. ALB health checks them
5. Traffic shifts
6. Old tasks drain and stop

I tested multiple updates to see real ENI provisioning, ALB registration, logs, and draining behavior.

Key Metrics From the Deployment

• 0 public IPs on ECS tasks
• Backend remained fully private
• <15ms frontend → backend latency
• 3-minute build → push → deploy cycle
• Zero downtime rolling deployments
• 100% successful health checks
• Multiple revisions without breakage

Challenges & Fixes
1. Target group 404
Cause: wrong path
Fix: use /api/message
2. ECR pull timeout
Cause: tasks in private subnet
Fix: add VPC endpoints for ECR/S3/Logs
3. Frontend couldn’t reach backend
Cause: hardcoded IP
Fix: use internal ALB DNS
4. Rolling update issues
Cause: invalid deployment settings
Fix: correct minHealthyPercent & maxPercent

What I Learned

• How Fargate attaches ENIs inside private subnets
• How ALB target groups determine readiness
• How internal ALBs handle microservice communication
• How IAM least privilege affects ECR/ECS
• How routing works in multi-tier VPCs
• How rolling deployments behave in real time
• How containers, networking, IAM, and load balancing combine to form real systems

Why This Project Mattered

This wasn’t just a deployment. It was a deep dive into how real cloud systems work — with failures, debugging, routing decisions, IAM restrictions, and architecture redesigns.

It brought together:
• VPC networking
• IAM
• ECS
• ECR
• Docker
• Load balancing
• Rolling deployments
• Private service-to-service communication
• Monitoring & logging

GitHub Repository
https://github.com/adil-khan-723/node-app-jenkins1.git

Conclusion

Anyone learning DevOps or AWS should attempt a project like this. It forces you to think like an engineer designing real systems, not just someone running commands. It also builds confidence that you can architect and debug production-style systems from scratch.

If you’re working on similar projects or want to discuss cloud architectures, feel free to reach out.