<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: cypher682</title>
    <description>The latest articles on Forem by cypher682 (@cypher682).</description>
    <link>https://forem.com/cypher682</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F610894%2F6245b0e9-d707-4351-b69c-159b40badb08.png</url>
      <title>Forem: cypher682</title>
      <link>https://forem.com/cypher682</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/cypher682"/>
    <language>en</language>
    <item>
      <title>Designing a Production-Grade Blue-Green ECS Platform on AWS with Terraform</title>
      <dc:creator>cypher682</dc:creator>
      <pubDate>Tue, 24 Feb 2026 15:02:49 +0000</pubDate>
      <link>https://forem.com/cypher682/designing-a-production-grade-blue-green-ecs-platform-on-aws-with-terraform-2n5h</link>
      <guid>https://forem.com/cypher682/designing-a-production-grade-blue-green-ecs-platform-on-aws-with-terraform-2n5h</guid>
      <description>&lt;p&gt;Most AWS tutorials stop at "it works."&lt;/p&gt;

&lt;p&gt;I wanted to build something closer to what a real engineering team would operate: network isolation, IAM least privilege, blue-green deployments, secrets management, and clean teardown—all defined as code.&lt;/p&gt;

&lt;p&gt;This article walks through the architecture, design decisions, tradeoffs, and &lt;strong&gt;the 8 real issues I encountered&lt;/strong&gt; along the way.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub Repository:&lt;/strong&gt; &lt;a href="https://github.com/cypher682/ecs-production-platform" rel="noopener noreferrer"&gt;ecs-production-platform&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Total Cost:&lt;/strong&gt; $0.12 for complete validation (4-hour session)&lt;/p&gt;




&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;What Was Built&lt;/li&gt;
&lt;li&gt;Architecture Overview&lt;/li&gt;
&lt;li&gt;Security Design&lt;/li&gt;
&lt;li&gt;Blue-Green Deployment Mechanics&lt;/li&gt;
&lt;li&gt;What Broke (8 Issues)&lt;/li&gt;
&lt;li&gt;Production Tradeoffs&lt;/li&gt;
&lt;li&gt;Cost Analysis&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  What Was Built
&lt;/h2&gt;

&lt;p&gt;A production-aligned ECS Fargate platform running a Flask API backed by PostgreSQL:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Networking&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Custom VPC (&lt;code&gt;10.0.0.0/16&lt;/code&gt;) across 2 Availability Zones&lt;/li&gt;
&lt;li&gt;Public subnets for ALB and ECS tasks&lt;/li&gt;
&lt;li&gt;Private subnets for RDS (no internet route)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Compute&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ECS Fargate services (no EC2 instance management)&lt;/li&gt;
&lt;li&gt;Application Load Balancer with HTTPS (ACM certificate, TLS 1.3)&lt;/li&gt;
&lt;li&gt;Blue-green target groups for zero-downtime deployments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Data&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;RDS PostgreSQL 15.12 (single-AZ for free tier)&lt;/li&gt;
&lt;li&gt;Private subnet only, no public endpoint&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Security&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;IAM role separation (execution vs task)&lt;/li&gt;
&lt;li&gt;SSM Parameter Store for secrets (KMS encrypted)&lt;/li&gt;
&lt;li&gt;Security group layering (internet → ALB → ECS → RDS)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Infrastructure as Code&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;100% Terraform (modular design)&lt;/li&gt;
&lt;li&gt;Remote state in S3 with DynamoDB locking&lt;/li&gt;
&lt;li&gt;Reusable modules (networking, IAM, ALB, ECS, RDS)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;36 AWS resources deployed, tested, and cleanly destroyed.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────┐
│              Internet                        │
└──────────────────┬──────────────────────────┘
                   │
         ┌─────────▼─────────┐
         │    Route 53 DNS    │
         └─────────┬──────────┘
                   │
         ┌─────────▼──────────┐
         │  ACM Certificate    │
         │    (TLS 1.3)        │
         └─────────┬───────────┘
                   │
    ┌──────────────▼──────────────┐
    │  Application Load Balancer  │
    │   (Public Subnets)          │
    └──────┬──────────────┬───────┘
           │              │
    ┌──────▼─────┐ ┌─────▼──────┐
    │ Blue TG    │ │ Green TG   │
    │ (Weight:   │ │ (Weight:   │
    │  100%)     │ │   0%)      │
    └──────┬─────┘ └─────┬──────┘
           │              │
    ┌──────▼──────────────▼──────┐
    │   ECS Fargate Services      │
    │   (Public Subnets)          │
    │   • 2 tasks (blue)          │
    │   • 0 tasks (green standby) │
    └──────────────┬──────────────┘
                   │
         ┌─────────▼──────────┐
         │  RDS PostgreSQL     │
         │  (Private Subnet)   │
         │  • No public IP     │
         │  • Port 5432 only   │
         └─────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key Design Choices:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ECS in Public Subnets&lt;/strong&gt;: Cost optimization—saves $33/month on NAT Gateway&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single-AZ RDS&lt;/strong&gt;: Free tier constraint—production would use Multi-AZ&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security Groups&lt;/strong&gt;: Each layer enforces isolation for the next&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Terraform Module Structure
&lt;/h2&gt;

&lt;p&gt;Instead of one monolithic configuration, concerns are separated into focused modules:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;terraform/
├── modules/
│   ├── networking/     # VPC, subnets, security groups, DB subnet group
│   ├── iam/            # Task execution role, task role, policies
│   ├── alb/            # Load balancer, target groups, listeners
│   ├── ecs/            # Cluster, services, task definitions
│   ├── rds/            # PostgreSQL instance, parameter group
│   └── cicd/           # GitHub Actions IAM role (design only)
└── environments/
    └── prod/
        ├── main.tf         # Module composition
        ├── variables.tf    # Input variables
        ├── outputs.tf      # Stack outputs
        ├── backend.tf      # S3 + DynamoDB state
        └── versions.tf     # Provider versions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Module Communication:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Explicit inputs and outputs only&lt;/li&gt;
&lt;li&gt;No cross-module resource references&lt;/li&gt;
&lt;li&gt;No hidden dependencies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example Module Call:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"ecs"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"../../modules/ecs"&lt;/span&gt;

  &lt;span class="nx"&gt;project_name&lt;/span&gt;                &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;project_name&lt;/span&gt;
  &lt;span class="nx"&gt;container_image&lt;/span&gt;             &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;container_image&lt;/span&gt;
  &lt;span class="nx"&gt;ecs_task_execution_role_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;iam&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ecs_task_execution_role_arn&lt;/span&gt;
  &lt;span class="nx"&gt;ecs_task_role_arn&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;iam&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ecs_task_role_arn&lt;/span&gt;
  &lt;span class="nx"&gt;public_subnet_ids&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;networking&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;public_subnet_ids&lt;/span&gt;
  &lt;span class="nx"&gt;ecs_security_group_id&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;networking&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ecs_tasks_security_group_id&lt;/span&gt;
  &lt;span class="nx"&gt;target_group_arn&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;alb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;blue_target_group_arn&lt;/span&gt;

  &lt;span class="c1"&gt;# Database connection&lt;/span&gt;
  &lt;span class="nx"&gt;db_host&lt;/span&gt;                     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;rds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;db_address&lt;/span&gt;
  &lt;span class="nx"&gt;db_password_ssm_param&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"/ecs-prod/db/password"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This keeps the architecture &lt;strong&gt;composable&lt;/strong&gt; and prevents &lt;strong&gt;circular dependency hell&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Security Design
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Network Isolation (Security Groups)
&lt;/h3&gt;

&lt;p&gt;Traffic flows in &lt;strong&gt;one direction only&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Internet (0.0.0.0/0)
    ↓ Port 443/80
┌───────────────────┐
│  ALB SG           │
│  sg-0373513dd...  │
└────────┬──────────┘
         ↓ Port 8000 (from ALB SG only)
┌────────────────────┐
│  ECS Tasks SG      │
│  sg-09a3082e31...  │
└────────┬───────────┘
         ↓ Port 5432 (from ECS SG only)
┌────────────────────┐
│  RDS SG            │
│  sg-07a5aae1f9...  │
└────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Security Group Rules:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ALB Security Group:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;ingress&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"HTTPS from internet"&lt;/span&gt;
  &lt;span class="nx"&gt;from_port&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;443&lt;/span&gt;
  &lt;span class="nx"&gt;to_port&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;443&lt;/span&gt;
  &lt;span class="nx"&gt;protocol&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"tcp"&lt;/span&gt;
  &lt;span class="nx"&gt;cidr_blocks&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"0.0.0.0/0"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;egress&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"To ECS tasks only"&lt;/span&gt;
  &lt;span class="nx"&gt;from_port&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;8000&lt;/span&gt;
  &lt;span class="nx"&gt;to_port&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;8000&lt;/span&gt;
  &lt;span class="nx"&gt;protocol&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"tcp"&lt;/span&gt;
  &lt;span class="nx"&gt;security_groups&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;aws_security_group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ecs_tasks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;ECS Tasks Security Group:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;ingress&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"From ALB only"&lt;/span&gt;
  &lt;span class="nx"&gt;from_port&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;8000&lt;/span&gt;
  &lt;span class="nx"&gt;to_port&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;8000&lt;/span&gt;
  &lt;span class="nx"&gt;protocol&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"tcp"&lt;/span&gt;
  &lt;span class="nx"&gt;security_groups&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;aws_security_group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;alb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;egress&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"To RDS only"&lt;/span&gt;
  &lt;span class="nx"&gt;from_port&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5432&lt;/span&gt;
  &lt;span class="nx"&gt;to_port&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5432&lt;/span&gt;
  &lt;span class="nx"&gt;protocol&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"tcp"&lt;/span&gt;
  &lt;span class="nx"&gt;security_groups&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;aws_security_group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;rds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;RDS Security Group:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;ingress&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"PostgreSQL from ECS only"&lt;/span&gt;
  &lt;span class="nx"&gt;from_port&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5432&lt;/span&gt;
  &lt;span class="nx"&gt;to_port&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5432&lt;/span&gt;
  &lt;span class="nx"&gt;protocol&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"tcp"&lt;/span&gt;
  &lt;span class="nx"&gt;security_groups&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;aws_security_group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ecs_tasks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# No egress rules - database doesn't need outbound&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt; No direct internet access to ECS tasks&lt;/li&gt;
&lt;li&gt; No public database endpoint&lt;/li&gt;
&lt;li&gt; No bypass paths&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  2. IAM Role Separation
&lt;/h3&gt;

&lt;p&gt;Two distinct roles prevent privilege escalation:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Task Execution Role&lt;/strong&gt; (infrastructure operations):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Statement"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"ecr:GetAuthorizationToken"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"ecr:BatchCheckLayerAvailability"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"ecr:GetDownloadUrlForLayer"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"ecr:BatchGetImage"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"logs:CreateLogStream"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"logs:PutLogEvents"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"ssm:GetParameters"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"*"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Task Role&lt;/strong&gt; (application runtime):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Statement"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ssm:GetParameter"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:ssm:us-east-1:*:parameter/ecs-prod/*"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why This Matters:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If a container is compromised, the attacker inherits &lt;strong&gt;only the task role&lt;/strong&gt;—not the execution role. They can't:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pull arbitrary images from ECR&lt;/li&gt;
&lt;li&gt;Write to CloudWatch logs outside their stream&lt;/li&gt;
&lt;li&gt;Access SSM parameters outside &lt;code&gt;/ecs-prod/*&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  3. Secrets Management Flow
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Generate password&lt;/span&gt;
&lt;span class="nv"&gt;PASSWORD&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;openssl rand &lt;span class="nt"&gt;-base64&lt;/span&gt; 32 | &lt;span class="nb"&gt;tr&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s2"&gt;"=+/"&lt;/span&gt; | &lt;span class="nb"&gt;cut&lt;/span&gt; &lt;span class="nt"&gt;-c1-25&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# 2. Store in SSM (KMS encrypted)&lt;/span&gt;
aws ssm put-parameter &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; &lt;span class="s2"&gt;"/ecs-prod/db/password"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--value&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$PASSWORD&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--type&lt;/span&gt; &lt;span class="s2"&gt;"SecureString"&lt;/span&gt;

&lt;span class="c"&gt;# 3. Reference in task definition&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"name"&lt;/span&gt;: &lt;span class="s2"&gt;"DB_PASSWORD_SSM_PARAM"&lt;/span&gt;,
  &lt;span class="s2"&gt;"value"&lt;/span&gt;: &lt;span class="s2"&gt;"/ecs-prod/db/password"&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;# 4. Application fetches at runtime&lt;/span&gt;
import boto3
ssm &lt;span class="o"&gt;=&lt;/span&gt; boto3.client&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'ssm'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
password &lt;span class="o"&gt;=&lt;/span&gt; ssm.get_parameter&lt;span class="o"&gt;(&lt;/span&gt;
    &lt;span class="nv"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;os.environ[&lt;span class="s1"&gt;'DB_PASSWORD_SSM_PARAM'&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;,
    &lt;span class="nv"&gt;WithDecryption&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;True
&lt;span class="o"&gt;)[&lt;/span&gt;&lt;span class="s1"&gt;'Parameter'&lt;/span&gt;&lt;span class="o"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;'Value'&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Never in:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt; Source control&lt;/li&gt;
&lt;li&gt; Docker image&lt;/li&gt;
&lt;li&gt; Environment variables (plaintext)&lt;/li&gt;
&lt;li&gt; Terraform state (marked &lt;code&gt;sensitive&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Blue-Green Deployment Mechanics
&lt;/h2&gt;

&lt;p&gt;The ALB HTTPS listener can route traffic to &lt;strong&gt;two separate target groups&lt;/strong&gt; with configurable weights.&lt;/p&gt;

&lt;h3&gt;
  
  
  Initial State
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Blue Target Group:  ████████████████████ 100% (2 healthy tasks)
Green Target Group: -------------------- 0%  (0 tasks)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Deployment Process
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Build New Version&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Update app version in app.py&lt;/span&gt;
&lt;span class="c"&gt;# Build Docker image&lt;/span&gt;
docker build &lt;span class="nt"&gt;-t&lt;/span&gt; 758620460011.dkr.ecr.us-east-1.amazonaws.com/ecs-prod/flask-app:v2 &lt;span class="nb"&gt;.&lt;/span&gt;

&lt;span class="c"&gt;# Push to ECR&lt;/span&gt;
docker push 758620460011.dkr.ecr.us-east-1.amazonaws.com/ecs-prod/flask-app:v2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 2: Deploy to Green&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Scale up green service&lt;/span&gt;
aws ecs update-service &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cluster&lt;/span&gt; ecs-prod-cluster &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--service&lt;/span&gt; ecs-prod-service-green &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--desired-count&lt;/span&gt; 2

&lt;span class="c"&gt;# Wait for health checks (90 seconds)&lt;/span&gt;
aws ecs &lt;span class="nb"&gt;wait &lt;/span&gt;services-stable &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cluster&lt;/span&gt; ecs-prod-cluster &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--services&lt;/span&gt; ecs-prod-service-green
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 3: Switch Traffic (&amp;lt; 1 Second)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Get listener and target group ARNs&lt;/span&gt;
&lt;span class="nv"&gt;LISTENER_ARN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;terraform output &lt;span class="nt"&gt;-raw&lt;/span&gt; alb_listener_arn&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;GREEN_TG&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;terraform output &lt;span class="nt"&gt;-raw&lt;/span&gt; green_target_group_arn&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# Instant traffic switch&lt;/span&gt;
aws elbv2 modify-listener &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--listener-arn&lt;/span&gt; &lt;span class="nv"&gt;$LISTENER_ARN&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--default-actions&lt;/span&gt; &lt;span class="nv"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;forward,TargetGroupArn&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$GREEN_TG&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 4: Validate and Cleanup&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Monitor green deployment&lt;/span&gt;
curl https://app.cipherpol.xyz/health
&lt;span class="c"&gt;# {"version":"2.0.0","deployment":"green","status":"healthy"}&lt;/span&gt;

&lt;span class="c"&gt;# After 15 minutes of monitoring, scale down blue&lt;/span&gt;
aws ecs update-service &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cluster&lt;/span&gt; ecs-prod-cluster &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--service&lt;/span&gt; ecs-prod-service &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--desired-count&lt;/span&gt; 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why This Works
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;No DNS propagation delays&lt;/strong&gt; — Traffic switches at ALB layer&lt;br&gt;&lt;br&gt;
 &lt;strong&gt;No container restarts&lt;/strong&gt; — Only listener weight changes&lt;br&gt;&lt;br&gt;
 &lt;strong&gt;Instant rollback&lt;/strong&gt; — Reverse the listener modification&lt;br&gt;&lt;br&gt;
 &lt;strong&gt;No downtime&lt;/strong&gt; — ALB handles connection draining  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rollback Command (Same as Deploy):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Get blue target group&lt;/span&gt;
&lt;span class="nv"&gt;BLUE_TG&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;terraform output &lt;span class="nt"&gt;-raw&lt;/span&gt; blue_target_group_arn&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# Instant rollback&lt;/span&gt;
aws elbv2 modify-listener &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--listener-arn&lt;/span&gt; &lt;span class="nv"&gt;$LISTENER_ARN&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--default-actions&lt;/span&gt; &lt;span class="nv"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;forward,TargetGroupArn&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$BLUE_TG&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Data Persistence Validation
&lt;/h2&gt;

&lt;p&gt;Both blue and green services connect to the &lt;strong&gt;same RDS instance&lt;/strong&gt;. I validated this explicitly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Create items while blue is active&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://app.cipherpol.xyz/items &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"name":"Test Item 1"}'&lt;/span&gt;

curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://app.cipherpol.xyz/items &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"name":"Test Item 2"}'&lt;/span&gt;

&lt;span class="c"&gt;# 2. Switch traffic to green&lt;/span&gt;
aws elbv2 modify-listener &lt;span class="nt"&gt;--listener-arn&lt;/span&gt; &lt;span class="nv"&gt;$LISTENER_ARN&lt;/span&gt; &lt;span class="nt"&gt;--default-actions&lt;/span&gt; &lt;span class="nv"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;forward,TargetGroupArn&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$GREEN_TG&lt;/span&gt;

&lt;span class="c"&gt;# 3. Verify data persists&lt;/span&gt;
curl https://app.cipherpol.xyz/items | jq
&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"count"&lt;/span&gt;: 2,
  &lt;span class="s2"&gt;"items"&lt;/span&gt;: &lt;span class="o"&gt;[&lt;/span&gt;
    &lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"id"&lt;/span&gt;: 2, &lt;span class="s2"&gt;"name"&lt;/span&gt;: &lt;span class="s2"&gt;"Test Item 2"&lt;/span&gt;, &lt;span class="s2"&gt;"created_at"&lt;/span&gt;: &lt;span class="s2"&gt;"2026-02-19T10:54:03"&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;,
    &lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"id"&lt;/span&gt;: 1, &lt;span class="s2"&gt;"name"&lt;/span&gt;: &lt;span class="s2"&gt;"Test Item 1"&lt;/span&gt;, &lt;span class="s2"&gt;"created_at"&lt;/span&gt;: &lt;span class="s2"&gt;"2026-02-19T10:53:02"&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
  &lt;span class="o"&gt;]&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;# 4. Create new item on green&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://app.cipherpol.xyz/items &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"name":"Created on Green v2.0"}'&lt;/span&gt;

&lt;span class="c"&gt;# 5. Rollback to blue&lt;/span&gt;
aws elbv2 modify-listener &lt;span class="nt"&gt;--listener-arn&lt;/span&gt; &lt;span class="nv"&gt;$LISTENER_ARN&lt;/span&gt; &lt;span class="nt"&gt;--default-actions&lt;/span&gt; &lt;span class="nv"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;forward,TargetGroupArn&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$BLUE_TG&lt;/span&gt;

&lt;span class="c"&gt;# 6. Confirm all items still present&lt;/span&gt;
curl https://app.cipherpol.xyz/items | jq &lt;span class="s1"&gt;'.count'&lt;/span&gt;
&lt;span class="c"&gt;# 3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
 &lt;strong&gt;5 items persisted across 3 deployment cycles&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
 The deployment layer is stateless&lt;br&gt;&lt;br&gt;
 The database is the single source of truth  &lt;/p&gt;


&lt;h2&gt;
  
  
  What Broke (And What I Learned)
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Issue 1: RDS Connection Delay After "Available" Status
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; ECS tasks failed health checks immediately after &lt;code&gt;terraform apply&lt;/code&gt; completed RDS creation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Symptoms:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ecs-prod-service: unhealthy targets: 2/2
Task stopped reason: Task failed container health checks
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Root Cause:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
RDS reports &lt;code&gt;available&lt;/code&gt; status when the instance is running, but doesn't accept connections for another 60-90 seconds while:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Background processes initialize&lt;/li&gt;
&lt;li&gt;Performance schema loads&lt;/li&gt;
&lt;li&gt;Cache warms up&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Health check grace period (60s) absorbed the delay:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_ecs_service"&lt;/span&gt; &lt;span class="s2"&gt;"app"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;health_check_grace_period_seconds&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;
  &lt;span class="c1"&gt;# Tasks retry connection until RDS is ready&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lesson:&lt;/strong&gt; AWS resource statuses don't always mean "ready for traffic." Plan for initialization time.&lt;/p&gt;




&lt;h3&gt;
  
  
  Issue 2: Docker HEALTHCHECK Missing Dependency
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; 100+ ECS task restarts. ALB target health showed &lt;code&gt;healthy&lt;/code&gt;, but ECS kept replacing tasks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Symptoms:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# ALB target group
Target health: healthy (2/2)

# ECS service events
Unhealthy container: flask-app
Task stopped, starting replacement
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Root Cause:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# Original (broken)&lt;/span&gt;
&lt;span class="k"&gt;HEALTHCHECK&lt;/span&gt;&lt;span class="s"&gt; CMD curl -f http://localhost:8000/health || exit 1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But &lt;code&gt;curl&lt;/code&gt; wasn't in &lt;code&gt;python:3.11-slim&lt;/code&gt; base image.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why ALB Passed but Container Failed:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ALB health check: HTTP request from outside the container (port 8000)&lt;/li&gt;
&lt;li&gt;Container health check: Command executed &lt;strong&gt;inside&lt;/strong&gt; the container&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# Use Python instead of curl&lt;/span&gt;
&lt;span class="k"&gt;HEALTHCHECK&lt;/span&gt;&lt;span class="s"&gt; CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health', timeout=2)"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lesson:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
ALB health checks and container health checks are &lt;strong&gt;independent control loops&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ALB health: Determines if task receives traffic&lt;/li&gt;
&lt;li&gt;Container health: Determines if ECS replaces the task&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;ECS uses container health, not ALB health.&lt;/p&gt;


&lt;h3&gt;
  
  
  Issue 3: Git Bash Path Conversion on Windows
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; SSM parameter &lt;code&gt;/ecs-prod/db/password&lt;/code&gt; became &lt;code&gt;C:\Program Files\Git\ecs-prod\db\password&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Error:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ParameterNotFound: /C:/Program Files/Git/ecs-prod/db/password
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Root Cause:&lt;/strong&gt; Git Bash on Windows auto-converts Unix-style paths to Windows paths.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Disable path conversion&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;MSYS_NO_PATHCONV&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1

&lt;span class="c"&gt;# Then run AWS CLI&lt;/span&gt;
aws ssm get-parameter &lt;span class="nt"&gt;--name&lt;/span&gt; /ecs-prod/db/password &lt;span class="nt"&gt;--with-decryption&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Alternative:&lt;/strong&gt; Use Windows Command Prompt or PowerShell for AWS CLI commands.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson:&lt;/strong&gt; Git Bash is great for Unix tools, but AWS CLI needs special handling on Windows.&lt;/p&gt;




&lt;h3&gt;
  
  
  Issue 4: ALB Listener Syntax Constraint
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; After configuring blue-green with weighted routing, subsequent updates using simple syntax failed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Error:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;An error occurred (ValidationError): Cannot use both TargetGroupArn and ForwardConfig in the same action
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Root Cause:&lt;/strong&gt; Once you use &lt;code&gt;ForwardConfig&lt;/code&gt; (weighted routing), the ALB API remembers this and requires full JSON syntax for all future updates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simple syntax (stopped working):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nt"&gt;--default-actions&lt;/span&gt; &lt;span class="nv"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;forward,TargetGroupArn&lt;span class="o"&gt;=&lt;/span&gt;arn:aws:...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Required syntax:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nt"&gt;--default-actions&lt;/span&gt; &lt;span class="s1"&gt;'[{
  "Type": "forward",
  "ForwardConfig": {
    "TargetGroups": [{
      "TargetGroupArn": "arn:aws:...",
      "Weight": 100
    }]
  }
}]'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lesson:&lt;/strong&gt; ALB API is stateful. Once you use advanced features, you can't revert to simple syntax. Document this in runbooks.&lt;/p&gt;




&lt;h3&gt;
  
  
  Issue 5: PostgreSQL Minor Version Retirement
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Terraform apply failed with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;InvalidParameterValue: Cannot find version 15.4 for postgres
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Root Cause:&lt;/strong&gt; AWS retired PostgreSQL 15.4 in favor of 15.12 (latest patch version).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before&lt;/span&gt;
&lt;span class="nx"&gt;engine_version&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"15.4"&lt;/span&gt;

&lt;span class="c1"&gt;# After&lt;/span&gt;
&lt;span class="nx"&gt;engine_version&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"15.12"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Better Fix (Production):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Pin major version, allow minor updates&lt;/span&gt;
&lt;span class="nx"&gt;engine_version&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"15"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lesson:&lt;/strong&gt; AWS manages minor version lifecycle. Pin major versions intentionally, but expect patch version changes.&lt;/p&gt;




&lt;h3&gt;
  
  
  Issue 6: Terraform State Lock Timeout
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; &lt;code&gt;terraform plan&lt;/code&gt; hung for 5 minutes, then failed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Error acquiring state lock: timeout waiting for lock
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Root Cause:&lt;/strong&gt; DynamoDB lock table had wrong key schema:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Wrong&lt;/span&gt;
&lt;span class="nv"&gt;AttributeName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;,KeyType&lt;span class="o"&gt;=&lt;/span&gt;HASH

&lt;span class="c"&gt;# Correct&lt;/span&gt;
&lt;span class="nv"&gt;AttributeName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;LockID,KeyType&lt;span class="o"&gt;=&lt;/span&gt;HASH
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Delete wrong table&lt;/span&gt;
aws dynamodb delete-table &lt;span class="nt"&gt;--table-name&lt;/span&gt; terraform-state-lock

&lt;span class="c"&gt;# Recreate with correct schema&lt;/span&gt;
aws dynamodb create-table &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--table-name&lt;/span&gt; terraform-state-lock &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--attribute-definitions&lt;/span&gt; &lt;span class="nv"&gt;AttributeName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;LockID,AttributeType&lt;span class="o"&gt;=&lt;/span&gt;S &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--key-schema&lt;/span&gt; &lt;span class="nv"&gt;AttributeName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;LockID,KeyType&lt;span class="o"&gt;=&lt;/span&gt;HASH &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--billing-mode&lt;/span&gt; PAY_PER_REQUEST
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lesson:&lt;/strong&gt; Terraform's DynamoDB lock table requires &lt;strong&gt;exactly&lt;/strong&gt; &lt;code&gt;LockID&lt;/code&gt; as the partition key. Case-sensitive.&lt;/p&gt;




&lt;h3&gt;
  
  
  Issue 7: ECR Image Pull Failures (Intermittent)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Some task launches failed with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CannotPullContainerError: API error: manifest unknown
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Root Cause:&lt;/strong&gt; Task execution role was missing &lt;code&gt;ecr:BatchGetImage&lt;/code&gt; permission.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Attached AWS managed policy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_iam_role_policy_attachment"&lt;/span&gt; &lt;span class="s2"&gt;"ecs_task_execution"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;role&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_iam_role&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ecs_task_execution&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;
  &lt;span class="nx"&gt;policy_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lesson:&lt;/strong&gt; Custom IAM policies are error-prone. Use AWS managed policies where possible, then restrict with conditions if needed.&lt;/p&gt;




&lt;h3&gt;
  
  
  Issue 8: Terraform vs Manual Scaling Conflict
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Terraform tried to update blue service's &lt;code&gt;desired_count&lt;/code&gt; while I was manually scaling for testing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Error: concurrent modification detected
Service is being modified by another operation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Added lifecycle rule to ignore runtime changes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_ecs_service"&lt;/span&gt; &lt;span class="s2"&gt;"app"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;# ... other config&lt;/span&gt;

  &lt;span class="nx"&gt;lifecycle&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;ignore_changes&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;desired_count&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lesson:&lt;/strong&gt; When testing blue-green manually, let Terraform manage infrastructure but ignore runtime scaling changes. Use &lt;code&gt;ignore_changes&lt;/code&gt; selectively.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd Change in Production
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;This Project&lt;/th&gt;
&lt;th&gt;Production Standard&lt;/th&gt;
&lt;th&gt;Reason&lt;/th&gt;
&lt;th&gt;Cost Impact&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ECS in public subnets&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Private subnets + NAT Gateway&lt;/td&gt;
&lt;td&gt;Defense-in-depth, reduced attack surface&lt;/td&gt;
&lt;td&gt;+$33/month&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Single-AZ RDS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multi-AZ RDS&lt;/td&gt;
&lt;td&gt;99.95% SLA vs 99.5%, automatic failover&lt;/td&gt;
&lt;td&gt;+$15/month&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Manual ALB switch&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;AWS CodeDeploy blue-green&lt;/td&gt;
&lt;td&gt;Automated rollback based on CloudWatch alarms&lt;/td&gt;
&lt;td&gt;$0 (free)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;No autoscaling&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;ECS Service Auto Scaling&lt;/td&gt;
&lt;td&gt;Handle traffic spikes, reduce idle costs&lt;/td&gt;
&lt;td&gt;Variable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SSM Parameter Store&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;AWS Secrets Manager&lt;/td&gt;
&lt;td&gt;Automatic rotation, better audit&lt;/td&gt;
&lt;td&gt;+$0.40/secret&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;No read replicas&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;RDS read replica&lt;/td&gt;
&lt;td&gt;Offload read traffic from primary&lt;/td&gt;
&lt;td&gt;+$15/month&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;7-day log retention&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;30-90 day retention&lt;/td&gt;
&lt;td&gt;Compliance, longer incident investigation&lt;/td&gt;
&lt;td&gt;+$2/month&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Total Production Cost:&lt;/strong&gt; ~$80-100/month&lt;br&gt;&lt;br&gt;
&lt;strong&gt;This Project Cost:&lt;/strong&gt; $0.12 for validation&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost Savings Breakdown:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;NAT Gateway: $33 saved&lt;/li&gt;
&lt;li&gt;Multi-AZ RDS: $15 saved&lt;/li&gt;
&lt;li&gt;Secrets Manager: $0.40 saved&lt;/li&gt;
&lt;li&gt;Read replica: $15 saved&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total: $63.40/month saved&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Cost Breakdown (Actual)
&lt;/h2&gt;
&lt;h3&gt;
  
  
  4-Hour Validation Session
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Hourly Rate&lt;/th&gt;
&lt;th&gt;Hours&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ALB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$0.0225&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.0675&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;ECS Fargate&lt;/strong&gt; (2 tasks × 0.25 vCPU)&lt;/td&gt;
&lt;td&gt;Free tier&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;RDS db.t3.micro&lt;/strong&gt; (single-AZ)&lt;/td&gt;
&lt;td&gt;Free tier&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Route 53 Hosted Zone&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$0.50/month prorated&lt;/td&gt;
&lt;td&gt;3 days&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.05&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S3 State Bucket&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&amp;lt;$0.01&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CloudWatch Logs&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Free tier (5GB)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data Transfer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Free tier (100GB)&lt;/td&gt;
&lt;td&gt;&amp;lt;1GB&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.12&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  Monthly Cost If Kept Running
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Monthly Cost&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ALB&lt;/td&gt;
&lt;td&gt;$16.20&lt;/td&gt;
&lt;td&gt;720 hours × $0.0225&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ECS Fargate&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;td&gt;Within 400 vCPU-hour free tier&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RDS db.t3.micro&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;td&gt;Within 750-hour free tier&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Route 53&lt;/td&gt;
&lt;td&gt;$0.50&lt;/td&gt;
&lt;td&gt;Hosted zone&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage&lt;/td&gt;
&lt;td&gt;&amp;lt;$1&lt;/td&gt;
&lt;td&gt;S3, CloudWatch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$17/month&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Comparison to EKS:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;EKS control plane: $72/month&lt;/li&gt;
&lt;li&gt;ECS Fargate control plane: &lt;strong&gt;$0&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Savings: $72/month&lt;/strong&gt; for equivalent compute&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;
&lt;h3&gt;
  
  
  What Worked Well
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Modular Terraform&lt;/strong&gt; — Each module had single responsibility, debugging was easier&lt;br&gt;&lt;br&gt;
 &lt;strong&gt;Blue-green switching&lt;/strong&gt; — True zero downtime, instant rollback capability&lt;br&gt;&lt;br&gt;
 &lt;strong&gt;Security group layering&lt;/strong&gt; — Network isolation without complexity&lt;br&gt;&lt;br&gt;
 &lt;strong&gt;SSM secrets&lt;/strong&gt; — No credentials in code, images, or state files&lt;br&gt;&lt;br&gt;
 &lt;strong&gt;Free tier optimization&lt;/strong&gt; — Validated production patterns for $0.12&lt;br&gt;&lt;br&gt;
 &lt;strong&gt;Documentation&lt;/strong&gt; — 8 real issues documented with root causes and fixes  &lt;/p&gt;
&lt;h3&gt;
  
  
  What I'd Improve
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Terragrunt for DRY&lt;/strong&gt; — Multi-environment deployments without duplication&lt;br&gt;&lt;br&gt;
 &lt;strong&gt;Automated testing&lt;/strong&gt; — Pre-deployment health checks in CI/CD&lt;br&gt;&lt;br&gt;
 &lt;strong&gt;CodeDeploy integration&lt;/strong&gt; — Production should automate blue-green&lt;br&gt;&lt;br&gt;
 &lt;strong&gt;Observability&lt;/strong&gt; — CloudWatch dashboards for latency, errors, saturation&lt;br&gt;&lt;br&gt;
 &lt;strong&gt;Database migrations&lt;/strong&gt; — Flyway or Liquibase for schema versioning&lt;br&gt;&lt;br&gt;
 &lt;strong&gt;Chaos engineering&lt;/strong&gt; — Terminate random tasks to test resilience  &lt;/p&gt;


&lt;h2&gt;
  
  
  Repository &amp;amp; Evidence
&lt;/h2&gt;

&lt;p&gt;Full source code with detailed documentation:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/cypher682/ecs-production-platform" rel="noopener noreferrer"&gt;cypher682/ecs-production-platform&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's included:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt; Complete Terraform modules (networking, IAM, ALB, ECS, RDS)&lt;/li&gt;
&lt;li&gt; Flask application with Dockerfile and health checks&lt;/li&gt;
&lt;li&gt; GitHub Actions workflow (OIDC design, not tested live)&lt;/li&gt;
&lt;li&gt; Operational runbooks (deployment failure, database connection, rollback)&lt;/li&gt;
&lt;li&gt; Lessons learned documentation (8 issues with root cause analysis)&lt;/li&gt;
&lt;li&gt; Cost analysis (actual vs production projection)&lt;/li&gt;
&lt;li&gt; Evidence files (Terraform outputs, CloudWatch logs, test results)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Documentation structure:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docs/
├── ARCHITECTURE.md            # Design decisions and network diagrams
├── 01_IMPLEMENTATION.md       # Phase-by-phase build log
├── 02_LESSONS_LEARNED.md      # 8 issues with fixes
├── 03_COST_ANALYSIS.md        # Detailed cost breakdown
├── 04_SECURITY.md             # IAM policies, secrets flow
├── 05_CICD_DESIGN.md          # GitHub Actions workflow design
└── runbooks/
    ├── deployment-failure.md  # What to do when deploy fails
    ├── database-connection.md # Troubleshooting RDS connectivity
    └── rollback-procedure.md  # Step-by-step rollback guide
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Questions or Feedback?
&lt;/h2&gt;

&lt;p&gt;If you're building something similar or have questions about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt; Blue-green deployment patterns&lt;/li&gt;
&lt;li&gt; IAM least privilege design&lt;/li&gt;
&lt;li&gt; AWS cost optimization strategies&lt;/li&gt;
&lt;li&gt; Terraform module architecture&lt;/li&gt;
&lt;li&gt; Debugging ECS task failures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Drop a comment below!&lt;/strong&gt; I'll respond with specific examples from this build.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This project was built as a portfolio sprint to demonstrate production-ready AWS skills. The platform was deployed, validated with 5 CRUD operations, and destroyed within 4 hours—total cost $0.12. All code and documentation available on GitHub.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;#AWS&lt;/code&gt; &lt;code&gt;#Terraform&lt;/code&gt; &lt;code&gt;#DevOps&lt;/code&gt; &lt;code&gt;#ECS&lt;/code&gt; &lt;code&gt;#InfrastructureAsCode&lt;/code&gt; &lt;code&gt;#BlueGreenDeployment&lt;/code&gt; &lt;code&gt;#CloudEngineering&lt;/code&gt; &lt;code&gt;#Docker&lt;/code&gt; &lt;code&gt;#PostgreSQL&lt;/code&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>terraform</category>
      <category>devops</category>
      <category>docker</category>
    </item>
    <item>
      <title>Building a Production-Grade AWS Cost &amp; Security Auditor</title>
      <dc:creator>cypher682</dc:creator>
      <pubDate>Thu, 12 Feb 2026 08:36:28 +0000</pubDate>
      <link>https://forem.com/cypher682/building-a-production-grade-aws-cost-security-auditor-nb1</link>
      <guid>https://forem.com/cypher682/building-a-production-grade-aws-cost-security-auditor-nb1</guid>
      <description>&lt;p&gt;Cloud environments naturally drift. Costs creep up. Security posture degrades. Manual audits do not scale, and periodic reviews miss issues that emerge between checks.&lt;/p&gt;

&lt;p&gt;I needed an auditing tool that could:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Identify cost waste&lt;/strong&gt; — idle EC2 instances, unattached EBS volumes, orphaned snapshots&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Detect security misconfigurations&lt;/strong&gt; — public S3 buckets, overly permissive security groups, weak IAM hygiene&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Map findings to a known framework&lt;/strong&gt; — CIS AWS Foundations Benchmark&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Operate safely&lt;/strong&gt; — strictly read-only, no automated deletion or remediation&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This article walks through the key design decisions, trade-offs, and lessons learned from building it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Core Design Principles
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Read-Only by Design
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Decision:&lt;/strong&gt; No auto-remediation. No destructive permissions.&lt;/p&gt;

&lt;p&gt;Automatically deleting or modifying cloud resources is risky, especially in production. An instance that appears idle may be a disaster-recovery standby, a scheduled batch worker, or part of a failover strategy.&lt;/p&gt;

&lt;p&gt;The tool’s role is to &lt;strong&gt;surface risk and waste&lt;/strong&gt;, not to make irreversible decisions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;IAM policy scope:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"ec2:Describe*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"s3:Get*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"s3:List*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"iam:List*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"iam:Get*"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"*"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No &lt;code&gt;Delete&lt;/code&gt;, &lt;code&gt;Terminate&lt;/code&gt;, or &lt;code&gt;Modify&lt;/code&gt; permissions. The blast radius is limited to discovery only.&lt;/p&gt;

&lt;p&gt;This constraint shaped the entire architecture and made the tool safe to run against live accounts.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Defining “Idle” Using CloudWatch Metrics
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; “Idle” is ambiguous in cloud systems.&lt;/p&gt;

&lt;p&gt;CPU utilization is an imperfect signal, but it is widely available and easy to reason about. I defined idle EC2 instances as those with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Average CPU utilization &amp;lt; 5%&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Observed over a 7-day window&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Implementation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_cpu_utilization&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;instance_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;days&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;end_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;utcnow&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;start_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;end_time&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;days&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;days&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cloudwatch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_metric_statistics&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;Namespace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;AWS/EC2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;MetricName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CPUUtilization&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;Dimensions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;InstanceId&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;instance_id&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
        &lt;span class="n"&gt;StartTime&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;start_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;EndTime&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;end_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;Period&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;86400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# 1 day
&lt;/span&gt;        &lt;span class="n"&gt;Statistics&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Average&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Datapoints&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Average&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;dp&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Datapoints&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Datapoints&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Trade-offs:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Misses bursty or scheduled workloads (batch jobs, ML training)&lt;/li&gt;
&lt;li&gt;Flags instances that are intentionally dormant&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Rather than hiding these limitations, they are explicitly documented. Transparency builds trust in tooling.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Aligning Findings with CIS Benchmarks
&lt;/h3&gt;

&lt;p&gt;Raw findings are less useful without context. Mapping issues to the &lt;strong&gt;CIS AWS Foundations Benchmark&lt;/strong&gt; adds structure and credibility.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;findings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;bucket_name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;bucket_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;issue&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Public access enabled&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;HIGH&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cis_control&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2.1.5&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;remediation&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Enable S3 Block Public Access&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This approach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Makes findings actionable&lt;/li&gt;
&lt;li&gt;Aligns with how security teams think&lt;/li&gt;
&lt;li&gt;Signals familiarity with compliance-driven environments&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Results from a Real AWS Account
&lt;/h2&gt;

&lt;p&gt;Running the auditor against my own AWS account produced the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SECURITY FINDINGS: 32
  CRITICAL: 11  (SSH/RDP open to 0.0.0.0/0)
  HIGH:      9  (IAM users without MFA, public S3 buckets)
  MEDIUM:   12  (stale access keys, permissive policies)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even a relatively small account accumulated meaningful security drift. Continuous auditing is not optional at scale.&lt;/p&gt;




&lt;h2&gt;
  
  
  Handling False Positives
&lt;/h2&gt;

&lt;p&gt;One flagged issue was my own &lt;code&gt;AuditorToolReadOnly&lt;/code&gt; IAM policy using:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"*"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At first glance, this looks overly permissive. In practice, it is required. Read-only IAM and EC2 discovery APIs (&lt;code&gt;List*&lt;/code&gt;, &lt;code&gt;Describe*&lt;/code&gt;) cannot be scoped to specific ARNs that are not yet known.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key point:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Not all flagged issues are actionable&lt;/li&gt;
&lt;li&gt;False positives should be documented, not ignored&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This distinction is critical in real operational environments.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Would Improve for a Production Deployment
&lt;/h2&gt;

&lt;p&gt;If this were moving beyond a portfolio project:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Multi-region scanning&lt;/strong&gt; instead of single-region execution&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Historical persistence&lt;/strong&gt; using DynamoDB for trend analysis&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS Cost Explorer integration&lt;/strong&gt; for real billing data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alerting&lt;/strong&gt; via Slack or SNS for critical findings&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IAM Access Analyzer integration&lt;/strong&gt; for deeper policy analysis&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The current scope balances realism with complexity without overengineering.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Read-only audits reduce risk and build trust when running against live environments&lt;/li&gt;
&lt;li&gt;Cost and security signals are more useful when tied to metrics and known frameworks&lt;/li&gt;
&lt;li&gt;Not every finding should be auto-remediated; judgment still matters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These principles guided the design choices throughout this project.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;Repository:&lt;br&gt;
&lt;a href="https://github.com/cypher682/aws-cost-security-auditor" rel="noopener noreferrer"&gt;https://github.com/cypher682/aws-cost-security-auditor&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Run locally:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/cypher682/aws-cost-security-auditor
&lt;span class="nb"&gt;cd &lt;/span&gt;aws-cost-security-auditor
python3 &lt;span class="nt"&gt;-m&lt;/span&gt; venv venv &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;source &lt;/span&gt;venv/bin/activate
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
python src/full_audit.py &lt;span class="nt"&gt;--profile&lt;/span&gt; auditor-role
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;See the remediation guidance in &lt;code&gt;docs/REMEDIATION_PLAYBOOK.md&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  What’s Next
&lt;/h2&gt;

&lt;p&gt;Planned extensions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Multi-account support (AWS Organizations)&lt;/li&gt;
&lt;li&gt;RDS idle detection&lt;/li&gt;
&lt;li&gt;Lambda cost analysis&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Links:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;X: &lt;a href="https://twitter.com/cypher682" rel="noopener noreferrer"&gt;https://twitter.com/cypher682&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;LinkedIn: &lt;a href="https://linkedin.com/in/suleiman-abdulrahman-dev" rel="noopener noreferrer"&gt;https://linkedin.com/in/suleiman-abdulrahman-dev&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;This project is part of my portfolio focused on production-grade cloud platform engineering.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>security</category>
      <category>python</category>
      <category>devops</category>
    </item>
    <item>
      <title>Building a Production-Ready Microservices Platform with CI/CD on AWS Free Tier</title>
      <dc:creator>cypher682</dc:creator>
      <pubDate>Thu, 11 Dec 2025 14:37:20 +0000</pubDate>
      <link>https://forem.com/cypher682/building-a-production-ready-microservices-platform-with-cicd-on-aws-free-tier-2la0</link>
      <guid>https://forem.com/cypher682/building-a-production-ready-microservices-platform-with-cicd-on-aws-free-tier-2la0</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Building a complete microservices architecture with professional DevOps practices can be intimidating. This guide walks you through creating a production-grade system using AWS free tier, demonstrating real-world patterns that mid-level engineers can apply immediately.&lt;/p&gt;

&lt;p&gt;We'll build a polyglot microservices platform with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Three microservices in different languages (Node.js, Python, Go)&lt;/li&gt;
&lt;li&gt;Complete CI/CD pipeline with automated testing and security scanning&lt;/li&gt;
&lt;li&gt;Infrastructure as Code with Terraform&lt;/li&gt;
&lt;li&gt;Monitoring with Prometheus and Grafana&lt;/li&gt;
&lt;li&gt;All running on AWS free tier for under $2/month&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;Our platform consists of three microservices behind an API Gateway, all running on a single EC2 instance with Docker Compose. While this isn't the scalability of Kubernetes, it demonstrates core concepts without the complexity or cost.&lt;/p&gt;

&lt;p&gt;The API Gateway handles routing and authentication, forwarding requests to the User Service (Python/FastAPI) for user operations and Product Service (Go/Gin) for product management. Both services interact with separate DynamoDB tables.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Stack?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Polyglot Architecture&lt;/strong&gt;: Using multiple languages demonstrates how microservices enable technology diversity. Each service uses the best tool for its job.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Docker Compose on EC2&lt;/strong&gt;: Kubernetes is powerful but complex. Docker Compose provides orchestration sufficient for small-medium deployments while remaining in free tier.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DynamoDB&lt;/strong&gt;: Pay-per-request pricing with generous free tier (25GB, 25 RCU/WCU) makes it ideal for demos that might sit idle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub Actions&lt;/strong&gt;: Native GitHub integration means no external CI/CD service needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Infrastructure Setup
&lt;/h2&gt;

&lt;p&gt;We use Terraform to create all AWS resources. The infrastructure includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;VPC with public subnet&lt;/li&gt;
&lt;li&gt;EC2 t2.micro instance&lt;/li&gt;
&lt;li&gt;DynamoDB tables for users and products&lt;/li&gt;
&lt;li&gt;Route53 hosted zone&lt;/li&gt;
&lt;li&gt;IAM roles with least-privilege policies&lt;/li&gt;
&lt;li&gt;Security groups allowing only necessary ports&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The EC2 instance runs Docker and is configured via user data script to install Docker Engine during launch.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_instance"&lt;/span&gt; &lt;span class="s2"&gt;"app"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;ami&lt;/span&gt;                    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;aws_ami&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ubuntu&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;instance_type&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"t2.micro"&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_security_group_ids&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;aws_security_group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ec2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="nx"&gt;iam_instance_profile&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_iam_instance_profile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ec2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;

  &lt;span class="nx"&gt;user_data&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;-&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;
              #!/bin/bash
              apt-get update
              curl -fsSL https://get.docker.com | sh
              usermod -aG docker ubuntu
&lt;/span&gt;&lt;span class="no"&gt;              EOF
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;DynamoDB tables use on-demand billing to stay within free tier:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_dynamodb_table"&lt;/span&gt; &lt;span class="s2"&gt;"users"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"aws-microservices-cicd-users"&lt;/span&gt;
  &lt;span class="nx"&gt;billing_mode&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"PAY_PER_REQUEST"&lt;/span&gt;
  &lt;span class="nx"&gt;hash_key&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"userId"&lt;/span&gt;

  &lt;span class="nx"&gt;attribute&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"userId"&lt;/span&gt;
    &lt;span class="nx"&gt;type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"S"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Microservices Implementation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  API Gateway (Node.js)
&lt;/h3&gt;

&lt;p&gt;The API Gateway handles authentication via API keys stored in AWS Systems Manager Parameter Store and routes requests to appropriate services.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;authMiddleware&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;next&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;apiKey&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;x-api-key&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;apiKey&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;apiKey&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="nx"&gt;API_KEY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;401&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Unauthorized&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;use&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/users&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;authMiddleware&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;axios&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;method&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;USER_SERVICE_URL&lt;/span&gt;&lt;span class="p"&gt;}${&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  User Service (Python)
&lt;/h3&gt;

&lt;p&gt;FastAPI provides automatic API documentation and Pydantic validation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@app.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;UserResponse&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;201&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_user&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;User&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uuid4&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="n"&gt;timestamp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;utcnow&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;put_item&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;Item&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;userId&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;email&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;createdAt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Product Service (Go)
&lt;/h3&gt;

&lt;p&gt;Go's performance makes it ideal for high-throughput services:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;createProduct&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;gin&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;product&lt;/span&gt; &lt;span class="n"&gt;Product&lt;/span&gt;
    &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ShouldBindJSON&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ProductID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CreatedAt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;UTC&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RFC3339&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;av&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;dynamodbattribute&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MarshalMap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PutItem&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;dynamodb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PutItemInput&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;TableName&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;aws&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tableName&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;Item&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;av&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;201&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  CI/CD Pipeline
&lt;/h2&gt;

&lt;p&gt;The GitHub Actions pipeline has two workflows:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CI Pipeline&lt;/strong&gt; (on pull requests):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Lint code for all three services&lt;/li&gt;
&lt;li&gt;Run unit tests&lt;/li&gt;
&lt;li&gt;Build Docker images&lt;/li&gt;
&lt;li&gt;Scan containers with Trivy for vulnerabilities&lt;/li&gt;
&lt;li&gt;Upload security findings to GitHub Security&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;CD Pipeline&lt;/strong&gt; (on merge to main):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Build Docker images for all services&lt;/li&gt;
&lt;li&gt;Push to Docker Hub with latest and SHA tags&lt;/li&gt;
&lt;li&gt;SSH to EC2 instance&lt;/li&gt;
&lt;li&gt;Pull new images&lt;/li&gt;
&lt;li&gt;Restart containers with docker-compose&lt;/li&gt;
&lt;li&gt;Run health checks&lt;/li&gt;
&lt;li&gt;Rollback on failure
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deploy to EC2&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;ssh ubuntu@${{ steps.get-ip.outputs.instance_ip }} &amp;lt;&amp;lt; 'EOF'&lt;/span&gt;
      &lt;span class="s"&gt;cd /home/ubuntu/app&lt;/span&gt;
      &lt;span class="s"&gt;docker-compose pull&lt;/span&gt;
      &lt;span class="s"&gt;docker-compose up -d&lt;/span&gt;
    &lt;span class="s"&gt;EOF&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Security Implementation
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Container Scanning&lt;/strong&gt;: Trivy scans images for known vulnerabilities before deployment. Critical and high severity findings block the pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Secrets Management&lt;/strong&gt;: API keys and sensitive data live in AWS Systems Manager Parameter Store, not in environment variables or code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SSL/TLS&lt;/strong&gt;: Let's Encrypt provides free SSL certificates, configured via Ansible during initial setup.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;IAM Policies&lt;/strong&gt;: EC2 instance has minimal permissions, only accessing specific DynamoDB tables and Parameter Store paths.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_iam_role_policy"&lt;/span&gt; &lt;span class="s2"&gt;"dynamodb"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;policy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonencode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;Statement&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
      &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
      &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="s2"&gt;"dynamodb:PutItem"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;"dynamodb:GetItem"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;"dynamodb:Query"&lt;/span&gt;
      &lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="nx"&gt;aws_dynamodb_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;users&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nx"&gt;aws_dynamodb_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;products&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
      &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;}]&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Monitoring Stack
&lt;/h2&gt;

&lt;p&gt;Prometheus scrapes metrics from all three services every 15 seconds. Each service exposes a &lt;code&gt;/metrics&lt;/code&gt; endpoint with custom business metrics plus standard HTTP metrics.&lt;/p&gt;

&lt;p&gt;Grafana visualizes the data with dashboards showing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Request rates and latencies per service&lt;/li&gt;
&lt;li&gt;Error rates (4xx, 5xx responses)&lt;/li&gt;
&lt;li&gt;Container resource usage (CPU, memory)&lt;/li&gt;
&lt;li&gt;DynamoDB operation metrics&lt;/li&gt;
&lt;li&gt;Custom business metrics (user registrations, product creates)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The monitoring stack runs alongside the application services in Docker Compose:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;prometheus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prom/prometheus:latest&lt;/span&gt;
  &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml&lt;/span&gt;
  &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;9090:9090"&lt;/span&gt;

&lt;span class="na"&gt;grafana&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;grafana/grafana:latest&lt;/span&gt;
  &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;GF_SECURITY_ADMIN_PASSWORD=admin&lt;/span&gt;
  &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3001:3000"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Configuration Management with Ansible
&lt;/h2&gt;

&lt;p&gt;Ansible handles EC2 setup tasks that are complex or stateful:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Installing Docker and Docker Compose&lt;/li&gt;
&lt;li&gt;Configuring UFW firewall&lt;/li&gt;
&lt;li&gt;Setting up Nginx with SSL&lt;/li&gt;
&lt;li&gt;Obtaining Let's Encrypt certificates&lt;/li&gt;
&lt;li&gt;Creating application directories&lt;/li&gt;
&lt;li&gt;Installing Prometheus Node Exporter&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The setup playbook runs once after Terraform creates the infrastructure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Install Docker&lt;/span&gt;
  &lt;span class="na"&gt;apt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;docker-ce&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;docker-ce-cli&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;containerd.io&lt;/span&gt;
    &lt;span class="na"&gt;state&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;present&lt;/span&gt;

&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Obtain SSL certificate&lt;/span&gt;
  &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="s"&gt;certbot --nginx -d api.cipherpol.xyz&lt;/span&gt;
    &lt;span class="s"&gt;--non-interactive --agree-tos&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Cost Optimization
&lt;/h2&gt;

&lt;p&gt;This architecture costs $0.50-2.00/month:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Route53&lt;/strong&gt;: $0.50/month for hosted zone (only unavoidable cost)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;EC2 t2.micro&lt;/strong&gt;: Free tier provides 750 hours/month, enough for one instance running 24/7&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DynamoDB&lt;/strong&gt;: Free tier includes 25GB storage and 25 RCU/WCU, sufficient for development and small production workloads&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Transfer&lt;/strong&gt;: First 1GB/month free, typically sufficient for API traffic&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;S3&lt;/strong&gt;: Used for Terraform state, negligible cost&lt;/p&gt;

&lt;p&gt;The key is using on-demand pricing for DynamoDB (no provisioned capacity charges) and staying within EC2 free tier limits.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Keep it Simple&lt;/strong&gt;: Docker Compose on one instance is simpler than ECS/EKS and adequate for many workloads.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security First&lt;/strong&gt;: Even demo projects should implement proper security. Trivy scanning and proper IAM policies cost nothing but prevent vulnerabilities.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monitoring Matters&lt;/strong&gt;: Prometheus and Grafana add minimal overhead but provide invaluable visibility into system behavior.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Automate Everything&lt;/strong&gt;: Manual deployments are error-prone. GitHub Actions makes automation straightforward.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost-Aware Design&lt;/strong&gt;: Understanding free tier limits enables building production-quality systems for minimal cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real Challenges Faced&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Docker build failures due to deprecated npm flags and missing dependency files&lt;/li&gt;
&lt;li&gt;IAM permission mismatches between parameter paths and policies&lt;/li&gt;
&lt;li&gt;Port conflicts between services requiring careful docker-compose configuration&lt;/li&gt;
&lt;li&gt;Nginx SSL configuration needing manual proxy_pass setup after certbot&lt;/li&gt;
&lt;li&gt;Multi-stage Docker builds requiring proper user permission handling&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Single Point of Failure&lt;/strong&gt;: One EC2 instance means no high availability. For production, consider auto-scaling groups.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limited Scalability&lt;/strong&gt;: Docker Compose doesn't provide automatic scaling. When traffic grows, consider ECS or EKS.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shared Resources&lt;/strong&gt;: All services share EC2 resources. A memory leak in one service affects others.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Manual SSL Renewal&lt;/strong&gt;: While automated via cron, this isn't as robust as AWS Certificate Manager.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Applications
&lt;/h2&gt;

&lt;p&gt;This architecture pattern works well for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Internal tools and admin dashboards&lt;/li&gt;
&lt;li&gt;API backends for mobile apps&lt;/li&gt;
&lt;li&gt;Prototypes and MVPs&lt;/li&gt;
&lt;li&gt;Side projects and personal applications&lt;/li&gt;
&lt;li&gt;Learning and portfolio projects&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's not suitable for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High-traffic consumer applications&lt;/li&gt;
&lt;li&gt;Systems requiring 99.99% uptime&lt;/li&gt;
&lt;li&gt;Workloads with unpredictable traffic spikes&lt;/li&gt;
&lt;li&gt;Applications requiring compliance certifications&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Next Steps
&lt;/h2&gt;

&lt;p&gt;To evolve this architecture:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Add Authentication Service&lt;/strong&gt;: Implement OAuth2/JWT for user authentication&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implement Caching&lt;/strong&gt;: Add Redis for frequently accessed data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Queue System&lt;/strong&gt;: Use SQS for asynchronous processing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API Gateway Replacement&lt;/strong&gt;: Consider AWS API Gateway for rate limiting and caching&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-Region&lt;/strong&gt;: Deploy in multiple regions for disaster recovery&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kubernetes Migration&lt;/strong&gt;: When scaling requirements justify complexity&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Building a complete microservices platform doesn't require expensive infrastructure or complex orchestration. This project demonstrates that professional DevOps practices are accessible and affordable.&lt;/p&gt;

&lt;p&gt;The skills developed here—Infrastructure as Code, CI/CD pipelines, container orchestration, and monitoring—translate directly to enterprise environments. The architecture patterns scale from side projects to production systems.&lt;/p&gt;

&lt;p&gt;Most importantly, this hands-on experience with real tools and services is more valuable than theoretical knowledge. You now have a working platform to experiment with, break, fix, and improve.&lt;/p&gt;

&lt;p&gt;The complete source code is available on GitHub, along with detailed setup instructions and troubleshooting guides. Fork it, modify it, and make it your own.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;GitHub Repository: &lt;a href="https://github.com/cypher682/aws-microservices-cicd" rel="noopener noreferrer"&gt;aws-microservices-cicd&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;AWS Free Tier: &lt;a href="https://aws.amazon.com/free" rel="noopener noreferrer"&gt;https://aws.amazon.com/free&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Docker Compose Documentation: &lt;a href="https://docs.docker.com/compose" rel="noopener noreferrer"&gt;https://docs.docker.com/compose&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Terraform AWS Provider: &lt;a href="https://registry.terraform.io/providers/hashicorp/aws" rel="noopener noreferrer"&gt;https://registry.terraform.io/providers/hashicorp/aws&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Prometheus Monitoring: &lt;a href="https://prometheus.io/docs" rel="noopener noreferrer"&gt;https://prometheus.io/docs&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GitHub Actions: &lt;a href="https://docs.github.com/actions" rel="noopener noreferrer"&gt;https://docs.github.com/actions&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>cicd</category>
      <category>devops</category>
      <category>microservices</category>
      <category>aws</category>
    </item>
    <item>
      <title>Complete Beginner's Guide to Blue-Green Deployment with Nginx and Real-Time Alerting</title>
      <dc:creator>cypher682</dc:creator>
      <pubDate>Tue, 09 Dec 2025 14:49:42 +0000</pubDate>
      <link>https://forem.com/cypher682/complete-beginners-guide-to-blue-green-deployment-with-nginx-and-real-time-alerting-268c</link>
      <guid>https://forem.com/cypher682/complete-beginners-guide-to-blue-green-deployment-with-nginx-and-real-time-alerting-268c</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Welcome to this comprehensive guide on &lt;strong&gt;Blue-Green Deployment&lt;/strong&gt; - a powerful deployment strategy used by companies like Netflix, Amazon, and Facebook to achieve &lt;strong&gt;zero-downtime deployments&lt;/strong&gt;. This project demonstrates how to implement a production-ready blue-green deployment system with automatic failover and real-time Slack alerting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repository:&lt;/strong&gt; &lt;a href="https://github.com/cypher682/hng13-stage-3-devops" rel="noopener noreferrer"&gt;HNG DevOps on GitHub&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What You'll Learn:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What blue-green deployment is and why it matters&lt;/li&gt;
&lt;li&gt;How to implement automatic failover with Nginx&lt;/li&gt;
&lt;li&gt;How to build a real-time monitoring and alerting system&lt;/li&gt;
&lt;li&gt;How to achieve zero-downtime deployments&lt;/li&gt;
&lt;li&gt;How to integrate Slack notifications for DevOps alerts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Prerequisites:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Basic understanding of Docker&lt;/li&gt;
&lt;li&gt;Familiarity with command line&lt;/li&gt;
&lt;li&gt;Basic knowledge of web servers (helpful but not required)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What is Blue-Green Deployment?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Problem: Traditional Deployments
&lt;/h3&gt;

&lt;p&gt;Imagine you're running a website. When you deploy a new version:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;You stop the old version&lt;/li&gt;
&lt;li&gt;Deploy the new version&lt;/li&gt;
&lt;li&gt;Start the new version&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; During steps 1-3, your website is &lt;strong&gt;DOWN&lt;/strong&gt;. Users see errors. You lose money. &lt;/p&gt;

&lt;h3&gt;
  
  
  The Solution: Blue-Green Deployment
&lt;/h3&gt;

&lt;p&gt;Instead of having one environment, you have &lt;strong&gt;TWO identical environments&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;BLUE&lt;/strong&gt; (Production) - Currently serving users&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GREEN&lt;/strong&gt; (Staging) - New version waiting to go live&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When you're ready to deploy:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Deploy new version to GREEN&lt;/li&gt;
&lt;li&gt;Test GREEN thoroughly&lt;/li&gt;
&lt;li&gt;Switch traffic from BLUE to GREEN &lt;strong&gt;instantly&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;If something goes wrong, switch back to BLUE &lt;strong&gt;instantly&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; &lt;strong&gt;ZERO DOWNTIME&lt;/strong&gt; &lt;/p&gt;

&lt;h3&gt;
  
  
  Real-World Analogy
&lt;/h3&gt;

&lt;p&gt;Think of it like having two stages at a concert:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Stage 1 (Blue)&lt;/strong&gt;: Band is performing, audience is watching&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stage 2 (Green)&lt;/strong&gt;: Next band is setting up and sound-checking&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When it's time to switch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rotate the stage 180°&lt;/li&gt;
&lt;li&gt;Audience now sees Stage 2 (Green)&lt;/li&gt;
&lt;li&gt;Stage 1 (Blue) becomes the setup area for the next act&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the new band has technical issues, rotate back to Stage 1 instantly!&lt;/p&gt;




&lt;h2&gt;
  
  
  Project Overview
&lt;/h2&gt;

&lt;p&gt;This project implements a &lt;strong&gt;production-ready blue-green deployment system&lt;/strong&gt; with:&lt;/p&gt;

&lt;h3&gt;
  
  
  Core Features
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Automatic Failover&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Nginx detects when Blue instance fails&lt;/li&gt;
&lt;li&gt;Automatically routes all traffic to Green&lt;/li&gt;
&lt;li&gt;Zero failed requests to users&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Real-Time Alerting&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python watcher monitors Nginx logs&lt;/li&gt;
&lt;li&gt;Detects failover events instantly&lt;/li&gt;
&lt;li&gt;Sends alerts to Slack&lt;/li&gt;
&lt;li&gt;Monitors error rates&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Zero-Downtime Deployment&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deploy new version to inactive instance&lt;/li&gt;
&lt;li&gt;Switch traffic instantly&lt;/li&gt;
&lt;li&gt;Rollback in seconds if needed&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Structured Logging&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every request logged with metadata&lt;/li&gt;
&lt;li&gt;Pool information (blue/green)&lt;/li&gt;
&lt;li&gt;Release version&lt;/li&gt;
&lt;li&gt;Response times&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Understanding the Core Concepts
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Blue-Green vs. Load Balancing
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Load Balancing:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Request 1 → Blue
Request 2 → Green
Request 3 → Blue
Request 4 → Green
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Traffic is &lt;strong&gt;distributed&lt;/strong&gt; between instances.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Blue-Green (This Project):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;All Requests → Blue (Primary)
              Green (Backup, standby)

If Blue fails:
All Requests → Green (Backup becomes active)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Traffic goes to &lt;strong&gt;ONE instance&lt;/strong&gt; at a time. The other is a hot standby.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Nginx Upstream Configuration
&lt;/h3&gt;

&lt;p&gt;Nginx can route traffic to multiple backend servers (upstreams). This project uses:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;&lt;span class="k"&gt;upstream&lt;/span&gt; &lt;span class="s"&gt;app&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kn"&gt;server&lt;/span&gt; &lt;span class="s"&gt;app_blue:3000&lt;/span&gt; &lt;span class="s"&gt;max_fails=1&lt;/span&gt; &lt;span class="s"&gt;fail_timeout=5s&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;server&lt;/span&gt; &lt;span class="s"&gt;app_green:3000&lt;/span&gt; &lt;span class="s"&gt;backup&lt;/span&gt; &lt;span class="s"&gt;max_fails=1&lt;/span&gt; &lt;span class="s"&gt;fail_timeout=5s&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key Directives:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;server app_blue:3000&lt;/code&gt; - Primary server&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;backup&lt;/code&gt; - Only use if primary fails&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;max_fails=1&lt;/code&gt; - Mark as failed after 1 error&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;fail_timeout=5s&lt;/code&gt; - Try again after 5 seconds&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Failover Mechanism
&lt;/h3&gt;

&lt;p&gt;When a request fails:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Nginx tries Blue instance&lt;/li&gt;
&lt;li&gt;Blue returns 5xx error or times out&lt;/li&gt;
&lt;li&gt;Nginx marks Blue as failed&lt;/li&gt;
&lt;li&gt;Nginx retries request to Green (backup)&lt;/li&gt;
&lt;li&gt;User receives successful response from Green&lt;/li&gt;
&lt;li&gt;All subsequent requests go to Green&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; User never sees an error!&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;h3&gt;
  
  
  System Architecture Diagram
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                Internet
                   |
                   v
            +--------------+
            |    Nginx     |
            |  (Port 8080) |
            +--------------+
                   |
    +--------------+--------------+
    |                             |
    v                             v
+----------+                +----------+
| App Blue | (Primary)      |App Green | (Backup)
|Port 3000 |                |Port 3000 |
+----------+                +----------+
    |                             |
    +-------------+---------------+
                  |
                  v
          +--------------+
          | Nginx Logs   |
          +--------------+
                  |
                  v
        +------------------+
        | Alert Watcher    |
        | (Python)         |
        +------------------+
                  |
                  v
          +--------------+
          |    Slack     |
          +--------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Component Breakdown
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1. Nginx Proxy
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Role:&lt;/strong&gt; Traffic router and load balancer&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Responsibilities:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Route all incoming requests&lt;/li&gt;
&lt;li&gt;Detect backend failures&lt;/li&gt;
&lt;li&gt;Perform automatic failover&lt;/li&gt;
&lt;li&gt;Log all requests with metadata&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  2. App Blue (Primary Instance)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Role:&lt;/strong&gt; Primary application server&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Environment Variables:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;APP_POOL=blue
RELEASE_ID=blue-release-1.0.0
PORT=3000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  3. App Green (Backup Instance)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Role:&lt;/strong&gt; Backup application server (hot standby)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Environment Variables:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;APP_POOL=green
RELEASE_ID=green-release-1.0.0
PORT=3000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  4. Alert Watcher (Python)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Role:&lt;/strong&gt; Real-time log monitoring and alerting&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Responsibilities:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tail Nginx access logs&lt;/li&gt;
&lt;li&gt;Parse structured log entries&lt;/li&gt;
&lt;li&gt;Detect failover events&lt;/li&gt;
&lt;li&gt;Monitor error rates&lt;/li&gt;
&lt;li&gt;Send Slack alerts&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Technology Stack
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Technology&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reverse Proxy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Nginx (Alpine)&lt;/td&gt;
&lt;td&gt;Traffic routing &amp;amp; failover&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Application&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Python/Flask&lt;/td&gt;
&lt;td&gt;Demo web application&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Monitoring&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Python 3.11&lt;/td&gt;
&lt;td&gt;Log watcher &amp;amp; alerting&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Alerting&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Slack Webhooks&lt;/td&gt;
&lt;td&gt;Real-time notifications&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Containerization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Docker&lt;/td&gt;
&lt;td&gt;Package all services&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Orchestration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Docker Compose&lt;/td&gt;
&lt;td&gt;Manage multi-container setup&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Setting Up the Project
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Clone the Repository
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/cypher682/hng13-stage-3-devops.git
&lt;span class="nb"&gt;cd &lt;/span&gt;hng13-stage-3-devops
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Understand the Project Structure
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;hng13-stage-3-devops/
├── docker-compose.yml       # Container orchestration
├── nginx.conf.template      # Nginx configuration template
├── entrypoint.sh           # Nginx startup script
├── watcher.py              # Python log monitoring script
├── requirements.txt        # Python dependencies
├── .env.example            # Environment variables template
├── test-failover.sh        # Failover testing script
└── public/                 # Static HTML files
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Configure Environment Variables
&lt;/h3&gt;

&lt;p&gt;Copy the example environment file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Edit &lt;code&gt;.env&lt;/code&gt; with your configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Application Configuration
PORT=3000
ACTIVE_POOL=blue

# Docker Images
BLUE_IMAGE=yimikaade/wonderful:latest
GREEN_IMAGE=yimikaade/wonderful:latest

# Release Identifiers
RELEASE_ID_BLUE=blue-release-1.0.0
RELEASE_ID_GREEN=green-release-1.0.0

# Slack Integration
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK/URL

# Alert Configuration
ERROR_RATE_THRESHOLD=2          # Percentage
WINDOW_SIZE=200                 # Number of requests
ALERT_COOLDOWN_SEC=300          # 5 minutes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4: Set Up Slack Webhook
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Go to &lt;a href="https://api.slack.com/messaging/webhooks" rel="noopener noreferrer"&gt;Slack API&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Create a new Slack App&lt;/li&gt;
&lt;li&gt;Enable Incoming Webhooks&lt;/li&gt;
&lt;li&gt;Create a webhook for your channel&lt;/li&gt;
&lt;li&gt;Copy the webhook URL to &lt;code&gt;.env&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Step 5: Build and Start Services
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Build all containers&lt;/span&gt;
docker compose build

&lt;span class="c"&gt;# Start all services in detached mode&lt;/span&gt;
docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt;

&lt;span class="c"&gt;# Verify all services are running&lt;/span&gt;
docker compose ps
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Expected Output:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NAME              STATUS    PORTS
nginx_proxy       Up        0.0.0.0:8080-&amp;gt;80/tcp
app_blue          Up        3000/tcp
app_green         Up        3000/tcp
alert_watcher     Up
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 6: Verify the Deployment
&lt;/h3&gt;

&lt;p&gt;Open your browser and navigate to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;http://localhost:8080/version
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should see:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"pool"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"blue"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"release"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"blue-release-1.0.0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"healthy"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Understanding the Configuration
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Nginx Configuration Template
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;nginx.conf.template&lt;/code&gt; uses environment variable substitution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;&lt;span class="k"&gt;upstream&lt;/span&gt; &lt;span class="s"&gt;app&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kn"&gt;server&lt;/span&gt; &lt;span class="s"&gt;app_blue:&lt;/span&gt;$&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="kn"&gt;PORT&lt;/span&gt;&lt;span class="err"&gt;}&lt;/span&gt; &lt;span class="s"&gt;max_fails=1&lt;/span&gt; &lt;span class="s"&gt;fail_timeout=5s&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;server&lt;/span&gt; &lt;span class="s"&gt;app_green:&lt;/span&gt;$&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="kn"&gt;PORT&lt;/span&gt;&lt;span class="err"&gt;}&lt;/span&gt; &lt;span class="s"&gt;backup&lt;/span&gt; &lt;span class="s"&gt;max_fails=1&lt;/span&gt; &lt;span class="s"&gt;fail_timeout=5s&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kn"&gt;server&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kn"&gt;listen&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="kn"&gt;location&lt;/span&gt; &lt;span class="n"&gt;/&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kn"&gt;proxy_pass&lt;/span&gt; &lt;span class="s"&gt;http://app&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="kn"&gt;proxy_next_upstream&lt;/span&gt; &lt;span class="s"&gt;error&lt;/span&gt; &lt;span class="s"&gt;timeout&lt;/span&gt; &lt;span class="s"&gt;http_502&lt;/span&gt; &lt;span class="s"&gt;http_503&lt;/span&gt; &lt;span class="s"&gt;http_504&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="c1"&gt;# Timeouts for fast failure detection&lt;/span&gt;
        &lt;span class="kn"&gt;proxy_connect_timeout&lt;/span&gt; &lt;span class="s"&gt;2s&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="kn"&gt;proxy_send_timeout&lt;/span&gt; &lt;span class="s"&gt;3s&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="kn"&gt;proxy_read_timeout&lt;/span&gt; &lt;span class="s"&gt;3s&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Key Configuration Explained
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1. Proxy Next Upstream
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;&lt;span class="k"&gt;proxy_next_upstream&lt;/span&gt; &lt;span class="s"&gt;error&lt;/span&gt; &lt;span class="s"&gt;timeout&lt;/span&gt; &lt;span class="s"&gt;http_502&lt;/span&gt; &lt;span class="s"&gt;http_503&lt;/span&gt; &lt;span class="s"&gt;http_504&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt;&lt;br&gt;
Automatically retry the request to the next upstream (Green) if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Connection error occurs&lt;/li&gt;
&lt;li&gt;Request times out&lt;/li&gt;
&lt;li&gt;Upstream returns 502, 503, or 504&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; User never sees these errors!&lt;/p&gt;
&lt;h4&gt;
  
  
  2. Aggressive Timeouts
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;&lt;span class="k"&gt;proxy_connect_timeout&lt;/span&gt; &lt;span class="s"&gt;2s&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;proxy_send_timeout&lt;/span&gt; &lt;span class="s"&gt;3s&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;proxy_read_timeout&lt;/span&gt; &lt;span class="s"&gt;3s&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Why so short?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Detect failures &lt;strong&gt;fast&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Failover happens in &lt;strong&gt;seconds&lt;/strong&gt;, not minutes&lt;/li&gt;
&lt;li&gt;Better user experience&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  3. Structured Logging
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;&lt;span class="k"&gt;log_format&lt;/span&gt; &lt;span class="s"&gt;detailed&lt;/span&gt; 
    &lt;span class="s"&gt;'pool=&lt;/span&gt;&lt;span class="nv"&gt;$upstream_http_x_app_pool&lt;/span&gt; &lt;span class="s"&gt;'&lt;/span&gt;
    &lt;span class="s"&gt;'release=&lt;/span&gt;&lt;span class="nv"&gt;$upstream_http_x_release_id&lt;/span&gt; &lt;span class="s"&gt;'&lt;/span&gt;
    &lt;span class="s"&gt;'upstream_status=&lt;/span&gt;&lt;span class="nv"&gt;$upstream_status&lt;/span&gt; &lt;span class="s"&gt;'&lt;/span&gt;
    &lt;span class="s"&gt;'latency=&lt;/span&gt;&lt;span class="nv"&gt;$request_time&lt;/span&gt; &lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;access_log&lt;/span&gt; &lt;span class="n"&gt;/var/log/nginx/access.log&lt;/span&gt; &lt;span class="s"&gt;detailed&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Log Example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pool=blue release=blue-release-1.0.0 upstream_status=200 latency=0.045
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  How Failover Works
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Normal Operation (Blue Active)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Request → Nginx → App Blue → Success (200 OK)
                       ↓
                   Nginx Logs: pool=blue
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Failure Scenario
&lt;/h3&gt;

&lt;p&gt;Let's trace what happens when Blue crashes:&lt;/p&gt;

&lt;h4&gt;
  
  
  Step 1: User Makes Request
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User → GET http://localhost:8080/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Step 2: Nginx Tries Blue
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Nginx → app_blue:3000
        ↓
    Connection Refused (Blue is down)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Step 3: Nginx Detects Failure
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Nginx marks app_blue as FAILED
(max_fails=1 threshold reached)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Step 4: Nginx Retries to Green
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Nginx → app_green:3000
        ↓
    Success! (200 OK)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Step 5: User Receives Response
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User ← 200 OK from Green
(User never knew Blue failed!)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Step 6: Alert Watcher Detects Failover
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# watcher.py detects pool change
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pool&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_alerted_pool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;send_slack_alert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Failover detected: blue → green&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Step 7: Slack Alert Sent
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Failover detected: blue → green (to backup)
Release: green-release-1.0.0
Upstream: 172.18.0.4:3000
Request time: 0.05s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Real-Time Alerting System
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Alert Watcher Architecture
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;watcher.py&lt;/code&gt; script monitors Nginx logs in real-time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AlertWatcher&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Track current state
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;current_pool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ACTIVE_POOL&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_alerted_pool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ACTIVE_POOL&lt;/span&gt;

        &lt;span class="c1"&gt;# Rolling window for error rate
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request_window&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;deque&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;maxlen&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;WINDOW_SIZE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Cooldown timers
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_failover_alert_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_error_alert_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  How It Works
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1. Log Tailing
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;tail_log_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LOG_FILE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;seek&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Go to end of file
&lt;/span&gt;        &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;readline&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;process_log_line&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Opens log file in read mode&lt;/li&gt;
&lt;li&gt;Seeks to end (like &lt;code&gt;tail -f&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Continuously reads new lines&lt;/li&gt;
&lt;li&gt;Processes each line in real-time&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  2. Failover Detection
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check_failover&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pool&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_alerted_pool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;  &lt;span class="c1"&gt;# No change
&lt;/span&gt;
    &lt;span class="c1"&gt;# Determine alert type
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pool&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;ACTIVE_POOL&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Recovery detected: back to &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;pool&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Failover detected: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_alerted_pool&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; → &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;pool&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="c1"&gt;# Send alert
&lt;/span&gt;    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_slack_alert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  3. Error Rate Monitoring
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check_error_rate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;all_statuses&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Check if any status is 5xx
&lt;/span&gt;    &lt;span class="n"&gt;is_5xx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;600&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;all_statuses&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Add to sliding window
&lt;/span&gt;    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request_window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;is_5xx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Calculate error rate
&lt;/span&gt;    &lt;span class="n"&gt;error_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request_window&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;total_requests&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request_window&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;error_rate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;error_count&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;total_requests&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;

    &lt;span class="c1"&gt;# Check threshold
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;error_rate&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ERROR_RATE_THRESHOLD&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_slack_alert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;High upstream error rate: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;error_rate&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;%&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Alert Types
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1. Failover Alert
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Failover detected: blue → green (to backup)
Release: green-release-1.0.0
Upstream: 172.18.0.4:3000
Request time: 0.05s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;When:&lt;/strong&gt; Primary instance fails, traffic switches to backup&lt;/p&gt;

&lt;h4&gt;
  
  
  2. Recovery Alert
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Recovery detected: back to blue (primary)
Release: blue-release-1.0.0
Upstream: 172.18.0.3:3000
Request time: 0.03s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;When:&lt;/strong&gt; Primary instance recovers, traffic returns to primary&lt;/p&gt;

&lt;h4&gt;
  
  
  3. High Error Rate Alert
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;High upstream error rate
5xx in upstream attempts: 5.50% over last 200 requests (threshold 2%).
Current pool: green
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;When:&lt;/strong&gt; Error rate exceeds threshold in sliding window&lt;/p&gt;




&lt;h2&gt;
  
  
  Testing the Deployment
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Test 1: Verify Normal Operation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check which pool is active&lt;/span&gt;
curl http://localhost:8080/version

&lt;span class="c"&gt;# Expected output:&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"pool"&lt;/span&gt;: &lt;span class="s2"&gt;"blue"&lt;/span&gt;,
  &lt;span class="s2"&gt;"release"&lt;/span&gt;: &lt;span class="s2"&gt;"blue-release-1.0.0"&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Test 2: Trigger Failover
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Manual Failover Test:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Stop Blue instance&lt;/span&gt;
docker compose stop app_blue

&lt;span class="c"&gt;# Make requests&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;i &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;1..10&lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;curl http://localhost:8080/version
  &lt;span class="nb"&gt;sleep &lt;/span&gt;1
&lt;span class="k"&gt;done&lt;/span&gt;

&lt;span class="c"&gt;# You should see responses from Green&lt;/span&gt;
&lt;span class="c"&gt;# Check Slack for failover alert&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Test 3: Verify Zero Downtime
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# In one terminal, continuously make requests&lt;/span&gt;
&lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://localhost:8080/version | jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'.pool'&lt;/span&gt;
  &lt;span class="nb"&gt;sleep &lt;/span&gt;0.5
&lt;span class="k"&gt;done&lt;/span&gt;

&lt;span class="c"&gt;# In another terminal, stop Blue&lt;/span&gt;
docker compose stop app_blue

&lt;span class="c"&gt;# Observe: No failed requests!&lt;/span&gt;
&lt;span class="c"&gt;# Output switches from "blue" to "green" seamlessly&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Test 4: Test Recovery
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Restart Blue instance&lt;/span&gt;
docker compose start app_blue

&lt;span class="c"&gt;# Wait for health check to pass (10-15 seconds)&lt;/span&gt;
&lt;span class="nb"&gt;sleep &lt;/span&gt;15

&lt;span class="c"&gt;# Make requests&lt;/span&gt;
curl http://localhost:8080/version

&lt;span class="c"&gt;# Should show Blue is active again&lt;/span&gt;
&lt;span class="c"&gt;# Check Slack for recovery alert&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Test 5: View Nginx Logs
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# View real-time logs&lt;/span&gt;
docker compose &lt;span class="nb"&gt;exec &lt;/span&gt;nginx_proxy &lt;span class="nb"&gt;tail&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt; /var/log/nginx/access.log

&lt;span class="c"&gt;# Example output:&lt;/span&gt;
&lt;span class="nv"&gt;pool&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;blue &lt;span class="nv"&gt;release&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;blue-release-1.0.0 &lt;span class="nv"&gt;upstream_status&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;200 &lt;span class="nv"&gt;latency&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.023
&lt;span class="nv"&gt;pool&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;green &lt;span class="nv"&gt;release&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;green-release-1.0.0 &lt;span class="nv"&gt;upstream_status&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;502, 200 &lt;span class="nv"&gt;latency&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.045
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Common Issues and Troubleshooting
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Issue 1: Failover Not Happening
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptoms:&lt;/strong&gt; Blue fails but traffic doesn't switch to Green&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solutions:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check Nginx config&lt;/span&gt;
docker compose &lt;span class="nb"&gt;exec &lt;/span&gt;nginx_proxy &lt;span class="nb"&gt;cat&lt;/span&gt; /etc/nginx/nginx.conf.processed

&lt;span class="c"&gt;# Restart Nginx&lt;/span&gt;
docker compose restart nginx_proxy

&lt;span class="c"&gt;# Verify Green is healthy&lt;/span&gt;
docker compose &lt;span class="nb"&gt;exec &lt;/span&gt;app_green wget &lt;span class="nt"&gt;-qO-&lt;/span&gt; http://localhost:3000/healthz
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Issue 2: Alerts Not Sending to Slack
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptoms:&lt;/strong&gt; Failover happens but no Slack notification&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solutions:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Verify Webhook URL:&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check .env file&lt;/span&gt;
&lt;span class="nb"&gt;cat&lt;/span&gt; .env | &lt;span class="nb"&gt;grep &lt;/span&gt;SLACK_WEBHOOK_URL

&lt;span class="c"&gt;# Test webhook manually&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST YOUR_WEBHOOK_URL &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"text":"Test alert"}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Restart Watcher:&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose restart alert_watcher
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Issue 3: Both Instances Receiving Traffic
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptoms:&lt;/strong&gt; Requests alternate between Blue and Green&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Ensure Green has "backup" directive&lt;/span&gt;
&lt;span class="c"&gt;# Check nginx.conf.template&lt;/span&gt;
&lt;span class="c"&gt;# Rebuild and restart&lt;/span&gt;
docker compose down
docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;--build&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Debugging Commands Cheat Sheet
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# View all container status&lt;/span&gt;
docker compose ps

&lt;span class="c"&gt;# View all logs&lt;/span&gt;
docker compose logs

&lt;span class="c"&gt;# View specific service logs&lt;/span&gt;
docker compose logs &lt;span class="nt"&gt;-f&lt;/span&gt; nginx_proxy

&lt;span class="c"&gt;# Execute command in container&lt;/span&gt;
docker compose &lt;span class="nb"&gt;exec &lt;/span&gt;nginx_proxy sh

&lt;span class="c"&gt;# Restart specific service&lt;/span&gt;
docker compose restart app_blue

&lt;span class="c"&gt;# Clean up everything&lt;/span&gt;
docker compose down &lt;span class="nt"&gt;-v&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What You've Learned
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Blue-Green Deployment&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How to implement zero-downtime deployments&lt;/li&gt;
&lt;li&gt;Difference between blue-green and load balancing&lt;/li&gt;
&lt;li&gt;When to use blue-green vs. other strategies&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Nginx as a Reverse Proxy&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Upstream configuration&lt;/li&gt;
&lt;li&gt;Failover mechanisms&lt;/li&gt;
&lt;li&gt;Health checks and timeouts&lt;/li&gt;
&lt;li&gt;Structured logging&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Real-Time Monitoring&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Log tailing and parsing&lt;/li&gt;
&lt;li&gt;Event detection&lt;/li&gt;
&lt;li&gt;Sliding window calculations&lt;/li&gt;
&lt;li&gt;Alert deduplication&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Slack Integration&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Webhook setup&lt;/li&gt;
&lt;li&gt;Alert formatting&lt;/li&gt;
&lt;li&gt;Error handling&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Docker Orchestration&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multi-container applications&lt;/li&gt;
&lt;li&gt;Service dependencies&lt;/li&gt;
&lt;li&gt;Volume management&lt;/li&gt;
&lt;li&gt;Health checks&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Real-World Applications
&lt;/h3&gt;

&lt;p&gt;This pattern is used by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Netflix&lt;/strong&gt; - Canary deployments with instant rollback&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Amazon&lt;/strong&gt; - Blue-green for critical services&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Heroku&lt;/strong&gt; - Platform-level blue-green deployments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub&lt;/strong&gt; - Zero-downtime deployments&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Next Steps
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Enhance the System&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Add database with replication&lt;/li&gt;
&lt;li&gt;Implement canary deployments&lt;/li&gt;
&lt;li&gt;Add A/B testing capabilities&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Improve Monitoring&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Add Prometheus metrics&lt;/li&gt;
&lt;li&gt;Create Grafana dashboards&lt;/li&gt;
&lt;li&gt;Implement distributed tracing&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Scale the Architecture&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deploy to Kubernetes&lt;/li&gt;
&lt;li&gt;Use managed load balancers&lt;/li&gt;
&lt;li&gt;Implement auto-scaling&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Additional Resources
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Documentation
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="http://nginx.org/en/docs/http/ngx_http_upstream_module.html" rel="noopener noreferrer"&gt;Nginx Upstream Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.docker.com/compose/" rel="noopener noreferrer"&gt;Docker Compose Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://api.slack.com/messaging/webhooks" rel="noopener noreferrer"&gt;Slack Webhooks Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://martinfowler.com/bliki/BlueGreenDeployment.html" rel="noopener noreferrer"&gt;Blue-Green Deployment Pattern&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Tools
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://nginx.org/" rel="noopener noreferrer"&gt;Nginx&lt;/a&gt; - High-performance web server&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.docker.com/" rel="noopener noreferrer"&gt;Docker&lt;/a&gt; - Containerization platform&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://slack.com/" rel="noopener noreferrer"&gt;Slack&lt;/a&gt; - Team communication&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Congratulations! You've just learned how to implement a &lt;strong&gt;production-ready blue-green deployment system&lt;/strong&gt; with automatic failover and real-time alerting. This is a critical skill for modern DevOps engineers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Remember:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Test thoroughly&lt;/strong&gt; before deploying to production&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitor continuously&lt;/strong&gt; - you can't fix what you can't see&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automate everything&lt;/strong&gt; - manual processes lead to errors&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Document your decisions&lt;/strong&gt; - future you will thank you&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Blue-green deployment is just one piece of the DevOps puzzle. The principles you've learned here - automation, monitoring, resilience, and rapid recovery - apply to all aspects of modern infrastructure.&lt;/p&gt;

&lt;p&gt;Keep experimenting, keep learning, and most importantly, keep building!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Full Project Repository:&lt;/strong&gt; &lt;a href="https://github.com/cypher682/hng13-stage-3-devops" rel="noopener noreferrer"&gt;HNG DevOps on GitHub&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Happy deploying!&lt;/p&gt;

</description>
      <category>devops</category>
      <category>monitoring</category>
      <category>architecture</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Complete Beginner's Guide to Building a Microservices TODO Application with Docker and Traefik</title>
      <dc:creator>cypher682</dc:creator>
      <pubDate>Tue, 09 Dec 2025 14:35:15 +0000</pubDate>
      <link>https://forem.com/cypher682/complete-beginners-guide-to-building-a-microservices-todo-application-with-docker-and-traefik-5gok</link>
      <guid>https://forem.com/cypher682/complete-beginners-guide-to-building-a-microservices-todo-application-with-docker-and-traefik-5gok</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Welcome! This guide will walk you through a real-world microservices application - a TODO app that demonstrates how different services written in different programming languages can work together seamlessly. If you're new to DevOps or microservices, don't worry - I'll explain everything step by step!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repository:&lt;/strong&gt; &lt;a href="https://github.com/cypher682/DevOps-Stage-6" rel="noopener noreferrer"&gt;HNG DevOps on GitHub&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What You'll Learn:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How microservices architecture works in practice&lt;/li&gt;
&lt;li&gt;How to use Docker to containerize multiple services&lt;/li&gt;
&lt;li&gt;How to set up Traefik as a reverse proxy&lt;/li&gt;
&lt;li&gt;How different programming languages work together&lt;/li&gt;
&lt;li&gt;How services communicate with each other&lt;/li&gt;
&lt;li&gt;How to implement authentication across microservices&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What is a Microservices Architecture?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Traditional Way (Monolithic)
&lt;/h3&gt;

&lt;p&gt;Imagine building a house where everything - kitchen, bedroom, bathroom - is one giant room. If you want to renovate the kitchen, you might affect the bedroom too. That's a &lt;strong&gt;monolithic application&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Microservices Way
&lt;/h3&gt;

&lt;p&gt;Now imagine a house where each room is a separate, self-contained unit. You can renovate the kitchen without touching the bedroom. Each room has its own entrance and can be accessed independently. That's &lt;strong&gt;microservices architecture&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Microservices?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Independent Development&lt;/strong&gt;: Different teams can work on different services&lt;br&gt;&lt;br&gt;
 &lt;strong&gt;Technology Flexibility&lt;/strong&gt;: Use the best language/framework for each service&lt;br&gt;&lt;br&gt;
 &lt;strong&gt;Scalability&lt;/strong&gt;: Scale only the services that need it&lt;br&gt;&lt;br&gt;
 &lt;strong&gt;Fault Isolation&lt;/strong&gt;: If one service fails, others keep running&lt;br&gt;&lt;br&gt;
 &lt;strong&gt;Easier Maintenance&lt;/strong&gt;: Smaller codebases are easier to understand&lt;/p&gt;


&lt;h2&gt;
  
  
  Project Overview
&lt;/h2&gt;

&lt;p&gt;This project is a &lt;strong&gt;TODO application&lt;/strong&gt; that allows users to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Register and log in (Authentication)&lt;/li&gt;
&lt;li&gt;Create, read, update, and delete TODO items&lt;/li&gt;
&lt;li&gt;View user profiles&lt;/li&gt;
&lt;li&gt;Track operations through logging&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  The Magic Behind It
&lt;/h3&gt;

&lt;p&gt;The application is split into &lt;strong&gt;5 independent microservices&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Frontend&lt;/strong&gt; (Vue.js) - The user interface&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auth API&lt;/strong&gt; (Go) - Handles user authentication&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TODOs API&lt;/strong&gt; (Node.js) - Manages TODO items&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Users API&lt;/strong&gt; (Java/Spring Boot) - Manages user profiles&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Log Message Processor&lt;/strong&gt; (Python) - Processes and logs operations&lt;/li&gt;
&lt;/ol&gt;


&lt;h2&gt;
  
  
  Understanding the Components
&lt;/h2&gt;

&lt;p&gt;Let's break down each component in simple terms:&lt;/p&gt;
&lt;h3&gt;
  
  
  1. Frontend (Vue.js - JavaScript)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; This is what users see and interact with - the web interface.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Think of it as:&lt;/strong&gt; The cashier at a restaurant who takes your order and brings you food.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Vue.js?&lt;/strong&gt; It's a modern, lightweight JavaScript framework perfect for building interactive user interfaces.&lt;/p&gt;


&lt;h3&gt;
  
  
  2. Auth API (Go)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; Handles user authentication and generates JWT (JSON Web Tokens) for secure access.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Think of it as:&lt;/strong&gt; The security guard who checks your ID and gives you a badge to enter the building.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How JWT Works:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. User logs in with username/password
2. Auth API verifies credentials
3. Auth API generates a JWT token (like a temporary pass)
4. User includes this token in future requests
5. Other services verify the token to confirm identity
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why Go?&lt;/strong&gt; Go is fast, efficient, and excellent for building high-performance APIs.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. TODOs API (Node.js)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; Manages all TODO operations - create, read, update, delete.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Think of it as:&lt;/strong&gt; The kitchen that prepares your food orders.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Redis Integration:&lt;/strong&gt;&lt;br&gt;
When you create or delete a TODO, the API sends a message to Redis (a message queue), which the Log Message Processor picks up and logs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Node.js?&lt;/strong&gt; Node.js is great for I/O-heavy operations and has a rich ecosystem of packages.&lt;/p&gt;


&lt;h3&gt;
  
  
  4. Users API (Java/Spring Boot)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; Manages user profile information.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Think of it as:&lt;/strong&gt; The HR department that maintains employee records.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Java/Spring Boot?&lt;/strong&gt; Spring Boot is enterprise-grade, robust, and widely used in production environments.&lt;/p&gt;


&lt;h3&gt;
  
  
  5. Log Message Processor (Python)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; Listens to Redis queue and logs TODO operations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Think of it as:&lt;/strong&gt; The security camera system that records all activities.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Python?&lt;/strong&gt; Python is simple, readable, and perfect for quick scripting tasks.&lt;/p&gt;


&lt;h3&gt;
  
  
  6. Traefik (Reverse Proxy)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; Routes incoming requests to the correct service and handles SSL/TLS certificates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Think of it as:&lt;/strong&gt; The receptionist who directs visitors to the right department.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Automatic service discovery&lt;/li&gt;
&lt;li&gt;SSL/TLS certificate management (Let's Encrypt)&lt;/li&gt;
&lt;li&gt;HTTP to HTTPS redirection&lt;/li&gt;
&lt;li&gt;Path-based routing&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Technology Stack
&lt;/h2&gt;

&lt;p&gt;Here's a summary of all technologies used:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Technology&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Frontend&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Vue.js&lt;/td&gt;
&lt;td&gt;User interface&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Auth API&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Go&lt;/td&gt;
&lt;td&gt;Authentication service&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TODOs API&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Node.js&lt;/td&gt;
&lt;td&gt;TODO management&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Users API&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Java/Spring Boot&lt;/td&gt;
&lt;td&gt;User profile management&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Log Processor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;Logging service&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Message Queue&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Redis&lt;/td&gt;
&lt;td&gt;Inter-service communication&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reverse Proxy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Traefik&lt;/td&gt;
&lt;td&gt;Request routing &amp;amp; SSL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Containerization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Docker&lt;/td&gt;
&lt;td&gt;Package services&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Orchestration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Docker Compose&lt;/td&gt;
&lt;td&gt;Manage multiple containers&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Visual Architecture
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                    Internet
                       |
                       v
              +----------------+
              |    Traefik     |
              | (Port 80/443)  |
              +----------------+
                       |
    +------------------+------------------+
    |                  |                  |
    v                  v                  v
+----------+    +----------+    +----------+
| Frontend |    | Auth API |    |TODOs API |
| (Vue.js) |    |   (Go)   |    | (Node.js)|
+----------+    +----------+    +----------+
    |                  |                  |
    |                  v                  |
    |          +-------------+            |
    +---------&amp;gt;|  Users API  |&amp;lt;-----------+
               | (Java)      |
               +-------------+


+-------------+              +-------------+
| Log Message |&amp;lt;--Redis------|  TODOs API  |
| Processor   |              +-------------+
| (Python)    |
+-------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Request Flow Example
&lt;/h3&gt;

&lt;p&gt;Let's trace what happens when a user creates a TODO:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;User Action&lt;/strong&gt;: User fills out TODO form and clicks "Create"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Frontend&lt;/strong&gt;: Sends POST request to &lt;code&gt;https://example.com/api/todos&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Traefik&lt;/strong&gt;: Receives request, checks routing rules, forwards to TODOs API&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TODOs API&lt;/strong&gt;: 

&lt;ul&gt;
&lt;li&gt;Validates JWT token&lt;/li&gt;
&lt;li&gt;Creates TODO in database&lt;/li&gt;
&lt;li&gt;Publishes "TODO created" message to Redis&lt;/li&gt;
&lt;li&gt;Returns success response&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Redis&lt;/strong&gt;: Stores message in queue&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Log Message Processor&lt;/strong&gt;: 

&lt;ul&gt;
&lt;li&gt;Reads message from Redis&lt;/li&gt;
&lt;li&gt;Logs the operation&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Frontend&lt;/strong&gt;: Receives success response, updates UI&lt;/li&gt;
&lt;/ol&gt;


&lt;h2&gt;
  
  
  Setting Up the Project
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;p&gt;Before you start, make sure you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Docker&lt;/strong&gt; installed (&lt;a href="https://www.docker.com/products/docker-desktop" rel="noopener noreferrer"&gt;Download here&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker Compose&lt;/strong&gt; installed (usually comes with Docker Desktop)&lt;/li&gt;
&lt;li&gt;Basic command line knowledge&lt;/li&gt;
&lt;li&gt;A text editor (VS Code, Sublime, etc.)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Step 1: Clone the Repository
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/cypher682/hng13-stage-3-devops.git
&lt;span class="nb"&gt;cd &lt;/span&gt;DevOps-Stage-6
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Step 2: Understand the Project Structure
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DevOps-Stage-6/
├── frontend/              # Vue.js application
│   ├── Dockerfile        # Instructions to build frontend container
│   └── src/              # Source code
├── auth-api/             # Go authentication service
│   ├── Dockerfile
│   └── main.go
├── todos-api/            # Node.js TODO service
│   ├── Dockerfile
│   └── server.js
├── users-api/            # Java Spring Boot service
│   ├── Dockerfile
│   └── src/
├── log-message-processor/ # Python logging service
│   ├── Dockerfile
│   └── processor.py
├── docker-compose.yml    # Orchestration configuration
└── .env                  # Environment variables
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Step 3: Configure Environment Variables
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;.env&lt;/code&gt; file contains configuration for all services:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# User credentials (for testing)
USER1_USERNAME=user1
USER1_PASSWORD=password1

# JWT Secret (for token generation)
JWT_SECRET=myfancysecret

# Redis Configuration
REDIS_HOST=redis-queue
REDIS_PORT=6379
REDIS_CHANNEL=log_channel
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt; In production, never commit &lt;code&gt;.env&lt;/code&gt; files with real credentials!&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Build and Run
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Build all services&lt;/span&gt;
docker compose build

&lt;span class="c"&gt;# Start all services&lt;/span&gt;
docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt;

&lt;span class="c"&gt;# Check if all services are running&lt;/span&gt;
docker compose ps
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What &lt;code&gt;-d&lt;/code&gt; means:&lt;/strong&gt; Runs containers in "detached" mode (in the background).&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5: Verify Services
&lt;/h3&gt;

&lt;p&gt;Check that all containers are running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose ps
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should see all services with status "Up" or "Running":&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;traefik&lt;/li&gt;
&lt;li&gt;frontend&lt;/li&gt;
&lt;li&gt;auth-api&lt;/li&gt;
&lt;li&gt;todos-api&lt;/li&gt;
&lt;li&gt;users-api&lt;/li&gt;
&lt;li&gt;log-message-processor&lt;/li&gt;
&lt;li&gt;redis-queue&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Understanding Docker Compose
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is Docker Compose?
&lt;/h3&gt;

&lt;p&gt;Docker Compose is a tool for defining and running multi-container Docker applications. Instead of running each container manually, you define everything in a &lt;code&gt;docker-compose.yml&lt;/code&gt; file.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Sections Explained
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1. Services Definition
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;traefik&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;traefik:latest&lt;/span&gt;
    &lt;span class="na"&gt;container_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;traefik&lt;/span&gt;
    &lt;span class="na"&gt;restart&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;unless-stopped&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;services:&lt;/code&gt; - Defines all containers&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;traefik:&lt;/code&gt; - Service name&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;image:&lt;/code&gt; - Docker image to use&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;container_name:&lt;/code&gt; - Name for the container&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;restart:&lt;/code&gt; - Restart policy&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  2. Port Mapping
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;80:80"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;443:443"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Format:&lt;/strong&gt; &lt;code&gt;HOST_PORT:CONTAINER_PORT&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Explanation:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Maps port 80 on your computer to port 80 in the container&lt;/li&gt;
&lt;li&gt;Allows you to access services from your browser&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  3. Networks
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;networks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;app-network&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All services on the same network can communicate with each other&lt;/li&gt;
&lt;li&gt;Services can reference each other by service name (e.g., &lt;code&gt;http://auth-api:8081&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  4. Traefik Labels (Routing)
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;traefik.enable=true"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;traefik.http.routers.frontend.rule=Host(`example.com`)"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;traefik.http.routers.frontend.entrypoints=websecure"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tells Traefik to route requests for &lt;code&gt;example.com&lt;/code&gt; to this service&lt;/li&gt;
&lt;li&gt;Enables HTTPS with automatic certificate generation&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  How Services Communicate
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Internal Communication (Service-to-Service)
&lt;/h3&gt;

&lt;p&gt;Services communicate using &lt;strong&gt;service names&lt;/strong&gt; as hostnames:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// In Frontend (Vue.js)&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;AUTH_API_ADDRESS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;http://auth-api:8081&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;TODOS_API_ADDRESS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;http://todos-api:8082&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Docker Compose creates a network where services can find each other by name&lt;/li&gt;
&lt;li&gt;No need for IP addresses&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  External Communication (User-to-Service)
&lt;/h3&gt;

&lt;p&gt;Users access services through Traefik:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User → https://example.com → Traefik → Frontend
User → https://example.com/api/auth → Traefik → Auth API
User → https://example.com/api/todos → Traefik → TODOs API
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Authentication Flow
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. User submits login form
   ↓
2. Frontend sends credentials to Auth API
   ↓
3. Auth API verifies with Users API
   ↓
4. Auth API generates JWT token
   ↓
5. Frontend stores token
   ↓
6. Frontend includes token in all future requests
   ↓
7. Each API validates token before processing
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Testing the Application
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Access the Frontend
&lt;/h3&gt;

&lt;p&gt;Open your browser and navigate to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;http://localhost:8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Test Authentication
&lt;/h3&gt;

&lt;p&gt;Login with test credentials:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Username: user1
Password: password1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Create a TODO
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Click "Add TODO"&lt;/li&gt;
&lt;li&gt;Enter a title and description&lt;/li&gt;
&lt;li&gt;Click "Save"&lt;/li&gt;
&lt;li&gt;Check that the TODO appears in the list&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  4. Check Logs
&lt;/h3&gt;

&lt;p&gt;View logs from the Log Message Processor:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose logs &lt;span class="nt"&gt;-f&lt;/span&gt; log-message-processor
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should see log entries for TODO creation.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Test API Endpoints Directly
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Get all users:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:8080/api/users
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Login (get JWT token):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8080/api/auth/login &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"username":"user1","password":"password1"}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Common Issues and Troubleshooting
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Issue 1: Containers Won't Start
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptoms:&lt;/strong&gt; Containers show as "Exited" or "Restarting"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solutions:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check logs for specific service&lt;/span&gt;
docker compose logs auth-api

&lt;span class="c"&gt;# Rebuild containers&lt;/span&gt;
docker compose down
docker compose build &lt;span class="nt"&gt;--no-cache&lt;/span&gt;
docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Issue 2: Port Already in Use
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Error:&lt;/strong&gt; &lt;code&gt;bind: address already in use&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solutions:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Find what's using the port (Windows)&lt;/span&gt;
netstat &lt;span class="nt"&gt;-ano&lt;/span&gt; | findstr :80

&lt;span class="c"&gt;# Change port in docker-compose.yml&lt;/span&gt;
ports:
  - &lt;span class="s2"&gt;"8000:80"&lt;/span&gt;  &lt;span class="c"&gt;# Use port 8000 instead&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Issue 3: Services Can't Communicate
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptoms:&lt;/strong&gt; Frontend can't reach APIs&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solutions:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Verify all services are on the same network&lt;/span&gt;
docker network inspect devops-stage-6_app-network

&lt;span class="c"&gt;# Restart networking&lt;/span&gt;
docker compose down
docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Debugging Commands Cheat Sheet
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# View all running containers&lt;/span&gt;
docker compose ps

&lt;span class="c"&gt;# View logs for all services&lt;/span&gt;
docker compose logs

&lt;span class="c"&gt;# View logs for specific service&lt;/span&gt;
docker compose logs &lt;span class="nt"&gt;-f&lt;/span&gt; frontend

&lt;span class="c"&gt;# Restart specific service&lt;/span&gt;
docker compose restart todos-api

&lt;span class="c"&gt;# Stop all services&lt;/span&gt;
docker compose down

&lt;span class="c"&gt;# View resource usage&lt;/span&gt;
docker stats
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What You've Learned
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Microservices Architecture&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How to split an application into independent services&lt;/li&gt;
&lt;li&gt;Benefits of using different languages for different services&lt;/li&gt;
&lt;li&gt;How services communicate with each other&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Docker &amp;amp; Containerization&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How to containerize applications&lt;/li&gt;
&lt;li&gt;How to use Docker Compose for multi-container applications&lt;/li&gt;
&lt;li&gt;How to manage container networking&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Reverse Proxy with Traefik&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How to route requests to different services&lt;/li&gt;
&lt;li&gt;How to automatically manage SSL certificates&lt;/li&gt;
&lt;li&gt;How to use labels for service discovery&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Inter-Service Communication&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Synchronous communication (HTTP/REST)&lt;/li&gt;
&lt;li&gt;Asynchronous communication (Message Queues)&lt;/li&gt;
&lt;li&gt;Authentication across services (JWT)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Real-World Applications
&lt;/h3&gt;

&lt;p&gt;This architecture pattern is used by companies like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Netflix&lt;/strong&gt; - Hundreds of microservices&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Amazon&lt;/strong&gt; - Service-oriented architecture&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Uber&lt;/strong&gt; - Microservices for different features&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spotify&lt;/strong&gt; - Independent teams, independent services&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Next Steps
&lt;/h3&gt;

&lt;p&gt;To deepen your understanding:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Modify the Application&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Add a new microservice (e.g., Comments API)&lt;/li&gt;
&lt;li&gt;Implement a database for persistent storage&lt;/li&gt;
&lt;li&gt;Add user registration functionality&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Improve the Infrastructure&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Add monitoring with Prometheus and Grafana&lt;/li&gt;
&lt;li&gt;Implement centralized logging&lt;/li&gt;
&lt;li&gt;Add CI/CD pipeline with GitHub Actions&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Security Enhancements&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Implement rate limiting&lt;/li&gt;
&lt;li&gt;Add API gateway&lt;/li&gt;
&lt;li&gt;Use secrets management&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Additional Resources
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Documentation
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.docker.com/" rel="noopener noreferrer"&gt;Docker Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.docker.com/compose/" rel="noopener noreferrer"&gt;Docker Compose Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://doc.traefik.io/traefik/" rel="noopener noreferrer"&gt;Traefik Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://microservices.io/patterns/" rel="noopener noreferrer"&gt;Microservices Patterns&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Tools
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.docker.com/products/docker-desktop" rel="noopener noreferrer"&gt;Docker Desktop&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.postman.com/" rel="noopener noreferrer"&gt;Postman&lt;/a&gt; - API testing&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.portainer.io/" rel="noopener noreferrer"&gt;Portainer&lt;/a&gt; - Docker management UI&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Congratulations!  You've just learned how to build, deploy, and manage a complete microservices application. This project demonstrates real-world DevOps practices used by major tech companies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Remember:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Start small&lt;/strong&gt; - Don't try to build everything at once&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Understand each component&lt;/strong&gt; - Know what each service does&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Experiment&lt;/strong&gt; - Break things and fix them (that's how you learn!)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Read logs&lt;/strong&gt; - They're your best friend when debugging&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep learning&lt;/strong&gt; - DevOps is constantly evolving&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The skills you've gained here - Docker, microservices, reverse proxies, and service orchestration - are highly valuable in the industry. Keep practicing, keep building, and most importantly, keep learning!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Full Project Repository:&lt;/strong&gt; &lt;a href="https://github.com/cypher682/DevOps-Stage-6" rel="noopener noreferrer"&gt;HNG DevOps on GitHub&lt;/a&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>beginners</category>
      <category>microservices</category>
      <category>docker</category>
    </item>
    <item>
      <title>Building Your Own Virtual Private Cloud on Linux: A Deep Dive into Network Namespaces</title>
      <dc:creator>cypher682</dc:creator>
      <pubDate>Sun, 09 Nov 2025 17:07:03 +0000</pubDate>
      <link>https://forem.com/cypher682/building-your-own-virtual-private-cloud-on-linux-a-deep-dive-into-network-namespaces-1l3e</link>
      <guid>https://forem.com/cypher682/building-your-own-virtual-private-cloud-on-linux-a-deep-dive-into-network-namespaces-1l3e</guid>
      <description>&lt;h2&gt;
  
  
  Introduction: Why Build a VPC from Scratch?
&lt;/h2&gt;

&lt;p&gt;Amazon Web Services revolutionized cloud computing with Virtual Private Clouds (VPCs), allowing users to create isolated network environments in the cloud. But have you ever wondered &lt;strong&gt;how VPCs actually work under the hood&lt;/strong&gt;?&lt;/p&gt;

&lt;p&gt;In this project, I recreated AWS VPC functionality on a single Linux machine using native networking primitives. No Docker, no Kubernetes—just pure Linux networking: &lt;strong&gt;network namespaces, bridges, veth pairs, and iptables&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  What You'll Learn
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;How network isolation works at the kernel level&lt;/li&gt;
&lt;li&gt;Linux network namespaces as lightweight containers&lt;/li&gt;
&lt;li&gt;Bridging and routing fundamentals&lt;/li&gt;
&lt;li&gt;NAT implementation with iptables&lt;/li&gt;
&lt;li&gt;Building infrastructure automation tools&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Real-World Applications
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Understanding cloud provider networking internals&lt;/strong&gt; - See how AWS/Azure/GCP implement VPCs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Building custom network isolation&lt;/strong&gt; for multi-tenant systems&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DevOps and infrastructure automation skills&lt;/strong&gt; - Create your own networking tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Debugging complex network issues&lt;/strong&gt; - Deep knowledge of Linux networking stack&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;A VPC in AWS provides:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Isolated network space&lt;/strong&gt; with your own IP range (CIDR)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Subnets&lt;/strong&gt; that partition your VPC into smaller networks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Internet Gateway&lt;/strong&gt; for public subnet internet access&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NAT Gateway&lt;/strong&gt; for private subnet outbound access&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security Groups&lt;/strong&gt; for firewall rules&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VPC Peering&lt;/strong&gt; for cross-VPC communication&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  My Implementation Stack
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────┐
│           Linux Host System                  │
│                                              │
│  ┌────────────────────────────────────────┐ │
│  │    VPC1 (10.0.0.0/16)                  │ │
│  │                                        │ │
│  │  br-vpc1 (Linux Bridge = VPC Router)  │ │
│  │        │                    │          │ │
│  │    veth-pair           veth-pair      │ │
│  │        │                    │          │ │
│  │  ┌─────▼────┐        ┌─────▼────┐    │ │
│  │  │ Public   │        │ Private  │    │ │
│  │  │ Subnet   │        │ Subnet   │    │ │
│  │  │ Namespace│        │ Namespace│    │ │
│  │  │10.0.1.2  │        │10.0.2.2  │    │ │
│  │  └─────┬────┘        └─────┬────┘    │ │
│  │        │                    X          │ │
│  │    NAT (iptables)      No Internet    │ │
│  └────────┼────────────────────────────────┘│
│           │                                  │
│      [eth0] ──► Internet                   │
└─────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Component Mapping
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;AWS VPC Concept&lt;/th&gt;
&lt;th&gt;Linux Implementation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;VPC&lt;/td&gt;
&lt;td&gt;Linux Bridge (br-vpc1)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Subnet&lt;/td&gt;
&lt;td&gt;Network Namespace&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Subnet Connection&lt;/td&gt;
&lt;td&gt;veth pair&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Internet Gateway&lt;/td&gt;
&lt;td&gt;iptables NAT MASQUERADE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Route Table&lt;/td&gt;
&lt;td&gt;ip route commands&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security Group&lt;/td&gt;
&lt;td&gt;iptables INPUT rules&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VPC Peering&lt;/td&gt;
&lt;td&gt;veth pair between bridges&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Part 1: Understanding Network Namespaces
&lt;/h2&gt;

&lt;p&gt;Network namespaces are Linux's way of creating isolated network stacks. Each namespace has its own:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Network interfaces&lt;/li&gt;
&lt;li&gt;IP addresses&lt;/li&gt;
&lt;li&gt;Routing tables&lt;/li&gt;
&lt;li&gt;iptables rules&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why This Matters:&lt;/strong&gt; This is the same technology Docker uses for container networking. Understanding this gives you deep insights into containerization.&lt;/p&gt;

&lt;h3&gt;
  
  
  Creating Isolation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# In vpcctl.py - SubnetManager.add()
&lt;/span&gt;&lt;span class="nf"&gt;run_command&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ip netns add &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ns_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Create isolated namespace
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When you create a namespace, the Linux kernel creates a &lt;strong&gt;completely separate network stack&lt;/strong&gt;. Processes inside can't see or access the host's network—perfect isolation!&lt;/p&gt;

&lt;h3&gt;
  
  
  Test It Yourself
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create namespace&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;ip netns add test-ns

&lt;span class="c"&gt;# List interfaces in host&lt;/span&gt;
ip &lt;span class="nb"&gt;link&lt;/span&gt;

&lt;span class="c"&gt;# List interfaces in namespace (only loopback!)&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;ip netns &lt;span class="nb"&gt;exec &lt;/span&gt;test-ns ip &lt;span class="nb"&gt;link&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll notice the namespace starts with only a loopback interface. It's completely isolated!&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 2: Connecting Namespaces - The veth Pair Magic
&lt;/h2&gt;

&lt;p&gt;Network namespaces are isolated, so we need a way to connect them. Enter &lt;strong&gt;veth pairs&lt;/strong&gt;—virtual ethernet cables.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conceptual Model
&lt;/h3&gt;

&lt;p&gt;Think of a veth pair as a virtual ethernet cable with two ends:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One end plugs into the namespace&lt;/li&gt;
&lt;li&gt;Other end plugs into the host or bridge
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Creating the connection
&lt;/span&gt;&lt;span class="nf"&gt;run_command&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ip link add &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;veth_host&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; type veth peer name &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;veth_ns&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;run_command&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ip link set &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;veth_host&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; master &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;bridge&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Connect to bridge
&lt;/span&gt;&lt;span class="nf"&gt;run_command&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ip link set &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;veth_ns&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; netns &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ns_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    &lt;span class="c1"&gt;# Move to namespace
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why This Works:&lt;/strong&gt; Packets entering one end of the veth pair automatically come out the other end—like a wormhole for network traffic.&lt;/p&gt;

&lt;h3&gt;
  
  
  IP Address Assignment
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;net_info&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;NetworkUtils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_network_info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cidr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Assign IP inside namespace
&lt;/span&gt;&lt;span class="nf"&gt;run_command&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ip netns exec &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ns_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; ip addr add &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;net_info&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;first_host&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;net_info&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;prefix&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; dev &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;veth_ns&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key Insight:&lt;/strong&gt; We use the first usable IP (.2) for the subnet, and the bridge gets the gateway IP (.1). This mirrors how AWS assigns IPs in subnets.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 3: The Bridge - Your VPC Router
&lt;/h2&gt;

&lt;p&gt;A Linux bridge is like a virtual network switch. It forwards packets between connected interfaces at Layer 2.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bridge Creation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;bridge_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;br-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="nf"&gt;run_command&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ip link add &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;bridge_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; type bridge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;run_command&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ip addr add &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;net_info&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;gateway&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;net_info&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;prefix&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; dev &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;bridge_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;run_command&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ip link set &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;bridge_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; up&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Critical Configuration
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Disable bridge netfilter - allows direct L2 forwarding
&lt;/span&gt;&lt;span class="nf"&gt;run_command&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sysctl -w net.bridge.bridge-nf-call-iptables=0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why This Matters:&lt;/strong&gt; By default, Linux bridges pass traffic through iptables. For intra-VPC communication, we want direct Layer 2 switching for performance—just like a real network switch.&lt;/p&gt;

&lt;h3&gt;
  
  
  Routing Setup
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Inside namespace: route everything through bridge gateway
&lt;/span&gt;&lt;span class="nf"&gt;run_command&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ip netns exec &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ns_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; ip route add default via &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;gateway_ip&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This makes the bridge act as the &lt;strong&gt;default gateway&lt;/strong&gt;—all traffic from namespaces flows through it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 4: NAT Gateway - Internet Access
&lt;/h2&gt;

&lt;p&gt;Private networks use RFC 1918 addresses (10.x.x.x, 172.16.x.x, 192.168.x.x) that aren't routable on the internet. NAT (Network Address Translation) solves this.&lt;/p&gt;

&lt;h3&gt;
  
  
  NAT Implementation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Enable IP forwarding (routing between interfaces)
&lt;/span&gt;&lt;span class="nf"&gt;run_command&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sysctl -w net.ipv4.ip_forward=1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# MASQUERADE: Replace source IP with host's public IP
&lt;/span&gt;&lt;span class="nf"&gt;run_command&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;iptables -t nat -A POSTROUTING -s &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cidr&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; -o &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;interface&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; -j MASQUERADE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Allow forwarding from VPC to internet
&lt;/span&gt;&lt;span class="nf"&gt;run_command&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;iptables -A FORWARD -s &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cidr&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; -i &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;bridge&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; -o &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;interface&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; -j ACCEPT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Allow return traffic
&lt;/span&gt;&lt;span class="nf"&gt;run_command&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;iptables -A FORWARD -d &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cidr&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; -i &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;interface&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; -o &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;bridge&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-m state --state RELATED,ESTABLISHED -j ACCEPT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  How MASQUERADE Works
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Packet leaves namespace with source IP &lt;code&gt;10.0.1.2&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Reaches host via bridge&lt;/li&gt;
&lt;li&gt;iptables MASQUERADE rewrites source to host's public IP&lt;/li&gt;
&lt;li&gt;Internet sees request from host, not internal IP&lt;/li&gt;
&lt;li&gt;Response comes back, iptables rewrites destination back to &lt;code&gt;10.0.1.2&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Packet forwarded to namespace&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Key Insight:&lt;/strong&gt; We only NAT &lt;strong&gt;public&lt;/strong&gt; subnets. Private subnets remain isolated—they can reach other subnets within the VPC but not the internet.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 5: VPC Isolation &amp;amp; Peering
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Default Isolation
&lt;/h3&gt;

&lt;p&gt;Without configuration, VPCs can't communicate. Why?&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Each VPC has its own bridge&lt;/li&gt;
&lt;li&gt;No routes exist between bridges&lt;/li&gt;
&lt;li&gt;iptables FORWARD policy is DROP by default&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Testing Isolation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# From vpc1 namespace, try to reach vpc2&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;ip netns &lt;span class="nb"&gt;exec &lt;/span&gt;vpc1-public ping 172.16.1.2
&lt;span class="c"&gt;# Result: Network unreachable (no route to 172.16.0.0/16)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  VPC Peering Implementation
&lt;/h3&gt;

&lt;p&gt;The tricky part: packets from a namespace need to reach another VPC's bridge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Create veth pair between bridges
&lt;/span&gt;&lt;span class="nf"&gt;run_command&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ip link add &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;veth1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; type veth peer name &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;veth2&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;run_command&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ip link set &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;veth1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; master &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;vpc1&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;bridge&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;run_command&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ip link set &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;veth2&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; master &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;vpc2&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;bridge&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# CRITICAL: Add routes in EACH namespace
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;subnet&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;vpc1&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;subnets&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;ns_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subnet&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;namespace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="nf"&gt;run_command&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ip netns exec &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ns_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; ip route add &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;vpc2&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cidr&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;via &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;vpc1&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;gateway&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why This Works
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Namespace sends packet to VPC2 CIDR&lt;/li&gt;
&lt;li&gt;Route points to its own gateway (bridge)&lt;/li&gt;
&lt;li&gt;Bridge forwards to peering veth pair&lt;/li&gt;
&lt;li&gt;Packet arrives at VPC2's bridge&lt;/li&gt;
&lt;li&gt;VPC2 bridge forwards to destination namespace&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Common Mistake I Made:&lt;/strong&gt; Initially, I added routes on the host routing table. This doesn't work because packets originate from &lt;strong&gt;inside namespaces&lt;/strong&gt;, which have their own routing tables!&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 6: Firewall Rules (Security Groups)
&lt;/h2&gt;

&lt;p&gt;AWS Security Groups are stateful firewalls. We simulate this with iptables inside namespaces.&lt;/p&gt;

&lt;h3&gt;
  
  
  JSON Policy Design
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"subnet"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"10.0.1.0/24"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ingress"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"port"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"protocol"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tcp"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"allow"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"port"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;22&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"protocol"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tcp"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"deny"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Implementation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Essential: Allow established connections (stateful behavior)
&lt;/span&gt;&lt;span class="nf"&gt;run_command&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ip netns exec &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ns_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; iptables -A INPUT &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-m state --state ESTABLISHED,RELATED -j ACCEPT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Apply custom rules
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;policy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingress&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]):&lt;/span&gt;
    &lt;span class="n"&gt;port&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;port&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;protocol&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;protocol&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;action&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ACCEPT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;action&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;allow&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DROP&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="nf"&gt;run_command&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ip netns exec &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ns_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; iptables -A INPUT &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-p &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;protocol&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; --dport &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; -j &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why ESTABLISHED,RELATED Matters:&lt;/strong&gt; Without this, responses to outbound connections would be blocked. The stateful rule tracks connections and allows return traffic automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  The CLI Tool: vpcctl
&lt;/h2&gt;

&lt;p&gt;I built a Python CLI tool to automate all VPC operations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Installation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/cypher682/hng13-stage-4-devops.git
&lt;span class="nb"&gt;cd &lt;/span&gt;hng13-stage-4-devops
&lt;span class="nb"&gt;chmod&lt;/span&gt; +x vpcctl.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Usage Examples
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Create VPC:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo&lt;/span&gt; ./vpcctl.py vpc create &lt;span class="nt"&gt;--name&lt;/span&gt; vpc1 &lt;span class="nt"&gt;--cidr&lt;/span&gt; 10.0.0.0/16
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Add Subnets:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Public subnet&lt;/span&gt;
&lt;span class="nb"&gt;sudo&lt;/span&gt; ./vpcctl.py subnet add &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--vpc&lt;/span&gt; vpc1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; public &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cidr&lt;/span&gt; 10.0.1.0/24 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--type&lt;/span&gt; public

&lt;span class="c"&gt;# Private subnet&lt;/span&gt;
&lt;span class="nb"&gt;sudo&lt;/span&gt; ./vpcctl.py subnet add &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--vpc&lt;/span&gt; vpc1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; private &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cidr&lt;/span&gt; 10.0.2.0/24 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--type&lt;/span&gt; private
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Enable NAT Gateway:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo&lt;/span&gt; ./vpcctl.py nat &lt;span class="nb"&gt;enable&lt;/span&gt; &lt;span class="nt"&gt;--vpc&lt;/span&gt; vpc1 &lt;span class="nt"&gt;--interface&lt;/span&gt; eth0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Deploy Web Server:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo&lt;/span&gt; ./vpcctl.py deploy webserver &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--vpc&lt;/span&gt; vpc1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--subnet&lt;/span&gt; public &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--port&lt;/span&gt; 8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Create VPC Peering:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo&lt;/span&gt; ./vpcctl.py peer create &lt;span class="nt"&gt;--vpc1&lt;/span&gt; vpc1 &lt;span class="nt"&gt;--vpc2&lt;/span&gt; vpc2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Apply Firewall Rules:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo&lt;/span&gt; ./vpcctl.py firewall apply &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--vpc&lt;/span&gt; vpc1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--subnet&lt;/span&gt; public &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--policy&lt;/span&gt; policy.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;List All VPCs:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo&lt;/span&gt; ./vpcctl.py vpc list
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Delete VPC:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo&lt;/span&gt; ./vpcctl.py vpc delete &lt;span class="nt"&gt;--name&lt;/span&gt; vpc1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Testing &amp;amp; Validation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Test 1: Intra-VPC Communication
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create VPC with two subnets&lt;/span&gt;
&lt;span class="nb"&gt;sudo&lt;/span&gt; ./vpcctl.py vpc create &lt;span class="nt"&gt;--name&lt;/span&gt; vpc1 &lt;span class="nt"&gt;--cidr&lt;/span&gt; 10.0.0.0/16
&lt;span class="nb"&gt;sudo&lt;/span&gt; ./vpcctl.py subnet add &lt;span class="nt"&gt;--vpc&lt;/span&gt; vpc1 &lt;span class="nt"&gt;--name&lt;/span&gt; public &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cidr&lt;/span&gt; 10.0.1.0/24 &lt;span class="nt"&gt;--type&lt;/span&gt; public
&lt;span class="nb"&gt;sudo&lt;/span&gt; ./vpcctl.py subnet add &lt;span class="nt"&gt;--vpc&lt;/span&gt; vpc1 &lt;span class="nt"&gt;--name&lt;/span&gt; private &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cidr&lt;/span&gt; 10.0.2.0/24 &lt;span class="nt"&gt;--type&lt;/span&gt; private

&lt;span class="c"&gt;# Test connectivity&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;ip netns &lt;span class="nb"&gt;exec &lt;/span&gt;vpc1-public ping &lt;span class="nt"&gt;-c&lt;/span&gt; 3 10.0.2.2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;✅ Expected:&lt;/strong&gt; Success (same VPC, bridge routes traffic)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's Happening:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Packet leaves public namespace (10.0.1.2)&lt;/li&gt;
&lt;li&gt;Goes through veth to bridge&lt;/li&gt;
&lt;li&gt;Bridge forwards to private subnet's veth&lt;/li&gt;
&lt;li&gt;Arrives at private namespace (10.0.2.2)&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Test 2: NAT Gateway
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Enable NAT&lt;/span&gt;
&lt;span class="nb"&gt;sudo&lt;/span&gt; ./vpcctl.py nat &lt;span class="nb"&gt;enable&lt;/span&gt; &lt;span class="nt"&gt;--vpc&lt;/span&gt; vpc1 &lt;span class="nt"&gt;--interface&lt;/span&gt; eth0

&lt;span class="c"&gt;# Test public subnet internet access&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;ip netns &lt;span class="nb"&gt;exec &lt;/span&gt;vpc1-public ping &lt;span class="nt"&gt;-c&lt;/span&gt; 3 8.8.8.8
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;✅ Expected:&lt;/strong&gt; Success&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Test private subnet (should fail)&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;ip netns &lt;span class="nb"&gt;exec &lt;/span&gt;vpc1-private ping &lt;span class="nt"&gt;-c&lt;/span&gt; 2 8.8.8.8
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;✅ Expected:&lt;/strong&gt; Timeout (no NAT rule for private subnet)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verification:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check NAT rules&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;iptables &lt;span class="nt"&gt;-t&lt;/span&gt; nat &lt;span class="nt"&gt;-L&lt;/span&gt; POSTROUTING &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt;
&lt;span class="c"&gt;# Should show MASQUERADE rule for 10.0.1.0/24 only&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Test 3: VPC Isolation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create second VPC&lt;/span&gt;
&lt;span class="nb"&gt;sudo&lt;/span&gt; ./vpcctl.py vpc create &lt;span class="nt"&gt;--name&lt;/span&gt; vpc2 &lt;span class="nt"&gt;--cidr&lt;/span&gt; 172.16.0.0/16
&lt;span class="nb"&gt;sudo&lt;/span&gt; ./vpcctl.py subnet add &lt;span class="nt"&gt;--vpc&lt;/span&gt; vpc2 &lt;span class="nt"&gt;--name&lt;/span&gt; public &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cidr&lt;/span&gt; 172.16.1.0/24 &lt;span class="nt"&gt;--type&lt;/span&gt; public

&lt;span class="c"&gt;# Try to communicate (should fail)&lt;/span&gt;
&lt;span class="nb"&gt;sudo timeout &lt;/span&gt;3 ip netns &lt;span class="nb"&gt;exec &lt;/span&gt;vpc1-public ping 172.16.1.2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;✅ Expected:&lt;/strong&gt; Network unreachable&lt;/p&gt;

&lt;h3&gt;
  
  
  Test 4: VPC Peering
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create peering&lt;/span&gt;
&lt;span class="nb"&gt;sudo&lt;/span&gt; ./vpcctl.py peer create &lt;span class="nt"&gt;--vpc1&lt;/span&gt; vpc1 &lt;span class="nt"&gt;--vpc2&lt;/span&gt; vpc2

&lt;span class="c"&gt;# Now communication should work&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;ip netns &lt;span class="nb"&gt;exec &lt;/span&gt;vpc1-public ping &lt;span class="nt"&gt;-c&lt;/span&gt; 3 172.16.1.2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;✅ Expected:&lt;/strong&gt; Success&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Verify routes were added&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;ip netns &lt;span class="nb"&gt;exec &lt;/span&gt;vpc1-public ip route
&lt;span class="c"&gt;# Should show: 172.16.0.0/16 via 10.0.0.1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Test 5: Firewall Rules
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Deploy web server&lt;/span&gt;
&lt;span class="nb"&gt;sudo&lt;/span&gt; ./vpcctl.py deploy webserver &lt;span class="nt"&gt;--vpc&lt;/span&gt; vpc1 &lt;span class="nt"&gt;--subnet&lt;/span&gt; public &lt;span class="nt"&gt;--port&lt;/span&gt; 8080

&lt;span class="c"&gt;# Test before firewall&lt;/span&gt;
curl http://10.0.1.2:8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;✅ Works&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Apply restrictive policy&lt;/span&gt;
&lt;span class="nb"&gt;sudo&lt;/span&gt; ./vpcctl.py firewall apply &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--vpc&lt;/span&gt; vpc1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--subnet&lt;/span&gt; public &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--policy&lt;/span&gt; test-policy.json

&lt;span class="c"&gt;# Test allowed port&lt;/span&gt;
curl http://10.0.1.2:8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;✅ Still works (port 8080 is allowed in policy)&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Complete Demo Script
&lt;/h2&gt;

&lt;p&gt;Here's the full automated demo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;

&lt;span class="c"&gt;# 1. Create VPC&lt;/span&gt;
&lt;span class="nb"&gt;sudo&lt;/span&gt; ./vpcctl.py vpc create &lt;span class="nt"&gt;--name&lt;/span&gt; vpc1 &lt;span class="nt"&gt;--cidr&lt;/span&gt; 10.0.0.0/16

&lt;span class="c"&gt;# 2. Add subnets&lt;/span&gt;
&lt;span class="nb"&gt;sudo&lt;/span&gt; ./vpcctl.py subnet add &lt;span class="nt"&gt;--vpc&lt;/span&gt; vpc1 &lt;span class="nt"&gt;--name&lt;/span&gt; public &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cidr&lt;/span&gt; 10.0.1.0/24 &lt;span class="nt"&gt;--type&lt;/span&gt; public
&lt;span class="nb"&gt;sudo&lt;/span&gt; ./vpcctl.py subnet add &lt;span class="nt"&gt;--vpc&lt;/span&gt; vpc1 &lt;span class="nt"&gt;--name&lt;/span&gt; private &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cidr&lt;/span&gt; 10.0.2.0/24 &lt;span class="nt"&gt;--type&lt;/span&gt; private

&lt;span class="c"&gt;# 3. Enable NAT&lt;/span&gt;
&lt;span class="nv"&gt;IFACE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;ip route | &lt;span class="nb"&gt;grep &lt;/span&gt;default | &lt;span class="nb"&gt;awk&lt;/span&gt; &lt;span class="s1"&gt;'{print $5}'&lt;/span&gt; | &lt;span class="nb"&gt;head&lt;/span&gt; &lt;span class="nt"&gt;-1&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;sudo&lt;/span&gt; ./vpcctl.py nat &lt;span class="nb"&gt;enable&lt;/span&gt; &lt;span class="nt"&gt;--vpc&lt;/span&gt; vpc1 &lt;span class="nt"&gt;--interface&lt;/span&gt; &lt;span class="nv"&gt;$IFACE&lt;/span&gt;

&lt;span class="c"&gt;# 4. Deploy web servers&lt;/span&gt;
&lt;span class="nb"&gt;sudo&lt;/span&gt; ./vpcctl.py deploy webserver &lt;span class="nt"&gt;--vpc&lt;/span&gt; vpc1 &lt;span class="nt"&gt;--subnet&lt;/span&gt; public &lt;span class="nt"&gt;--port&lt;/span&gt; 8080
&lt;span class="nb"&gt;sudo&lt;/span&gt; ./vpcctl.py deploy webserver &lt;span class="nt"&gt;--vpc&lt;/span&gt; vpc1 &lt;span class="nt"&gt;--subnet&lt;/span&gt; private &lt;span class="nt"&gt;--port&lt;/span&gt; 8081

&lt;span class="c"&gt;# 5. Test intra-VPC communication&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;ip netns &lt;span class="nb"&gt;exec &lt;/span&gt;vpc1-public ping &lt;span class="nt"&gt;-c&lt;/span&gt; 3 10.0.2.2

&lt;span class="c"&gt;# 6. Test NAT gateway&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;ip netns &lt;span class="nb"&gt;exec &lt;/span&gt;vpc1-public ping &lt;span class="nt"&gt;-c&lt;/span&gt; 3 8.8.8.8
&lt;span class="nb"&gt;sudo timeout &lt;/span&gt;3 ip netns &lt;span class="nb"&gt;exec &lt;/span&gt;vpc1-private ping &lt;span class="nt"&gt;-c&lt;/span&gt; 2 8.8.8.8

&lt;span class="c"&gt;# 7. Create second VPC&lt;/span&gt;
&lt;span class="nb"&gt;sudo&lt;/span&gt; ./vpcctl.py vpc create &lt;span class="nt"&gt;--name&lt;/span&gt; vpc2 &lt;span class="nt"&gt;--cidr&lt;/span&gt; 172.16.0.0/16
&lt;span class="nb"&gt;sudo&lt;/span&gt; ./vpcctl.py subnet add &lt;span class="nt"&gt;--vpc&lt;/span&gt; vpc2 &lt;span class="nt"&gt;--name&lt;/span&gt; public &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cidr&lt;/span&gt; 172.16.1.0/24 &lt;span class="nt"&gt;--type&lt;/span&gt; public
&lt;span class="nb"&gt;sudo&lt;/span&gt; ./vpcctl.py nat &lt;span class="nb"&gt;enable&lt;/span&gt; &lt;span class="nt"&gt;--vpc&lt;/span&gt; vpc2 &lt;span class="nt"&gt;--interface&lt;/span&gt; &lt;span class="nv"&gt;$IFACE&lt;/span&gt;

&lt;span class="c"&gt;# 8. Test VPC isolation&lt;/span&gt;
&lt;span class="nb"&gt;sudo timeout &lt;/span&gt;3 ip netns &lt;span class="nb"&gt;exec &lt;/span&gt;vpc1-public ping &lt;span class="nt"&gt;-c&lt;/span&gt; 2 172.16.1.2

&lt;span class="c"&gt;# 9. Create VPC peering&lt;/span&gt;
&lt;span class="nb"&gt;sudo&lt;/span&gt; ./vpcctl.py peer create &lt;span class="nt"&gt;--vpc1&lt;/span&gt; vpc1 &lt;span class="nt"&gt;--vpc2&lt;/span&gt; vpc2

&lt;span class="c"&gt;# 10. Test after peering&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;ip netns &lt;span class="nb"&gt;exec &lt;/span&gt;vpc1-public ping &lt;span class="nt"&gt;-c&lt;/span&gt; 3 172.16.1.2

&lt;span class="c"&gt;# 11. Apply firewall rules&lt;/span&gt;
&lt;span class="nb"&gt;sudo&lt;/span&gt; ./vpcctl.py firewall apply &lt;span class="nt"&gt;--vpc&lt;/span&gt; vpc1 &lt;span class="nt"&gt;--subnet&lt;/span&gt; public &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--policy&lt;/span&gt; test-policy.json

&lt;span class="c"&gt;# 12. View logs&lt;/span&gt;
&lt;span class="nb"&gt;tail&lt;/span&gt; &lt;span class="nt"&gt;-30&lt;/span&gt; /var/lib/vpcctl/vpcctl.log

&lt;span class="c"&gt;# 13. List resources&lt;/span&gt;
&lt;span class="nb"&gt;sudo&lt;/span&gt; ./vpcctl.py vpc list

&lt;span class="c"&gt;# 14. Cleanup&lt;/span&gt;
&lt;span class="nb"&gt;sudo&lt;/span&gt; ./vpcctl.py vpc delete &lt;span class="nt"&gt;--name&lt;/span&gt; vpc1
&lt;span class="nb"&gt;sudo&lt;/span&gt; ./vpcctl.py vpc delete &lt;span class="nt"&gt;--name&lt;/span&gt; vpc2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Cleanup Process
&lt;/h2&gt;

&lt;p&gt;Proper cleanup is critical. Orphaned namespaces and iptables rules can cause issues.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Delete VPC (automated cleanup)&lt;/span&gt;
&lt;span class="nb"&gt;sudo&lt;/span&gt; ./vpcctl.py vpc delete &lt;span class="nt"&gt;--name&lt;/span&gt; vpc1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What happens internally:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Kill all processes in namespaces&lt;/li&gt;
&lt;li&gt;Delete namespaces&lt;/li&gt;
&lt;li&gt;Remove veth pairs&lt;/li&gt;
&lt;li&gt;Delete iptables rules&lt;/li&gt;
&lt;li&gt;Remove bridge&lt;/li&gt;
&lt;li&gt;Clean up state file&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Emergency Cleanup
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo&lt;/span&gt; ./cleanup.sh
&lt;span class="c"&gt;# Removes ALL network resources created by vpcctl&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Verification
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Should be empty&lt;/span&gt;
ip netns list
ip &lt;span class="nb"&gt;link &lt;/span&gt;show &lt;span class="nb"&gt;type &lt;/span&gt;bridge | &lt;span class="nb"&gt;grep &lt;/span&gt;br-

&lt;span class="c"&gt;# Check iptables&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;iptables &lt;span class="nt"&gt;-L&lt;/span&gt; FORWARD &lt;span class="nt"&gt;-n&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;iptables &lt;span class="nt"&gt;-t&lt;/span&gt; nat &lt;span class="nt"&gt;-L&lt;/span&gt; POSTROUTING &lt;span class="nt"&gt;-n&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Challenges &amp;amp; Solutions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Challenge 1: VPC Peering Not Working
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Routes added to host routing table, but packets originate from namespaces.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Add routes inside each namespace pointing to their respective gateways.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Wrong approach
&lt;/span&gt;&lt;span class="nf"&gt;run_command&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ip route add &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;vpc2&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cidr&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; via &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;vpc2_peer_ip&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Correct approach
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;subnet&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;vpc1&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;subnets&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;ns_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subnet&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;namespace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="nf"&gt;run_command&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ip netns exec &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ns_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; ip route add &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;vpc2&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cidr&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;via &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;vpc1&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;gateway&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Challenge 2: Firewall Blocking Everything
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Applied rules but forgot to allow established connections.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Always add stateful rules first:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;run_command&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ip netns exec &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ns_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; iptables -A INPUT &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-m state --state ESTABLISHED,RELATED -j ACCEPT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Challenge 3: Bridge Netfilter Interference
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; iptables was processing bridge traffic, causing performance issues.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Disable bridge netfilter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;sysctl &lt;span class="nt"&gt;-w&lt;/span&gt; net.bridge.bridge-nf-call-iptables&lt;span class="o"&gt;=&lt;/span&gt;0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Network Namespaces&lt;/strong&gt; provide true isolation—the foundation of containers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;veth Pairs&lt;/strong&gt; are the glue connecting isolated environments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bridges&lt;/strong&gt; act as virtual switches for Layer 2 forwarding&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;iptables&lt;/strong&gt; is incredibly powerful for NAT, routing, and firewalling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Proper cleanup&lt;/strong&gt; is essential for infrastructure automation&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Real-World Applications
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Container Networking:&lt;/strong&gt; Docker/Kubernetes use the same primitives&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-tenant Systems:&lt;/strong&gt; Isolate customer workloads&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud Provider Internals:&lt;/strong&gt; Understanding how AWS VPC really works&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security:&lt;/strong&gt; Network segmentation and isolation strategies&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Architecture Deep Dive
&lt;/h2&gt;

&lt;p&gt;Let me explain how packets flow through the system:&lt;/p&gt;

&lt;h3&gt;
  
  
  Scenario 1: Public Subnet → Internet
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Namespace] 10.0.1.2
     ↓ (veth pair)
[Bridge] br-vpc1 (10.0.0.1)
     ↓ (routing decision)
[iptables NAT] MASQUERADE (rewrites source IP)
     ↓
[eth0] → Internet
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Scenario 2: Namespace → Namespace (Same VPC)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Namespace A] 10.0.1.2
     ↓ (veth pair)
[Bridge] br-vpc1 (L2 switching)
     ↓ (veth pair)
[Namespace B] 10.0.2.2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Scenario 3: VPC Peering
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[VPC1 Namespace] 10.0.1.2
     ↓ (veth pair)
[VPC1 Bridge] br-vpc1
     ↓ (peering veth pair)
[VPC2 Bridge] br-vpc2
     ↓ (veth pair)
[VPC2 Namespace] 172.16.1.2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Performance Considerations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Bridge vs Router
&lt;/h3&gt;

&lt;p&gt;Using bridges instead of routing gives us:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lower latency&lt;/strong&gt; - Layer 2 switching is faster than Layer 3 routing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Higher throughput&lt;/strong&gt; - No routing table lookups for intra-VPC traffic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simpler configuration&lt;/strong&gt; - Bridges handle MAC learning automatically&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Namespace Overhead
&lt;/h3&gt;

&lt;p&gt;Network namespaces are lightweight:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;~1KB memory&lt;/strong&gt; per namespace&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Negligible CPU overhead&lt;/strong&gt; for namespace switching&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Near-native performance&lt;/strong&gt; for network operations&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Scalability Limits
&lt;/h3&gt;

&lt;p&gt;On a typical Linux system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;~100,000+ namespaces&lt;/strong&gt; possible&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Limited by file descriptors&lt;/strong&gt; and memory&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;iptables rules&lt;/strong&gt; become the bottleneck at scale (~10,000+ rules)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Security Considerations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Isolation Guarantees
&lt;/h3&gt;

&lt;p&gt;Network namespaces provide:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Complete network stack isolation&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Separate iptables rules&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Independent routing tables&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Process isolation&lt;/strong&gt; (can't see other namespace processes)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Attack Surface
&lt;/h3&gt;

&lt;p&gt;Potential security concerns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Host compromise&lt;/strong&gt; affects all VPCs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bridge vulnerabilities&lt;/strong&gt; (MAC flooding, ARP spoofing)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;iptables misconfigurations&lt;/strong&gt; can leak traffic&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Best Practices
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Least privilege&lt;/strong&gt; - Only enable NAT for public subnets&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Default deny&lt;/strong&gt; - Block all traffic, then allow specific flows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit logging&lt;/strong&gt; - Log all iptables rules and changes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Regular cleanup&lt;/strong&gt; - Remove unused resources&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Future Enhancements
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What's Missing?
&lt;/h3&gt;

&lt;p&gt;Compared to AWS VPC, this implementation lacks:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;IPv6 Support&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Current: IPv4 only&lt;/li&gt;
&lt;li&gt;Enhancement: Dual-stack networking&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;DNS Server&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Current: No internal DNS&lt;/li&gt;
&lt;li&gt;Enhancement: dnsmasq in each VPC&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;DHCP&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Current: Static IP assignment&lt;/li&gt;
&lt;li&gt;Enhancement: Dynamic IP allocation&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Network ACLs&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Current: Security groups only&lt;/li&gt;
&lt;li&gt;Enhancement: Subnet-level firewalls&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;VPC Flow Logs&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Current: Basic logging&lt;/li&gt;
&lt;li&gt;Enhancement: Detailed traffic logging&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Elastic IPs&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Current: No persistent public IPs&lt;/li&gt;
&lt;li&gt;Enhancement: Static IP mapping&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Implementation Ideas
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;DNS Server:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;add_dns_server&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vpc_name&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Start dnsmasq in namespace
&lt;/span&gt;    &lt;span class="nf"&gt;run_command&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ip netns exec &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;vpc_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;-dns &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dnsmasq --interface=lo --bind-interfaces &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--listen-address=10.0.0.2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;DHCP Server:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;add_dhcp_server&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vpc_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cidr&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Configure dnsmasq for DHCP
&lt;/span&gt;    &lt;span class="nf"&gt;run_command&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ip netns exec &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;vpc_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;-dhcp &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dnsmasq --dhcp-range=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;start_ip&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;end_ip&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;,12h&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Comparison with Real Cloud VPCs
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;My Implementation&lt;/th&gt;
&lt;th&gt;AWS VPC&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Isolation&lt;/td&gt;
&lt;td&gt;✅ Network namespaces&lt;/td&gt;
&lt;td&gt;✅ Hypervisor-level&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Subnets&lt;/td&gt;
&lt;td&gt;✅ Multiple per VPC&lt;/td&gt;
&lt;td&gt;✅ Multiple per VPC&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NAT Gateway&lt;/td&gt;
&lt;td&gt;✅ iptables MASQUERADE&lt;/td&gt;
&lt;td&gt;✅ Managed NAT service&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VPC Peering&lt;/td&gt;
&lt;td&gt;✅ veth pairs&lt;/td&gt;
&lt;td&gt;✅ Software-defined networking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security Groups&lt;/td&gt;
&lt;td&gt;✅ iptables rules&lt;/td&gt;
&lt;td&gt;✅ Stateful firewall&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Network ACLs&lt;/td&gt;
&lt;td&gt;❌ Not implemented&lt;/td&gt;
&lt;td&gt;✅ Subnet-level firewall&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DNS&lt;/td&gt;
&lt;td&gt;❌ Not implemented&lt;/td&gt;
&lt;td&gt;✅ Route 53 integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DHCP&lt;/td&gt;
&lt;td&gt;❌ Static IPs only&lt;/td&gt;
&lt;td&gt;✅ DHCP options&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Flow Logs&lt;/td&gt;
&lt;td&gt;⚠️ Basic logging&lt;/td&gt;
&lt;td&gt;✅ Detailed flow logs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IPv6&lt;/td&gt;
&lt;td&gt;❌ IPv4 only&lt;/td&gt;
&lt;td&gt;✅ Dual-stack&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-region&lt;/td&gt;
&lt;td&gt;❌ Single host&lt;/td&gt;
&lt;td&gt;✅ Global infrastructure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HA/Redundancy&lt;/td&gt;
&lt;td&gt;❌ Single point of failure&lt;/td&gt;
&lt;td&gt;✅ Multi-AZ redundancy&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Learning Resources
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Official Documentation
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://man7.org/linux/man-pages/man8/ip-netns.8.html" rel="noopener noreferrer"&gt;Linux Network Namespaces&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.netfilter.org/documentation/" rel="noopener noreferrer"&gt;iptables Tutorial&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://wiki.linuxfoundation.org/networking/bridge" rel="noopener noreferrer"&gt;Linux Bridge Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/vpc/" rel="noopener noreferrer"&gt;AWS VPC Documentation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Books
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Linux Networking Cookbook&lt;/em&gt; by Carla Schroder&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;TCP/IP Illustrated&lt;/em&gt; by W. Richard Stevens&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Linux Kernel Networking&lt;/em&gt; by Rami Rosen&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Online Courses
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Linux Foundation: Linux Networking and Administration&lt;/li&gt;
&lt;li&gt;Pluralsight: Linux Networking Fundamentals&lt;/li&gt;
&lt;li&gt;Udemy: Linux Networking Masterclass&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Project Repository
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; [&lt;a href="https://github.com/cypher682/vpcctl-linux-networking" rel="noopener noreferrer"&gt;https://github.com/cypher682/vpcctl-linux-networking&lt;/a&gt;]&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Video Demo:&lt;/strong&gt; [&lt;a href="https://youtu.be/7LYUl3hc3xE" rel="noopener noreferrer"&gt;https://youtu.be/7LYUl3hc3xE&lt;/a&gt;]&lt;/p&gt;

&lt;h3&gt;
  
  
  Repository Structure
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vpcctl-project/
├── README.md              # Complete documentation
├── vpcctl             # Main CLI tool
├── demo.sh                # Automated demo
├── cleanup.sh             # Emergency cleanup
├── docs/
│   ├── architecture-diagram.png

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Building a VPC from scratch taught me more about networking than reading documentation ever could. Understanding these primitives—&lt;strong&gt;namespaces, bridges, veth pairs, and iptables&lt;/strong&gt;—gives you superpowers when debugging container networking, understanding cloud provider internals, or designing multi-tenant systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Lessons:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Linux networking is powerful&lt;/strong&gt; - You don't need specialized tools for complex networking&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud abstractions are implementations&lt;/strong&gt; - AWS VPC is just well-packaged Linux networking&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Isolation is achievable&lt;/strong&gt; - Network namespaces provide true isolation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automation is essential&lt;/strong&gt; - Infrastructure as code makes everything reproducible&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Got questions?&lt;/strong&gt; Drop them in the comments below! 👇&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Found this useful?&lt;/strong&gt; Star the repo and share with fellow DevOps engineers! ⭐&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GitHub:&lt;a href="https://github.com/cypher682" rel="noopener noreferrer"&gt;https://github.com/cypher682&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>linux</category>
      <category>vpc</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
