<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Edith Asante</title>
    <description>The latest articles on Forem by Edith Asante (@edithasante).</description>
    <link>https://forem.com/edithasante</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3901913%2F06551fa0-ec1f-49ea-86f1-97f983a7aad3.jpg</url>
      <title>Forem: Edith Asante</title>
      <link>https://forem.com/edithasante</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/edithasante"/>
    <language>en</language>
    <item>
      <title>I Spent 6 Hours Debugging AWS Before Realising the Bug Was a Capital Letter</title>
      <dc:creator>Edith Asante</dc:creator>
      <pubDate>Wed, 13 May 2026 17:07:10 +0000</pubDate>
      <link>https://forem.com/edithasante/-i-spent-6-hours-debugging-aws-before-realising-the-bug-was-a-capital-letter-5369</link>
      <guid>https://forem.com/edithasante/-i-spent-6-hours-debugging-aws-before-realising-the-bug-was-a-capital-letter-5369</guid>
      <description>&lt;h1&gt;
  
  
  I stared at my screen for 6 hours. The API kept returning 404. I checked the Lambda code line by line. I tested the DynamoDB table. I redeployed the API three times. Everything looked right.Then I noticed it. The resource was named &lt;code&gt;/Students&lt;/code&gt; — capital S. My frontend was calling &lt;code&gt;/students&lt;/code&gt; — lowercase. That was it. Six hours. One capital letter.
&lt;/h1&gt;




&lt;p&gt;This is the story of building my first serverless app on AWS — a Student Record Management System — as part of my AWS Cloud Practitioner journey. I'll walk through the full architecture, how I built it, and every AWS configuration bug I hit along the way. Spoiler: 5 out of 8 bugs had nothing to do with my code.&lt;/p&gt;

&lt;p&gt;-&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;A &lt;strong&gt;Student Record Management System&lt;/strong&gt; that allows you to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create new student records&lt;/li&gt;
&lt;li&gt;View all students in a table&lt;/li&gt;
&lt;li&gt;Search by Student ID&lt;/li&gt;
&lt;li&gt;Edit student information&lt;/li&gt;
&lt;li&gt;Delete students&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With a clean UI showing live stats — total students, average GPA, and number of unique majors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Live site:&lt;/strong&gt; &lt;a href="http://student-records-edith-321.s3-website-us-east-1.amazonaws.com" rel="noopener noreferrer"&gt;http://student-records-edith-321.s3-website-us-east-1.amazonaws.com&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/asanteedith/student-record-system" rel="noopener noreferrer"&gt;https://github.com/asanteedith/student-record-system&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;The entire application is serverless:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Browser (S3 Static Website)
        ↓
API Gateway (REST API)
        ↓
Lambda Functions (Python 3.12)
        ↓
DynamoDB (StudentRecords table)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No EC2, no servers to manage, no infrastructure to maintain. Everything scales automatically and stays within the AWS Free Tier.&lt;/p&gt;




&lt;h2&gt;
  
  
  AWS Services Used
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DynamoDB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;NoSQL database to store student records&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Lambda&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;5 serverless functions for CRUD operations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;API Gateway&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;REST API connecting frontend to backend&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hosts the static frontend website&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;IAM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Permissions and security roles&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Project Structure
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;student-record-system/
├── README.md
├── BUGS.md
├── frontend/
│   ├── index.html
│   ├── styles.css
│   └── app.js
└── lambda/
    ├── GetAllStudents/
    │   └── lambda_function.py
    ├── GetStudent/
    │   └── lambda_function.py
    ├── CreateStudent/
    │   └── lambda_function.py
    ├── UpdateStudent/
    │   └── lambda_function.py
    └── DeleteStudent/
        └── lambda_function.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Phase 1: DynamoDB Setup
&lt;/h2&gt;

&lt;p&gt;DynamoDB is AWS's managed NoSQL database. I created a table called &lt;code&gt;StudentRecords&lt;/code&gt; with &lt;code&gt;studentId&lt;/code&gt; as the partition key (think primary key).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key settings:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;On-demand capacity&lt;/strong&gt; — you only pay for what you use, perfect for a project like this&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Partition key:&lt;/strong&gt; &lt;code&gt;studentId&lt;/code&gt; (String) — every student needs a unique ID&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each record stores: &lt;code&gt;studentId&lt;/code&gt;, &lt;code&gt;name&lt;/code&gt;, &lt;code&gt;email&lt;/code&gt;, &lt;code&gt;major&lt;/code&gt;, &lt;code&gt;gpa&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;One thing I learned early — DynamoDB stores numbers as Python's &lt;code&gt;Decimal&lt;/code&gt; type, not a regular float. This caused a JSON serialization bug later (more on that below).&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 2: Lambda Functions
&lt;/h2&gt;

&lt;p&gt;I created 5 Lambda functions in Python 3.12, one for each CRUD operation:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Function&lt;/th&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;GetAllStudents&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;GET&lt;/td&gt;
&lt;td&gt;Scan entire table&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;GetStudent&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;GET&lt;/td&gt;
&lt;td&gt;Get one student by ID&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;CreateStudent&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;POST&lt;/td&gt;
&lt;td&gt;Add new student&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;UpdateStudent&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;PUT&lt;/td&gt;
&lt;td&gt;Update student fields&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DeleteStudent&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;DELETE&lt;/td&gt;
&lt;td&gt;Remove student&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Here's the &lt;code&gt;GetAllStudents&lt;/code&gt; function — it scans the entire DynamoDB table and handles pagination for large datasets:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;decimal&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Decimal&lt;/span&gt;

&lt;span class="n"&gt;dynamodb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dynamodb&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dynamodb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;StudentRecords&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;DecimalEncoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JSONEncoder&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;default&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Decimal&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DecimalEncoder&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;default&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lambda_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scan&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;students&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Items&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;LastEvaluatedKey&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ExclusiveStartKey&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;LastEvaluatedKey&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
            &lt;span class="n"&gt;students&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Items&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;statusCode&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;headers&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Access-Control-Allow-Origin&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;students&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cls&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;DecimalEncoder&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;statusCode&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;headers&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Access-Control-Allow-Origin&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)})&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt; After creating each function, I had to manually attach &lt;code&gt;AmazonDynamoDBFullAccess&lt;/code&gt; to the Lambda execution role in IAM. Lambda has no DynamoDB access by default — this tripped me up more than once.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 3: API Gateway
&lt;/h2&gt;

&lt;p&gt;API Gateway is what connects the frontend to the Lambda functions. I created a REST API with this structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/students
  GET     → GetAllStudents
  POST    → CreateStudent
  /{studentid}
    GET    → GetStudent
    PUT    → UpdateStudent
    DELETE → DeleteStudent
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key settings for each method:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Integration type: Lambda Function&lt;/li&gt;
&lt;li&gt;Lambda Proxy integration: ✅ ON (passes the full request to Lambda)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;CORS&lt;/strong&gt; must be enabled on both resources — without it the browser blocks every API call.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 4: Frontend
&lt;/h2&gt;

&lt;p&gt;The frontend is pure HTML, CSS and vanilla JavaScript — no React, no frameworks. It's hosted as a static website on S3.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stats bar showing live total students, average GPA and unique majors&lt;/li&gt;
&lt;li&gt;Color-coded avatar initials per student&lt;/li&gt;
&lt;li&gt;Major badges with different colors per field&lt;/li&gt;
&lt;li&gt;Actions dropdown (View / Edit / Delete) per row&lt;/li&gt;
&lt;li&gt;Smooth animated modals for all operations&lt;/li&gt;
&lt;li&gt;Toast notifications for success and error feedback&lt;/li&gt;
&lt;li&gt;Fully responsive on mobile&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The entire API connection is in &lt;code&gt;app.js&lt;/code&gt; — one file handles all 5 CRUD operations by calling the API Gateway endpoints.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Bugs — This Is Where I Actually Learned AWS
&lt;/h2&gt;

&lt;p&gt;This section is the most valuable part. Every one of these bugs taught me something important about how AWS services work together.&lt;/p&gt;




&lt;h3&gt;
  
  
  Bug 1 — Missing &lt;code&gt;GET /students&lt;/code&gt; Endpoint
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptom:&lt;/strong&gt; Table was always empty on page load&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Cause:&lt;/strong&gt; I set up routes for individual student operations but completely forgot to create a &lt;code&gt;GET /students&lt;/code&gt; endpoint to fetch all students. The frontend called it on load and got a 404.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; Created a new &lt;code&gt;GetAllStudents&lt;/code&gt; Lambda function and added &lt;code&gt;GET /students&lt;/code&gt; → &lt;code&gt;GetAllStudents&lt;/code&gt; in API Gateway.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Lesson:&lt;/strong&gt; Always map out ALL your API routes before you start building.&lt;/p&gt;




&lt;h3&gt;
  
  
  Bug 2 — &lt;code&gt;/Students&lt;/code&gt; vs &lt;code&gt;/students&lt;/code&gt; (Case Sensitivity)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptom:&lt;/strong&gt; View, Edit and Delete all failed silently&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Cause:&lt;/strong&gt; I accidentally created the resource as &lt;code&gt;/Students&lt;/code&gt; (capital S) instead of &lt;code&gt;/students&lt;/code&gt;. AWS API Gateway is &lt;strong&gt;case-sensitive&lt;/strong&gt; — these are completely different paths.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; Deleted &lt;code&gt;/Students&lt;/code&gt;, recreated as &lt;code&gt;/{studentid}&lt;/code&gt; under &lt;code&gt;/students&lt;/code&gt; (lowercase), re-added all methods.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Lesson:&lt;/strong&gt; API Gateway resource paths are case-sensitive. Always double-check before adding methods.&lt;/p&gt;




&lt;h3&gt;
  
  
  Bug 3 — Path Parameter Case Mismatch
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptom:&lt;/strong&gt; &lt;code&gt;Error: 'studentId'&lt;/code&gt; on every individual student operation&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Cause:&lt;/strong&gt; My Lambda functions read &lt;code&gt;event['pathParameters']['studentId']&lt;/code&gt; (camelCase) but the API Gateway resource was named &lt;code&gt;/{studentid}&lt;/code&gt; (all lowercase). AWS passes the &lt;strong&gt;exact&lt;/strong&gt; parameter name — no automatic case conversion.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; Updated all 3 Lambda functions to use &lt;code&gt;event['pathParameters']['studentid']&lt;/code&gt;.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Lesson:&lt;/strong&gt; The path parameter name in your Lambda code must match exactly what's in the API Gateway resource path.&lt;/p&gt;




&lt;h3&gt;
  
  
  Bug 4 — CORS Not Re-enabled After Changes
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptom:&lt;/strong&gt; &lt;code&gt;CORS policy blocked&lt;/code&gt; errors in browser console&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Cause:&lt;/strong&gt; Every time I modified a resource or added a method in API Gateway, CORS got reset. I forgot to re-enable it after making fixes.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; After any resource change — Enable CORS on both resources → replace existing → redeploy.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Lesson:&lt;/strong&gt; CORS must be re-enabled every time you modify API Gateway resources.&lt;/p&gt;




&lt;h3&gt;
  
  
  Bug 5 — API Gateway Changes Not Going Live
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptom:&lt;/strong&gt; Fixed things in API Gateway but nothing changed on the live site&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Cause:&lt;/strong&gt; API Gateway uses a staging system. Changes are saved as drafts until you explicitly deploy them to a stage.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; Always: API Actions → Deploy API → Stage: prod → Deploy after every change.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Lesson:&lt;/strong&gt; Unlike Lambda (where Deploy is instant), API Gateway changes are never live until deployed to a stage.&lt;/p&gt;




&lt;h3&gt;
  
  
  Bug 6 — Lambda Missing DynamoDB Permissions
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptom:&lt;/strong&gt; &lt;code&gt;User is not authorized to perform: dynamodb:PutItem&lt;/code&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Cause:&lt;/strong&gt; Lambda functions are created with a minimal execution role that only has CloudWatch logging permissions. They have no DynamoDB access by default.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; For each Lambda function: Configuration → Permissions → click role → Attach &lt;code&gt;AmazonDynamoDBFullAccess&lt;/code&gt;.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Lesson:&lt;/strong&gt; In AWS, no service has access to another service by default. IAM permissions must always be explicitly granted.&lt;/p&gt;




&lt;h3&gt;
  
  
  Bug 7 — DynamoDB Decimal Serialization Error
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptom:&lt;/strong&gt; Lambda returned 500 when reading student data&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Cause:&lt;/strong&gt; DynamoDB stores numbers as Python's &lt;code&gt;Decimal&lt;/code&gt; type. Python's &lt;code&gt;json.dumps()&lt;/code&gt; can't serialize &lt;code&gt;Decimal&lt;/code&gt; objects.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; Added a custom &lt;code&gt;DecimalEncoder&lt;/code&gt; class to convert &lt;code&gt;Decimal&lt;/code&gt; to &lt;code&gt;float&lt;/code&gt; during JSON serialization.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Lesson:&lt;/strong&gt; Always handle the &lt;code&gt;Decimal&lt;/code&gt; ↔ &lt;code&gt;float&lt;/code&gt; conversion when working with DynamoDB numbers in Python.&lt;/p&gt;




&lt;h3&gt;
  
  
  Bug 8 — S3 Serving Old Cached Files
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptom:&lt;/strong&gt; Updated files uploaded to S3 but site still showed old version&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Cause:&lt;/strong&gt; The browser cached the old files aggressively.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; Hard refresh with &lt;code&gt;Ctrl + Shift + R&lt;/code&gt;, or test in an incognito window.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Lesson:&lt;/strong&gt; Always hard refresh or use incognito after deploying new files to S3.&lt;/p&gt;




&lt;h2&gt;
  
  
  Bug Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Bug&lt;/th&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Severity&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Missing GET /students endpoint&lt;/td&gt;
&lt;td&gt;API Gateway + Lambda&lt;/td&gt;
&lt;td&gt;🔴 Critical&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;/Students vs /students case mismatch&lt;/td&gt;
&lt;td&gt;API Gateway&lt;/td&gt;
&lt;td&gt;🔴 Critical&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Path parameter case mismatch&lt;/td&gt;
&lt;td&gt;Lambda + API Gateway&lt;/td&gt;
&lt;td&gt;🔴 Critical&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;CORS not re-enabled after changes&lt;/td&gt;
&lt;td&gt;API Gateway&lt;/td&gt;
&lt;td&gt;🔴 Critical&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Changes not live without redeployment&lt;/td&gt;
&lt;td&gt;API Gateway&lt;/td&gt;
&lt;td&gt;🟠 Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Lambda missing DynamoDB permissions&lt;/td&gt;
&lt;td&gt;Lambda + IAM&lt;/td&gt;
&lt;td&gt;🔴 Critical&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;DynamoDB Decimal serialization error&lt;/td&gt;
&lt;td&gt;Lambda + DynamoDB&lt;/td&gt;
&lt;td&gt;🟠 Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;S3 browser caching old files&lt;/td&gt;
&lt;td&gt;S3&lt;/td&gt;
&lt;td&gt;🟡 Minor&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;5 out of 8 bugs were Critical — and every single one was an &lt;strong&gt;AWS configuration issue&lt;/strong&gt;, not an application code bug. That's the biggest takeaway from this project.&lt;/p&gt;




&lt;h2&gt;
  
  
  💡 Key Takeaways
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. IAM permissions are everything&lt;/strong&gt;&lt;br&gt;
No AWS service can talk to another without explicit IAM permissions. Check permissions first when something isn't working.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. API Gateway requires redeployment&lt;/strong&gt;&lt;br&gt;
Every change to API Gateway — methods, CORS, integrations — must be deployed to a stage before it goes live.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Case sensitivity matters in AWS&lt;/strong&gt;&lt;br&gt;
Resource paths, parameter names, table names — AWS is case-sensitive everywhere. Be consistent and always use lowercase.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. CORS needs to be re-enabled after every change&lt;/strong&gt;&lt;br&gt;
Don't just enable it once and forget about it. Any resource modification resets it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. DynamoDB numbers are Decimal, not float&lt;/strong&gt;&lt;br&gt;
Always use a &lt;code&gt;DecimalEncoder&lt;/code&gt; when returning DynamoDB data as JSON.&lt;/p&gt;




&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Live Site:&lt;/strong&gt; &lt;a href="http://student-records-edith-321.s3-website-us-east-1.amazonaws.com" rel="noopener noreferrer"&gt;http://student-records-edith-321.s3-website-us-east-1.amazonaws.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/asanteedith/student-record-system" rel="noopener noreferrer"&gt;https://github.com/asanteedith/student-record-system&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS Free Tier:&lt;/strong&gt; &lt;a href="https://aws.amazon.com/free" rel="noopener noreferrer"&gt;https://aws.amazon.com/free&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;If you're working through your AWS Cloud Practitioner certification and want a hands-on project that touches DynamoDB, Lambda, API Gateway, S3 and IAM all at once — this is a great one to build. The bugs you'll hit will teach you more than any documentation.&lt;/p&gt;

&lt;p&gt;Feel free to fork the repo, ask questions in the comments, or connect with me!&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Tags: #aws #serverless #python #javascript #cloudpractitioner #beginners&lt;/em&gt;&lt;/p&gt;

</description>
      <category>api</category>
      <category>aws</category>
      <category>beginners</category>
      <category>serverless</category>
    </item>
    <item>
      <title># I Built a Tool That Watches Your Server, Learns Your Traffic, and Blocks Attackers Automatically</title>
      <dc:creator>Edith Asante</dc:creator>
      <pubDate>Tue, 12 May 2026 06:27:34 +0000</pubDate>
      <link>https://forem.com/edithasante/-i-built-a-tool-that-watches-your-server-learns-your-traffic-and-blocks-attackers-automatically-11f7</link>
      <guid>https://forem.com/edithasante/-i-built-a-tool-that-watches-your-server-learns-your-traffic-and-blocks-attackers-automatically-11f7</guid>
      <description>&lt;h2&gt;
  
  
  &lt;strong&gt;Most developers deploy servers. Few think about what happens when someone tries to take them down. I did. I built ShieldDaemon — a tool that watches every request hitting your server, learns your normal traffic patterns, and automatically blocks attackers the moment something looks wrong. No manual intervention. No hardcoded rules. Just a daemon that never sleeps. Here is exactly how I built it.&lt;/strong&gt;
&lt;/h2&gt;




&lt;h2&gt;
  
  
  What Is This Project About?
&lt;/h2&gt;

&lt;p&gt;Imagine you run an online shop. Everything is working fine until one day thousands of fake requests flood your website all at once. Your server crashes. Real customers can't access your shop. You lose money and trust.&lt;/p&gt;

&lt;p&gt;That is called a &lt;strong&gt;DDoS attack&lt;/strong&gt; — Distributed Denial of Service. It is one of the most common ways attackers take down websites.&lt;/p&gt;

&lt;p&gt;In this project I built &lt;strong&gt;ShieldDaemon&lt;/strong&gt; — a tool that watches every request coming into a server, learns what normal traffic looks like, and automatically blocks any IP address that starts behaving suspiciously.&lt;/p&gt;

&lt;p&gt;The best part? It does all of this in real time, without any human intervention.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Python&lt;/strong&gt; — the detection daemon&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Nginx&lt;/strong&gt; — reverse proxy that logs all traffic in JSON format&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Nextcloud&lt;/strong&gt; — the application being protected&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker Compose&lt;/strong&gt; — runs everything together&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;iptables&lt;/strong&gt; — Linux firewall used to block bad IPs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flask&lt;/strong&gt; — powers the live dashboard&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Slack&lt;/strong&gt; — receives instant alert notifications&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  How the System Works — In Plain English
&lt;/h2&gt;

&lt;p&gt;Think of it like a security camera system at a shopping mall:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The Camera (Nginx)&lt;/strong&gt;&lt;br&gt;
Every person who walks through the mall entrance gets recorded. Their face, the time they arrived, which shop they visited, and whether they were let in or turned away. Nginx does the same thing — it records every request that hits your server in JSON format and saves it to a shared log file.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The Recording (JSON Log File)&lt;/strong&gt;&lt;br&gt;
All that information is saved to a log file in real time. Every single request — who made it, when, what they asked for, and what happened. It looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"source_ip"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"45.33.32.156"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-11T22:07:28+00:00"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"method"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"GET"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"response_size"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;6674&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. The Security Guard (ShieldDaemon)&lt;/strong&gt;&lt;br&gt;
There is a guard watching that recording live. Not checking it hours later — watching it as it happens. The guard has been watching long enough to know what a normal busy day looks like versus something suspicious.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. The Pattern Recognition&lt;/strong&gt;&lt;br&gt;
If one person walks past the same shop 300 times in one minute, the guard knows that is not normal. ShieldDaemon does the same — it compares current traffic against what it has learned is normal and raises an alarm when something is off.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. The Bouncer (iptables)&lt;/strong&gt;&lt;br&gt;
When the alarm is raised, the bouncer steps in. The suspicious visitor is blocked at the door — they cannot get back in. This happens automatically within 10 seconds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. The Radio (Slack)&lt;/strong&gt;&lt;br&gt;
Every time someone is blocked or unblocked, a message is sent to the security team instantly via Slack.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;7. The Monitor Screen (Dashboard)&lt;/strong&gt;&lt;br&gt;
A live screen shows everything happening in real time — who is visiting, how fast, who is blocked, and how the system is performing.&lt;/p&gt;


&lt;h2&gt;
  
  
  Part 1 — Watching the Logs
&lt;/h2&gt;

&lt;p&gt;The first thing ShieldDaemon does is read the Nginx access log line by line as new requests come in. This is called &lt;strong&gt;tailing&lt;/strong&gt; a file.&lt;/p&gt;

&lt;p&gt;Nginx is configured to write logs in JSON format like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"source_ip"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"45.33.32.156"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-11T22:07:28+00:00"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"method"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"GET"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"response_size"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;6674&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every line tells us exactly who made a request, when, what they requested, and whether it succeeded.&lt;/p&gt;

&lt;p&gt;My monitor script tails this file and passes each line to the detector:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;tail_log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;seek&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# start at end of file
&lt;/span&gt;        &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;readline&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;parsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parse_log_line&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="nf"&gt;callback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Part 2 — The Sliding Window
&lt;/h2&gt;

&lt;p&gt;Now that we can see every request, we need to measure how fast they are coming.&lt;/p&gt;

&lt;p&gt;I use a &lt;strong&gt;sliding window&lt;/strong&gt; — a structure that tracks requests over the last 60 seconds. I use Python's &lt;code&gt;deque&lt;/code&gt; (double-ended queue) for this.&lt;/p&gt;

&lt;p&gt;Here is how it works in simple terms:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Imagine a conveyor belt that is 60 seconds long. Every new request gets placed on the right end. Any request older than 60 seconds falls off the left end automatically. The number of items on the belt at any moment is the current request rate.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;deque&lt;/span&gt;

&lt;span class="n"&gt;ip_window&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;deque&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;record&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ip&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;ip_window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Remove entries older than 60 seconds
&lt;/span&gt;    &lt;span class="n"&gt;cutoff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;ip_window&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;ip_window&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;cutoff&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;ip_window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;popleft&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Current rate = items on belt / belt length
&lt;/span&gt;    &lt;span class="n"&gt;rate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ip_window&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives us an accurate requests-per-second value for every IP at any moment.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 3 — The Rolling Baseline
&lt;/h2&gt;

&lt;p&gt;Knowing the current rate is not enough. We need to know whether that rate is &lt;strong&gt;normal or not&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For example 10 requests per second might be completely normal for a busy website during the day. But at 3am it might be a sign of an attack.&lt;/p&gt;

&lt;p&gt;This is where the &lt;strong&gt;rolling baseline&lt;/strong&gt; comes in. It learns what normal traffic looks like over the last 30 minutes.&lt;/p&gt;

&lt;p&gt;Every second we record how many requests came in. Every 60 seconds we calculate the &lt;strong&gt;mean&lt;/strong&gt; (average) and &lt;strong&gt;standard deviation&lt;/strong&gt; (how much it varies) of those counts.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;mean&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;variance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;std&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;variance&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The baseline also maintains &lt;strong&gt;per-hour slots&lt;/strong&gt; — so it learns that traffic during business hours is higher than traffic at night, and adjusts accordingly.&lt;/p&gt;

&lt;p&gt;Floor values of 0.1 are applied to both mean and standard deviation to prevent false positives when there is zero traffic.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 4 — Detecting Anomalies
&lt;/h2&gt;

&lt;p&gt;Now we have two things: the current rate and the baseline. We compare them using two methods.&lt;/p&gt;

&lt;h3&gt;
  
  
  Method 1 — Z-Score
&lt;/h3&gt;

&lt;p&gt;The z-score tells us how many standard deviations the current rate is above normal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;z_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current_rate&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;baseline_mean&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;baseline_std&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the z-score is above 2.0, something is unusual. A z-score of 2.0 means the rate is so high it would only happen naturally about 2% of the time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Method 2 — Rate Multiplier
&lt;/h3&gt;

&lt;p&gt;We also check if the rate is simply more than 2 times the baseline mean:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current_rate&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;2.0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;baseline_mean&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# anomaly detected
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Whichever fires first triggers the response.&lt;/strong&gt; This gives us two layers of protection.&lt;/p&gt;

&lt;p&gt;If an IP also has a high rate of error responses (4xx and 5xx), the thresholds tighten automatically to catch it sooner.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 5 — Blocking with iptables
&lt;/h2&gt;

&lt;p&gt;When an anomaly is detected the IP gets blocked at the &lt;strong&gt;firewall level&lt;/strong&gt; using iptables. This means the server stops accepting any traffic from that IP before it even reaches Nginx or Nextcloud.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;iptables&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-I&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INPUT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ip&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-j&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DROP&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This happens within 10 seconds of detection.&lt;/p&gt;

&lt;p&gt;Here is what a blocked IP looks like in iptables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;Chain INPUT (policy ACCEPT)
target     prot opt source               destination
DROP       all  --  45.33.32.156         0.0.0.0/0
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Part 6 — Auto-Unban with Backoff Schedule
&lt;/h2&gt;

&lt;p&gt;Blocking an IP forever for a first offence is too harsh — it might be a false positive. But being too lenient encourages repeat attacks.&lt;/p&gt;

&lt;p&gt;I implemented a &lt;strong&gt;progressive backoff schedule&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Offence&lt;/th&gt;
&lt;th&gt;Ban Duration&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1st ban&lt;/td&gt;
&lt;td&gt;10 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2nd ban&lt;/td&gt;
&lt;td&gt;30 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3rd ban&lt;/td&gt;
&lt;td&gt;2 hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4th+ ban&lt;/td&gt;
&lt;td&gt;Permanent&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each ban is scheduled using a Python timer thread that fires after the duration and removes the iptables rule automatically. A Slack notification is sent every time an IP is unbanned.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 7 — Slack Alerts
&lt;/h2&gt;

&lt;p&gt;Every significant event sends an alert to Slack:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ban alert example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; IP BANNED
• IP: 45.33.32.156
• Condition: z-score=5.43 &amp;gt; threshold=2.0
• Current rate: 3.72 req/s
• Baseline: 0.10 req/s
• Ban duration: 600 seconds
• Timestamp: 2026-05-11T22:07:33Z
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Global anomaly alert:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; GLOBAL TRAFFIC ANOMALY
• Condition: Global request rate spike
• Current rate: 3.10 req/s
• Baseline: 0.10 req/s
• Action: No IP ban — monitoring closely
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Part 8 — The Live Dashboard
&lt;/h2&gt;

&lt;p&gt;The dashboard at port 8080 refreshes every 3 seconds and shows everything happening in real time:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Global request rate&lt;/li&gt;
&lt;li&gt;Baseline mean and standard deviation&lt;/li&gt;
&lt;li&gt;Blocked IPs with ban count&lt;/li&gt;
&lt;li&gt;CPU and memory usage&lt;/li&gt;
&lt;li&gt;System uptime&lt;/li&gt;
&lt;li&gt;Top 10 source IPs&lt;/li&gt;
&lt;li&gt;Live traffic chart vs baseline&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It is built with Flask and Chart.js with a dark blue security-themed design.&lt;/p&gt;




&lt;h2&gt;
  
  
  Challenges I Faced
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The baseline kept adapting to attack traffic.&lt;/strong&gt; When I injected test requests the baseline learned those high rates as normal and stopped flagging them. The fix was to restart the daemon with a clean baseline before testing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The latency calculation was wrong.&lt;/strong&gt; My first attempt used &lt;code&gt;date +%s%N&lt;/code&gt; which is not supported on all Linux versions. I switched to curl's built-in &lt;code&gt;%{time_total}&lt;/code&gt; timing instead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Slack webhook was accidentally exposed.&lt;/strong&gt; I committed the webhook URL to GitHub and GitHub's secret scanning blocked the push. I revoked the token immediately and used a placeholder in the config file.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Docker volume mounting.&lt;/strong&gt; The detector container needed to read the Nginx log file through a shared Docker volume called &lt;code&gt;HNG-nginx-logs&lt;/code&gt;. Getting the volume permissions right took some debugging.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;Building ShieldDaemon taught me that &lt;strong&gt;real security tools are statistical, not rule-based&lt;/strong&gt;. A fixed threshold of "block anyone who sends more than 100 requests per minute" would block legitimate users during a product launch. A statistical baseline that learns from actual traffic patterns is far more accurate.&lt;/p&gt;

&lt;p&gt;I also learned that &lt;strong&gt;the order of operations matters in security&lt;/strong&gt;. You must detect before you block. You must verify before you unban. You must log everything so you can audit what happened.&lt;/p&gt;

&lt;p&gt;Most importantly I learned that &lt;strong&gt;security is a continuous process&lt;/strong&gt;. ShieldDaemon runs forever, constantly learning and adapting. There is no finish line — only a daemon that never sleeps.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Result
&lt;/h2&gt;

&lt;p&gt;A fully working DDoS detection engine that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Watches Nginx logs in real time&lt;/li&gt;
&lt;li&gt;Learns normal traffic patterns automatically&lt;/li&gt;
&lt;li&gt;Detects attacks within seconds using z-scores&lt;/li&gt;
&lt;li&gt;Blocks malicious IPs with iptables&lt;/li&gt;
&lt;li&gt;Unbans automatically on a backoff schedule&lt;/li&gt;
&lt;li&gt;Alerts the team via Slack&lt;/li&gt;
&lt;li&gt;Shows everything on a live dashboard&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can see it running at &lt;strong&gt;&lt;a href="http://13.60.224.73:8080" rel="noopener noreferrer"&gt;http://13.60.224.73:8080&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The full source code is at &lt;strong&gt;&lt;a href="https://github.com/asanteedith/Shield-Daemon-Detection-Engine" rel="noopener noreferrer"&gt;https://github.com/asanteedith/Shield-Daemon-Detection-Engine&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;Written by Edith Asante — Cloud &amp;amp; DevOps Engineer Find me on GitHub | Dev.to&lt;/p&gt;

</description>
      <category>devops</category>
      <category>security</category>
      <category>python</category>
      <category>docker</category>
    </item>
    <item>
      <title>Building a Self-Service Sandbox Platform from Scratch</title>
      <dc:creator>Edith Asante</dc:creator>
      <pubDate>Mon, 11 May 2026 16:31:41 +0000</pubDate>
      <link>https://forem.com/edithasante/building-a-self-service-sandbox-platform-from-scratch-4ff8</link>
      <guid>https://forem.com/edithasante/building-a-self-service-sandbox-platform-from-scratch-4ff8</guid>
      <description>&lt;p&gt;&lt;em&gt;This is part of my HNG DevOps internship series. Follow along as I document every stage.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  A Quick Recap
&lt;/h2&gt;

&lt;p&gt;Stage 0 was about securing a Linux server. Stage 1 was deploying an API behind Nginx. Stage 2 was containerizing a microservices app. Stage 3 was building a DDoS detection engine. Stage 4 was writing a declarative deployment tool. Stage 5 is the most ambitious yet.&lt;/p&gt;

&lt;p&gt;This time there was no starter code. No bugs to fix. No existing app to containerize. I had to build the entire platform from scratch — a self-service system where users can spin up isolated temporary environments, deploy apps into them, simulate outages, monitor health, and have everything auto-destroyed when the lifetime expires. Think of it as a miniature internal Heroku with a chaos engineering toggle.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Task
&lt;/h2&gt;

&lt;p&gt;The platform had to do all of this on a single Linux VM:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Environment Lifecycle&lt;/strong&gt; — create and destroy isolated Docker environments on demand with a configurable TTL&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto Cleanup Daemon&lt;/strong&gt; — a background process that scans every 60 seconds and destroys expired environments automatically&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dynamic Nginx Routing&lt;/strong&gt; — every new environment gets its own Nginx config written and reloaded automatically&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Log Shipping&lt;/strong&gt; — container logs captured and queryable by environment ID&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Health Monitoring&lt;/strong&gt; — a poller that hits every environment's &lt;code&gt;/health&lt;/code&gt; endpoint every 30 seconds and marks environments as degraded after 3 consecutive failures&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Outage Simulation&lt;/strong&gt; — a script that can crash, pause, disconnect, or stress-test any environment on demand&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Control API&lt;/strong&gt; — a REST API with 6 endpoints wrapping all the scripts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Makefile&lt;/strong&gt; — every action available as a make target&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The stack was Docker, Docker Compose, Nginx, Bash, Python 3, and Flask. Everything had to spin up with one command.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 1: Repo Structure and Scaffold
&lt;/h2&gt;

&lt;p&gt;Before writing a single line of logic I set up the repo structure exactly as specified:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;devops-sandbox/
├── platform/
│   ├── create_env.sh
│   ├── destroy_env.sh
│   ├── cleanup_daemon.sh
│   ├── simulate_outage.sh
│   └── api.py
├── nginx/
│   ├── nginx.conf
│   └── conf.d/
├── monitor/
│   └── health_poller.sh
├── logs/
├── envs/
├── Makefile
├── docker-compose.yml
├── README.md
├── .env.example
└── .gitignore
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Getting this right first saved a lot of headaches later. Every script references paths relative to the project root, and if those paths don't exist at runtime the scripts fail silently. I also set &lt;code&gt;chmod +x&lt;/code&gt; on all shell scripts immediately — forgetting this causes confusing permission errors later.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;.gitignore&lt;/code&gt; was set up to exclude &lt;code&gt;envs/&lt;/code&gt;, &lt;code&gt;logs/&lt;/code&gt;, and &lt;code&gt;.env&lt;/code&gt; from the start. These directories contain runtime state and secrets that should never be committed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: The Demo App
&lt;/h2&gt;

&lt;p&gt;The platform needed something to run inside each environment. The task was clear that the demo app is not the project — the platform is. So I kept it simple: a Flask app with two routes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@app.route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;index&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;jsonify&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hello from the sandbox!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;env_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ENV_ID&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="nd"&gt;@app.route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/health&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;health&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;jsonify&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ok&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;env_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ENV_ID&lt;/span&gt;&lt;span class="p"&gt;}),&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;/health&lt;/code&gt; route is the critical one. The health poller depends on it. Every environment container gets its &lt;code&gt;ENV_ID&lt;/code&gt; injected as an environment variable so you can always tell which container you are talking to.&lt;/p&gt;

&lt;p&gt;The app binds to &lt;code&gt;0.0.0.0&lt;/code&gt; not &lt;code&gt;127.0.0.1&lt;/code&gt;. This is a mistake I see constantly. If you bind to localhost inside a container, nothing outside the container can reach it — including Nginx.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 3: Nginx Dynamic Routing
&lt;/h2&gt;

&lt;p&gt;Nginx is the front door for every environment. The key insight is that &lt;code&gt;nginx.conf&lt;/code&gt; never needs to change. It just includes everything in &lt;code&gt;conf.d/&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;&lt;span class="k"&gt;http&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kn"&gt;include&lt;/span&gt; &lt;span class="n"&gt;/etc/nginx/conf.d/*.conf&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="kn"&gt;server&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kn"&gt;listen&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt; &lt;span class="s"&gt;default_server&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="kn"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;404&lt;/span&gt; &lt;span class="s"&gt;"No&lt;/span&gt; &lt;span class="s"&gt;environment&lt;/span&gt; &lt;span class="s"&gt;found&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="s"&gt;n"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When &lt;code&gt;create_env.sh&lt;/code&gt; runs, it writes a new file to &lt;code&gt;nginx/conf.d/$ENV_ID.conf&lt;/code&gt; and reloads Nginx. When &lt;code&gt;destroy_env.sh&lt;/code&gt; runs, it deletes that file and reloads Nginx again. No manual config editing ever.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;conf.d/&lt;/code&gt; directory is mounted as a Docker volume into the Nginx container. This means files written to &lt;code&gt;nginx/conf.d/&lt;/code&gt; on the host appear immediately inside the container. Only a reload is needed, not a rebuild.&lt;/p&gt;

&lt;p&gt;One critical mistake to avoid: never write the Nginx config before the container is running. Nginx validates upstream hostnames on reload. If you write a config pointing to a container that doesn't exist yet, the reload fails and Nginx goes down. The order matters — start the container first, then write the config.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 4: Environment Lifecycle
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;create_env.sh&lt;/code&gt; is the heart of the platform. It has to do six things in the right order:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Generate a unique env ID from the name and a timestamp suffix&lt;/li&gt;
&lt;li&gt;Create a dedicated Docker network for the environment&lt;/li&gt;
&lt;li&gt;Connect the Nginx container to that network&lt;/li&gt;
&lt;li&gt;Start the app container on that network with a &lt;code&gt;sandbox.env=$ENV_ID&lt;/code&gt; label&lt;/li&gt;
&lt;li&gt;Write the Nginx config and reload&lt;/li&gt;
&lt;li&gt;Write the state file to &lt;code&gt;envs/$ENV_ID.json&lt;/code&gt; atomically&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The atomic write is important. The cleanup daemon reads these state files in a loop. If a write crashes halfway, the daemon reads garbage and fails. The fix is to write to a temp file first and then &lt;code&gt;mv&lt;/code&gt; it into place:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;TEMP_FILE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;mktemp&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$ENVS_DIR&lt;/span&gt;&lt;span class="s2"&gt;/.tmp.XXXXXX"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$TEMP_FILE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="no"&gt;JSON&lt;/span&gt;&lt;span class="sh"&gt;
{
  "id": "&lt;/span&gt;&lt;span class="nv"&gt;$ENV_ID&lt;/span&gt;&lt;span class="sh"&gt;",
  "name": "&lt;/span&gt;&lt;span class="nv"&gt;$ENV_NAME&lt;/span&gt;&lt;span class="sh"&gt;",
  "container": "&lt;/span&gt;&lt;span class="nv"&gt;$CONTAINER_NAME&lt;/span&gt;&lt;span class="sh"&gt;",
  "network": "&lt;/span&gt;&lt;span class="nv"&gt;$NETWORK_NAME&lt;/span&gt;&lt;span class="sh"&gt;",
  "created_at": "&lt;/span&gt;&lt;span class="nv"&gt;$CREATED_AT&lt;/span&gt;&lt;span class="sh"&gt;",
  "ttl": &lt;/span&gt;&lt;span class="nv"&gt;$TTL&lt;/span&gt;&lt;span class="sh"&gt;,
  "status": "running"
}
&lt;/span&gt;&lt;span class="no"&gt;JSON
&lt;/span&gt;&lt;span class="nb"&gt;mv&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$TEMP_FILE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$ENVS_DIR&lt;/span&gt;&lt;span class="s2"&gt;/&lt;/span&gt;&lt;span class="nv"&gt;$ENV_ID&lt;/span&gt;&lt;span class="s2"&gt;.json"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;mv&lt;/code&gt; is atomic on Linux when source and destination are on the same filesystem. The daemon either reads the complete file or nothing.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;destroy_env.sh&lt;/code&gt; reverses all of this in the correct order — kill the log shipper first, stop and remove containers, disconnect Nginx from the network, remove the network, delete the Nginx config, reload Nginx, archive logs, delete the state file. Order matters here too. You cannot remove a network while containers are still connected to it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 5: The Cleanup Daemon
&lt;/h2&gt;

&lt;p&gt;The daemon runs in an infinite loop with a 60 second sleep. On each iteration it reads every file in &lt;code&gt;envs/&lt;/code&gt;, computes how much time has passed since &lt;code&gt;created_at&lt;/code&gt;, and calls &lt;code&gt;destroy_env.sh&lt;/code&gt; if the TTL has been exceeded.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;CREATED_EPOCH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$CREATED_AT&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; +%s&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;NOW_EPOCH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt; +%s&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;EXPIRES_AT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;$((&lt;/span&gt;CREATED_EPOCH &lt;span class="o"&gt;+&lt;/span&gt; TTL&lt;span class="k"&gt;))&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$NOW_EPOCH&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-ge&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$EXPIRES_AT&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;]]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
    &lt;/span&gt;bash &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$DESTROY_SCRIPT&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$ENV_ID&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One thing that breaks this: not using &lt;code&gt;nullglob&lt;/code&gt;. If &lt;code&gt;envs/&lt;/code&gt; is empty, &lt;code&gt;*.json&lt;/code&gt; expands to the literal string &lt;code&gt;*.json&lt;/code&gt; and the loop tries to process a file called &lt;code&gt;*.json&lt;/code&gt; which doesn't exist. Fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;shopt&lt;/span&gt; &lt;span class="nt"&gt;-s&lt;/span&gt; nullglob
&lt;span class="nv"&gt;STATE_FILES&lt;/span&gt;&lt;span class="o"&gt;=(&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$ENVS_DIR&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;/&lt;span class="k"&gt;*&lt;/span&gt;.json&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;shopt&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt; nullglob
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every action is timestamped and written to &lt;code&gt;logs/cleanup.log&lt;/code&gt;. The daemon runs in the background with &lt;code&gt;nohup&lt;/code&gt; and its PID is saved so &lt;code&gt;make down&lt;/code&gt; can stop it cleanly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 6: Health Monitoring
&lt;/h2&gt;

&lt;p&gt;The health poller runs every 30 seconds. For each active environment it finds the container's IP address, hits &lt;code&gt;GET /health&lt;/code&gt;, measures the latency, and writes the result to &lt;code&gt;logs/$ENV_ID/health.log&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Getting latency right was harder than expected. My first approach used &lt;code&gt;date +%s%N&lt;/code&gt; for nanosecond timestamps. This failed because the &lt;code&gt;%N&lt;/code&gt; flag is not supported on the version of Linux on the VM. The numbers came out as something like &lt;code&gt;14209454ms&lt;/code&gt; for a request that obviously took under a second.&lt;/p&gt;

&lt;p&gt;The fix was to use curl's own built-in timing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;RESULT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; /dev/null &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-w&lt;/span&gt; &lt;span class="s2"&gt;"%{http_code} %{time_total}"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-time&lt;/span&gt; 5 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"http://&lt;/span&gt;&lt;span class="nv"&gt;$CONTAINER_IP&lt;/span&gt;&lt;span class="s2"&gt;:5000/health"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="nv"&gt;HTTP_STATUS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$RESULT&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | &lt;span class="nb"&gt;awk&lt;/span&gt; &lt;span class="s1"&gt;'{print $1}'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;TIME_SEC&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$RESULT&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | &lt;span class="nb"&gt;awk&lt;/span&gt; &lt;span class="s1"&gt;'{print $2}'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;LATENCY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$TIME_SEC&lt;/span&gt;&lt;span class="s2"&gt; * 1000"&lt;/span&gt; | &lt;span class="nb"&gt;awk&lt;/span&gt; &lt;span class="s1"&gt;'{printf "%d", $1 * 1000}'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;curl&lt;/code&gt;'s &lt;code&gt;%{time_total}&lt;/code&gt; gives you wall clock time in seconds as a decimal. Multiply by 1000 and you have milliseconds. Accurate and reliable.&lt;/p&gt;

&lt;p&gt;After 3 consecutive failures the poller marks the environment as degraded by updating the state file. It also resets the fail counter and restores the status to running when checks pass again. The status update uses the same atomic write pattern as the lifecycle scripts.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 7: Outage Simulation
&lt;/h2&gt;

&lt;p&gt;The simulation script accepts &lt;code&gt;--env&lt;/code&gt; and &lt;code&gt;--mode&lt;/code&gt; flags. The modes map directly to Docker commands:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;crash&lt;/code&gt; → &lt;code&gt;docker kill&lt;/code&gt; (SIGKILL, not graceful)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pause&lt;/code&gt; → &lt;code&gt;docker pause&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;network&lt;/code&gt; → &lt;code&gt;docker network disconnect&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;recover&lt;/code&gt; → inspects current state and reverses whichever mode is active&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;stress&lt;/code&gt; → &lt;code&gt;stress-ng&lt;/code&gt; inside the container for 60 seconds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The guard at the top of the script is not optional. It checks whether the target container name matches any protected service names and refuses to run if it does:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;PROTECTED&lt;/span&gt;&lt;span class="o"&gt;=(&lt;/span&gt;&lt;span class="s2"&gt;"sandbox-nginx"&lt;/span&gt; &lt;span class="s2"&gt;"cleanup_daemon"&lt;/span&gt; &lt;span class="s2"&gt;"sandbox-api"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;PROTECTED_NAME &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PROTECTED&lt;/span&gt;&lt;span class="p"&gt;[@]&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
    if&lt;/span&gt; &lt;span class="o"&gt;[[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$CONTAINER&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$PROTECTED_NAME&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="o"&gt;]]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
        &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"ERROR: Refusing to simulate outage against protected container"&lt;/span&gt;
        &lt;span class="nb"&gt;exit &lt;/span&gt;1
    &lt;span class="k"&gt;fi
done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without this guard, nothing stops someone from passing the Nginx container ID and taking down the entire platform.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;recover&lt;/code&gt; mode was the most interesting to write. It does not know which mode caused the problem — it just inspects the current state and fixes whatever is wrong. Paused? Unpause. Exited? Restart. Network disconnected? Reconnect. This makes recover genuinely useful rather than just a wrapper around one specific undo.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 8: The Control API
&lt;/h2&gt;

&lt;p&gt;The Flask API wraps all the scripts via &lt;code&gt;subprocess.run&lt;/code&gt;. It has 6 endpoints:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;POST   /envs              → create env
GET    /envs              → list active envs + TTL remaining
DELETE /envs/:id          → destroy env
GET    /envs/:id/logs     → last 100 lines of app.log
GET    /envs/:id/health   → last 10 health check results
POST   /envs/:id/outage   → trigger simulation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The TTL remaining calculation happens in Python:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ttl_remaining&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;created&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fromisoformat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;created_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Z&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;+00:00&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timezone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;elapsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;created&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;total_seconds&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ttl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;elapsed&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The API runs inside a Docker container with the project directory mounted as a volume and the Docker socket mounted so it can execute Docker commands. This is the standard pattern for tools that need to manage Docker from inside Docker.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 9: The Makefile
&lt;/h2&gt;

&lt;p&gt;Every action has a make target. The two most important ones are &lt;code&gt;up&lt;/code&gt; and &lt;code&gt;down&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;make up&lt;/code&gt; starts Nginx and the API via Docker Compose, then starts the cleanup daemon and health poller as background processes with &lt;code&gt;nohup&lt;/code&gt;, saving their PIDs to files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight make"&gt;&lt;code&gt;&lt;span class="nl"&gt;up&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
    docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;--build&lt;/span&gt;
    &lt;span class="nb"&gt;nohup &lt;/span&gt;bash platform/cleanup_daemon.sh &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; logs/cleanup.log 2&amp;gt;&amp;amp;1 &amp;amp;
    &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nv"&gt;$$&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; logs/cleanup_daemon.pid
    &lt;span class="nb"&gt;nohup &lt;/span&gt;bash monitor/health_poller.sh &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; logs/poller.log 2&amp;gt;&amp;amp;1 &amp;amp;
    &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nv"&gt;$$&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; logs/health_poller.pid
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;make down&lt;/code&gt; reads those PID files and kills the processes cleanly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight make"&gt;&lt;code&gt;&lt;span class="nl"&gt;down&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
    &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt; logs/cleanup_daemon.pid &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
        &lt;span class="nb"&gt;kill&lt;/span&gt; &lt;span class="p"&gt;$$(&lt;/span&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;logs/cleanup_daemon.pid&lt;span class="p"&gt;)&lt;/span&gt; 2&amp;gt;/dev/null &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
        &lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt; logs/cleanup_daemon.pid&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="k"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Makefile syntax has one rule that catches everyone: indentation must use tabs, not spaces. If you use spaces, make throws a cryptic &lt;code&gt;missing separator&lt;/code&gt; error that has nothing to do with separators.&lt;/p&gt;




&lt;h2&gt;
  
  
  Problems I Hit Along the Way
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Docker permission denied on a fresh VM&lt;/strong&gt; — The ubuntu user is not in the docker group by default. Fix: &lt;code&gt;sudo usermod -aG docker $USER&lt;/code&gt; followed by &lt;code&gt;newgrp docker&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Nginx crashing on startup&lt;/strong&gt; — I left a sample &lt;code&gt;example.conf&lt;/code&gt; file in &lt;code&gt;nginx/conf.d/&lt;/code&gt; as a reference. Nginx tried to resolve the upstream hostname &lt;code&gt;example:5000&lt;/code&gt; on startup, failed, and crashed. The fix was obvious in hindsight: delete the sample file before starting Nginx.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Disk full during Docker build&lt;/strong&gt; — &lt;code&gt;docker system prune -af&lt;/code&gt; recovered the space. The build cache had accumulated several GB from previous builds and test runs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;demo-app:latest&lt;/code&gt; image lost after prune&lt;/strong&gt; — Docker prune removes all images not referenced by a running container. After cleaning disk space the demo app image was gone. Always rebuild the demo app image after a prune: &lt;code&gt;docker build -t demo-app:latest ./demo-app&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Health log latency showing 14 million milliseconds&lt;/strong&gt; — Caused by &lt;code&gt;date +%s%N&lt;/code&gt; not being supported. Fixed by switching to curl's &lt;code&gt;%{time_total}&lt;/code&gt; timing.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Big Picture
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What we built&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Dedicated Docker network per environment&lt;/td&gt;
&lt;td&gt;Complete isolation — environments cannot interfere with each other&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Atomic state file writes&lt;/td&gt;
&lt;td&gt;Prevents corruption when daemon and scripts write concurrently&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Nginx config as code&lt;/td&gt;
&lt;td&gt;Dynamic routing without touching the main config&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Log shipper PID tracking&lt;/td&gt;
&lt;td&gt;Prevents zombie processes on destroy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Guard in simulation script&lt;/td&gt;
&lt;td&gt;Prevents accidental destruction of platform infrastructure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Health-based degraded detection&lt;/td&gt;
&lt;td&gt;Automated observability without external tooling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;REST API over raw scripts&lt;/td&gt;
&lt;td&gt;Makes the platform programmable and integratable&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The hardest part of this task was not any single script. It was understanding the correct order of operations. Create the container before writing the Nginx config. Kill the log shipper before removing the container. Disconnect the network before removing it. Write state files atomically. These ordering constraints are not obvious until something breaks, and when they break they break in confusing ways.&lt;/p&gt;

&lt;p&gt;That is the difference between infrastructure that works in a demo and infrastructure that works at 3am when something goes wrong.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Stage 5 complete. Find me on Dev.to | &lt;a href="https://github.com/asanteedith/devops-sandbox" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>docker</category>
      <category>bash</category>
      <category>beginners</category>
    </item>
    <item>
      <title># Containerizing a Broken Microservices App and Shipping It with a Full CI/CD Pipeline</title>
      <dc:creator>Edith Asante</dc:creator>
      <pubDate>Mon, 11 May 2026 01:03:21 +0000</pubDate>
      <link>https://forem.com/edithasante/-containerizing-a-broken-microservices-app-and-shipping-it-with-a-full-cicd-pipeline-407b</link>
      <guid>https://forem.com/edithasante/-containerizing-a-broken-microservices-app-and-shipping-it-with-a-full-cicd-pipeline-407b</guid>
      <description>&lt;p&gt;&lt;em&gt;This is part of my HNG DevOps internship series. In Stage 1 I deployed a personal API behind Nginx on a live server. Stage 2 is where things got serious.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Task
&lt;/h2&gt;

&lt;p&gt;We were handed a broken codebase and told to make it production-ready. No hints about what was wrong. No list of bugs. Just the code and the instruction: &lt;em&gt;"Finding them is part of the task."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The application was a distributed job processing system made up of four services:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;frontend&lt;/strong&gt; (Node.js/Express) where users submit and track jobs&lt;/li&gt;
&lt;li&gt;An &lt;strong&gt;API&lt;/strong&gt; (Python/FastAPI) that creates jobs and serves status updates&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;worker&lt;/strong&gt; (Python) that picks up and processes jobs from a queue&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;Redis&lt;/strong&gt; instance shared between the API and worker as a message broker&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;My job was to find every bug, fix every misconfiguration, containerize all three services with production-quality Dockerfiles, wire everything together with Docker Compose, and build a full CI/CD pipeline that runs lint, tests, security scanning, integration tests, and rolling deployment — all in strict order.&lt;/p&gt;




&lt;h2&gt;
  
  
  Reading the Code Before Touching Anything
&lt;/h2&gt;

&lt;p&gt;The first thing I did was read every file carefully before writing a single line of infrastructure. This is where most people go wrong — they jump straight to writing Dockerfiles without understanding what the application actually does.&lt;/p&gt;

&lt;p&gt;Here is what I found.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Redis hostname problem
&lt;/h3&gt;

&lt;p&gt;Both &lt;code&gt;api/main.py&lt;/code&gt; and &lt;code&gt;frontend/app.js&lt;/code&gt; had hardcoded &lt;code&gt;localhost&lt;/code&gt; as the Redis and API hostname respectively. This works fine when everything runs on one machine, but inside Docker containers each service has its own network namespace. &lt;code&gt;localhost&lt;/code&gt; inside the API container points to the API container itself, not Redis.&lt;/p&gt;

&lt;p&gt;The fix was straightforward — use environment variables and Docker's built-in DNS:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before
&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Redis&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;localhost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;6379&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# After
&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Redis&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;REDIS_HOST&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;redis&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;6379&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Docker Compose automatically creates DNS entries for each service using the service name. So &lt;code&gt;redis&lt;/code&gt; resolves to the Redis container's IP address inside the network.&lt;/p&gt;

&lt;h3&gt;
  
  
  The silent queue mismatch
&lt;/h3&gt;

&lt;p&gt;This one was subtle. The API was pushing job IDs to a Redis list called &lt;code&gt;job_queue&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lpush&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;job_queue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;job_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But the worker was polling a completely different list called &lt;code&gt;job&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;job&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;blpop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;job&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every job submitted through the API went into &lt;code&gt;job_queue&lt;/code&gt;. The worker was watching &lt;code&gt;job&lt;/code&gt;. Jobs piled up forever in &lt;code&gt;pending&lt;/code&gt; state and nobody ever processed them. The fix was one word — change &lt;code&gt;job&lt;/code&gt; to &lt;code&gt;job_queue&lt;/code&gt; in the worker.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Python magic variable typo
&lt;/h3&gt;

&lt;p&gt;The worker file ended with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;process_redis_jobs&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note &lt;code&gt;name&lt;/code&gt; instead of &lt;code&gt;__name__&lt;/code&gt;. This means the main function never ran. The container started, did nothing, and sat there silently. Changed to &lt;code&gt;if __name__ == "__main__":&lt;/code&gt; and the worker came to life.&lt;/p&gt;

&lt;h3&gt;
  
  
  Missing CORS headers
&lt;/h3&gt;

&lt;p&gt;The frontend was making HTTP requests to the API from a browser. Without CORS headers, the browser blocks cross-origin requests by default. Added &lt;code&gt;CORSMiddleware&lt;/code&gt; to the FastAPI app:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi.middleware.cors&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CORSMiddleware&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_middleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;CORSMiddleware&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;allow_origins&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;allow_methods&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;allow_headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Redis byte strings
&lt;/h3&gt;

&lt;p&gt;The Redis client was returning raw bytes instead of strings, so &lt;code&gt;job_id&lt;/code&gt; would come back as &lt;code&gt;b'abc-123'&lt;/code&gt; instead of &lt;code&gt;abc-123&lt;/code&gt;. Added &lt;code&gt;decode_responses=True&lt;/code&gt; to the Redis connection to get UTF-8 strings automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  Writing Production Dockerfiles
&lt;/h2&gt;

&lt;p&gt;Once I understood the application I wrote Dockerfiles for all three services. The two rules I followed strictly: multi-stage builds and non-root users.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multi-stage builds
&lt;/h3&gt;

&lt;p&gt;A naive Dockerfile copies all your source code and runs &lt;code&gt;pip install&lt;/code&gt;. The resulting image contains your build tools, pip cache, compiler output — everything the build needed but the runtime doesn't. Multi-stage builds fix this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# Stage 1: install dependencies&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;python:3.11-slim&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;builder&lt;/span&gt;
&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /app&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; requirements.txt .&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--user&lt;/span&gt; &lt;span class="nt"&gt;--no-cache-dir&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt

&lt;span class="c"&gt;# Stage 2: copy only what's needed to run&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;python:3.11-slim&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;runtime&lt;/span&gt;
&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /app&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; --from=builder /root/.local /home/edith/.local&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; . .&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The final image only contains the installed packages and source code. Build tools never make it in. Image size reduced by over 70%.&lt;/p&gt;

&lt;h3&gt;
  
  
  Non-root users
&lt;/h3&gt;

&lt;p&gt;Every service creates and runs as a dedicated user called &lt;code&gt;edith&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;RUN &lt;/span&gt;useradd &lt;span class="nt"&gt;-m&lt;/span&gt; edith
&lt;span class="k"&gt;RUN &lt;/span&gt;&lt;span class="nb"&gt;chown&lt;/span&gt; &lt;span class="nt"&gt;-R&lt;/span&gt; edith:edith /home/edith /app
&lt;span class="k"&gt;USER&lt;/span&gt;&lt;span class="s"&gt; edith&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If someone finds a vulnerability in your application and gets code execution, they get a restricted user with no special privileges — not root access to the container.&lt;/p&gt;

&lt;h3&gt;
  
  
  Health checks
&lt;/h3&gt;

&lt;p&gt;Every Dockerfile includes a &lt;code&gt;HEALTHCHECK&lt;/code&gt; instruction so Docker knows whether the service is actually working, not just running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# API&lt;/span&gt;
&lt;span class="k"&gt;HEALTHCHECK&lt;/span&gt;&lt;span class="s"&gt; --interval=30s --timeout=10s --retries=3 \&lt;/span&gt;
  CMD curl -f http://127.0.0.1:8000/health || exit 1

&lt;span class="c"&gt;# Worker — no HTTP port, so use a filesystem heartbeat&lt;/span&gt;
&lt;span class="k"&gt;HEALTHCHECK&lt;/span&gt;&lt;span class="s"&gt; --interval=30s --timeout=10s --retries=3 \&lt;/span&gt;
  CMD test -f /tmp/worker_healthy || exit 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The worker writes a timestamp to &lt;code&gt;/tmp/worker_healthy&lt;/code&gt; on every loop. The health check verifies that file exists. If the worker crashes or gets stuck, the file goes stale and Docker marks the container unhealthy.&lt;/p&gt;




&lt;h2&gt;
  
  
  Docker Compose Orchestration
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;docker-compose.yml&lt;/code&gt; file ties everything together. The key decisions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Startup order with health checks.&lt;/strong&gt; Using &lt;code&gt;depends_on&lt;/code&gt; with just a service name only waits for the container to start, not for the application inside to be ready. Using &lt;code&gt;condition: service_healthy&lt;/code&gt; waits for the health check to pass:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;api&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;redis&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;condition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;service_healthy&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This eliminated the race condition where the API would crash on startup because Redis wasn't ready yet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Redis not exposed on the host.&lt;/strong&gt; Redis uses &lt;code&gt;expose&lt;/code&gt; instead of &lt;code&gt;ports&lt;/code&gt;. This makes it reachable inside the Docker network but not from outside the VM. No reason to expose a database to the internet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resource limits on every service.&lt;/strong&gt; Without limits, one misbehaving service can starve the entire host:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;cpus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;0.50'&lt;/span&gt;
      &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;512M&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Named internal network.&lt;/strong&gt; All services communicate over &lt;code&gt;hng_network&lt;/code&gt; — an isolated bridge network managed by Docker Compose.&lt;/p&gt;




&lt;h2&gt;
  
  
  The CI/CD Pipeline
&lt;/h2&gt;

&lt;p&gt;The task specified 6 stages in strict order:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;lint → test → build → security scan → integration test → deploy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A failure in any stage must prevent all subsequent stages from running. GitHub Actions handles this with &lt;code&gt;needs&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;test&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;lint&lt;/span&gt;
&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;test&lt;/span&gt;
&lt;span class="na"&gt;security&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;build&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Lint stage
&lt;/h3&gt;

&lt;p&gt;Three linters run in sequence:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;flake8&lt;/code&gt; for Python — catches style violations, unused imports, undefined names&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;eslint&lt;/code&gt; for JavaScript — catches syntax errors and bad patterns&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;hadolint&lt;/code&gt; for Dockerfiles — catches common Dockerfile mistakes like missing &lt;code&gt;--no-install-recommends&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Getting Python files to pass flake8 was the most tedious part. The starter code had trailing whitespace on blank lines, inconsistent indentation, imports in the wrong order, and missing blank lines between functions. Every line had to be cleaned up manually.&lt;/p&gt;

&lt;h3&gt;
  
  
  Test stage
&lt;/h3&gt;

&lt;p&gt;Three unit tests with pytest and coverage reporting:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_redis_connection_mocked&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;mock_redis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MagicMock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;mock_redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ping&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;return_value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;mock_redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ping&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_health_logic&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_math_logic&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Coverage report uploaded as a pipeline artifact so you can see exactly which lines are tested.&lt;/p&gt;

&lt;h3&gt;
  
  
  Build stage
&lt;/h3&gt;

&lt;p&gt;This stage runs a local Docker registry as a GitHub Actions service container, builds all three images, tags each with the git SHA and &lt;code&gt;latest&lt;/code&gt;, and pushes them to the local registry:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;registry&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;registry:2&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;5000:5000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker build &lt;span class="nt"&gt;-t&lt;/span&gt; localhost:5000/hng-api:&lt;span class="nv"&gt;$SHA&lt;/span&gt; &lt;span class="nt"&gt;-t&lt;/span&gt; localhost:5000/hng-api:latest ./api
docker push localhost:5000/hng-api:&lt;span class="nv"&gt;$SHA&lt;/span&gt;
docker push localhost:5000/hng-api:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tagging with the git SHA means every image is traceable back to the exact commit that built it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Security scan stage
&lt;/h3&gt;

&lt;p&gt;Trivy scans all three images for known vulnerabilities:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aquasecurity/trivy-action@master&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image-ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;hng-api:latest'&lt;/span&gt;
    &lt;span class="na"&gt;format&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sarif'&lt;/span&gt;
    &lt;span class="na"&gt;output&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;trivy-api.sarif'&lt;/span&gt;
    &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CRITICAL'&lt;/span&gt;
    &lt;span class="na"&gt;exit-code&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;0'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Results uploaded as SARIF artifacts — GitHub can render these in the Security tab. We set &lt;code&gt;exit-code: '0'&lt;/code&gt; so the pipeline continues even if vulnerabilities are found, but they are reported and visible.&lt;/p&gt;

&lt;h3&gt;
  
  
  Integration test stage
&lt;/h3&gt;

&lt;p&gt;This is the most valuable stage. It starts the complete stack inside the GitHub Actions runner, submits a real job, and polls until it completes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Submit a job&lt;/span&gt;
&lt;span class="nv"&gt;JOB&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8000/jobs &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;JOB_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nv"&gt;$JOB&lt;/span&gt; | python3 &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"import sys,json; print(json.load(sys.stdin)['job_id'])"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# Poll until completed&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;i &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;seq &lt;/span&gt;1 20&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;&lt;span class="nv"&gt;STATUS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://localhost:8000/jobs/&lt;span class="nv"&gt;$JOB_ID&lt;/span&gt; | python3 &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="s2"&gt;"import sys,json; print(json.load(sys.stdin).get('status',''))"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$STATUS&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"completed"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
    &lt;/span&gt;&lt;span class="nb"&gt;exit &lt;/span&gt;0
  &lt;span class="k"&gt;fi
  &lt;/span&gt;&lt;span class="nb"&gt;sleep &lt;/span&gt;5
&lt;span class="k"&gt;done
&lt;/span&gt;&lt;span class="nb"&gt;exit &lt;/span&gt;1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the job doesn't complete within 100 seconds, the pipeline fails. The stack tears down cleanly regardless of the outcome.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deploy stage
&lt;/h3&gt;

&lt;p&gt;The deploy stage only runs on pushes to &lt;code&gt;main&lt;/code&gt;. It SSHs into the production VM and performs a rolling update:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Deploy the API first&lt;/span&gt;
docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;--build&lt;/span&gt; &lt;span class="nt"&gt;--no-deps&lt;/span&gt; api

&lt;span class="c"&gt;# Wait up to 60 seconds for the health check to pass&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;i &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;seq &lt;/span&gt;1 12&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  if &lt;/span&gt;docker compose &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-T&lt;/span&gt; api python &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="s2"&gt;"import urllib.request; urllib.request.urlopen('http://localhost:8000/health')"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    2&amp;gt;/dev/null&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt;
    &lt;span class="c"&gt;# Health check passed — deploy the rest&lt;/span&gt;
    docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;--build&lt;/span&gt; &lt;span class="nt"&gt;--no-deps&lt;/span&gt; worker frontend
    &lt;span class="nb"&gt;exit &lt;/span&gt;0
  &lt;span class="k"&gt;fi
  &lt;/span&gt;&lt;span class="nb"&gt;sleep &lt;/span&gt;5
&lt;span class="k"&gt;done&lt;/span&gt;

&lt;span class="c"&gt;# Health check failed — abort, leave old container running&lt;/span&gt;
&lt;span class="nb"&gt;exit &lt;/span&gt;1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The old container keeps serving traffic until the new one passes its health check. If the new version is broken, nothing goes down.&lt;/p&gt;




&lt;h2&gt;
  
  
  Problems I Hit Along the Way
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;YAML duplicate jobs.&lt;/strong&gt; I accidentally appended the &lt;code&gt;integration-test&lt;/code&gt; and &lt;code&gt;deploy&lt;/code&gt; stages to the ci.yml file twice using &lt;code&gt;cat &amp;gt;&amp;gt;&lt;/code&gt;. GitHub rejected the workflow because job names were duplicated. Fixed by rewriting the entire file from scratch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pinned apt package version not found.&lt;/strong&gt; Hadolint flagged &lt;code&gt;apt-get install curl&lt;/code&gt; without a pinned version (DL3008). I tried to pin it as &lt;code&gt;curl=7.88.1-10+deb12u5&lt;/code&gt; but that exact version didn't exist in the GitHub Actions runner's package index, breaking the Docker build. Fixed by ignoring DL3008 with &lt;code&gt;hadolint --ignore DL3008&lt;/code&gt; — a pragmatic tradeoff.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Windows CRLF line endings.&lt;/strong&gt; Editing files on Windows and pushing to a Linux CI environment caused flake8 to report phantom whitespace errors. Every blank line showed as &lt;code&gt;W293 blank line contains whitespace&lt;/code&gt; because of the carriage return character. Fixed by configuring git with &lt;code&gt;core.autocrlf false&lt;/code&gt; and converting files to LF.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token scope too narrow.&lt;/strong&gt; Pushing changes to the workflow file required a GitHub token with the &lt;code&gt;workflow&lt;/code&gt; scope, not just &lt;code&gt;repo&lt;/code&gt;. Generated a new token with both scopes to resolve the 403 error.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SSH key missing on VM.&lt;/strong&gt; The deploy stage needed to SSH into the production server but no SSH key existed on the VM. Generated one with &lt;code&gt;ssh-keygen -t ed25519&lt;/code&gt;, added the public key to &lt;code&gt;authorized_keys&lt;/code&gt;, and stored the private key as a GitHub Actions secret.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Final Pipeline
&lt;/h2&gt;

&lt;p&gt;After all of that, the pipeline looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;✅ lint          — 16s
✅ test          — 12s
✅ build         — 1m 4s
✅ security      — 46s
✅ integration-test — 1m 33s
✅ deploy        — 8s

Status: Success — Total duration: 2m 37s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All 6 stages green. Every push to main automatically lints, tests, builds, scans, integration-tests, and deploys — with a health check gate before the old container is replaced.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;The most important lesson from Stage 2 is that reading code before writing infrastructure is not optional. Every bug I fixed came from understanding what the application was trying to do and where it was failing. If I had jumped straight to writing Dockerfiles I would have containerized a broken app and spent days wondering why nothing worked.&lt;/p&gt;

&lt;p&gt;The second lesson is that CI/CD is not just automation — it is documentation. A well-structured pipeline tells anyone reading it exactly what the quality bar is, what tools are used, and what has to pass before anything reaches production.&lt;/p&gt;

&lt;p&gt;The third lesson is that container security is not complicated but it is easy to skip. Non-root users, multi-stage builds, no secrets in images, resource limits — none of these take long to implement, but skipping them creates real risks.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Stage 2 complete. Find the repo at &lt;a href="https://github.com/asanteedith/Containerized_MicroService" rel="noopener noreferrer"&gt;github.com/asanteedith/Containerized_MicroService&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>docker</category>
      <category>cicd</category>
      <category>beginners</category>
    </item>
    <item>
      <title># I Built a Deployment CLI That Says No — And Here's the Policy Engine Behind It</title>
      <dc:creator>Edith Asante</dc:creator>
      <pubDate>Wed, 06 May 2026 19:59:23 +0000</pubDate>
      <link>https://forem.com/edithasante/building-a-policy-gated-deployment-system-with-observability-swiftdeploy-stage-4b-4od2</link>
      <guid>https://forem.com/edithasante/building-a-policy-gated-deployment-system-with-observability-swiftdeploy-stage-4b-4od2</guid>
      <description>&lt;p&gt;&lt;em&gt;Most deployment tools ask you to configure infrastructure manually. This one writes it for you — and refuses to deploy if it is not safe.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem I Set Out to Solve
&lt;/h2&gt;

&lt;p&gt;Every time I deployed a new service I found myself doing the same things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Writing a Docker Compose file&lt;/li&gt;
&lt;li&gt;Writing an Nginx config&lt;/li&gt;
&lt;li&gt;Hoping both were consistent with each other&lt;/li&gt;
&lt;li&gt;Manually checking if the server had enough resources&lt;/li&gt;
&lt;li&gt;Deploying and hoping for the best&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There had to be a better way. What if a single file described everything — and a tool generated all the configs, checked all the policies, and deployed the stack automatically?&lt;/p&gt;

&lt;p&gt;That is what SwiftDeploy does.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is SwiftDeploy?
&lt;/h2&gt;

&lt;p&gt;SwiftDeploy is a CLI tool built in Python that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Reads a single &lt;code&gt;manifest.yaml&lt;/code&gt; file&lt;/li&gt;
&lt;li&gt;Generates &lt;code&gt;nginx.conf&lt;/code&gt; and &lt;code&gt;docker-compose.yml&lt;/code&gt; from templates&lt;/li&gt;
&lt;li&gt;Asks OPA (Open Policy Agent) if it is safe to deploy&lt;/li&gt;
&lt;li&gt;Brings up the stack and waits for health checks&lt;/li&gt;
&lt;li&gt;Lets you promote between stable and canary modes — but only if the canary is healthy&lt;/li&gt;
&lt;li&gt;Records every decision in an audit trail&lt;/li&gt;
&lt;li&gt;Shows you a live dashboard of what is happening&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The manifest is the only file you ever edit. Everything else is generated.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 1 — The Design: A Tool That Writes Its Own Infrastructure
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Manifest
&lt;/h3&gt;

&lt;p&gt;Here is what &lt;code&gt;manifest.yaml&lt;/code&gt; looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;swift-deploy-1-node:latest&lt;/span&gt;
  &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3000&lt;/span&gt;
  &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;stable&lt;/span&gt;
  &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;

&lt;span class="na"&gt;nginx&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nginx:latest&lt;/span&gt;
  &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
  &lt;span class="na"&gt;proxy_timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt;

&lt;span class="na"&gt;network&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;swiftdeploy-net&lt;/span&gt;
  &lt;span class="na"&gt;driver_type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bridge&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the entire configuration. One file. Everything else is derived from it.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Templates
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;init&lt;/code&gt; command reads the manifest and fills in template files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;init&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;manifest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_manifest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;templates/docker-compose.yml.tpl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;compose_tpl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;compose_out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;compose_tpl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{ app_image }}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;manifest&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;services&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;compose_out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;compose_out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{ mode }}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;manifest&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;services&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;docker-compose.yml&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;compose_out&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you delete your configs, run &lt;code&gt;init&lt;/code&gt; and you get the exact same stack back. No guessing. No inconsistency.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why This Matters
&lt;/h3&gt;

&lt;p&gt;In most projects configs drift over time. Someone edits &lt;code&gt;docker-compose.yml&lt;/code&gt; directly. Someone else edits &lt;code&gt;nginx.conf&lt;/code&gt;. After six months nobody knows what the source of truth is.&lt;/p&gt;

&lt;p&gt;With SwiftDeploy the source of truth is always &lt;code&gt;manifest.yaml&lt;/code&gt;. If it is not in the manifest it does not exist.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 2 — The Guardrails: Policy Enforcement with OPA
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why OPA?
&lt;/h3&gt;

&lt;p&gt;I could have written the policy checks directly in Python. But the task required something more important — &lt;strong&gt;separation of concerns&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The CLI should not decide what is safe. That decision should live in a separate system that can be updated independently. That system is OPA — Open Policy Agent.&lt;/p&gt;

&lt;p&gt;OPA runs as a separate container. The CLI sends data to OPA and OPA sends back a decision. The CLI just follows orders.&lt;/p&gt;

&lt;h3&gt;
  
  
  Infrastructure Policy
&lt;/h3&gt;

&lt;p&gt;Before deploying the CLI collects host statistics and sends them to OPA:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_host_stats&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;disk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;shutil&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;disk_usage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;disk_free_gb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;disk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;free&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;cpu_load&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;psutil&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cpu_percent&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;disk_free_gb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;disk_free_gb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cpu_load&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cpu_load&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;OPA evaluates the infrastructure policy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rego"&gt;&lt;code&gt;&lt;span class="ow"&gt;package&lt;/span&gt; &lt;span class="n"&gt;infra&lt;/span&gt;

&lt;span class="ow"&gt;default&lt;/span&gt; &lt;span class="n"&gt;allow&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;

&lt;span class="n"&gt;allow&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="n"&gt;if&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;disk_free_gb&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
    &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cpu_load&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mf"&gt;2.0&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;reason&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="s2"&gt;"Disk space too low"&lt;/span&gt; &lt;span class="n"&gt;if&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;disk_free_gb&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the disk is below 10GB or CPU is above 2.0 the deployment is blocked:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Running pre-deploy policy check...
   Disk free: 5.2GB | CPU: 0.3 | Memory: 45%
Infrastructure policy: BLOCKED
   Reason: Disk space too low
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Canary Safety Policy
&lt;/h3&gt;

&lt;p&gt;Before promoting to canary mode the CLI scrapes the &lt;code&gt;/metrics&lt;/code&gt; endpoint and calculates the error rate and P99 latency:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calc_error_rate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http_requests_total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;errors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;status_code=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;errors&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;OPA evaluates the canary safety policy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rego"&gt;&lt;code&gt;&lt;span class="ow"&gt;package&lt;/span&gt; &lt;span class="n"&gt;canary&lt;/span&gt;

&lt;span class="ow"&gt;default&lt;/span&gt; &lt;span class="n"&gt;allow&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;

&lt;span class="n"&gt;allow&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="n"&gt;if&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;error_rate&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;
    &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;p99_latency_ms&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="m"&gt;500&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;reason&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="s2"&gt;"P99 latency too high (must be &amp;lt;= 500ms)"&lt;/span&gt; &lt;span class="n"&gt;if&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;p99_latency_ms&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="m"&gt;500&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the canary is unhealthy the promotion is blocked:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Running pre-promote policy check...
   Error rate: 0.0% | P99 latency: 100.0ms
Canary safety policy: BLOCKED
   Reason: P99 latency too high (must be &amp;lt;= 500ms)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why Isolation Matters
&lt;/h3&gt;

&lt;p&gt;OPA runs as a separate container and is only reachable by the CLI — not through Nginx. This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No external actor can query or manipulate policy decisions&lt;/li&gt;
&lt;li&gt;Policies can be updated without touching the CLI code&lt;/li&gt;
&lt;li&gt;Each domain (infrastructure, canary) owns exactly one question&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Part 3 — The Chaos: What Happened When Things Broke
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Injecting Slow Chaos
&lt;/h3&gt;

&lt;p&gt;The API exposes a &lt;code&gt;/chaos&lt;/code&gt; endpoint that simulates degraded behaviour:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8080/chaos &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"mode": "slow", "duration": 2}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This makes every request sleep for 2 seconds before responding. The metrics immediately reflect the change — P99 latency spikes.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Status View Catches It
&lt;/h3&gt;

&lt;p&gt;Running &lt;code&gt;swiftdeploy status&lt;/code&gt; shows the live state:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;--- Scrape @ Fri May 15 12:38:05 2026 ---
  Mode:        canary
  Uptime:      115s
  Error rate:  0.0%
  P99 latency: 2100.0ms
  Chaos:       active

  Policy Compliance:
    Infrastructure: PASS
    Canary safety:  FAIL - P99 latency too high
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Promotion Is Blocked
&lt;/h3&gt;

&lt;p&gt;When we tried to promote:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Running pre-promote policy check...
   Error rate: 0.0% | P99 latency: 2100.0ms
Canary safety policy: BLOCKED
   Reason: P99 latency too high (must be &amp;lt;= 500ms)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The system worked exactly as designed. The broken canary could not be promoted.&lt;/p&gt;

&lt;h3&gt;
  
  
  Recovery
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8080/chaos &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"mode": "recover"}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Latency dropped back to normal and the next promote attempt passed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 4 — The Audit Trail
&lt;/h2&gt;

&lt;p&gt;Every action is recorded in &lt;code&gt;history.jsonl&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"event"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"deploy"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"success"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1778794519.2&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"event"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"pre_promote_check"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"result"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"allow"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"P99 latency too high"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1778799306.5&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"event"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"promote"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"mode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"canary"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"success"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1778799535.0&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Running &lt;code&gt;swiftdeploy audit&lt;/code&gt; generates &lt;code&gt;audit_report.md&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Timeline&lt;/span&gt;

| Time | Event | Details |
|---|---|---|
| Fri May 15 12:36:48 | deploy | status=success |
| Fri May 15 12:40:17 | pre_promote_check | BLOCKED reason=P99 latency too high |
| Fri May 15 12:44:50 | promote | mode=canary status=success |

&lt;span class="gu"&gt;## Policy Violations&lt;/span&gt;

| Time | Check | Reason |
|---|---|---|
| Fri May 15 12:40:17 | pre_promote_check | P99 latency too high |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can always answer the question "what happened and when" with a single command.&lt;/p&gt;




&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Declarative infrastructure is worth the investment&lt;/strong&gt;&lt;br&gt;
Writing templates takes time upfront but saves enormous time later. When something breaks you regenerate from the manifest and you know the configs are correct.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Policies should be external&lt;/strong&gt;&lt;br&gt;
Keeping policy logic in OPA means you can update thresholds without touching the CLI code. This is how real production systems work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Metrics drive decisions — not just monitoring&lt;/strong&gt;&lt;br&gt;
I used to think metrics were for dashboards. Now I use them to gate deployments. If the canary is unhealthy the metrics prove it and the policy enforces the consequence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Audit trails matter more than you think&lt;/strong&gt;&lt;br&gt;
During debugging I could look at &lt;code&gt;history.jsonl&lt;/code&gt; and see exactly what happened and in what order. Without it I would have been guessing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. The CLI is just an orchestrator&lt;/strong&gt;&lt;br&gt;
SwiftDeploy does not make decisions. It collects data, asks OPA, and follows the answer. This separation makes the system trustworthy and testable.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Final Result
&lt;/h2&gt;

&lt;p&gt;A complete declarative deployment system that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Generates infrastructure from a single manifest&lt;/li&gt;
&lt;li&gt;Validates pre-flight conditions before deploying&lt;/li&gt;
&lt;li&gt;Enforces infrastructure and canary safety policies via OPA&lt;/li&gt;
&lt;li&gt;Tracks metrics in Prometheus format&lt;/li&gt;
&lt;li&gt;Shows a live dashboard of system state and policy compliance&lt;/li&gt;
&lt;li&gt;Records every decision in a structured audit trail&lt;/li&gt;
&lt;li&gt;Generates a clean audit report in GitHub-flavored Markdown&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Full source code: &lt;strong&gt;&lt;a href="https://github.com/asanteedith/swiftdeploy-project" rel="noopener noreferrer"&gt;https://github.com/asanteedith/swiftdeploy-project&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Written by **Edith Asante&lt;/em&gt;* — Cloud &amp;amp; DevOps Engineer*&lt;/p&gt;

</description>
      <category>devops</category>
      <category>docker</category>
      <category>python</category>
      <category>security</category>
    </item>
  </channel>
</rss>
