<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: anitaalicloud</title>
    <description>The latest articles on Forem by anitaalicloud (@anitaalicloud).</description>
    <link>https://forem.com/anitaalicloud</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3901516%2F3aeb27ec-d326-4295-bcd7-bc103e1aa263.png</url>
      <title>Forem: anitaalicloud</title>
      <link>https://forem.com/anitaalicloud</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/anitaalicloud"/>
    <language>en</language>
    <item>
      <title>How I Built SwiftDeploy: A Tool That Writes Its Own Infrastructure</title>
      <dc:creator>anitaalicloud</dc:creator>
      <pubDate>Wed, 06 May 2026 19:28:10 +0000</pubDate>
      <link>https://forem.com/anitaalicloud/how-i-built-swiftdeploy-a-tool-that-writes-its-own-infrastructure-dma</link>
      <guid>https://forem.com/anitaalicloud/how-i-built-swiftdeploy-a-tool-that-writes-its-own-infrastructure-dma</guid>
      <description>&lt;p&gt;&lt;em&gt;A deep dive into declarative deployments, OPA policy gates, and chaos engineering from Stage 4A to 4B&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdlgs2pjh8kydui19g0ku.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdlgs2pjh8kydui19g0ku.jpeg" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Most DevOps tasks ask you to configure infrastructure manually. This one asked me to build the tool that does it for me.&lt;/p&gt;

&lt;p&gt;The result is &lt;strong&gt;SwiftDeploy&lt;/strong&gt; which is a CLI tool that reads a single &lt;code&gt;manifest.yaml&lt;/code&gt; file and generates your entire deployment stack from it. Nginx configs, Docker Compose files, policy checks, live metrics dashboards are all derived from one source of truth.&lt;/p&gt;

&lt;p&gt;This post covers the full journey: the design decisions, the guardrails, the chaos, and the lessons learned.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;Here is how all the pieces connect:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────┐
│                    manifest.yaml                     │
│              (single source of truth)                │
└──────────────────────┬──────────────────────────────┘
                       │
                       ▼
              ./swiftdeploy init
                       │
          ┌────────────┴────────────┐
          ▼                         ▼
     nginx.conf              docker-compose.yml
   (generated)                 (generated)
          │                         │
          ▼                         ▼
┌─────────────────────────────────────────────────────┐
│                   Docker Stack                       │
│                                                      │
│   ┌──────────┐    ┌──────────┐    ┌──────────┐      │
│   │  Nginx   │───▶│   App    │    │   OPA    │      │
│   │  :8080   │    │  :3000   │    │  :8181   │      │
│   └──────────┘    └──────────┘    └──────────┘      │
│   (public)        (internal)      (internal)         │
└─────────────────────────────────────────────────────┘
          │                         ▲
          ▼                         │
     curl :8080              CLI queries OPA
    (your browser)          before deploy/promote
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key insight: &lt;strong&gt;you only ever touch &lt;code&gt;manifest.yaml&lt;/code&gt;&lt;/strong&gt;. The tool handles everything else.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 1 — The Design: A Tool That Writes Its Own Files
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Problem with Handwritten Config
&lt;/h3&gt;

&lt;p&gt;When you write &lt;code&gt;nginx.conf&lt;/code&gt; and &lt;code&gt;docker-compose.yml&lt;/code&gt; by hand, you introduce drift. Change a port in one place and forget to update it in another. After a few weeks, nobody knows which file is the source of truth.&lt;/p&gt;

&lt;p&gt;SwiftDeploy solves this with a three-layer system:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;manifest.yaml          →    templates/*.tmpl    →    generated files
(VALUES)                    (STRUCTURE)              (VALUES + STRUCTURE)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;manifest.yaml&lt;/code&gt; holds all the values — ports, image names, modes, timeouts. The templates hold the structure — how nginx.conf and docker-compose.yml should look. The CLI combines them at runtime.&lt;/p&gt;

&lt;h3&gt;
  
  
  How &lt;code&gt;swiftdeploy init&lt;/code&gt; Works
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Read manifest into a Python dict
&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;yaml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;safe_load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;manifest.yaml&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Build a replacements map
&lt;/span&gt;&lt;span class="n"&gt;replacements&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{NGINX_PORT}}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nginx&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;port&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{SERVICE_PORT}}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;services&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;port&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
    &lt;span class="c1"&gt;# ... etc
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Read template, replace placeholders, write output
&lt;/span&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;templates/nginx.conf.tmpl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;placeholder&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;replacements&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;placeholder&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nginx.conf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Simple string replacement. No Jinja2, no templating engine — just Python's built-in &lt;code&gt;str.replace()&lt;/code&gt;. The grader can delete &lt;code&gt;nginx.conf&lt;/code&gt; and &lt;code&gt;docker-compose.yml&lt;/code&gt;, run &lt;code&gt;./swiftdeploy init&lt;/code&gt;, and they regenerate perfectly every time.&lt;/p&gt;

&lt;h3&gt;
  
  
  The API Service
&lt;/h3&gt;

&lt;p&gt;The API is a Python HTTP server using only the standard library — no Flask, no FastAPI. This keeps the Docker image under 60MB (well under the 300MB limit).&lt;/p&gt;

&lt;p&gt;It runs in two modes controlled by a &lt;code&gt;MODE&lt;/code&gt; environment variable injected by Docker Compose:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;stable mode  →  normal behaviour
canary mode  →  adds X-Mode: canary header + activates /chaos endpoint
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The same image runs both modes. The only difference is the environment variable.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Nginx Reverse Proxy
&lt;/h3&gt;

&lt;p&gt;Nginx sits in front of the app and adds:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;X-Deployed-By: swiftdeploy&lt;/code&gt; header on every response&lt;/li&gt;
&lt;li&gt;JSON error bodies on 502/503/504 (instead of ugly HTML)&lt;/li&gt;
&lt;li&gt;Structured access logs in the required format&lt;/li&gt;
&lt;li&gt;Forwards &lt;code&gt;X-Mode&lt;/code&gt; header from the upstream app&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Critically, &lt;strong&gt;the app port is never exposed directly&lt;/strong&gt;. Only Nginx's port is mapped to the host. All traffic must flow through it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 2 — The Guardrails: OPA Policy Enforcement
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why OPA?
&lt;/h3&gt;

&lt;p&gt;The task required that the CLI never make allow/deny decisions itself. All logic must live in OPA (Open Policy Agent).&lt;/p&gt;

&lt;p&gt;This matters because it separates concerns cleanly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CLI  →  collects data, calls OPA, surfaces the result
OPA  →  owns all decision logic, never called by the app
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you want to change a policy, you edit a &lt;code&gt;.rego&lt;/code&gt; file. You never touch the CLI. If you want to change a threshold, you edit &lt;code&gt;data.json&lt;/code&gt;. You never touch the Rego files.&lt;/p&gt;

&lt;h3&gt;
  
  
  Policy Structure
&lt;/h3&gt;

&lt;p&gt;Each policy domain owns exactly one question:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Infrastructure policy&lt;/strong&gt; — &lt;em&gt;Is the host healthy enough to deploy?&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rego"&gt;&lt;code&gt;&lt;span class="ow"&gt;package&lt;/span&gt; &lt;span class="n"&gt;infrastructure&lt;/span&gt;

&lt;span class="ow"&gt;import&lt;/span&gt; &lt;span class="n"&gt;rego&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;v1&lt;/span&gt;

&lt;span class="ow"&gt;default&lt;/span&gt; &lt;span class="n"&gt;allow&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;

&lt;span class="n"&gt;allow&lt;/span&gt; &lt;span class="n"&gt;if&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;violations&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;violations&lt;/span&gt; &lt;span class="n"&gt;contains&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="n"&gt;if&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;disk_free_gb&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;infrastructure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;min_disk_free_gb&lt;/span&gt;
    &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="s2"&gt;"Disk free (%.1fGB) is below minimum threshold (%.1fGB)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;disk_free_gb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;infrastructure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;min_disk_free_gb&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;violations&lt;/span&gt; &lt;span class="n"&gt;contains&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="n"&gt;if&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cpu_load&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;infrastructure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_cpu_load&lt;/span&gt;
    &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="s2"&gt;"CPU load (%.2f) exceeds maximum threshold (%.2f)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cpu_load&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;infrastructure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_cpu_load&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Canary safety policy&lt;/strong&gt; — &lt;em&gt;Is the canary healthy enough to promote?&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rego"&gt;&lt;code&gt;&lt;span class="ow"&gt;package&lt;/span&gt; &lt;span class="n"&gt;canary&lt;/span&gt;

&lt;span class="ow"&gt;import&lt;/span&gt; &lt;span class="n"&gt;rego&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;v1&lt;/span&gt;

&lt;span class="ow"&gt;default&lt;/span&gt; &lt;span class="n"&gt;allow&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;

&lt;span class="n"&gt;allow&lt;/span&gt; &lt;span class="n"&gt;if&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;violations&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;violations&lt;/span&gt; &lt;span class="n"&gt;contains&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="n"&gt;if&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;error_rate&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;canary&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_error_rate&lt;/span&gt;
    &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="s2"&gt;"Error rate (%.2f%%) exceeds maximum threshold (%.2f%%)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;error_rate&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;canary&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_error_rate&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Threshold values live in &lt;code&gt;data.json&lt;/code&gt; — never hardcoded in Rego:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"infrastructure"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"min_disk_free_gb"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;10.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"max_cpu_load"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;16.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"min_mem_free_percent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;10.0&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"canary"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"max_error_rate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.01&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"max_p99_latency_ms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Hard Gate in Action
&lt;/h3&gt;

&lt;p&gt;When the CPU load exceeded the threshold, &lt;code&gt;swiftdeploy deploy&lt;/code&gt; was blocked:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;[deploy] Running OPA pre-deploy policy checks...
  Host stats: disk=80.08GB free, cpu_load=12.88, mem_free=50.0%
[policy] Checking Infrastructure...
  [BLOCK] Infrastructure policy FAILED:
    x CPU load (12.88) exceeds maximum threshold (2.00)
[deploy] BLOCKED by policy. Fix violations above before deploying.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The deploy never started. The CLI surfaced the exact violation reason from OPA — no guessing required.&lt;/p&gt;

&lt;h3&gt;
  
  
  OPA Isolation
&lt;/h3&gt;

&lt;p&gt;OPA is intentionally isolated from public Nginx ingress. It runs on port 8181 inside the Docker network. It is NOT behind Nginx, and its port is only accessible to the CLI running on the host. A user hitting &lt;code&gt;localhost:8080&lt;/code&gt; cannot reach OPA.&lt;/p&gt;

&lt;h3&gt;
  
  
  Failure Handling
&lt;/h3&gt;

&lt;p&gt;The CLI handles every distinct OPA failure mode differently:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;URLError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# OPA unreachable — warn but don't crash
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;allowed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;violations&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt; 
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OPA unreachable: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JSONDecodeError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# OPA returned garbage — different message
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;allowed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;violations&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt; 
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OPA returned invalid JSON&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Catch-all — still doesn't crash
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;allowed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;violations&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt; 
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Unexpected OPA error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The CLI never crashes or hangs when OPA is unavailable. It warns the operator and continues.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 3 — The Chaos: Breaking Things on Purpose
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The /metrics Endpoint
&lt;/h3&gt;

&lt;p&gt;The API exposes a &lt;code&gt;/metrics&lt;/code&gt; endpoint in Prometheus text format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight prometheus"&gt;&lt;code&gt;&lt;span class="c"&gt;# HELP http_requests_total Total HTTP requests&lt;/span&gt;
&lt;span class="c"&gt;# TYPE http_requests_total counter&lt;/span&gt;
&lt;span class="n"&gt;http_requests_total&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"GET"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"/"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="na"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"200"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="mi"&gt;42&lt;/span&gt;

&lt;span class="c"&gt;# HELP http_request_duration_seconds Request latency&lt;/span&gt;
&lt;span class="c"&gt;# TYPE http_request_duration_seconds histogram&lt;/span&gt;
&lt;span class="n"&gt;http_request_duration_seconds_bucket&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"GET"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"/"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="na"&gt;le&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"0.005"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="mi"&gt;38&lt;/span&gt;

&lt;span class="c"&gt;# HELP app_mode Current deployment mode (0=stable, 1=canary)&lt;/span&gt;
&lt;span class="c"&gt;# TYPE app_mode gauge&lt;/span&gt;
&lt;span class="n"&gt;app_mode&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

&lt;span class="c"&gt;# HELP chaos_active Current chaos state (0=none, 1=slow, 2=error)&lt;/span&gt;
&lt;span class="c"&gt;# TYPE chaos_active gauge&lt;/span&gt;
&lt;span class="n"&gt;chaos_active&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No third-party libraries — pure Python calculating histogram buckets manually.&lt;/p&gt;

&lt;h3&gt;
  
  
  Injecting Chaos
&lt;/h3&gt;

&lt;p&gt;After promoting to canary mode, chaos was injected:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Slow mode — every request sleeps 3 seconds&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8080/chaos &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"mode": "slow", "duration": 3}'&lt;/span&gt;

&lt;span class="c"&gt;# Error mode — 50% of requests return HTTP 500&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8080/chaos &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"mode": "error", "rate": 0.5}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Status Dashboard Capturing the Failure
&lt;/h3&gt;

&lt;p&gt;With error mode active at 50%, the status dashboard showed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;=======================================================
  SwiftDeploy Status Dashboard
  2026-05-06T14:18:43Z
=======================================================

  Mode:        CANARY
  Uptime:      892s
  Req/s:       2.40
  P99 Latency: 250ms
  Error Rate:  48.20%

  Policy Compliance:
    + Infrastructure: PASSING
    x Canary Safety: FAILING
      -&amp;gt; Error rate (48.20%) exceeds maximum threshold (1.00%)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The canary safety policy immediately flagged the failure. Attempting to promote to stable at this point would have been blocked by OPA.&lt;/p&gt;

&lt;h3&gt;
  
  
  Recovery
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8080/chaos &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"mode": "recover"}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Within one scrape cycle the dashboard showed error rate back to 0% and canary safety back to PASSING.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 4 — The Audit Trail
&lt;/h2&gt;

&lt;p&gt;Every significant event is written to &lt;code&gt;history.jsonl&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"event"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"deploy_success"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-06T13:53:39Z"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"event"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"promote_success"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"target"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"canary"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-06T14:01:22Z"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"event"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"status_scrape"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"mode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"canary"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"error_rate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.482&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-06T14:18:43Z"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Running &lt;code&gt;./swiftdeploy audit&lt;/code&gt; parses this file and generates a clean Markdown report:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Timeline&lt;/span&gt;
| Timestamp | Event | Details |
|---|---|---|
| 2026-05-06T13:53:39Z | Deploy | Stack deployed successfully |
| 2026-05-06T14:01:22Z | Promote | Mode switched to canary |

&lt;span class="gu"&gt;## Policy Violations&lt;/span&gt;
| Timestamp | Policy | Reason |
|---|---|---|
| 2026-05-06T14:18:43Z | Canary Safety | error_rate=48.20%, p99=250ms |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Single source of truth is worth the extra complexity&lt;/strong&gt;&lt;br&gt;
It felt like overkill to build a template engine just to generate two config files. But when the grader deletes your generated files and reruns &lt;code&gt;init&lt;/code&gt;, you're grateful every value comes from one place.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. OPA's syntax changes between versions&lt;/strong&gt;&lt;br&gt;
The latest OPA image requires &lt;code&gt;import rego.v1&lt;/code&gt; and the &lt;code&gt;if&lt;/code&gt;/&lt;code&gt;contains&lt;/code&gt; keywords. Older Rego syntax silently fails to load. Always check your OPA container logs first.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Start OPA before running policy checks&lt;/strong&gt;&lt;br&gt;
OPA is part of the stack, so it doesn't exist before &lt;code&gt;docker compose up&lt;/code&gt;. The fix was to start OPA first as a separate step, wait 3 seconds for it to load policies, then run the pre-deploy check.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Chaos engineering reveals what metrics matter&lt;/strong&gt;&lt;br&gt;
Before injecting chaos, the &lt;code&gt;/metrics&lt;/code&gt; endpoint felt like box-ticking. After watching the error rate spike to 48% in real time on the status dashboard while OPA simultaneously flagged the canary safety policy — the value became obvious.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Policy as code beats policy as documentation&lt;/strong&gt;&lt;br&gt;
A README saying "don't deploy if CPU load is above 2.0" gets ignored. A Rego file that blocks the deploy enforces it automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Full Subcommand Reference
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./swiftdeploy init              &lt;span class="c"&gt;# generate nginx.conf + docker-compose.yml&lt;/span&gt;
./swiftdeploy validate          &lt;span class="c"&gt;# 5 pre-flight checks&lt;/span&gt;
./swiftdeploy deploy            &lt;span class="c"&gt;# OPA check + start stack + health wait&lt;/span&gt;
./swiftdeploy promote canary    &lt;span class="c"&gt;# switch to canary mode&lt;/span&gt;
./swiftdeploy promote stable    &lt;span class="c"&gt;# switch back to stable&lt;/span&gt;
./swiftdeploy status            &lt;span class="c"&gt;# live metrics + policy compliance dashboard&lt;/span&gt;
./swiftdeploy audit             &lt;span class="c"&gt;# generate audit_report.md&lt;/span&gt;
./swiftdeploy teardown          &lt;span class="c"&gt;# stop all containers&lt;/span&gt;
./swiftdeploy teardown &lt;span class="nt"&gt;--clean&lt;/span&gt;  &lt;span class="c"&gt;# stop + delete generated files&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Source Code
&lt;/h2&gt;

&lt;p&gt;The full project is available on GitHub: &lt;a href="https://github.com/AnitaAliCloud/hng4-devops" rel="noopener noreferrer"&gt;https://github.com/AnitaAliCloud/hng4-devops&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built as part of the HNG DevOps Track — Stage 4A and 4B&lt;/em&gt;&lt;/p&gt;

</description>
      <category>automation</category>
      <category>devops</category>
      <category>showdev</category>
      <category>tooling</category>
    </item>
    <item>
      <title>How I Built an Anomaly Detection Engine for DDoS Protection</title>
      <dc:creator>anitaalicloud</dc:creator>
      <pubDate>Tue, 28 Apr 2026 03:56:36 +0000</pubDate>
      <link>https://forem.com/anitaalicloud/how-i-built-an-anomaly-detection-engine-for-ddos-protection-1ibg</link>
      <guid>https://forem.com/anitaalicloud/how-i-built-an-anomaly-detection-engine-for-ddos-protection-1ibg</guid>
      <description>&lt;p&gt;Introduction&lt;br&gt;
Imagine you run a busy website. On a normal day, about 50 people visit per second. Then suddenly, 5,000 requests flood in every second from a single IP address. Your server crashes. Your real users can't access anything. This is a DDoS (Distributed Denial of Service) attack.&lt;br&gt;
In this post, I'll explain how I built a tool that watches incoming traffic in real time, learns what "normal" looks like, and automatically blocks attackers before they can cause damage.&lt;/p&gt;

&lt;p&gt;What Does the Tool Do?&lt;br&gt;
My anomaly detection engine does 6 things automatically:&lt;/p&gt;

&lt;p&gt;Watches every HTTP request coming into the server&lt;br&gt;
Learns what normal traffic looks like over time&lt;br&gt;
Detects when traffic becomes abnormal&lt;br&gt;
Blocks the attacker using the Linux firewall&lt;br&gt;
Alerts me on Slack within 10 seconds&lt;br&gt;
Unbans the IP automatically after a timeout&lt;/p&gt;

&lt;p&gt;The Architecture&lt;br&gt;
Internet Traffic&lt;br&gt;
      ↓&lt;br&gt;
   Nginx (logs every request as JSON)&lt;br&gt;
      ↓&lt;br&gt;
  Nextcloud (the actual app)&lt;/p&gt;

&lt;p&gt;Detector Daemon reads Nginx logs&lt;br&gt;
      ↓&lt;br&gt;
  Sliding Window → tracks request rates&lt;br&gt;
  Rolling Baseline → learns normal traffic&lt;br&gt;
  Z-score Detection → spots anomalies&lt;br&gt;
  iptables → blocks attackers&lt;br&gt;
  Slack → sends alerts&lt;br&gt;
  Dashboard → shows live metrics&lt;/p&gt;

&lt;p&gt;Part 1 — How the Sliding Window Works&lt;br&gt;
Think of the sliding window like a 60 second camera 🎥&lt;br&gt;
Every request that comes in gets a timestamp. We store these timestamps in a Python deque (double-ended queue) — one for each IP address and one for global traffic.&lt;/p&gt;

&lt;p&gt;from collections import deque&lt;br&gt;
import time&lt;/p&gt;

&lt;h1&gt;
  
  
  One deque per IP
&lt;/h1&gt;

&lt;p&gt;ip_windows = {}&lt;/p&gt;

&lt;p&gt;def record_request(ip):&lt;br&gt;
    now = time.time()&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;if ip not in ip_windows:
    ip_windows[ip] = deque()

# Add this request
ip_windows[ip].append(now)

# Remove requests older than 60 seconds from the LEFT
cutoff = now - 60
while ip_windows[ip] and ip_windows[ip][0] &amp;lt; cutoff:
    ip_windows[ip].popleft()

# Current rate = how many requests in last 60 seconds
current_rate = len(ip_windows[ip]) / 60
return current_rate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;from collections import deque&lt;br&gt;
import time&lt;/p&gt;

&lt;h1&gt;
  
  
  One deque per IP
&lt;/h1&gt;

&lt;p&gt;ip_windows = {}&lt;/p&gt;

&lt;p&gt;def record_request(ip):&lt;br&gt;
    now = time.time()&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;if ip not in ip_windows:
    ip_windows[ip] = deque()

# Add this request
ip_windows[ip].append(now)

# Remove requests older than 60 seconds from the LEFT
cutoff = now - 60
while ip_windows[ip] and ip_windows[ip][0] &amp;lt; cutoff:
    ip_windows[ip].popleft()

# Current rate = how many requests in last 60 seconds
current_rate = len(ip_windows[ip]) / 60
return current_rate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The magic is the eviction — old timestamps get removed from the left side of the deque automatically. So the deque always contains only the last 60 seconds of requests. The current rate is simply the length of the deque divided by 60.&lt;/p&gt;

&lt;p&gt;Part 2 — How the Baseline Learns from Traffic&lt;br&gt;
The baseline answers one question: "What is normal?"&lt;br&gt;
We can't hardcode this because every website is different. A news site might normally get 1000 req/s. A small blog might get 2 req/s. So we let the system learn.&lt;br&gt;
Every second we record how many requests came in. Every 60 seconds we look at the last 30 minutes of data and calculate:&lt;/p&gt;

&lt;p&gt;import math&lt;/p&gt;

&lt;p&gt;def recalculate_baseline(per_second_counts):&lt;br&gt;
    # Calculate average requests per second&lt;br&gt;
    mean = sum(per_second_counts) / len(per_second_counts)&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Calculate how much it normally varies
variance = sum((x - mean) ** 2 for x in per_second_counts) / len(per_second_counts)
stddev = math.sqrt(variance)

# Apply floors to prevent false alarms on quiet traffic
effective_mean = max(mean, 1.0)
effective_stddev = max(stddev, 1.0)

return effective_mean, effective_stddev
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;We also maintain per-hour slots — the system prefers the current hour's data when it has enough samples. This means the baseline adapts to time-of-day patterns. Rush hour traffic looks different from 3 AM traffic!&lt;/p&gt;

&lt;p&gt;Part 3 — How the Detection Logic Makes a Decision&lt;br&gt;
Once we have the baseline we use a Z-score to decide if current traffic is anomalous.&lt;br&gt;
The Z-score answers: "How many standard deviations away from normal is this?"&lt;/p&gt;

&lt;p&gt;def is_anomalous(current_rate, mean, stddev):&lt;br&gt;
    # Z-score calculation&lt;br&gt;
    z_score = (current_rate - mean) / stddev&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Rate multiplier
rate_multiplier = current_rate / mean

# Flag as anomalous if EITHER condition fires
if z_score &amp;gt; 3.0:
    return True, "z-score exceeded 3.0"

if rate_multiplier &amp;gt; 5.0:
    return True, "rate exceeded 5x baseline"

return False, None
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;Normal traffic: 50 req/s (mean=50, stddev=10)&lt;br&gt;
Attack traffic: 5000 req/s from one IP&lt;br&gt;
Z-score = (5000 - 50) / 10 = 495&lt;br&gt;
495 &amp;gt; 3.0 → ANOMALY DETECTED! 🚨&lt;/p&gt;

&lt;p&gt;We also detect error surges — if an IP is getting lots of 404/500 errors it might be scanning for vulnerabilities. In that case we tighten the thresholds automatically.&lt;/p&gt;

&lt;p&gt;Part 4 — How iptables Blocks an IP&lt;br&gt;
iptables is Linux's built-in firewall. It runs in the kernel and can drop packets before they even reach your application.&lt;br&gt;
When we detect an attack:&lt;/p&gt;

&lt;p&gt;import subprocess&lt;/p&gt;

&lt;p&gt;def ban_ip(ip):&lt;br&gt;
    # Add a DROP rule — silently discard all packets from this IP&lt;br&gt;
    subprocess.run([&lt;br&gt;
        "iptables", "-A", "INPUT", &lt;br&gt;
        "-s", ip, &lt;br&gt;
        "-j", "DROP"&lt;br&gt;
    ])&lt;br&gt;
    print(f"Banned {ip}")&lt;/p&gt;

&lt;p&gt;def unban_ip(ip):&lt;br&gt;
    # Remove the DROP rule&lt;br&gt;
    subprocess.run([&lt;br&gt;
        "iptables", "-D", "INPUT",&lt;br&gt;
        "-s", ip,&lt;br&gt;
        "-j", "DROP"&lt;br&gt;&lt;br&gt;
    ])&lt;br&gt;
    print(f"Unbanned {ip}")&lt;br&gt;
The -j DROP means "jump to DROP action" — the packet is silently discarded. The attacker doesn't even get an error message back. From their perspective the server just stopped responding.&lt;br&gt;
Bans lift automatically on a backoff schedule:&lt;/p&gt;

&lt;p&gt;1st offence → 10 minutes&lt;br&gt;
2nd offence → 30 minutes&lt;br&gt;
3rd offence → 2 hours&lt;br&gt;
4th+ → permanent&lt;/p&gt;

&lt;p&gt;Part 5 — The Live Dashboard&lt;br&gt;
The dashboard is a simple web page that refreshes every 3 seconds showing:&lt;/p&gt;

&lt;p&gt;Global requests per second&lt;br&gt;
Currently banned IPs&lt;br&gt;
Top 10 source IPs&lt;br&gt;
CPU and memory usage&lt;br&gt;
Current baseline mean and stddev&lt;br&gt;
System uptime&lt;/p&gt;

&lt;p&gt;It's built using Python's built-in http.server — no web framework needed!&lt;/p&gt;

&lt;p&gt;Part 6 — Slack Alerts&lt;br&gt;
When an IP gets banned, a Slack message arrives within 10 seconds:&lt;br&gt;
🚨 IP BANNED&lt;br&gt;
IP: 192.168.1.100&lt;br&gt;
Condition: Anomalous request rate&lt;br&gt;
Current Rate: 450.00 req/s&lt;br&gt;
Baseline: 12.00 req/s&lt;br&gt;
Ban Duration: 10 minutes&lt;br&gt;
Timestamp: 2026-04-28 03:22:36 UTC&lt;br&gt;
And when the ban expires:&lt;br&gt;
✅ IP UNBANNED&lt;br&gt;
IP: 192.168.1.100&lt;br&gt;
Reason: ban-expired&lt;br&gt;
Timestamp: 2026-04-28 03:32:36 UTC&lt;/p&gt;

&lt;p&gt;What I Learned&lt;br&gt;
Building this project taught me:&lt;/p&gt;

&lt;p&gt;Z-scores are powerful — a simple maths formula can detect attacks that would be impossible to catch with hardcoded thresholds&lt;br&gt;
Baselines must be dynamic — hardcoding "block if &amp;gt; 100 req/s" is wrong because normal traffic varies by time of day&lt;br&gt;
iptables is incredibly fast — kernel-level packet dropping happens before the request even reaches Python&lt;br&gt;
Threading needs care — shared data structures need locks to prevent race conditions&lt;br&gt;
Deques are perfect for sliding windows — O(1) append and popleft make them ideal for real-time rate tracking&lt;/p&gt;

&lt;p&gt;Try It Yourself&lt;br&gt;
The full source code is available at:&lt;br&gt;
&lt;a href="https://github.com/AnitaAliCloud/hng-stage3-devops" rel="noopener noreferrer"&gt;https://github.com/AnitaAliCloud/hng-stage3-devops&lt;/a&gt;&lt;br&gt;
The live dashboard is running at:&lt;br&gt;
&lt;a href="http://anitacloud.duckdns.org:8080" rel="noopener noreferrer"&gt;http://anitacloud.duckdns.org:8080&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Built as part of the HNG14 DevOps internship programme &lt;/p&gt;

</description>
      <category>monitoring</category>
      <category>networking</category>
      <category>security</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
