<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Felix Gogodae</title>
    <description>The latest articles on Forem by Felix Gogodae (@trojanhorse7).</description>
    <link>https://forem.com/trojanhorse7</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1086841%2Ffad40ab3-fb26-4f9f-ae7c-19c06a58cec5.jpeg</url>
      <title>Forem: Felix Gogodae</title>
      <link>https://forem.com/trojanhorse7</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/trojanhorse7"/>
    <language>en</language>
    <item>
      <title>SwiftDeploy: Building a Self-Governing Deployment Tool with OPA, Prometheus, and a Single YAML File</title>
      <dc:creator>Felix Gogodae</dc:creator>
      <pubDate>Wed, 06 May 2026 18:39:44 +0000</pubDate>
      <link>https://forem.com/trojanhorse7/swiftdeploy-building-a-self-governing-deployment-tool-with-opa-prometheus-and-a-single-yaml-file-50cp</link>
      <guid>https://forem.com/trojanhorse7/swiftdeploy-building-a-self-governing-deployment-tool-with-opa-prometheus-and-a-single-yaml-file-50cp</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Series:&lt;/strong&gt; Firstly,, I built the engine (manifest → rendered nginx + compose, gated lifecycle). Then I added the eyes (Prometheus &lt;code&gt;/metrics&lt;/code&gt;) and the brain (OPA policy sidecar). This post covers the complete journey.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;What problem are we solving?&lt;/li&gt;
&lt;li&gt;The single source of truth&lt;/li&gt;
&lt;li&gt;Architecture overview&lt;/li&gt;
&lt;li&gt;The engine: writing its own infrastructure&lt;/li&gt;
&lt;li&gt;The eyes: Prometheus instrumentation&lt;/li&gt;
&lt;li&gt;The brain: OPA policy sidecar&lt;/li&gt;
&lt;li&gt;Gated lifecycle: deploy and promote&lt;/li&gt;
&lt;li&gt;The live dashboard: &lt;code&gt;swiftdeploy status&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;The memory: &lt;code&gt;swiftdeploy audit&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Injecting chaos and watching the gates fire&lt;/li&gt;
&lt;li&gt;Lessons learned&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  1. What problem are we solving?
&lt;/h2&gt;

&lt;p&gt;Most deployment tooling separates three concerns that should be tightly coupled:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concern&lt;/th&gt;
&lt;th&gt;Typical state&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Configuration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Scattered across Compose files, env files, CI yamls&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Policy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Implicit — "the person who ran the deploy knew it was safe"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Observability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;A separate system bolted on afterward&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;SwiftDeploy collapses all three into one loop: a &lt;strong&gt;single &lt;code&gt;manifest.yaml&lt;/code&gt;&lt;/strong&gt; drives rendered infrastructure, feeds thresholds to OPA at deploy/promote time, and tells the API how long to keep rolling-window metrics.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. The single source of truth
&lt;/h2&gt;

&lt;p&gt;Every value that changes between environments lives in one file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;swiftdeploy-hng14-api:latest&lt;/span&gt;
  &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3000&lt;/span&gt;
  &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;stable&lt;/span&gt;           &lt;span class="c1"&gt;# swiftdeploy promote rewrites this in-place&lt;/span&gt;

&lt;span class="na"&gt;nginx&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nginx:1.27-alpine&lt;/span&gt;
  &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
  &lt;span class="na"&gt;proxy_timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;60s&lt;/span&gt;

&lt;span class="na"&gt;network&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;swiftdeploy-net&lt;/span&gt;
  &lt;span class="na"&gt;driver_type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bridge&lt;/span&gt;

&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.0.0"&lt;/span&gt;
  &lt;span class="na"&gt;service_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;swiftdeploy-api&lt;/span&gt;
  &lt;span class="na"&gt;contact&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;you@example.com"&lt;/span&gt;
  &lt;span class="na"&gt;deployed_by&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;swiftdeploy"&lt;/span&gt;

&lt;span class="na"&gt;compose_project&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;swiftdeploy&lt;/span&gt;

&lt;span class="na"&gt;policy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;thresholds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;          &lt;span class="c1"&gt;# fed to OPA as input.thresholds — never hardcoded in .rego&lt;/span&gt;
    &lt;span class="na"&gt;min_disk_free_gb&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
    &lt;span class="na"&gt;min_mem_available_gb&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;max_cpu_load&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2.0&lt;/span&gt;
    &lt;span class="na"&gt;max_error_rate_percent&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;max_p99_latency_ms&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;500&lt;/span&gt;
    &lt;span class="na"&gt;metrics_window_seconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
  &lt;span class="na"&gt;opa&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openpolicyagent/opa:0.69.0&lt;/span&gt;
    &lt;span class="na"&gt;host_port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;9182&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The CLI (&lt;code&gt;swiftdeploy&lt;/code&gt;) reads this file, renders &lt;code&gt;nginx.conf&lt;/code&gt; and &lt;code&gt;docker-compose.yml&lt;/code&gt; via Jinja2, and never asks you to edit either generated file.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Architecture overview
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdanq53hl3zeqt0zjk3jz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdanq53hl3zeqt0zjk3jz.png" alt="Architecture Diagram" width="800" height="327"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Key isolation properties
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Property&lt;/th&gt;
&lt;th&gt;How it is enforced&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OPA not reachable via Nginx&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;OPA bound to &lt;code&gt;127.0.0.1:9182&lt;/code&gt; only; Nginx only proxies to &lt;code&gt;api&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;API port not public&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;expose:&lt;/code&gt; only — no &lt;code&gt;ports:&lt;/code&gt; mapping on the &lt;code&gt;api&lt;/code&gt; service&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;No decision logic in CLI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;CLI POSTs context, reads back &lt;code&gt;allowed&lt;/code&gt; + &lt;code&gt;checks[]&lt;/code&gt;; all logic lives in Rego&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Thresholds not in Rego&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;.rego&lt;/code&gt; files reference only &lt;code&gt;input.thresholds.*&lt;/code&gt; — values come from &lt;code&gt;manifest.yaml&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  4. The engine: writing its own infrastructure
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;swiftdeploy init&lt;/code&gt; parses &lt;code&gt;manifest.yaml&lt;/code&gt; with PyYAML and feeds the result into two Jinja2 templates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;templates/nginx.conf.j2&lt;/code&gt;&lt;/strong&gt; — upstream block, proxy timeouts, error pages, access log format, &lt;code&gt;X-Deployed-By&lt;/code&gt; header, temp paths under &lt;code&gt;/tmp&lt;/code&gt; so the &lt;code&gt;nginx&lt;/code&gt; user can write&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;templates/docker-compose.yml.j2&lt;/code&gt;&lt;/strong&gt; — three services (&lt;code&gt;api&lt;/code&gt;, &lt;code&gt;nginx&lt;/code&gt;, &lt;code&gt;opa&lt;/code&gt;), security hardening on &lt;code&gt;api&lt;/code&gt; (&lt;code&gt;cap_drop: ALL&lt;/code&gt;, &lt;code&gt;no-new-privileges&lt;/code&gt;, &lt;code&gt;user: 1000:1000&lt;/code&gt;), healthcheck, named volume&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;METRICS_WINDOW_SECONDS&lt;/code&gt; env var is written from &lt;code&gt;policy.thresholds.metrics_window_seconds&lt;/code&gt; — the same value that OPA uses as the SLO window — so the API's rolling gauge and the Rego rule are always in sync.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;swiftdeploy validate&lt;/code&gt; runs five pre-flight checks before any container starts:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;manifest.yaml&lt;/code&gt; exists and parses&lt;/li&gt;
&lt;li&gt;All required fields are non-empty (including the full &lt;code&gt;policy&lt;/code&gt; block)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;docker image inspect &amp;lt;services.image&amp;gt;&lt;/code&gt; succeeds&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;nginx.port&lt;/code&gt; is free on the host&lt;/li&gt;
&lt;li&gt;Rendered &lt;code&gt;nginx.conf&lt;/code&gt; passes &lt;code&gt;nginx -t&lt;/code&gt; inside a throwaway container&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  5. The eyes: Prometheus instrumentation
&lt;/h2&gt;

&lt;p&gt;The FastAPI app exposes &lt;code&gt;GET /metrics&lt;/code&gt; in Prometheus text format. There are two layers of middleware:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;request in
    │
    ▼
[chaos middleware]       &amp;lt;- injects slow/error in canary mode (skipped on POST /chaos)
    │
    ▼
[prometheus middleware]  &amp;lt;- times the full stack including chaos delay
    │
    ▼
route handler
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Standard metrics:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight prometheus"&gt;&lt;code&gt;&lt;span class="n"&gt;http_requests_total&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;http_request_duration_seconds&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;   &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="n"&gt;histogram&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;standard&lt;/span&gt; &lt;span class="n"&gt;buckets&lt;/span&gt;
&lt;span class="n"&gt;app_uptime_seconds&lt;/span&gt;
&lt;span class="n"&gt;app_mode&lt;/span&gt;                                       &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;stable&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;canary&lt;/span&gt;
&lt;span class="n"&gt;chaos_active&lt;/span&gt;                                   &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;none&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;slow&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Rolling-window gauges&lt;/strong&gt; (what OPA queries for canary SLOs):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;swiftdeploy_window_requests_total       &amp;lt;- count of requests in last N seconds
swiftdeploy_window_errors_total         &amp;lt;- 5xx count in window
swiftdeploy_window_p99_latency_seconds  &amp;lt;- in-process P99 over same window
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The window is a &lt;code&gt;collections.deque&lt;/code&gt;. On every request, a &lt;code&gt;(timestamp, duration, is_error)&lt;/code&gt; tuple is appended, stale entries are evicted, and the three gauges are recomputed — P99 via sorted index. No external TSDB needed; the gauge values are always current when scraped.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. The brain: OPA policy sidecar
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why OPA instead of if-statements in the CLI?
&lt;/h3&gt;

&lt;p&gt;The key constraint: &lt;strong&gt;the CLI must not make any allow/deny decision itself&lt;/strong&gt;. With if-statements in Python, the logic and the thresholds are co-located with the operator tool. With OPA:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Thresholds live only in &lt;code&gt;manifest.yaml&lt;/code&gt; (one place to change for all environments)&lt;/li&gt;
&lt;li&gt;Policy logic lives only in &lt;code&gt;.rego&lt;/code&gt; (auditable, testable with &lt;code&gt;opa test&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;The CLI is a &lt;strong&gt;dumb messenger&lt;/strong&gt; — it assembles context, posts it, reads back a decision object&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Domain isolation
&lt;/h3&gt;

&lt;p&gt;Each policy domain owns exactly one question and one data shape:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Domain&lt;/th&gt;
&lt;th&gt;Question&lt;/th&gt;
&lt;th&gt;Input shape&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;swiftdeploy.infrastructure&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Is the host healthy enough to deploy?&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{phase, host: {disk_free_gb, cpu_load_1m, mem_available_gb}, thresholds}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;swiftdeploy.canary&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Is the canary safe enough to promote?&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{phase, promotion_target, metrics: {error_rate_percent, p99_latency_ms, window_seconds}, thresholds}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A change to the infrastructure rules never touches &lt;code&gt;canary/policy.rego&lt;/code&gt; and vice versa.&lt;/p&gt;

&lt;h3&gt;
  
  
  Decision structure (never a bare boolean)
&lt;/h3&gt;

&lt;p&gt;Every &lt;code&gt;decision&lt;/code&gt; document carries per-rule &lt;code&gt;checks&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rego"&gt;&lt;code&gt;&lt;span class="n"&gt;decision&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"allowed"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reasons&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"domain"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"infrastructure"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"phase"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;phase&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"reasons"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="n"&gt;reasons&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;]]),&lt;/span&gt;
    &lt;span class="s2"&gt;"checks"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"rule_id"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"infra_disk_free_minimum"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="s2"&gt;"passed"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;disk_ok&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"detail"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;disk_detail&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"rule_id"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"infra_cpu_load_maximum"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="s2"&gt;"passed"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;cpu_ok&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="s2"&gt;"detail"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;cpu_detail&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"rule_id"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"infra_memory_available_minimum"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="s2"&gt;"passed"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;mem_ok&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="s2"&gt;"detail"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;mem_detail&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The CLI iterates &lt;code&gt;checks[]&lt;/code&gt; directly for the live status display — it never infers pass/fail itself.&lt;/p&gt;

&lt;h3&gt;
  
  
  Failure handling
&lt;/h3&gt;

&lt;p&gt;Every distinct failure mode has a unique &lt;code&gt;failure_kind&lt;/code&gt; and a human-readable message:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Situation&lt;/th&gt;
&lt;th&gt;&lt;code&gt;failure_kind&lt;/code&gt;&lt;/th&gt;
&lt;th&gt;Message shown to operator&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;OPA container not started&lt;/td&gt;
&lt;td&gt;&lt;code&gt;opa_connection_refused&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;"Start with: docker compose up -d opa"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OPA slow to respond&lt;/td&gt;
&lt;td&gt;&lt;code&gt;opa_timeout&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;"OPA request timed out (read)"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OPA returns non-JSON&lt;/td&gt;
&lt;td&gt;&lt;code&gt;opa_bad_json&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;includes raw snippet&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OPA returns no &lt;code&gt;result&lt;/code&gt; key&lt;/td&gt;
&lt;td&gt;&lt;code&gt;opa_no_result&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;includes raw snippet&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;psutil&lt;/code&gt; not installed&lt;/td&gt;
&lt;td&gt;&lt;code&gt;host_stats_unavailable&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;install instructions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;None of these paths crash or hang the CLI.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Gated lifecycle: deploy and promote
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;swiftdeploy deploy&lt;/code&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;init (render nginx.conf + docker-compose.yml)
    |
    v
docker compose up -d opa
    |
    v
wait_opa_ready (polls /health, up to 75s)
    |
    v
collect_host_stats --&amp;gt; POST /v1/data/swiftdeploy/infrastructure/decision
                                |
                      +---------+-----------+
                      |                     |
                 allowed: false        allowed: true
                      |                     |
               print FAIL checks    docker compose up --build -d
               exit(1)                      |
               (no stack up)                v
                                   poll GET /healthz via nginx
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Real output on a day when CPU spiked:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Policy compliance (infrastructure (pre-deploy)):
  [PASS] infra_disk_free_minimum: PASS: disk free 66.57 GB meets minimum 10.00 GB.
  [FAIL] infra_cpu_load_maximum: FAIL: CPU load 2.52 exceeds maximum 2.00.
  [PASS] infra_memory_available_minimum: PASS: memory available 8.10 GB meets minimum 1.00 GB.
[swiftdeploy] POLICY VIOLATION - deploy blocked (infrastructure).
  - Policy violation: CPU load (2.52) exceeds maximum allowed (2.00).
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The stack never started. No compose up ran. The OPA sidecar is the only container that exists at this point.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;swiftdeploy promote canary&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Before rewriting &lt;code&gt;manifest.yaml&lt;/code&gt;, the CLI:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Scrapes &lt;code&gt;GET /metrics&lt;/code&gt; via Nginx&lt;/li&gt;
&lt;li&gt;Derives &lt;code&gt;error_rate_percent&lt;/code&gt; and &lt;code&gt;p99_latency_ms&lt;/code&gt; from the rolling-window gauges&lt;/li&gt;
&lt;li&gt;Posts to &lt;code&gt;swiftdeploy/canary/decision&lt;/code&gt; with &lt;code&gt;promotion_target: "canary"&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;On &lt;code&gt;allowed: false&lt;/code&gt; — exits without touching &lt;code&gt;manifest.yaml&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Promoting to &lt;strong&gt;stable&lt;/strong&gt; takes a different Rego branch that skips SLO evaluation entirely (there are no "canary metrics" to check when moving away from canary).&lt;/p&gt;




&lt;h2&gt;
  
  
  8. The live dashboard: &lt;code&gt;swiftdeploy status&lt;/code&gt;
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python swiftdeploy status &lt;span class="nt"&gt;--interval&lt;/span&gt; 2 &lt;span class="nt"&gt;-n&lt;/span&gt; 5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each sample scrapes &lt;code&gt;/healthz&lt;/code&gt;, &lt;code&gt;/metrics&lt;/code&gt;, and both OPA domains independently, then prints:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;=== 2026-05-06T18:16:17Z  mode='stable'  req/s~=3.2100 ===
  window(30s): errors=2/41 err_rate=4.8780% p99=312.45ms
  chaos_active: 2 (error)
  Policy compliance (infrastructure (pre-deploy)):
    [PASS] infra_disk_free_minimum: PASS: disk free 66.62 GB meets minimum 10.00 GB.
    [PASS] infra_cpu_load_maximum: PASS: CPU load 0.89 is within maximum 2.00.
    [PASS] infra_memory_available_minimum: PASS: memory available 11.38 GB meets minimum 1.00 GB.
  OPA [infrastructure (pre-deploy)] aggregate: ALLOW
  Policy compliance (canary (hypothetical promote-&amp;gt;canary)):
    [FAIL] canary_error_rate_window: FAIL: error rate 4.8780% exceeds maximum 1.0000% over 30 s window.
    [PASS] canary_p99_latency_window: PASS: P99 latency 312.45 ms within maximum 500.00 ms over 30 s window.
  OPA [canary (hypothetical promote-&amp;gt;canary)] aggregate: DENY
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every sample is appended as one JSON line to &lt;code&gt;history.jsonl&lt;/code&gt; including &lt;code&gt;chaos_active&lt;/code&gt;, window metrics, and both OPA snapshots with their &lt;code&gt;checks[]&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. The memory: &lt;code&gt;swiftdeploy audit&lt;/code&gt;
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python swiftdeploy audit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;audit_report.md&lt;/code&gt; is generated from &lt;code&gt;history.jsonl&lt;/code&gt; with four sections:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Summary&lt;/strong&gt; — sample count, denial count, transport error count&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timeline events&lt;/strong&gt; — mode transitions and chaos transitions detected by diffing consecutive records&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Violations&lt;/strong&gt; — every &lt;code&gt;allowed: false&lt;/code&gt; from any domain, with reasons&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recent timeline&lt;/strong&gt; — last 25 samples in a table with Chaos column and per-domain OPA status&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example timeline events table:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Time (UTC)&lt;/th&gt;
&lt;th&gt;Event&lt;/th&gt;
&lt;th&gt;Detail&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2026-05-06T18:17:02Z&lt;/td&gt;
&lt;td&gt;chaos_change&lt;/td&gt;
&lt;td&gt;none -&amp;gt; error&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2026-05-06T18:20:14Z&lt;/td&gt;
&lt;td&gt;mode_change&lt;/td&gt;
&lt;td&gt;stable -&amp;gt; canary&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2026-05-06T18:23:41Z&lt;/td&gt;
&lt;td&gt;chaos_change&lt;/td&gt;
&lt;td&gt;error -&amp;gt; none&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  10. Injecting chaos and watching the gates fire
&lt;/h2&gt;

&lt;p&gt;In canary mode, &lt;code&gt;POST /chaos&lt;/code&gt; arms the process-global chaos state:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# arm 40% error rate&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://127.0.0.1:8080/chaos &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"mode": "error", "rate": 0.40}'&lt;/span&gt;

&lt;span class="c"&gt;# arm 2-second slow response on every request&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://127.0.0.1:8080/chaos &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"mode": "slow", "duration": 2.0}'&lt;/span&gt;

&lt;span class="c"&gt;# recover&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://127.0.0.1:8080/chaos &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"mode": "recover"}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With 40% error rate active and traffic flowing, the &lt;code&gt;status&lt;/code&gt; dashboard shows &lt;code&gt;canary_error_rate_window&lt;/code&gt; &lt;strong&gt;FAIL&lt;/strong&gt; within one 30-second window. Attempting &lt;code&gt;swiftdeploy promote canary&lt;/code&gt; while this is true produces:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxw9q3gw4kxtlitin46jd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxw9q3gw4kxtlitin46jd.png" alt="Swiftdeploy promote canary blocked by OPA canary safety policy" width="800" height="210"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  Policy compliance (canary (pre-promote)):
    [FAIL] canary_error_rate_window: FAIL: error rate 50.8772% exceeds maximum 1.0000% over 30 s window.
    [PASS] canary_p99_latency_window: PASS: P99 latency 1.96 ms within maximum 500.00 ms over 30 s window.
[swiftdeploy] POLICY VIOLATION - promote blocked (canary safety policy).
  - Policy violation: error rate (50.8772%) exceeds maximum (1.0000%) over last 30 seconds.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;manifest.yaml&lt;/code&gt; is &lt;strong&gt;not modified&lt;/strong&gt;. After recovering and waiting for the window to clear, the same command succeeds.&lt;/p&gt;




&lt;h2&gt;
  
  
  11. Lessons learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. One source of truth is a forcing function, not a convenience.&lt;/strong&gt;&lt;br&gt;
When thresholds are only in &lt;code&gt;manifest.yaml&lt;/code&gt; and nowhere else, you cannot accidentally have a tighter limit in the Rego file than in your runbook. The manifest &lt;em&gt;is&lt;/em&gt; the runbook.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. OPA's value is in the separation, not the language.&lt;/strong&gt;&lt;br&gt;
Rego has a learning curve. The real benefit is that a policy change is a PR to a &lt;code&gt;.rego&lt;/code&gt; file with a clear audit trail, not a diff buried inside deployment tooling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Rolling-window gauges beat querying a TSDB for CLI gates.&lt;/strong&gt;&lt;br&gt;
The alternative — running Prometheus Server just to evaluate a PromQL expression at deploy time — adds infrastructure for something the app can compute in-process with a deque. The CLI scrapes the gauge, not the raw counter buckets.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Failure modes are the real API.&lt;/strong&gt;&lt;br&gt;
The most useful work in this project was not the happy path. It was giving every OPA transport failure a distinct &lt;code&gt;failure_kind&lt;/code&gt; and message so an operator at 2am knows immediately whether OPA is down, slow, returning bad JSON, or returning a policy decision that says no.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Windows CPU approximation is not Linux load average.&lt;/strong&gt;&lt;br&gt;
The infrastructure policy uses 1-minute load average on Linux. On Windows, &lt;code&gt;psutil.cpu_percent x logical_cpus&lt;/code&gt; spikes aggressively during container start. The gate working correctly the first time it fired was both the most satisfying and most annoying moment of the project.&lt;/p&gt;




&lt;h2&gt;
  
  
  Repository
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/Trojanhorse7/swift-deploy" rel="noopener noreferrer"&gt;github.com/Trojanhorse7/swift-deploy&lt;/a&gt;&lt;/p&gt;




</description>
      <category>devops</category>
      <category>docker</category>
      <category>opa</category>
      <category>prometheus</category>
    </item>
    <item>
      <title>Building a Rolling-Baseline HTTP Anomaly Detector (No Fail2Ban)</title>
      <dc:creator>Felix Gogodae</dc:creator>
      <pubDate>Tue, 28 Apr 2026 04:46:39 +0000</pubDate>
      <link>https://forem.com/trojanhorse7/building-a-rolling-baseline-http-anomaly-detector-no-fail2ban-4kf2</link>
      <guid>https://forem.com/trojanhorse7/building-a-rolling-baseline-http-anomaly-detector-no-fail2ban-4kf2</guid>
      <description>&lt;p&gt;Every VPS running a public web app gets hit with traffic it didn't ask for, from scrapers, brute-force login attempts, or just someone's misconfigured bot hammering the same endpoint every second. Most tutorials say "install Fail2Ban and move on." But what if you want to &lt;em&gt;understand&lt;/em&gt; the traffic before you block it? What if you need thresholds that adapt to your actual load instead of a hardcoded "5 failures in 10 minutes"?&lt;/p&gt;

&lt;p&gt;That's what I built for the HNG DevOps track: a Python daemon that tails Nginx access logs, compares live request rates to a &lt;strong&gt;rolling 30-minute baseline&lt;/strong&gt;, and reacts — Slack alerts for global spikes, &lt;code&gt;iptables DROP&lt;/code&gt; for abusive individual IPs, with tiered auto-unban so a single bad minute doesn't permanently lock someone out.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repository:&lt;/strong&gt; &lt;a href="https://github.com/Trojanhorse7/hng-anomaly-detector" rel="noopener noreferrer"&gt;github.com/Trojanhorse7/hng-anomaly-detector&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Stack
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyx70lp3oqfibdd60dyqp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyx70lp3oqfibdd60dyqp.png" alt="Detector daemon running" width="800" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The whole system runs on a single Linux VPS with Docker Compose:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Nextcloud&lt;/strong&gt; — the upstream &lt;a href="https://hub.docker.com/r/kefaslungu/hng-nextcloud" rel="noopener noreferrer"&gt;&lt;code&gt;kefaslungu/hng-nextcloud&lt;/code&gt;&lt;/a&gt; image, unmodified.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Nginx&lt;/strong&gt; — reverse proxy in front of Nextcloud, configured to write &lt;strong&gt;JSON-formatted access logs&lt;/strong&gt; (not the default combined format). This is critical — structured logs let the detector parse fields reliably instead of regex-guessing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Detector&lt;/strong&gt; — a Python 3.12 container that tails the shared log volume, runs the detection logic, calls Slack, and executes &lt;code&gt;iptables&lt;/code&gt; commands on the host.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shared volume&lt;/strong&gt; — a named Docker volume (&lt;code&gt;HNG-nginx-logs&lt;/code&gt;) that Nginx writes to and the detector reads from.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3s5pb38do5vkfk4ojdrg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3s5pb38do5vkfk4ojdrg.png" alt="Architecture diagram" width="800" height="362"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The detector container runs with &lt;code&gt;network_mode: host&lt;/code&gt; and &lt;code&gt;cap_add: NET_ADMIN&lt;/code&gt; so its &lt;code&gt;iptables&lt;/code&gt; calls affect the actual host firewall — not an isolated container network.&lt;/p&gt;




&lt;h2&gt;
  
  
  How Detection Works
&lt;/h2&gt;

&lt;p&gt;The detection pipeline has three layers: &lt;strong&gt;sliding windows&lt;/strong&gt;, &lt;strong&gt;rolling baseline&lt;/strong&gt;, and &lt;strong&gt;anomaly evaluation&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: Sliding Windows (60 seconds)
&lt;/h3&gt;

&lt;p&gt;Every parsed log line feeds into &lt;code&gt;collections.deque&lt;/code&gt; structures — one global deque for all requests, and one per source IP. Timestamps older than 60 seconds are continuously evicted from the left side. At any moment, &lt;strong&gt;RPS = count / 60&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;There's no "bucket per minute" approximation. Every request is tracked individually and aged out precisely. Parallel deques track 4xx/5xx errors separately for the error-surge path (more on that below).&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: Rolling Baseline (30 minutes)
&lt;/h3&gt;

&lt;p&gt;A background thread recomputes the baseline every 60 seconds. It builds a dense vector of per-second request counts over the last 1,800 seconds (30 minutes) and calculates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;effective_mean&lt;/code&gt;&lt;/strong&gt; — average requests per second&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;effective_std&lt;/code&gt;&lt;/strong&gt; — standard deviation of per-second counts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There's an important twist: if&lt;br&gt;
 enough samples exist in the &lt;strong&gt;current UTC hour&lt;/strong&gt;, the baseline uses only that hour's data instead of the full 30-minute window. This matters because traffic patterns shift — 2 AM is different from 2 PM, and the baseline should reflect &lt;em&gt;current&lt;/em&gt; conditions, not a blend of quiet and busy periods.&lt;/p&gt;

&lt;p&gt;Floor values prevent divide-by-zero edge cases in z-score calculations. Every recompute is &lt;strong&gt;audited&lt;/strong&gt; to a structured log file with the timestamp, source (hourly vs full window), and the computed mean/std.&lt;/p&gt;
&lt;h3&gt;
  
  
  Layer 3: Anomaly Evaluation
&lt;/h3&gt;

&lt;p&gt;For each incoming request, the detector compares current RPS to the baseline. An anomaly fires if &lt;strong&gt;either&lt;/strong&gt; condition is true:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Z-score&lt;/strong&gt; &amp;gt; threshold (default &lt;strong&gt;3.0&lt;/strong&gt;) — the current rate is more than 3 standard deviations above the baseline mean&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rate&lt;/strong&gt; &amp;gt; &lt;strong&gt;multiplier × baseline mean&lt;/strong&gt; (default &lt;strong&gt;5×&lt;/strong&gt;) — the current rate is more than 5 times the average&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Error surge tightening:&lt;/strong&gt; if an IP's error RPS (4xx/5xx responses) exceeds 3× the baseline error mean, thresholds tighten automatically — z-score drops to &lt;strong&gt;2.0&lt;/strong&gt; and the rate multiplier drops to &lt;strong&gt;3×&lt;/strong&gt;. This means an IP generating lots of failed requests gets scrutinized more aggressively, which is exactly what you want for brute-force login attempts.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Normal:     z &amp;gt; 3.0  OR  rate &amp;gt; 5 × mean  →  anomaly
Error surge: z &amp;gt; 2.0  OR  rate &amp;gt; 3 × mean  →  anomaly (tighter)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What Happens When an Anomaly Fires
&lt;/h2&gt;

&lt;p&gt;The system distinguishes between &lt;strong&gt;global&lt;/strong&gt; and &lt;strong&gt;per-IP&lt;/strong&gt; anomalies, and they trigger different responses:&lt;/p&gt;

&lt;h3&gt;
  
  
  Global Anomaly → Slack Only
&lt;/h3&gt;

&lt;p&gt;If the aggregate RPS across all IPs spikes above the baseline, the detector sends a Slack notification. It does &lt;strong&gt;not&lt;/strong&gt; apply iptables rules — blocking all traffic would take the service down. Global alerts are informational: "your server is seeing unusual load right now."&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fls8nfh42pv0rjy6zn8oy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fls8nfh42pv0rjy6zn8oy.png" alt="Global anomaly Slack alert" width="800" height="124"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A cooldown (default 120 seconds) prevents Slack spam if the global anomaly persists for minutes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Per-IP Anomaly → iptables DROP + Slack + Audit
&lt;/h3&gt;

&lt;p&gt;If a single IP is responsible for anomalous traffic, the detector:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Adds an &lt;code&gt;iptables -I INPUT -s &amp;lt;IP&amp;gt; -j DROP&lt;/code&gt; rule&lt;/strong&gt; — the IP is immediately blocked at the kernel level, before Nginx even sees the packets.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sends a Slack notification&lt;/strong&gt; with the IP, the detection condition (z-score or rate multiplier), the current rate, and the baseline stats.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Writes a structured audit log entry&lt;/strong&gt; with all the same details plus the ban duration.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fadji0h0baclhdeejsr2c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fadji0h0baclhdeejsr2c.png" alt="Ban Slack notification" width="800" height="155"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr1wsoqrtrlyrn31lenu6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr1wsoqrtrlyrn31lenu6.png" alt="iptables showing DROP rule" width="800" height="94"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Tiered Auto-Unban
&lt;/h3&gt;

&lt;p&gt;Permanently banning IPs from a single spike is too aggressive. The system uses &lt;strong&gt;escalating timeouts&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strike&lt;/th&gt;
&lt;th&gt;Ban Duration&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1st&lt;/td&gt;
&lt;td&gt;10 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2nd&lt;/td&gt;
&lt;td&gt;30 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3rd&lt;/td&gt;
&lt;td&gt;2 hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4th+&lt;/td&gt;
&lt;td&gt;Permanent (no auto-unban)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A background thread checks every 3 seconds for IPs whose ban has expired, removes the iptables rule, and sends an unban Slack notification. The strike counter persists across container restarts via a JSON file (&lt;code&gt;ban_state.json&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0gxunbxazp8plvft3fa9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0gxunbxazp8plvft3fa9.png" alt="Unban Slack notification" width="800" height="150"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This means a legitimate user who triggered a false positive gets unblocked in 10 minutes. A repeat offender escalates through the tiers. By the 4th strike, they're gone for good.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Audit Trail
&lt;/h2&gt;

&lt;p&gt;Every significant event is appended to a structured log file at &lt;code&gt;data/audit.log&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;BASELINE_RECALC&lt;/code&gt;&lt;/strong&gt; — every 60 seconds, with source (hourly vs full), mean, std&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;BAN&lt;/code&gt;&lt;/strong&gt; — IP, condition, rate, baseline stats, duration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;UNBAN&lt;/code&gt;&lt;/strong&gt; — IP, reason, historical ban count&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F58t8mhs28kiyfqob80ma.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F58t8mhs28kiyfqob80ma.png" alt="Structured audit log" width="800" height="236"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This file is the source of truth for debugging, compliance, and the baseline graph (more below).&lt;/p&gt;




&lt;h2&gt;
  
  
  The Dashboard
&lt;/h2&gt;

&lt;p&gt;A FastAPI server on port 8080 serves a single-page dashboard with live metrics via &lt;strong&gt;WebSocket push&lt;/strong&gt; (every 2.5 seconds). If WebSocket fails (e.g., behind a proxy without Upgrade support), the page falls back to HTTP polling automatically.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;/api/state&lt;/code&gt; JSON endpoint returns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Uptime, event count, CPU/memory&lt;/li&gt;
&lt;li&gt;Current global RPS and baseline &lt;code&gt;effective_mean&lt;/code&gt; / &lt;code&gt;effective_std&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;List of currently banned IPs with tier info&lt;/li&gt;
&lt;li&gt;Top 10 source IPs by request count in the current window&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Baseline Over Time
&lt;/h2&gt;

&lt;p&gt;One of the requirements was demonstrating that the baseline actually adapts. By parsing &lt;code&gt;BASELINE_RECALC&lt;/code&gt; lines from the audit log and plotting &lt;code&gt;effective_mean&lt;/code&gt; over time, you can see the baseline shift as traffic patterns change between UTC hours.&lt;/p&gt;

&lt;p&gt;During a busy period, &lt;code&gt;effective_mean&lt;/code&gt; climbs. When traffic drops, it falls. The hourly-slice preference means the baseline reacts to the &lt;em&gt;current&lt;/em&gt; hour's pattern rather than being dragged by stale data from 25 minutes ago.&lt;/p&gt;




&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. JSON logs are non-negotiable.&lt;/strong&gt; Parsing regex against Nginx's default combined log format is fragile. One unusual user-agent string with spaces and quotes breaks your parser. JSON logs with &lt;code&gt;escape=json&lt;/code&gt; in the Nginx config give you reliable field extraction every time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Host networking in Docker is powerful but surprising.&lt;/strong&gt; &lt;code&gt;network_mode: host&lt;/code&gt; means the container shares the host's network stack — &lt;code&gt;iptables&lt;/code&gt; rules apply to the actual server, not a virtual bridge. This is exactly what you want for blocking IPs, but it also means port conflicts are your problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Hardcoded thresholds are the enemy.&lt;/strong&gt; "Block after 100 requests per minute" sounds reasonable until your app legitimately serves 200 req/s during peak hours. A rolling baseline that adapts to actual traffic means your thresholds stay meaningful whether you're serving 2 req/s at 3 AM or 50 req/s at noon.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Tiered responses prevent self-inflicted outages.&lt;/strong&gt; The first time I tested with aggressive thresholds, my own monitoring IP got permanently banned. Escalating tiers (10m → 30m → 2h → permanent) give false positives a way to recover while still catching persistent abuse.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Audit everything.&lt;/strong&gt; When something goes wrong — a legitimate user gets blocked, or an attack slips through — the audit log tells you exactly what the baseline was, what the detector saw, and why it made the decision it did. Without that, you're guessing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Running It Yourself
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Trojanhorse7/hng-anomaly-detector
&lt;span class="nb"&gt;cd &lt;/span&gt;hng-anomaly-detector
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
&lt;span class="c"&gt;# Set SLACK_WEBHOOK_URL in .env&lt;/span&gt;
docker compose build &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Nextcloud at &lt;code&gt;http://&amp;lt;VPS_IP&amp;gt;/&lt;/code&gt;, dashboard at &lt;code&gt;http://&amp;lt;VPS_IP&amp;gt;:8080/&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Thresholds, window sizes, and ban durations are all in &lt;code&gt;detector/config.yaml&lt;/code&gt; — no code changes needed to tune the system.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd Improve
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Per-IP baselines&lt;/strong&gt; — currently all IPs are compared against the global baseline. High-traffic legitimate IPs (like a CDN edge) could benefit from their own rolling stats.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HTTPS on the dashboard&lt;/strong&gt; — right now it's plain HTTP on 8080. A reverse proxy with TLS would be better for production.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prometheus/Grafana&lt;/strong&gt; — the audit log works, but a proper time-series database would make baseline visualization trivial.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IPv6&lt;/strong&gt; — the current implementation only handles IPv4 in iptables rules.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Built for the &lt;a href="https://hng.tech/internship" rel="noopener noreferrer"&gt;HNG DevOps track&lt;/a&gt;. The full source is at &lt;a href="https://github.com/Trojanhorse7/hng-anomaly-detector" rel="noopener noreferrer"&gt;github.com/Trojanhorse7/hng-anomaly-detector&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>python</category>
      <category>docker</category>
      <category>security</category>
    </item>
  </channel>
</rss>
