Forem: Felix Gogodae

SwiftDeploy: Building a Self-Governing Deployment Tool with OPA, Prometheus, and a Single YAML File

Felix Gogodae — Wed, 06 May 2026 18:39:44 +0000

Series: Firstly,, I built the engine (manifest → rendered nginx + compose, gated lifecycle). Then I added the eyes (Prometheus /metrics) and the brain (OPA policy sidecar). This post covers the complete journey.

What problem are we solving?
The single source of truth
Architecture overview
The engine: writing its own infrastructure
The eyes: Prometheus instrumentation
The brain: OPA policy sidecar
Gated lifecycle: deploy and promote
The live dashboard: swiftdeploy status
The memory: swiftdeploy audit
Injecting chaos and watching the gates fire
Lessons learned

1. What problem are we solving?

Most deployment tooling separates three concerns that should be tightly coupled:

Concern	Typical state
Configuration	Scattered across Compose files, env files, CI yamls
Policy	Implicit — "the person who ran the deploy knew it was safe"
Observability	A separate system bolted on afterward

SwiftDeploy collapses all three into one loop: a single manifest.yaml drives rendered infrastructure, feeds thresholds to OPA at deploy/promote time, and tells the API how long to keep rolling-window metrics.

2. The single source of truth

Every value that changes between environments lives in one file:

services:
  image: swiftdeploy-hng14-api:latest
  port: 3000
  mode: stable           # swiftdeploy promote rewrites this in-place

nginx:
  image: nginx:1.27-alpine
  port: 8080
  proxy_timeout: 60s

network:
  name: swiftdeploy-net
  driver_type: bridge

metadata:
  version: "1.0.0"
  service_name: swiftdeploy-api
  contact: "you@example.com"
  deployed_by: "swiftdeploy"

compose_project: swiftdeploy

policy:
  thresholds:          # fed to OPA as input.thresholds — never hardcoded in .rego
    min_disk_free_gb: 10
    min_mem_available_gb: 1
    max_cpu_load: 2.0
    max_error_rate_percent: 1
    max_p99_latency_ms: 500
    metrics_window_seconds: 30
  opa:
    image: openpolicyagent/opa:0.69.0
    host_port: 9182

The CLI (swiftdeploy) reads this file, renders nginx.conf and docker-compose.yml via Jinja2, and never asks you to edit either generated file.

3. Architecture overview

Key isolation properties

Property	How it is enforced
OPA not reachable via Nginx	OPA bound to `127.0.0.1:9182` only; Nginx only proxies to `api`
API port not public	`expose:` only — no `ports:` mapping on the `api` service
No decision logic in CLI	CLI POSTs context, reads back `allowed` + `checks[]`; all logic lives in Rego
Thresholds not in Rego	`.rego` files reference only `input.thresholds.*` — values come from `manifest.yaml`

4. The engine: writing its own infrastructure

swiftdeploy init parses manifest.yaml with PyYAML and feeds the result into two Jinja2 templates:

templates/nginx.conf.j2 — upstream block, proxy timeouts, error pages, access log format, X-Deployed-By header, temp paths under /tmp so the nginx user can write
templates/docker-compose.yml.j2 — three services (api, nginx, opa), security hardening on api (cap_drop: ALL, no-new-privileges, user: 1000:1000), healthcheck, named volume

The METRICS_WINDOW_SECONDS env var is written from policy.thresholds.metrics_window_seconds — the same value that OPA uses as the SLO window — so the API's rolling gauge and the Rego rule are always in sync.

swiftdeploy validate runs five pre-flight checks before any container starts:

manifest.yaml exists and parses
All required fields are non-empty (including the full policy block)
docker image inspect <services.image> succeeds
nginx.port is free on the host
Rendered nginx.conf passes nginx -t inside a throwaway container

5. The eyes: Prometheus instrumentation

The FastAPI app exposes GET /metrics in Prometheus text format. There are two layers of middleware:

request in
    │
    ▼
[chaos middleware]       <- injects slow/error in canary mode (skipped on POST /chaos)
    │
    ▼
[prometheus middleware]  <- times the full stack including chaos delay
    │
    ▼
route handler

Standard metrics:

http_requests_total{method, path, status_code}
http_request_duration_seconds{method, path}   <- histogram, standard buckets
app_uptime_seconds
app_mode                                       <- 0=stable, 1=canary
chaos_active                                   <- 0=none, 1=slow, 2=error

Rolling-window gauges (what OPA queries for canary SLOs):

swiftdeploy_window_requests_total       <- count of requests in last N seconds
swiftdeploy_window_errors_total         <- 5xx count in window
swiftdeploy_window_p99_latency_seconds  <- in-process P99 over same window

The window is a collections.deque. On every request, a (timestamp, duration, is_error) tuple is appended, stale entries are evicted, and the three gauges are recomputed — P99 via sorted index. No external TSDB needed; the gauge values are always current when scraped.

6. The brain: OPA policy sidecar

Why OPA instead of if-statements in the CLI?

The key constraint: the CLI must not make any allow/deny decision itself. With if-statements in Python, the logic and the thresholds are co-located with the operator tool. With OPA:

Thresholds live only in manifest.yaml (one place to change for all environments)
Policy logic lives only in .rego (auditable, testable with opa test)
The CLI is a dumb messenger — it assembles context, posts it, reads back a decision object

Domain isolation

Each policy domain owns exactly one question and one data shape:

Domain	Question	Input shape
`swiftdeploy.infrastructure`	Is the host healthy enough to deploy?	`{phase, host: {disk_free_gb, cpu_load_1m, mem_available_gb}, thresholds}`
`swiftdeploy.canary`	Is the canary safe enough to promote?	`{phase, promotion_target, metrics: {error_rate_percent, p99_latency_ms, window_seconds}, thresholds}`

A change to the infrastructure rules never touches canary/policy.rego and vice versa.

Decision structure (never a bare boolean)

Every decision document carries per-rule checks:

decision := {
    "allowed": count(reasons) == 0,
    "domain": "infrastructure",
    "phase": input.phase,
    "reasons": sort([r | reasons[r]]),
    "checks": [
        {"rule_id": "infra_disk_free_minimum",
         "passed": disk_ok, "detail": disk_detail},
        {"rule_id": "infra_cpu_load_maximum",
         "passed": cpu_ok,  "detail": cpu_detail},
        {"rule_id": "infra_memory_available_minimum",
         "passed": mem_ok,  "detail": mem_detail},
    ],
    ...
}

The CLI iterates checks[] directly for the live status display — it never infers pass/fail itself.

Failure handling

Every distinct failure mode has a unique failure_kind and a human-readable message:

Situation	`failure_kind`	Message shown to operator
OPA container not started	`opa_connection_refused`	"Start with: docker compose up -d opa"
OPA slow to respond	`opa_timeout`	"OPA request timed out (read)"
OPA returns non-JSON	`opa_bad_json`	includes raw snippet
OPA returns no `result` key	`opa_no_result`	includes raw snippet
`psutil` not installed	`host_stats_unavailable`	install instructions

None of these paths crash or hang the CLI.

7. Gated lifecycle: deploy and promote

`swiftdeploy deploy`

init (render nginx.conf + docker-compose.yml)
    |
    v
docker compose up -d opa
    |
    v
wait_opa_ready (polls /health, up to 75s)
    |
    v
collect_host_stats --> POST /v1/data/swiftdeploy/infrastructure/decision
                                |
                      +---------+-----------+
                      |                     |
                 allowed: false        allowed: true
                      |                     |
               print FAIL checks    docker compose up --build -d
               exit(1)                      |
               (no stack up)                v
                                   poll GET /healthz via nginx

Real output on a day when CPU spiked:

Policy compliance (infrastructure (pre-deploy)):
  [PASS] infra_disk_free_minimum: PASS: disk free 66.57 GB meets minimum 10.00 GB.
  [FAIL] infra_cpu_load_maximum: FAIL: CPU load 2.52 exceeds maximum 2.00.
  [PASS] infra_memory_available_minimum: PASS: memory available 8.10 GB meets minimum 1.00 GB.
[swiftdeploy] POLICY VIOLATION - deploy blocked (infrastructure).
  - Policy violation: CPU load (2.52) exceeds maximum allowed (2.00).

The stack never started. No compose up ran. The OPA sidecar is the only container that exists at this point.

`swiftdeploy promote canary`

Before rewriting manifest.yaml, the CLI:

Scrapes GET /metrics via Nginx
Derives error_rate_percent and p99_latency_ms from the rolling-window gauges
Posts to swiftdeploy/canary/decision with promotion_target: "canary"
On allowed: false — exits without touching manifest.yaml

Promoting to stable takes a different Rego branch that skips SLO evaluation entirely (there are no "canary metrics" to check when moving away from canary).

8. The live dashboard: `swiftdeploy status`

python swiftdeploy status --interval 2 -n 5

Each sample scrapes /healthz, /metrics, and both OPA domains independently, then prints:

=== 2026-05-06T18:16:17Z  mode='stable'  req/s~=3.2100 ===
  window(30s): errors=2/41 err_rate=4.8780% p99=312.45ms
  chaos_active: 2 (error)
  Policy compliance (infrastructure (pre-deploy)):
    [PASS] infra_disk_free_minimum: PASS: disk free 66.62 GB meets minimum 10.00 GB.
    [PASS] infra_cpu_load_maximum: PASS: CPU load 0.89 is within maximum 2.00.
    [PASS] infra_memory_available_minimum: PASS: memory available 11.38 GB meets minimum 1.00 GB.
  OPA [infrastructure (pre-deploy)] aggregate: ALLOW
  Policy compliance (canary (hypothetical promote->canary)):
    [FAIL] canary_error_rate_window: FAIL: error rate 4.8780% exceeds maximum 1.0000% over 30 s window.
    [PASS] canary_p99_latency_window: PASS: P99 latency 312.45 ms within maximum 500.00 ms over 30 s window.
  OPA [canary (hypothetical promote->canary)] aggregate: DENY

Every sample is appended as one JSON line to history.jsonl including chaos_active, window metrics, and both OPA snapshots with their checks[].

9. The memory: `swiftdeploy audit`

python swiftdeploy audit

audit_report.md is generated from history.jsonl with four sections:

Summary — sample count, denial count, transport error count
Timeline events — mode transitions and chaos transitions detected by diffing consecutive records
Violations — every allowed: false from any domain, with reasons
Recent timeline — last 25 samples in a table with Chaos column and per-domain OPA status

Example timeline events table:

Time (UTC)	Event	Detail
2026-05-06T18:17:02Z	chaos_change	none -> error
2026-05-06T18:20:14Z	mode_change	stable -> canary
2026-05-06T18:23:41Z	chaos_change	error -> none

10. Injecting chaos and watching the gates fire

In canary mode, POST /chaos arms the process-global chaos state:

# arm 40% error rate
curl -s -X POST http://127.0.0.1:8080/chaos \
  -H "Content-Type: application/json" \
  -d '{"mode": "error", "rate": 0.40}'

# arm 2-second slow response on every request
curl -s -X POST http://127.0.0.1:8080/chaos \
  -H "Content-Type: application/json" \
  -d '{"mode": "slow", "duration": 2.0}'

# recover
curl -s -X POST http://127.0.0.1:8080/chaos \
  -H "Content-Type: application/json" \
  -d '{"mode": "recover"}'

With 40% error rate active and traffic flowing, the status dashboard shows canary_error_rate_window FAIL within one 30-second window. Attempting swiftdeploy promote canary while this is true produces:

  Policy compliance (canary (pre-promote)):
    [FAIL] canary_error_rate_window: FAIL: error rate 50.8772% exceeds maximum 1.0000% over 30 s window.
    [PASS] canary_p99_latency_window: PASS: P99 latency 1.96 ms within maximum 500.00 ms over 30 s window.
[swiftdeploy] POLICY VIOLATION - promote blocked (canary safety policy).
  - Policy violation: error rate (50.8772%) exceeds maximum (1.0000%) over last 30 seconds.

manifest.yaml is not modified. After recovering and waiting for the window to clear, the same command succeeds.

11. Lessons learned

1. One source of truth is a forcing function, not a convenience.
When thresholds are only in manifest.yaml and nowhere else, you cannot accidentally have a tighter limit in the Rego file than in your runbook. The manifest is the runbook.

2. OPA's value is in the separation, not the language.
Rego has a learning curve. The real benefit is that a policy change is a PR to a .rego file with a clear audit trail, not a diff buried inside deployment tooling.

3. Rolling-window gauges beat querying a TSDB for CLI gates.
The alternative — running Prometheus Server just to evaluate a PromQL expression at deploy time — adds infrastructure for something the app can compute in-process with a deque. The CLI scrapes the gauge, not the raw counter buckets.

4. Failure modes are the real API.
The most useful work in this project was not the happy path. It was giving every OPA transport failure a distinct failure_kind and message so an operator at 2am knows immediately whether OPA is down, slow, returning bad JSON, or returning a policy decision that says no.

5. Windows CPU approximation is not Linux load average.
The infrastructure policy uses 1-minute load average on Linux. On Windows, psutil.cpu_percent x logical_cpus spikes aggressively during container start. The gate working correctly the first time it fired was both the most satisfying and most annoying moment of the project.

Repository

github.com/Trojanhorse7/swift-deploy

Building a Rolling-Baseline HTTP Anomaly Detector (No Fail2Ban)

Felix Gogodae — Tue, 28 Apr 2026 04:46:39 +0000

Every VPS running a public web app gets hit with traffic it didn't ask for, from scrapers, brute-force login attempts, or just someone's misconfigured bot hammering the same endpoint every second. Most tutorials say "install Fail2Ban and move on." But what if you want to understand the traffic before you block it? What if you need thresholds that adapt to your actual load instead of a hardcoded "5 failures in 10 minutes"?

That's what I built for the HNG DevOps track: a Python daemon that tails Nginx access logs, compares live request rates to a rolling 30-minute baseline, and reacts — Slack alerts for global spikes, iptables DROP for abusive individual IPs, with tiered auto-unban so a single bad minute doesn't permanently lock someone out.

Repository: github.com/Trojanhorse7/hng-anomaly-detector

The Stack

The whole system runs on a single Linux VPS with Docker Compose:

Nextcloud — the upstream kefaslungu/hng-nextcloud image, unmodified.
Nginx — reverse proxy in front of Nextcloud, configured to write JSON-formatted access logs (not the default combined format). This is critical — structured logs let the detector parse fields reliably instead of regex-guessing.
Detector — a Python 3.12 container that tails the shared log volume, runs the detection logic, calls Slack, and executes iptables commands on the host.
Shared volume — a named Docker volume (HNG-nginx-logs) that Nginx writes to and the detector reads from.

The detector container runs with network_mode: host and cap_add: NET_ADMIN so its iptables calls affect the actual host firewall — not an isolated container network.

How Detection Works

The detection pipeline has three layers: sliding windows, rolling baseline, and anomaly evaluation.

Layer 1: Sliding Windows (60 seconds)

Every parsed log line feeds into collections.deque structures — one global deque for all requests, and one per source IP. Timestamps older than 60 seconds are continuously evicted from the left side. At any moment, RPS = count / 60.

There's no "bucket per minute" approximation. Every request is tracked individually and aged out precisely. Parallel deques track 4xx/5xx errors separately for the error-surge path (more on that below).

Layer 2: Rolling Baseline (30 minutes)

A background thread recomputes the baseline every 60 seconds. It builds a dense vector of per-second request counts over the last 1,800 seconds (30 minutes) and calculates:

effective_mean — average requests per second
effective_std — standard deviation of per-second counts

There's an important twist: if
enough samples exist in the current UTC hour, the baseline uses only that hour's data instead of the full 30-minute window. This matters because traffic patterns shift — 2 AM is different from 2 PM, and the baseline should reflect current conditions, not a blend of quiet and busy periods.

Floor values prevent divide-by-zero edge cases in z-score calculations. Every recompute is audited to a structured log file with the timestamp, source (hourly vs full window), and the computed mean/std.

Layer 3: Anomaly Evaluation

For each incoming request, the detector compares current RPS to the baseline. An anomaly fires if either condition is true:

Z-score > threshold (default 3.0) — the current rate is more than 3 standard deviations above the baseline mean
Rate > multiplier × baseline mean (default 5×) — the current rate is more than 5 times the average

Error surge tightening: if an IP's error RPS (4xx/5xx responses) exceeds 3× the baseline error mean, thresholds tighten automatically — z-score drops to 2.0 and the rate multiplier drops to 3×. This means an IP generating lots of failed requests gets scrutinized more aggressively, which is exactly what you want for brute-force login attempts.

Normal:     z > 3.0  OR  rate > 5 × mean  →  anomaly
Error surge: z > 2.0  OR  rate > 3 × mean  →  anomaly (tighter)

What Happens When an Anomaly Fires

The system distinguishes between global and per-IP anomalies, and they trigger different responses:

Global Anomaly → Slack Only

If the aggregate RPS across all IPs spikes above the baseline, the detector sends a Slack notification. It does not apply iptables rules — blocking all traffic would take the service down. Global alerts are informational: "your server is seeing unusual load right now."

A cooldown (default 120 seconds) prevents Slack spam if the global anomaly persists for minutes.

Per-IP Anomaly → iptables DROP + Slack + Audit

If a single IP is responsible for anomalous traffic, the detector:

Adds an iptables -I INPUT -s <IP> -j DROP rule — the IP is immediately blocked at the kernel level, before Nginx even sees the packets.
Sends a Slack notification with the IP, the detection condition (z-score or rate multiplier), the current rate, and the baseline stats.
Writes a structured audit log entry with all the same details plus the ban duration.

Tiered Auto-Unban

Permanently banning IPs from a single spike is too aggressive. The system uses escalating timeouts:

Strike	Ban Duration
1st	10 minutes
2nd	30 minutes
3rd	2 hours
4th+	Permanent (no auto-unban)

A background thread checks every 3 seconds for IPs whose ban has expired, removes the iptables rule, and sends an unban Slack notification. The strike counter persists across container restarts via a JSON file (ban_state.json).

This means a legitimate user who triggered a false positive gets unblocked in 10 minutes. A repeat offender escalates through the tiers. By the 4th strike, they're gone for good.

The Audit Trail

Every significant event is appended to a structured log file at data/audit.log:

BASELINE_RECALC — every 60 seconds, with source (hourly vs full), mean, std
BAN — IP, condition, rate, baseline stats, duration
UNBAN — IP, reason, historical ban count

This file is the source of truth for debugging, compliance, and the baseline graph (more below).

The Dashboard

A FastAPI server on port 8080 serves a single-page dashboard with live metrics via WebSocket push (every 2.5 seconds). If WebSocket fails (e.g., behind a proxy without Upgrade support), the page falls back to HTTP polling automatically.

The /api/state JSON endpoint returns:

Uptime, event count, CPU/memory
Current global RPS and baseline effective_mean / effective_std
List of currently banned IPs with tier info
Top 10 source IPs by request count in the current window

Baseline Over Time

One of the requirements was demonstrating that the baseline actually adapts. By parsing BASELINE_RECALC lines from the audit log and plotting effective_mean over time, you can see the baseline shift as traffic patterns change between UTC hours.

During a busy period, effective_mean climbs. When traffic drops, it falls. The hourly-slice preference means the baseline reacts to the current hour's pattern rather than being dragged by stale data from 25 minutes ago.

Lessons Learned

1. JSON logs are non-negotiable. Parsing regex against Nginx's default combined log format is fragile. One unusual user-agent string with spaces and quotes breaks your parser. JSON logs with escape=json in the Nginx config give you reliable field extraction every time.

2. Host networking in Docker is powerful but surprising. network_mode: host means the container shares the host's network stack — iptables rules apply to the actual server, not a virtual bridge. This is exactly what you want for blocking IPs, but it also means port conflicts are your problem.

3. Hardcoded thresholds are the enemy. "Block after 100 requests per minute" sounds reasonable until your app legitimately serves 200 req/s during peak hours. A rolling baseline that adapts to actual traffic means your thresholds stay meaningful whether you're serving 2 req/s at 3 AM or 50 req/s at noon.

4. Tiered responses prevent self-inflicted outages. The first time I tested with aggressive thresholds, my own monitoring IP got permanently banned. Escalating tiers (10m → 30m → 2h → permanent) give false positives a way to recover while still catching persistent abuse.

5. Audit everything. When something goes wrong — a legitimate user gets blocked, or an attack slips through — the audit log tells you exactly what the baseline was, what the detector saw, and why it made the decision it did. Without that, you're guessing.

Running It Yourself

git clone https://github.com/Trojanhorse7/hng-anomaly-detector
cd hng-anomaly-detector
cp .env.example .env
# Set SLACK_WEBHOOK_URL in .env
docker compose build && docker compose up -d

Nextcloud at http://<VPS_IP>/, dashboard at http://<VPS_IP>:8080/.

Thresholds, window sizes, and ban durations are all in detector/config.yaml — no code changes needed to tune the system.

What I'd Improve

Per-IP baselines — currently all IPs are compared against the global baseline. High-traffic legitimate IPs (like a CDN edge) could benefit from their own rolling stats.
HTTPS on the dashboard — right now it's plain HTTP on 8080. A reverse proxy with TLS would be better for production.
Prometheus/Grafana — the audit log works, but a proper time-series database would make baseline visualization trivial.
IPv6 — the current implementation only handles IPv4 in iptables rules.

Built for the HNG DevOps track. The full source is at github.com/Trojanhorse7/hng-anomaly-detector.