Forem: Okeoghene Akwerigbe

SwiftDeploy: Building a Declarative Deployment CLI with Observability and OPA Policy Gates

Okeoghene Akwerigbe — Wed, 06 May 2026 20:56:38 +0000

The idea of SwiftDeploy is really simple: What if I could describe my deployment once, then let a tool generate the infrastructure files for me?

Most beginner DevOps projects teach you to write a docker-compose.yml, configure Nginx, wire up containers, add health checks, and then remember to keep all of those files in sync. That is useful practice, but in real projects it can become messy very quickly.

You change a port in one file and forget another. You switch a service from stable to canary in Docker Compose but forget the value in your notes. You edit Nginx by hand and now nobody knows whether the generated config still matches the intended deployment.

SwiftDeploy is an attempt to solve that problem in a small, understandable way.

It is a Python CLI tool that reads one file, manifest.yaml, then generates the infrastructure files, deploys the stack, checks policies, exposes metrics, and keeps an audit trail.

In this post, I will walk through how SwiftDeploy works and how you can rebuild the same idea yourself.

Repository: https://github.com/AkwerigbeO/swiftdeploy

What SwiftDeploy Does
SwiftDeploy is a small deployment tool with five main responsibilities:

Read a deployment manifest
Generate Docker Compose and Nginx config files
Deploy a FastAPI app behind Nginx
Use OPA policies as safety gates before deploys and promotions
Expose metrics, show status, and generate an audit report

The project structure looks like this:

.
|-- app/
|   |-- main.py
|   `-- requirements.txt
|-- policies/
|   |-- infra.rego
|   `-- canary.rego
|-- templates/
|   |-- docker-compose.yml.tpl
|   `-- nginx.conf.tpl
|-- Dockerfile
|-- manifest.yaml
|-- swiftdeploy
`-- README.md

The important thing is that manifest.yaml is the source of truth. The generated files are not meant to be edited directly.

The Core Idea: One Manifest Controls Everything
Here is an example manifest:

services:
  image: swift-odysia:latest
  port: 3000
  mode: canary
  version: v1
  restart_policy: unless-stopped

nginx:
  image: nginx:latest
  port: 8844
  proxy_timeout: 10s
  contact: you@example.com

network:
  name: swiftdeploy-net
  driver_type: bridge

policy:
  opa_url: http://127.0.0.1:8181
  infra_min_disk_gb: 10
  infra_max_cpu_load: 2.0
  canary_max_error_rate: 0.01
  canary_max_p99_latency_seconds: 0.5
  canary_window_seconds: 30

This file answers the basic deployment questions:

What image should run?
What port does the app use?
Should the app run as stable or canary?
What port should Nginx expose?
What Docker network should the containers use?
What policy limits should be enforced?

Once the manifest exists, SwiftDeploy can generate the rest.

The Design: A Tool That Writes Its Own Infrastructure Files
The first command is:

./swiftdeploy init

On Windows PowerShell, you can run:

python swiftdeploy init

This command reads manifest.yaml, then renders two template files:

templates/docker-compose.yml.tpl -> docker-compose.yml
templates/nginx.conf.tpl         -> nginx.conf

The template contains placeholders like this:

image: {{SERVICE_IMAGE}}

The CLI replaces that with the value given in the manifest:

image: swift-odysia:latest

That is the whole trick. SwiftDeploy is not magically inventing infrastructure. It is using a manifest plus templates to generate consistent config files.

The generated Docker Compose file creates three main containers:

app: the FastAPI service
nginx: the public reverse proxy
opa: the policy engine
The app container does not publish its port directly. It only uses expose, so traffic must go through Nginx.

Nginx is the public entry point. It listens on the port from the manifest, forwards traffic to the app, adds useful headers, and returns JSON error bodies for gateway failures.

OPA runs as a sidecar. The CLI talks to OPA when it needs policy decisions.

The architecture looks like this:

Building the App
The application is a small FastAPI service. It supports two modes:

stable
canary
The mode comes from an environment variable:
MODE=stable
or:

MODE=canary
The app exposes:

GET /
GET /healthz
GET /metrics
POST /chaos
The root endpoint returns a welcome response with the mode, version, and timestamp.

The health endpoint returns:

{
  "status": "ok",
  "uptime": 42,
  "mode": "canary"
}

When the app is running in canary mode, it adds this header to responses:
X-Mode: canary

That makes it easy to confirm which version of the service is responding.

Adding Observability with /metrics
Deploying a service is only half the story. You also need to see what it is doing.

SwiftDeploy exposes a Prometheus-compatible /metrics endpoint. The app tracks:

http_requests_total
http_request_duration_seconds
app_uptime_seconds
app_mode
chaos_active

The request counter uses labels:

method
path
status_code

That means you can see how many requests went to /healthz, how many hit /, and how many returned 500.

The latency histogram lets the CLI estimate P99 latency. That becomes important later when deciding whether a canary is healthy enough to promote.

You can check metrics with:

curl http://localhost:8844/metrics

or on Powershell:

Invoke-WebRequest -UseBasicParsing http://localhost:8844/metrics |
  Select-Object -ExpandProperty Content

The Guardrails (OPA)
A deployment tool should not just run commands blindly. It should ask: “Is this safe?”

That is where OPA, Open Policy Agent, comes in.

OPA lets you write policy rules in Rego. Instead of hardcoding all safety checks inside the CLI, SwiftDeploy sends facts to OPA and lets OPA decide whether an action is allowed.

This is an important design choice:

The CLI gathers information. OPA makes the policy decision.

SwiftDeploy has two policy files:

policies/infra.rego
policies/canary.rego
They answer different questions.

Infrastructure Policy
The infrastructure policy answers:

Is the host safe enough for deployment?

It checks:

free disk space
CPU load
The policy denies deployment if:
disk free is less than 10GB
CPU load is greater than 2.0

Those limits come from manifest.yaml, not from the Rego file.

That matters because policy logic and environment configuration are different things. The Rego file defines the rule. The manifest defines the threshold.

A simplified version of the logic is:

package infra

default allow := true

allow := false if {
    input.disk_free < input.min_disk
}

allow := false if {
    input.cpu_load > input.max_cpu
}

When you run:

./swiftdeploy deploy

SwiftDeploy:

Starts OPA

Collects host disk and CPU data
Sends that data to OPA
Blocks the deploy if OPA denies it

Example Failure is
DEPLOY BLOCKED by infra policy:

Disk space below minimum threshold FAIL: Deployment aborted due to policy violations.

That is the hard gate. If the environment is unsafe, deploy stops.
**
Canary Safety Policy**
The canary policy answers:

Is the canary healthy enough to promote?

It checks:

error rate
P99 latency
The policy denies promotion if:
error rate is greater than 1%
P99 latency is greater than 500ms
Before promoting, SwiftDeploy scrapes /metrics, samples a configured window, calculates the error rate and P99 latency, then sends those facts to OPA.

The CLI does not decide whether 1% is good or bad. OPA does.

That keeps the deployment flow flexible. If I want to make the policy stricter later, I can change the policy threshold in the manifest without rewriting the deployment logic.

Why OPA Isolation Matters
OPA is powerful because it answers policy questions. That also means it should not be exposed publicly.

In SwiftDeploy, Nginx is the public ingress. OPA is bound to:

127.0.0.1:8181
The CLI can reach it from the host machine, but public traffic through Nginx cannot reach the OPA API.

That separation matters because users should access the application, not the policy engine.

You can test that OPA is not leaking through Nginx:

Invoke-WebRequest -UseBasicParsing http://localhost:8844/v1/data/infra

That should not return an OPA policy response through the public app port.

Deploying the Stack
To replicate the project locally, start by building the app image:

docker build -t swift-odysia:latest .

Generate infrastructure files:

./swiftdeploy init

Validate the setup:

./swiftdeploy validate

Deploy:

./swiftdeploy deploy

Check health

curl http://localhost:8844/healthz

You should see something like:

{
  "status": "ok",
  "uptime": 10,
  "mode": "canary"
}

The Status View
SwiftDeploy includes a status command:

./swiftdeploy status

For a single snapshot:

./swiftdeploy status --count 1

The status view shows:

mode
uptime
chaos state
throughput
error rate
P99 latency
policy compliance

Example:

========================================================================
SWIFTDEPLOY STATUS
========================================================================
Time          : 2026-05-06T17:39:30.794723+00:00
Mode          : canary
Uptime        : 10s
Chaos         : 0 (0=none, 1=slow, 2=error)
Throughput    : 0.00 req/s
Error rate    : 0.00%
P99 latency   : 0.100s

Policy Compliance
- infra: PASS - policy allowed
- canary: PASS - policy allowed
========================================================================

Every status scrape is appended to history.jsonl. That file becomes the raw audit trail.

The Chaos: Injecting Slow Responses
The app has a /chaos endpoint that only works in canary mode.

To inject slow responses:

curl -X POST http://localhost:8844/chaos \
  -H "Content-Type: application/json" \
  -d '{"mode":"slow","duration":2}'

Then generate traffic:

for i in {1..10}; do curl -s http://localhost:8844/ > /dev/null; done

On PowerShell:

1..10 | ForEach-Object {
  Invoke-WebRequest -UseBasicParsing http://localhost:8844/ | Out-Null
}

After that run:

./swiftdeploy status --count 1

The Chaos: Injecting Errors
To inject errors:

curl -X POST http://localhost:8844/chaos \
  -H "Content-Type: application/json" \
  -d '{"mode":"error","rate":0.5}'

Generate Traffic:

for i in {1..20}; do curl -s http://localhost:8844/ > /dev/null; done

On Powershell:
1..20 | ForEach-Object {
try {
Invoke-WebRequest -UseBasicParsing http://localhost:8844/ | Out-Null
} catch {}
}

Now the metrics endpoint records more 500 responses. The status view should show the error rate climbing.

Lessons Learned
The first lesson is that a deployment tool is more than a wrapper around docker compose up.

A good deployment tool needs to know:

what should exist
whether the environment is safe
whether the app is healthy
what changed over time The second lesson is that generated files are powerful when there is a clear source of truth. By making manifest.yaml the only file operators need to edit, the system becomes easier to reason about.

The third lesson is that policy belongs in its own layer. The CLI should collect facts, but OPA should make the allow/deny decision. That separation makes the system easier to test and safer to extend.

The fourth lesson is that canary deployments need metrics. A container can be running and still be a bad candidate for promotion. Error rate and latency tell a better story than health checks alone.

Final Thoughts
SwiftDeploy is not a replacement for Kubernetes, Terraform, or a production deployment platform. It is a learning project that shows the core ideas behind those tools in a smaller package:

declare desired state
generate infrastructure
observe runtime behavior
enforce safety policies
keep an audit trail If you are learning DevOps, this kind of project is a good way to understand how deployment automation, observability, policy, and reliability fit together.

The best part is that the idea is simple enough to rebuild:

Start with a manifest
Add templates
Write a CLI to render them
Add health checks
Add metrics
Add OPA policy gates
Add history and audit reporting
That is SwiftDeploy in one sentence:

A small deployment CLI that turns one manifest into a running, observable, policy-protected stack.

Multi-Layered Defense: Building an Intelligent, Dual-Trigger Firewall

Okeoghene Akwerigbe — Wed, 29 Apr 2026 13:40:45 +0000

In modern DevOps, a simple rate-limiter isn't enough. If you set a hard limit of "10 requests per second," what happens when your product goes viral and legitimate customers hit that limit? You end up blocking the exact people you want to serve. Professional security requires intelligence: systems that adapt to your traffic patterns and can tell the difference between a busy hour and a malicious botnet.

For my HNG Stage 3 project, I decided to move beyond basic firewalls. I built a Python-based Anomaly Detection Engine that uses a Dual-Trigger System to protect a server in real-time.

Here is a deep dive into the architecture, the math, and the automation behind it.

1. The Architecture: Real-Time Log Tailing
The foundation of the engine is a continuous loop that monitors the server's pulse. Instead of acting as a proxy that traffic must pass through (which can slow things down), this engine sits entirely out of the way.

It uses a Python script to "tail" Nginx JSON access logs asynchronously. The moment a request hits Nginx, my script reads the log entry, extracts the IP address, and immediately begins evaluating it. This means the engine is lightweight and doesn't add any latency to the actual web application.

How the Sliding window works
To detect attacks, the engine needs to know how many requests are happening right now, not just how many happened in the last full minute. A normal per-minute counter would be too slow because an attacker could send a burst of traffic and disappear before the next minute ends.

To solve this, I used a sliding window with Python’s deque.

Think of a deque like a queue of timestamps. Every time a request comes in, the detector stores the current time inside the queue. Before calculating the request rate, it removes every timestamp older than 60 seconds. Whatever remains in the queue represents traffic from the last 60 seconds only.

Here is the basic idea:

python
now = time.time()

while window and window[0] < now - 60:
    window.popleft()

window.append(now)

current_rate = len(window) / 60

The detector keeps two types of sliding windows:

One global window for all requests hitting the server.
One per-IP window for each source IP address.
This helps the engine detect two different situations. If one IP is sending too many requests, it can be blocked directly. If traffic from many IPs rises at the same time, the engine treats it as a global traffic spike and sends a Slack alert without banning everyone.

2. The Dual-Trigger System
What makes this engine robust is that it doesn't rely on a single point of failure. It evaluates every IP address through two distinct logic paths simultaneously.

Trigger A: The Statistical "Brain" (Z-Score)
This trigger is designed to catch "stealthy" attacks—bots that scrape your site slowly to avoid triggering basic alarms. It works by calculating a Baseline Mean (your historical average traffic) and a Standard Deviation (the normal "wobble" or fluctuation of your traffic).

The engine evaluates incoming traffic using the Z-Score formula:

How the Baseline Learns From Traffic
The baseline is what tells the detector what “normal” looks like. Instead of hardcoding a fixed number like “10 requests per second,” the engine watches real traffic and learns from it.

It stores per-second request counts over a rolling 30-minute window. Every 60 seconds, it recalculates the average request rate and the standard deviation. This means the baseline keeps adjusting as traffic changes throughout the day.

For example, if the server normally gets 1 request per second at night but 8 requests per second during a busy hour, the detector should not treat both periods the same way. A fixed threshold would either be too strict during busy hours or too weak during quiet hours.

The engine also keeps hourly traffic slots. When the current hour has enough data, it prefers that hour’s baseline because traffic patterns can change depending on the time of day. This makes the detector more adaptive and reduces false positives.

In statistics, 99.7% of all normal activity falls within a Z-Score of 3.0. Therefore, if an IP's traffic generates a Z-Score of 3.0 (which my engine frequently caught during testing), the system mathematically proves this isn't a normal user, it’s an anomaly.

Trigger B: The Volumetric "Shield" (Rate Multiplier)
While the Z-Score is brilliant for complex patterns, it takes a few seconds of data to calculate. What if an attacker tries to crash the server instantly with a massive flood of traffic?

That is where the Rate Multiplier comes in. This is a fail-safe configured with a multiplier of 5. It constantly compares the current traffic to the baseline. If your normal traffic is 1.0 request per second, and an IP suddenly spikes to over 5.0 requests per second, this trigger trips immediately. It acts as an emergency brake before the Z-Score even has time to finish its math.

3. Automated Response: iptables and Docker Routing
Detecting a threat is only half the battle; the system must neutralize it without human intervention.

When a trigger fires, the Python engine communicates directly with the Linux kernel's firewall (iptables). However, because modern applications run inside Docker containers, standard firewall rules often fail. Docker aggressively rewrites network rules, which means blocking an IP on the standard INPUT chain won't work—the traffic will slip right past it.

To solve this, my engine targets the DOCKER-USER chain. By inserting a DROP rule here, the malicious IP is blocked at the lowest possible kernel level before the traffic is even allowed to route toward the Docker container.

The Escalation Ladder
Security shouldn't be entirely unforgiving. I programmed an automated background worker (the "Unbanner") to manage a backoff schedule:

First Offense: The IP is banned for exactly 10 minutes to cool off.

Repeat Offenses: If the IP returns and attacks again, the penalty increases to 30 minutes, and then 2 hours .

The Final Strike: On the fourth offense, the engine flags the IP as purely hostile and issues a PERMANENT ban (999,999 minutes).

4. Full Visibility: Live Dashboards & Instant Alerts
You can't secure what you can't see. To ensure the engine was performing correctly, I built a full observability suite.

Live Metrics: A web dashboard that plots a sliding window of the Current Requests Per Second against the Baseline Mean, making it incredibly easy to visualize traffic spikes in real-time.

Active Threats: A "Currently Banned" panel that lists the IPs currently sitting in the iptables penalty box.

Slack Webhooks: Every time a block occurs, the engine constructs a JSON payload and fires it to a Slack channel. The alert details the offending IP, the exact trigger that caught them (e.g., Rate Multiplier), and how long they are banned for.

Conclusion
Building this engine fundamentally changed how I view server security. Moving from static, hard-coded rules to dynamic, math-driven logic allows us to build systems that scale securely. The Dual-Trigger System covers both bases: the Z-Score outsmarts the slow, sneaky attacks, while the Rate Multiplier stops brute-force floods dead in their tracks.

Check out the live metrics dashboard at: metrics.okeakwerigbe.name.ng