Beyond Automation: Building a Policy-Gated Deployment Engine with OPA and Prometheus

Adewumi Victor — Wed, 06 May 2026 21:10:44 +0000

When you first start in DevOps, you think the goal is a "Green Pipeline." You want the code to build, the container to start, and the URL to load. But as systems scale, a "Green Pipeline" can actually be a disaster.

What if your container starts, but the server is out of disk space? What if the code runs, but it’s so slow that users give up? In this project, I moved beyond simple automation to Governance. I built SwiftDeploy: a deployment tool that doesn't just act; it thinks.

The Core Problem: The "Blind" Deployment
Most basic CI/CD setups are "blind." They push code and hope for the best. If the environment is unhealthy, the deployment crashes. If the code is buggy, the users suffer.

To solve this, I added three human-like qualities to my deployment script:

Eyes (Observability): The ability to see how the app is performing in real-time.

A Brain (Policy Enforcement): A central logic engine to decide if a deployment is safe.

A Memory (Auditing): A permanent record of every decision made.

Step 1: The Architecture of Safety
I designed SwiftDeploy as a multi-container ecosystem. Instead of one big app, I used a Sidecar Pattern.

The CLI: My Python-based orchestrator.

The App: Instrumented with Prometheus metrics.

Open Policy Agent (OPA): Our "Supreme Court." It holds the laws (policies) and gives the final verdict on deployments.

Nginx: The traffic controller that handles the switch from a "Canary" (test version) to "Stable" (live version).

Step 2: Implementing the "Eyes" (Instrumentation)
To make the app observable, I used the Prometheus text format. I modified my API to expose a /metrics endpoint. This isn't just a status page; it’s a high-speed stream of data tracking:

Throughput: How many people are using the app?

Error Rate: Is the app failing?

Latency (P99): Is the app slow for the unluckiest 1% of users?

Why P99? Average latency is a lie. If 99 people have a 1ms response and 1 person has a 10-second response, the "average" looks fine, but you’re losing 1% of your customers. SwiftDeploy looks at the P99 to ensure everyone has a good experience.

Step 3: Implementing the "Brain" (OPA & Rego)
This was the most exciting part. I used Open Policy Agent (OPA). OPA uses a language called Rego.

The magic here is Decoupling. My CLI doesn't have hardcoded rules like if disk < 10GB. Instead, the CLI asks OPA: "Here is the current disk space and the user's requirements. Should I deploy?"

I wrote two distinct policy domains:

Infrastructure Policy: Guards the "physical" health (Disk, CPU, RAM).

Canary Policy: Guards the "performance" health (Error rates, Latency).

Step 4: The Gated Lifecycle in Action
When I run ./swiftdeploy promote, a complex dance happens behind the scenes:

Scrape: The CLI hits the Canary’s /metrics endpoint.

Calculate: It computes the current Error Rate and P99 Latency.

Consult: It sends this data to OPA.

Decision: If OPA sees that the Error Rate is > 1%, it returns a DENY with a reason.

Stop: The CLI aborts the promotion, keeping the "Stable" version safe and sound.

Step 5: Chaos Engineering (Testing the Guardrails)
A safety system is only good if it’s tested. I intentionally "broke" my environment to see if SwiftDeploy would catch it.

Scenario A: I manually lowered the disk threshold in my config. Result: SwiftDeploy blocked the deployment immediately with a clear error message.

Scenario B: I injected "Chaos" into the Canary, forcing it to return 500 errors. Result: The CLI refused to promote the Canary, saving the production environment from a faulty update.

Step 6: The Audit Trail (The "Memory")
In a professional setting, you need to know why a deployment failed three weeks ago. SwiftDeploy solves this by generating an audit_report.md.
Every time the CLI checks a policy, it logs the input, the decision, and the reasoning into a history.jsonl file. This creates a transparent, unchangeable timeline of the system's health.

Conclusion: What I Learned
Building SwiftDeploy taught me that DevOps is about trust.

You trust your metrics to tell the truth.

You trust your policies to enforce the rules.

You trust your automation to stay stopped when things go wrong.

By separating the "How" (Docker/Nginx) from the "Why" (OPA/Rego), I’ve built a tool that is ready for the complexities of modern cloud-native engineering.
you can checkout my repo to see all the project code at https://github.com/Adewumicrown/swiftdeploy

Building a Self-Learning DDoS Guard

Adewumi Victor — Wed, 29 Apr 2026 13:40:22 +0000

Real-Time Anomaly Detection with
Python
By Victor • HNG DevSecOps Project Case Study

In the modern web landscape, static rate limiting is often a blunt instrument. While it
can stop basic brute-force attacks, it struggles with sophisticated, low-and-slow DDoS
attacks or sudden legitimate traffic spikes. For my latest HNG DevSecOps project, I
built a dynamic Anomaly Detection & DDoS Engine that learns from your traffic
patterns and defends your AWS infrastructure in real-time.

The Problem: Why Static Limits Fail
Most developers set a hard limit: "Allow 100 requests per minute." But what happens at
2:00 AM when your server is usually empty? A sudden burst of 90 requests per
minute from a single IP might be an attack, yet it passes under the radar. Conversely,
during a Black Friday sale, 150 requests might be perfectly normal. I needed a system
that understood context.

The Solution: Statistical Learning
The heart of this engine is a Python-based daemon that "learns" what normal traffic
looks like for every hour of the day. It uses two key mathematical concepts:

The Rolling Baseline
Instead of hardcoded numbers, the engine maintains a 30-minute rolling window
of traffic metrics. It calculates the mean and standard deviation for every hour
slot. This allows the system to distinguish between a busy Monday afternoon and
a quiet Sunday night.
The Z-Score
To identify an anomaly, we calculate the Z-Score of incoming traffic. The formula
is:

z = (x - μ) / σ

Where x is the current traffic rate, μ is the learned mean, and σ is the standard
deviation. If the z exceeds 3.0, the system flags the IP as an anomaly.

The Architecture
The project is deployed on AWS EC2 using a Dockerized stack:
Nginx: Acts as the frontline, logging every request in a structured JSON format.
Nextcloud: Our sample application being protected.
Python Detector: The "Brain." It tails the Nginx logs, performs statistical
analysis, and makes decisions.

Active Defense with Iptables
Detection is useless without action. When an IP is flagged, the engine doesn't just
send an alert; it executes a system-level command using iptables to DROP all
traffic from that IP. To ensure we don't block legitimate users forever, I implemented
an Unbanner module. It follows an exponential backoff schedule: 10 minutes, then
30 minutes, then 2 hours, before finally issuing a permanent ban for repeat
offenders.

Real-Time Visibility
I integrated a Slack notification system to keep the DevOps team informed. Whether
it’s a specific IP being banned, a global traffic surge, or an automatic unban, the team
receives a formatted alert within seconds. Additionally, a Flask-based dashboard
provides a live look at current metrics and system health.
•
•
•

Conclusion
Securing infrastructure is not just about building walls; it's about building systems
that can think. By combining Python’s data processing power with Linux’s networking
tools, I've created a resilient, self-correcting defense mechanism that scales its
sensitivity based on actual usage patterns.
The code for this project is open-source and available on GitHub at https://github.com/Adewumicrown/hng-anomaly-detector for anyone looking to try it on their own

Forem: Adewumi Victor

Beyond Automation: Building a Policy-Gated Deployment Engine with OPA and Prometheus

Building a Self-Learning DDoS Guard