Forem: Feyisayo Lasisi

Building a PCI-DSS Compliant DevSecOps CI/CD Pipeline for a Fintech Using .NET

Feyisayo Lasisi — Mon, 13 Apr 2026 11:21:13 +0000

A deep dive into the security scanning pipeline that runs on every pull request before a single line of code reaches staging or production
In my last article, I described how I designed a smart CI pipeline trigger for a .NET monorepo containing 11 APIs. The core idea was to make the pipeline aware of what actually changed before deciding how much of the test suite to run. That article focused on efficiency. This one focuses on security.
Here, I will walk through what actually happens once the pipeline is triggered: how the application is built, how it is attacked, how its dependencies are interrogated, and how its code quality is evaluated before a single line of code is allowed to reach the staging or production environment.
The pipeline is designed to be PCI-DSS compliant, which means security is not an afterthought bolted onto the end of the process. It is embedded into every stage.
The Pipeline Trigger
The CI pipeline is triggered by a pull request to either the staging or main branch. This is a deliberate design decision. The goal is to ensure that before any merge to these critical branches, the incoming commit passes the full suite of security and quality checks. At no point should a vulnerability or a failing test be allowed to land on a protected branch undetected. The main and staging branches are kept clean at all times.
For this article, I will focus on the Layer 2 scenario from the smart detection step: a change is detected in a shared dependency directory such as Core or Persistence. This is the most exhaustive scenario because it triggers the pipeline to build and test the entire suite of 11 APIs.
Stage 1: Building the Application
Once the detect-changes job determines that the full suite needs to run, the pipeline proceeds to the build-app job. This job depends on detect-changes completing successfully before it can start.
In this stage, all 11 APIs are built in parallel using the GitHub Actions matrix strategy, along with the Core and Persistence projects. The build process follows four steps for each application: it restores all NuGet dependencies, compiles and builds the application, publishes the build output, and finally uploads the compiled artifact to GitHub Actions artifact storage. Each artifact is tagged with the name of the application it belongs to, which serves as the identifier the matrix uses to route subsequent jobs to the correct artifact.
The reason for uploading build artifacts at this stage rather than rebuilding at each subsequent step is significant. Without this, every security scan job would need to compile the entire application from scratch before it could run its tests. Across 11 APIs and multiple scan types, that would add a considerable amount of redundant build time to every pipeline run. By building once and sharing the artifact, all downstream jobs can skip straight to their core function.
If any one of the 11 applications fails to build, the entire pipeline fails immediately. There is no point running security scans on code that does not compile.

Stage 2: Dynamic Application Security Testing (DAST) with OWASP ZAP
Once the build artifacts are available, the DAST job runs. This job depends on both detect-changes and build-app completing successfully.
DAST is a category of security testing that evaluates a running application from the outside, simulating the behaviour of a real attacker. Unlike static analysis, which reads source code, DAST probes the application while it is actually executing and looks for vulnerabilities that only manifest at runtime. The types of issues it targets include SQL Injection, Cross-Site Scripting (XSS), broken authentication flows, security misconfigurations, and sensitive data exposure.
For this pipeline, I use OWASP ZAP (Zed Attack Proxy) to carry out the DAST scan. ZAP is an open source tool maintained by the Open Worldwide Application Security Project and is widely used in both manual penetration testing and automated security pipelines.
The job works as follows. It first downloads the build artifact for each API from the matrix. It extracts the compiled .dll file and starts the application directly within the GitHub Actions runner VM. A separate step then verifies that the application is actually accessible and responding before ZAP begins its scan, because running a scan against an unreachable application would produce meaningless results.
The scan that runs within the pipeline is a ZAP baseline scan rather than a full active scan. The reason for this is time. A full active scan on 11 APIs within a CI pipeline would be prohibitively slow and would block engineers from merging for an unacceptable amount of time. The baseline scan is fast, catches the most critical runtime vulnerabilities, and is appropriate for a PR gate.
The full deep scan is handled separately by a scheduled cron job that runs routinely against the codebase outside of the normal PR flow. This deep scan generates a detailed report and triggers an alert if any application fails.
After each baseline scan, ZAP produces a report that is uploaded as an artifact to GitHub Actions for review. The pipeline evaluates the results using ZAP's risk code system. A risk code of 2 (medium severity) or 3 (high severity) causes the pipeline to return an exit code of 1, which fails the job. Low severity findings are logged but do not block the merge.

Stage 3: Software Composition Analysis (SCA) with Trivy
The SCA job also depends on both detect-changes and build-app. It runs concurrently with the DAST job rather than waiting for it to finish.
Software Composition Analysis is the practice of examining an application's third-party dependencies for known vulnerabilities. Modern applications rarely consist entirely of first-party code. They rely on dozens, sometimes hundreds, of external packages and libraries. Each of those packages is a potential attack surface. A vulnerability in a single dependency can compromise the security of the entire application that uses it.
Developers working day to day on feature delivery cannot reasonably be expected to track every CVE published against every package in their dependency tree. SCA automates this responsibility and moves it into the pipeline where it runs on every commit.
For this pipeline, I use Trivy, an open source vulnerability scanner developed by Aqua Security. Trivy scans the project's dependency manifest and checks every package against multiple vulnerability databases simultaneously, including the National Vulnerability Database (NVD), the GitHub Advisory Database, the OSS Index, and the broader CVE database ecosystem.
The job downloads the build artifact produced by build-app, runs Trivy against it, and evaluates the results. Any dependency with a vulnerability rated High or Critical severity causes the pipeline to fail. Medium and low severity findings are surfaced in the scan output but do not block the merge on their own, giving the engineering team visibility without creating unnecessary friction for lower-risk issues.

Stage 4: Static Application Security Testing (SAST) with SonarQube
The final security stage is the SAST scan, which runs using SonarQube. This job is slightly different in its design compared to the DAST and SCA jobs because SonarQube does not consume the pre-built artifact from the build-app stage.
The reason for this is architectural. SonarQube performs its analysis by instrumenting the build process itself. It needs to observe the compilation as it happens in order to collect the full range of metrics it analyses. It cannot do this by inspecting a finished artifact after the fact. So for SonarQube, the application is built again as part of this job, and SonarQube hooks into that build to perform its analysis.
SAST tests the application at rest, meaning it analyses the source code and compiled output without running the application. This is complementary to DAST rather than a replacement for it. Where DAST catches vulnerabilities that appear at runtime, SAST catches issues that are visible in the code itself before the application is ever started.
SonarQube tests for a broad range of concerns: bugs, technical debt, security hotspots and exposed secrets, code smells, code coverage, code duplication, reliability ratings, security ratings, and maintainability ratings.
A key part of the SonarQube setup is the quality gate. Before the pipeline was deployed, the DevOps team and the software engineering team agreed on a set of coding standards and quality thresholds. These standards were encoded as a quality gate within SonarQube. Any commit that falls short of this agreed standard causes the pipeline to fail.
The SAST scan runs across all 11 APIs. Every developer on the team has access to the SonarQube dashboard, where they can see exactly which checks their commit failed and what aspect of the quality gate was not met. This creates a feedback loop that goes beyond simply failing a pipeline: it tells the developer precisely what needs to be fixed and why.
Failure Handling: Why fail-fast Is Set to False
One important design decision in this pipeline is that fail-fast is set to false for all jobs that run concurrently, specifically the DAST, SCA, and SAST stages.
By default, GitHub Actions matrix jobs will cancel all in-progress jobs the moment one of them fails. This is efficient in terms of minutes but produces incomplete information. If the DAST scan fails on API number 3, you want to know whether APIs 4 through 11 would also have failed, or whether the issue is isolated. Cancelling everything at the first failure obscures that picture.
With fail-fast set to false, all jobs run to completion even if one fails. The pipeline still fails at the end if any job returned a failure, but the team receives the full diagnostic picture from every scan across every API. This makes root cause analysis significantly faster and reduces the number of subsequent pipeline runs needed to surface all issues.
Monitoring and Alerting: Slack Integration
Every job in the pipeline is connected to a Slack notification step. When any job fails, a notification is sent to the team's Slack channel immediately. The notification identifies which job failed and which repository it belongs to, so the responsible team can investigate without needing to manually check the GitHub Actions dashboard.
A successful pipeline run also sends a notification, which serves as a positive signal that the commit is clean and the merge can proceed. This keeps the whole team informed of pipeline health without requiring anyone to actively monitor it.
The CD Pipeline: Security as a Deployment Gateway
The Continuous Deployment pipeline is triggered by a push to either the main or staging branch, which occurs after a pull request is merged.
Before any deployment begins, the full CI pipeline is retriggered. This acts as a final gateway. Even though the code passed all checks during the PR, this second run ensures that nothing has changed in the environment or the dependencies between the PR being raised and the actual merge. Only after the CI pipeline passes in full does the deployment proceed.
Both successful deployments and failed deployments trigger a Slack notification. In the event of a failed deployment, a rollback is initiated and that too triggers a notification. Every deployment event is logged through Slack for audit purposes, which is a requirement under PCI-DSS compliance.
A Note on Pipeline Maintainability
This is a long pipeline. The full script currently exceeds 500 lines of YAML. At that length it becomes difficult to read, review, and maintain as a single file.
To address this, my team is in the process of refactoring the pipeline to split each job into its own separate reusable workflow file. The build job, DAST job, SCA job, and SAST job will each live in their own file and be referenced from the main pipeline file. This makes the overall structure significantly easier to navigate and allows individual jobs to be updated, tested, and reviewed in isolation without touching the rest of the pipeline.

Summary: The Full Pipeline Flow
To bring it all together, here is how a pull request to staging or main moves through the pipeline from start to finish.
The detect-changes job runs first and determines the scope of the test suite based on what changed. The build-app job runs next, compiling all affected APIs and uploading their artifacts. The DAST, SCA, and SAST jobs then run concurrently against those artifacts, each applying a different category of security and quality analysis. All three must pass before the pipeline is considered successful. If any one of them fails, the full diagnostic output from all three is preserved and a Slack notification is sent to the team. On a merge, the CI pipeline runs again as a final deployment gate before the CD pipeline takes over and pushes to the environment.
Security is not a checkpoint at the end of this process. It is the process.

DevSecOps #ApplicationSecurity #Fintech #PCIDSS #CICD #GitHubActions #SonarQube #OWASPZAP #Trivy #DotNET

How I Cut GitHub Actions Usage in Half by Making the CI Pipeline Smarter

Feyisayo Lasisi — Mon, 30 Mar 2026 11:55:03 +0000

*A story about runaway build minutes, a blocked engineer, and a smarter approach to CI triggers
*
A Slack notification landed in my feed. A CI pipeline had failed. My first instinct was the usual suspects: the Static Application Security Scanning (SAST) tool flagged a vulnerability, the Software Composition Analysis (SCA) picked up a bad dependency, or the Dynamic Application Security Testing (DAST) tool caught something in the running application. I opened the GitHub Actions run to investigate.
The pipeline had not even started. The error message read:
"The job was not started because recent account payments have failed or your spending limit needs to be increased."

That was unexpected. This was an organization account with a private repository. GitHub gives a healthy allocation of free minutes per month. We should not have been anywhere near the limit.
I pulled up the GitHub Actions usage metrics. We had consumed over twice our normal monthly minutes across all repositories, and we were not even at the end of the month.

How We Got Here
A few weeks earlier, I had been working on evolving our existing CI setup into a full DevSecOps pipeline. That work introduced several new jobs, each running on its own dedicated VM, to execute security scans in parallel rather than sequentially on a single runner. Running everything on one VM would have meant pipeline runs stretching to 30 minutes or more per push, which was unacceptable.
To make things more complex, a single repository in our system could contain as many as 8 to 10 APIs. To avoid testing them one after another, I implemented a GitHub Actions matrix strategy, which spun up a separate VM per API and tested them concurrently. This dramatically reduced wall-clock time per run but multiplied the VM-minutes consumed per run by the same factor.
Multiply that across several backend repositories (each with multiple APIs) and multiple frontend repositories, and the numbers compounded fast. Every push or pull request to any branch triggered the full pipeline: every job, every VM, every API, regardless of what actually changed.
The result was that we burned through our free GitHub Actions minutes before the month was out.

The Immediate Fix
By the time I had traced the root cause, the engineer whose push triggered the failed run had already reached out. They were blocked and could not deploy.
The immediate fix was straightforward. I increased the spending limit on GitHub Actions. That unblocked the pipeline within minutes and deployments resumed. But that was a patch, not a solution.

The Fundamental Flaw
Unblocking the team gave me time to step back and examine the underlying design problem. The issue was simple but costly: the pipeline had no awareness of what actually changed.
Every push, whether it was a one-line README update, a config file tweak, or a core business logic change, triggered the full suite. Every API was built, tested, and scanned. Every VM was spun up. Every minute was consumed.
This was not just inefficient. It was architecturally blind.
Redesigning the Trigger Strategy
To fix this properly, I needed the pipeline to make intelligent decisions based on what changed, not just that something changed. The strategy I settled on had three layers:

Layer 1: Ignore Non-Functional Changes Entirely
Some changes simply do not affect application behaviour. Edits to .github/ workflow files during development, updates to README.md, changes to documentation: none of these should consume a single build minute. GitHub Actions supports this natively via the paths-ignore keyword. When a push only touches these paths, the workflow does not start at all. Zero minutes consumed.

Layer 2: Detect Shared Dependency Changes
Our application is built on a shared Core and Persistence layer. Every API in the repository depends on these projects. A bug introduced into Core is not isolated. It is a ripple that can break every single API downstream.
This makes Core and Persistence special. Any commit that touches these directories must trigger tests across the entire API suite, not just one API. There is no safe shortcut here.

Layer 3: Isolate Changes to Leaf APIs
On the other end of the dependency tree are the individual API projects. These are leaf nodes. A change to one API cannot physically affect another because there is no dependency between them.
If a push only touches a single API directory, only that API needs to be built and tested.

The Implementation: A detect-changes Job
I introduced a dedicated detect-changes job as the first stage of the pipeline. Every subsequent job, whether building, scanning, or testing, depends on its output before it can proceed.
The job works in four steps. First, it checks out the full git history so it can compare the current commit against the base branch and accurately identify what changed. Second, it runs a file diff specifically against the Core and Persistence directories. If any files in those directories were modified, a flag is set that tells the rest of the pipeline to run everything. Third, it runs a separate file diff across all individual API project directories to capture which specific APIs were touched. Fourth, it uses the results of both checks to dynamically construct a JSON matrix.
The matrix is the key output. If Core or Persistence changed, the matrix contains every API in the repository. If only specific APIs changed, the matrix contains only those APIs. If nothing relevant changed, the matrix is empty and a has-changes flag is set to false, which causes all downstream jobs to skip entirely.
Every downstream job, including the build job, the SAST scan, the SCA scan, and the DAST scan, reads this matrix as its input. Each job only processes the APIs the matrix tells it to. This means the entire security scanning suite becomes scoped to the actual change, not the entire repository.

Decision Flow
The logic the pipeline now follows on every push looks like this:
Did the push only touch ignored paths like README or workflow config files? The workflow does not start. Zero minutes consumed.
Did Core or Persistence change? The full matrix is built. Every API is tested.
Did only individual API directories change? A scoped matrix is built containing only the affected APIs.
Did no API directories change at all? The has-changes flag is set to false and all downstream jobs are skipped automatically.

The Result
The change in consumption was immediate. Pushes that previously triggered 8 to 10 parallel VM instances now trigger only the ones that matter. Documentation updates consume nothing. A fix to a single API spins up one VM, not ten.
More importantly, the pipeline did not become less safe. The Core and Persistence guard ensures that changes to shared dependencies still trigger a full suite run. The security guarantees of the DevSecOps pipeline remained intact. We just stopped paying for work that did not need to be done.

Lessons Learned

CI pipelines need dependency awareness. A flat "run everything on every push" approach does not scale. Model your repository's dependency graph and let the pipeline reflect it.
Shared layers deserve special treatment. Core and Persistence are not just directories. They are the foundation every other component builds on. Treat them that way in your pipeline logic.
Dynamic matrices are powerful. GitHub Actions matrix strategy is commonly used with static values. Building the matrix dynamically at runtime unlocks a level of precision that static configurations cannot achieve.
Optimization and security are not in conflict. A well-designed pipeline can be both efficient and thorough. The goal is not to skip checks. It is to run the right checks on the right code.
In a follow-up article, I will walk through the full DevSecOps pipeline design, covering how SAST, SCA, and DAST are integrated and how security gates are enforced without becoming a bottleneck to developer velocity.

When Monitoring Saves the Day: How We Optimized Our Production Database Without Increasing Costs

Feyisayo Lasisi — Mon, 16 Mar 2026 10:59:49 +0000

When Monitoring Saves the Day: How We Optimized Our Production Database Without Increasing Costs
It started like any other day until our monitoring system triggered an alert on the production database.
The alarm indicated that the storage utilization on the production database had crossed the configured threshold, meaning the storage was approaching a critical level and required immediate attention.
As part of the investigation, I reviewed other key database metrics to better understand the overall system health. That’s when I noticed something interesting.
The database had been experiencing frequent spikes in CPU utilization, occasionally reaching as high as 89% of available vCPUs. These spikes were brief and intermittent, which explained why they never triggered a CPU alarm threshold.
Memory usage was also relatively high. Freeable memory was being consumed up to about 80%, but not to the point where the system had to rely on swap memory. Swap usage was only about 1MB, which indicated that memory pressure had not yet begun affecting database performance.
To ensure performance was not already degraded, I checked additional metrics:
Read latency
Write latency

Both metrics were well within acceptable ranges, confirming that despite the high CPU and memory spikes, database performance remained stable.
Defining the Action Plan

After assessing the situation, I outlined the actions to present to the CTO.
1. Immediate Storage Increase
The first step was straightforward: increase the database storage to prevent reaching a critical capacity limit.
One advantage of using Amazon RDS is that storage scaling can occur without downtime, as the process runs in the background. After reviewing the cost implications, the increase was approved since the cost impact was minimal.

2. Investigate CPU and Memory Spikes
The next challenge was the recurring spikes in CPU and memory utilization.
There were two possible approaches:
Option A: Increase the instance size
Upgrading the database instance to the next size would increase compute and memory capacity. However, because our architecture included both a primary database and a read replica, upgrading both instances would nearly double our RDS cost.
Option B: Optimize database queries
Since performance metrics were still healthy, I recommended first working with the engineering team to optimize database queries that might be inefficient or resource-intensive.
This approach would allow us to improve performance without immediately increasing infrastructure costs.

Safe Execution: Scaling Storage
Before applying the storage increase to the production database, I followed a cautious approach.
As a rule of thumb, I scaled the read replica first, ensuring that the operation completed successfully without unexpected side effects.
Once that was completed successfully, I proceeded to scale the primary database storage, which also completed without downtime or disruption.
This was performed during a period of low traffic, allowing us to confirm that the operation did not impact uptime.

Collaboration with Engineering
Following this infrastructure adjustment, I connected with the Software Engineering Team Lead to discuss the observed resource spikes.
A Jira ticket was created and assigned to a developer to investigate and optimize the database queries responsible for the load.
The optimization work was first implemented and tested extensively in the staging environment. Once the improvements showed the desired results, the optimized queries were deployed to production.

The Result
After deploying the query optimizations, we observed a significant reduction in database resource usage:

CPU utilization dropped to approximately 60%
Memory consumption stabilized
Database performance remained strong

Summary
This experience reinforced an important principle in engineering:
Not every performance issue should be solved by scaling infrastructure.
Instead of immediately upgrading the database instance, which would have significantly increased our AWS costs, we focused on observability, analysis, and optimization.
By combining infrastructure scaling where necessary (storage) with application-level improvements (query optimization), we resolved the issue efficiently while keeping operational costs under control.

Final Thought
Good engineering is not just about making systems work.
It’s about making them work efficiently, reliably, and sustainably.
Sometimes the best solution isn’t scaling up, it's understanding the system deeply enough to optimize what already exists.

Building Hyperscale Capabilities for B2B Workloads - Without a Hyperscaler

Feyisayo Lasisi — Tue, 10 Mar 2026 12:36:30 +0000

From Vertical Scaling to True Elasticity
At a pivotal point in my career as a Consulting Cloud Solutions / Technical Product Engineer, I was handed a mandate that sounded simple on paper but was far from trivial in execution:
"Enable horizontal autoscaling for our B2B customers."
At the time, customers could only scale vertically, resizing their virtual machines whenever they needed more computers. It worked, but only up to a point. Beyond that, the limitations became obvious.
What made it even more interesting was the environment: this wasn't on any of the well-known hyperscalers. It was within a startup cloud compute company, where elasticity wasn't a built-in convenience, it had to be intentionally engineered.
They couldn't:
Automatically scale out based on load
Scale back in when utilization dropped
Distribute traffic intelligently
Optimize cost through elasticity
Maintain performance during unpredictable spikes

For production B2B workloads, that model was limiting.
We needed real elasticity.
The Real Problem: Vertical Scaling Isn't Elastic
Vertical scaling solves capacity constraints temporarily. But it introduces structural issues:
Downtime during resize operations
Hard upper limits on instance sizes
Paying peak cost even during off-peak hours
No automated response to traffic bursts

For APIs and transactional B2B systems, that's fragile.
What customers needed wasn't "bigger machines." They needed adaptive infrastructure.
The Architecture I Designed
The solution introduced:
Autoscaling groups
Metric-driven scaling policies (CPU + RAM)
Application-aware health monitoring
Layer 7 traffic distribution
Conditional TLS termination
Fully parameterized Infrastructure-as-Code

The core of the design was an autoscaling group resource that dynamically created and managed instances based on scaling signals.
But more importantly, it was built to be reusable and controlled.
Design Philosophy: Parameterization Over Hardcoding
Every meaningful variable was externalized:
Image
Instance size
Network and subnet
Security rules
Scaling thresholds
Cooldowns
Load balancing algorithm
TLS configuration

This allowed the platform team to provide a blueprint, while tenants retained safe tuning control.
Guardrails prevented runaway scaling. Flexibility enabled elasticity.
Dual-Metric Scaling: CPU + RAM
Most autoscaling implementations rely solely on CPU.
That's incomplete.
CPU-Based Scaling
CPU scaling ensured instance-size awareness:
High CPU → scale out
Sustained low CPU → scale in

RAM-Based Scaling
Why memory?
Because CPU-only logic fails for:
Memory-heavy APIs
Cache-intensive services
Background workers
Stateful processes

Memory pressure doesn't always spike CPU. But it degrades performance quietly.
By combining both signals, scaling decisions became more accurate and less reactive.
Controlled Scaling Policies
Scaling adjustments were incremental:
Scale up by +1
Scale down by -1
Cooldown period enforced

Cooldowns were non-negotiable.
Without them, you get:
Oscillation
Thrashing
Cost spikes
Instability

Elasticity must be disciplined.
Intelligent Traffic Distribution
Horizontal scaling without intelligent routing is incomplete.
Each instance was automatically registered behind a managed Layer 7 load balancer configured with:
Listener configuration
Traffic pools
Algorithm selection (round robin, least connections, source affinity)
Health checks

This ensured:
Even traffic spread
Automatic removal of unhealthy instances
Seamless scale-out integration

New nodes joined the pool automatically. Terminated nodes exited cleanly. No manual intervention.
Conditional TLS Termination
Security couldn't be an afterthought.
The template supported conditional TLS termination at the load balancer:
Secure edge termination when required
Certificate injection via parameter
HTTP communication retained internally

This allowed secure external exposure without unnecessary internal overhead.
One template. Multiple behaviors. Clean logic.
Application-Aware Health Monitoring
Metrics alone are not enough.
Health checks were configurable:
Protocol type (HTTP, HTTPS, TCP, PING)
Custom endpoint paths
Expected response codes
Retry logic

Scaling decisions were therefore not just metric-driven. They were application-aware.
If an instance failed health validation, it was removed from rotation, even if CPU looked fine.
Clean Separation of Concerns
Compute provisioning lived in a nested template.
Each instance:
Booted with the required configuration
Attached to the correct network
Applied security policies
Registered automatically as a pool member

This separation improved:
Maintainability
Debuggability
Reusability

Scaling logic stayed isolated from compute logic.
Clean architecture scales better than tangled configuration.
Observability by Design
The stack exposed:
Load balancer endpoint
Active instance IPs
Current autoscaling group size
Scaling thresholds
TLS state

This made validation straightforward and troubleshooting faster.
Elasticity without visibility is chaos.
How the Feature Was Exposed to Customers
Internally, the autoscaling logic was implemented as infrastructure orchestration.
But for customers, it had to feel simple.
The orchestration layer was triggered by a backend service written in Python. Customer-facing APIs were exposed via a Node.js frontend layer.
This separation allowed:
Clean API contracts for B2B customers
Secure backend orchestration execution
Controlled feature consumption
Clear abstraction of infrastructure complexity

Customers interacted with an API. Behind the scenes, the platform translated that into autoscaling behavior.
No platform details exposed. Just capability.
What This Enabled for B2B Customers
After deployment, customers could:
Automatically scale from x → x + n instances
Scale down during low traffic
Avoid manual resizing
Maintain SLA during demand spikes
Secure endpoints with TLS
Optimize infrastructure cost

Infrastructure shifted from:
Static capacity to Adaptive elasticity.
Key Lessons

Autoscaling is not just CPU-based Memory pressure breaks systems quietly.
Cooldowns prevent instability Elastic does not mean erratic.
Parameterization is platform maturity Reusable blueprints reduce operational risk.
Health checks are mandatory Metrics without application validation are incomplete. Final Thoughts This wasn't just about writing an Infrastructure-as-Code template. It was about: Platform enablement Elastic workload design Cost optimization SLA resilience Production-grade automation

Horizontal autoscaling isn't a feature toggle.
It's an operational philosophy.
And once B2B customers experience real elasticity, going back to vertical-only scaling isn't an option anymore.

If your pipelines deploy cleanly but never clean up, you’re shipping future problems.

Feyisayo Lasisi — Mon, 02 Feb 2026 11:06:16 +0000

Owning the Long Tail of Automation: Designing CI/CD Systems That Clean Up After Themselves
A few years ago, while optimizing our AWS deployment workflows, I identified a systemic issue that had both reliability and cost implications if left unchecked.
At the time, I was deploying a .NET Core application using build artifacts directly, no Docker, no container orchestration. The CI pipeline ran tests and validations, and once those passed, the CD pipeline built the application and pushed the artifacts to Amazon S3. Artifacts were tagged and separated by environment (staging and production), and S3 also served as our rollback mechanism, allowing us to redeploy a previous version quickly if a release failed. Nexus was considered, but S3 was chosen for its simplicity and tight integration with AWS, without an additional infrastructure overhead.
The system worked, but like many early-stage automation decisions, it had a long-term side effect.
We had 14 applications at the time, each with both staging and production environments. Over a few months, every environment accumulated 20+ artifacts. That meant well over 560 artifacts stored already, and growing linearly with every deployment. There was no retention policy, no cleanup mechanism, and no visibility into how fast this was scaling. While S3 storage is relatively inexpensive, at this rate we were heading toward hundreds to thousands of stale artifacts within a year, introducing unnecessary cost, clutter, and operational risk during incident response or rollbacks.
From an SRE perspective, this violated two principles I care deeply about:
Automation should not introduce long-term operational debt

Systems must be self-maintaining, not just self-deploying

Rather than relying on manual cleanup or tribal knowledge, I designed an automated, low-risk solution.
I implemented an AWS Lambda function with least-privilege IAM access, scoped strictly to artifact buckets. The function was triggered monthly by a cron job and on execution, the function:
Enumerated all artifact S3 buckets across the account.

Detected whether new artifacts had been added since the last run, exiting early if no changes were found.

Sorted artifacts by creation timestamp.

Retained only the latest 10 artifacts per environment, deleting all older ones.

This approach preserved rollback safety while enforcing a clear retention policy. It also ensured deterministic behavior, no deletions unless a newer artifact existed, and no assumptions baked into deployment pipelines.
The impact was immediate and measurable:
~70–80% reduction in stored artifacts across environments

Predictable, bounded S3 storage growth

Elimination of manual cleanup tasks

Cleaner rollback workflows during incidents

Long-term cost savings with zero developer involvement

Just as important as the code was the leadership follow-through. I documented:
The original problem and risk assessment

The retention logic and safeguards

IAM design decisions

Operational behavior under different deployment scenarios

This ensured the solution was understandable, auditable, and maintainable by the wider team.
From an SRE and leadership standpoint, this wasn’t just about cleaning up S3 buckets, it was about owning the full lifecycle of automation, anticipating second-order effects, and leaving systems better than I found them. Once deployed, this became another class of operational concern that engineers simply didn’t have to think about again, which, to me, is the mark of effective DevOps and SRE work.
Image credit: www.sonassystems.com

SRE Leadership Case Study: Establishing Reliability and Observability in a FinTech Platform

Feyisayo Lasisi — Mon, 19 Jan 2026 11:04:46 +0000

SRE Leadership Case Study: Establishing Reliability and Observability in a FinTech Platform
While working with a FinTech startup operating in a high-growth, fast-paced environment, my primary focus during my first week was not feature delivery, but platform stability and operational reliability. The company was running a microservices architecture with over a dozen production services, yet had no effective observability across infrastructure or applications.
At the time, outages were typically detected by customers reporting issues to customer support, not by engineering. This resulted in delayed incident response, extended downtime, and reputational damage. The situation was further compounded by the fact that several APIs were consumed by third-party partners. When these APIs failed, the impact propagated beyond our platform, disrupting collections and payment workflows across multiple external businesses and their customers.
It became clear that reliability was not just a technical concern but a direct business risk.
Phase 1: Infrastructure Observability and Alerting
The first decision was selecting an appropriate monitoring stack. Given that the platform was fully deployed on AWS, I chose Amazon CloudWatch over Prometheus/Grafana to leverage native integrations, centralized visibility, and streamlined alerting.
I deployed and configured the CloudWatch Agent across all production servers to collect critical metrics beyond default CPU usage, including:
CPU utilization

Memory utilization

Disk usage

This immediately expanded our observability footprint from single-metric visibility to full host-level telemetry.
I then designed a centralized dashboard providing real-time visibility across all production infrastructure. Based on historical behavior and service tolerances, I defined alert thresholds and configured alarms. For example:
Memory utilization ≥ 80% sustained for 5 minutes triggered immediate alerts to the DevOps/SRE team.

These alerts were enabled across all production servers.
Outcome:
Mean Time to Detect (MTTD) was reduced from customer-reported incidents to sub-5 minutes.

Infrastructure-related incidents were identified before customer impact in over 70% of cases within the first month.

Phase 2: Third-Party Connectivity Reliability
A major source of downtime was unstable VPN connectivity to third-party service providers. Prior to my involvement:
Only a single primary VPN tunnel existed.

There was no monitoring of tunnel health.

Failures often went unnoticed until services were already disrupted.

I configured CloudWatch to monitor VPN tunnel status and set alarms to trigger when a tunnel remained down for three consecutive minutes. This provided immediate visibility into connectivity degradation and allowed rapid response.
Outcome:
VPN-related downtime was reduced by over 60%.

Incident response became proactive rather than reactive, with engineers alerted before full service disruption occurred.

Phase 3: Application-Level Health Monitoring and Alerting
Infrastructure visibility alone was insufficient without insight into application health. To address this, I implemented Route 53 health checks for all publicly accessible API endpoints.
Health checks validated endpoint availability and correctness, and alarms were configured as follows:
Production: Alert on non-200 responses for 3 consecutive minutes

Non-production: Alert on non-200 responses for 15 consecutive minutes to reduce noise

This approach ensured high signal-to-noise alerts while maintaining strict availability standards in production.
Outcome:
Mean Time to Recovery (MTTR) improved by approximately 40%.

Prolonged outages were virtually eliminated.

Engineering teams gained clear, actionable insight into service health.

Leadership and SRE Impact
Beyond the technical implementation, this initiative established a reliability-first culture. Monitoring and alerting shifted from being an afterthought to a core engineering concern. Engineers no longer relied on customer complaints to detect failures, and incident response became faster, data-driven, and more coordinated.
Overall impact included:
Reduced customer-visible downtime

Improved third-party trust and SLA adherence

Faster incident detection and resolution

A scalable observability foundation to support future growth

This work laid the groundwork for sustained reliability and positioned the platform to scale without sacrificing stability, aligning engineering outcomes directly with business continuity and customer trust.

PS: Ignore the numbers in the attached image, they’re just from a server stress test while I was showing my students how monitoring and alerting works. Chaos was intentional 😅

What happens when you have 50+ production servers to patch — and zero room for mistakes?

Feyisayo Lasisi — Mon, 05 Jan 2026 10:59:38 +0000

Manual security updates stop being an option very quickly.
Automating Security Patching Across Production Servers with Ansible
Security patching is critical, especially when production servers have known vulnerabilities that can be exploited. Beyond the risk itself, there’s also the compliance requirement, being able to prove that security updates are applied regularly.
This becomes challenging at scale.
In our case, we had 50+ production servers that required routine security-only patching. Manual updates were no longer reliable or sustainable, so I automated the entire process using Ansible.
I set up an Ansible control node with an inventory of all production servers and wrote a playbook that applies security updates only, avoiding full system upgrades and breaking changes. The playbook is idempotent, auditable, and production-safe, with:

. Timestamped logs per server

. Centralized reporting for visibility and compliance

. Zero service disruption during patching

Once a new production server is added to the inventory, security patching becomes automated from that point onward.
Because the control node is highly sensitive, I also hardened it by blocking all external SSH access, a compromise there would be catastrophic.
To complete the setup, I scheduled a cron job to run the playbook every midnight. From a single trigger, Ansible securely patches all production servers in a consistent and repeatable manner.

Results
. Reduced security risk by quickly closing known vulnerabilities

. Maintained production stability by avoiding breaking upgrades

. Embedded security into operations (DevSecOps in practice)

. Eliminated manual patching across dozens of servers

. Minimized downtime-related costs

Automation like this turns security from a reactive task into a built-in operational standard.

Devops #Security #Automation #Developers

Refection

Feyisayo Lasisi — Mon, 29 Dec 2025 10:58:26 +0000

As I prepared for the end of the year strategy session, I began taking inventory of my achievements and outlining plans for my department and team members for the coming year. I also reflected on both the wins and the challenges that initially seemed insurmountable, as well as the progress my students, interns, and mentees have made. In doing so, I realized something important. You do not need to have everything figured out from the beginning. What truly matters is having a clear sense of direction, then showing up consistently every day and putting in the work.

This year reinforced the value of patience, resilience, and intentional effort. Many of the outcomes I am most proud of, personally, professionally, and the development of those under my guidance, did not happen overnight. They were the result of small, sometimes uncomfortable steps taken repeatedly, learning from mistakes, adapting quickly, and staying committed even when progress felt slow. Each challenge became an opportunity to grow, both individually and as a team.

I also learned that progress is rarely linear. There were moments of doubt, setbacks, and recalibration, but those moments shaped better decisions and stronger execution. Looking back, the obstacles that once felt overwhelming now serve as proof that consistency and focus can turn uncertainty into measurable results.

As I look ahead to next year, the goal is not just to do more, but to do better. To build on the foundations already laid, invest more intentionally in people, improve processes, and pursue outcomes that are sustainable and impactful. The plan is to remain disciplined, stay curious, and continue showing up with purpose.

This reflection is a reminder to trust the process, commit to steady progress, and believe that clarity often comes through action, not before it.

Cheers to a wonderful year ahead, folks!!!

reflection #community #devops

Implementing Automated Storage Utilization Monitoring on Windows Servers (Part 2)

Feyisayo Lasisi — Mon, 15 Dec 2025 10:58:46 +0000

Implementing Automated Storage Utilization Monitoring on Windows Servers (Part 2)
After successfully deploying automated storage-utilization monitoring on our Linux-based instances, I was tasked with extending the same capability to Windows servers within our private cloud. While the objective remained the same, collect and report storage metrics automatically, the Windows implementation introduced unique challenges due to platform differences in automation and system behavior.

Understanding the Requirements
Just like the Linux solution, the Windows-based approach needed to:

Collect total storage capacity, used space, and free space

Identify each server using its unique Server ID

Post aggregated metrics to our monitoring endpoint

Run reliably and autonomously without human intervention

The logic for gathering and transmitting metrics was simple. The challenge was ensuring that the PowerShell script executed automatically and consistently.

Designing the Storage Collection Script
I built a PowerShell script that captured:

Disk size

Storage used

Available free space

Server ID (retrieved through a custom API endpoint)

The script then packaged the data and posted it to our monitoring API, mirroring the behavior of the Linux version. But unlike Linux, automating recurring execution wasn’t as straightforward.

Tackling Windows Automation Challenges

Replacing Cron with Windows Scheduled Tasks Linux relies on cron jobs, which run indefinitely once configured. Windows uses Scheduled Tasks, so I automated the script with:

New-ScheduledTaskAction

New-ScheduledTaskTrigger

Everything appeared correct during testing, until I noticed an unexpected issue.

The Hidden Three-Day Trigger Limitation By default, Windows Scheduled Tasks without an explicit repetition duration expire after three days. After that, the trigger simply stops firing, even though the task still exists.

This behavior is very different from cron.

To fix it, I explicitly set:

-RepetitionDuration ([TimeSpan]::MaxValue)

This ensured the task ran indefinitely, just like a cron job.

Handling a More Subtle Issue: Server Restarts
After resolving the trigger-duration issue, another problem surfaced:

The task didn’t reliably restart after server reboots.

Cron automatically recovers after a restart. Windows Scheduled Tasks do not, depending on how the task is defined. I attempted multiple configurations, including delayed-start options and different trigger types, but the results were inconsistent.

Implementing a Bootstrap Mechanism
To guarantee reliability, I introduced a lightweight bootstrap script triggered at system startup. Its job was simple:

Check if the monitoring task was running

Restart the task if necessary

This ensured that no matter how often a server rebooted, storage monitoring resumed automatically.

Outcome & Impact
Despite the additional complexity, the Windows implementation ultimately delivered the same level of automation and reliability as the Linux version.

Windows servers now report real-time storage metrics automatically.

Server-specific identification allows accurate metric attribution.

The bootstrap mechanism ensures resilience across reboots.

Combined with the Linux solution, our monitoring became fully unified across the private cloud.

This enhancement has improved infrastructure visibility, enabled early detection of storage issues, and contributed significantly to operational efficiency by reducing Mean Time to Detect (MTTD).

Automating Storage Utilization Monitoring on a Private Cloud - Part 1

Feyisayo Lasisi — Mon, 01 Dec 2025 11:13:58 +0000

Automating Storage Utilization Monitoring on a Private Cloud - Part 1
Monitoring is one of those things that’s easy to take for granted until it fails, and in this case, it was the missing piece in our private cloud setup.
A few years ago, I worked with a team that managed a private cloud infrastructure. One of our goals was to monitor the core performance metrics of each server: CPU, RAM, and storage utilization.
We already had reliable API endpoints that returned CPU and memory usage, but there was a gap: no API existed to track storage usage per instance. And that was a serious issue, because storage metrics were crucial for capacity planning and incident response.
So, I had to find a way to automatically collect and report disk usage for every server that was provisioned without relying on manual checks or post-deployment scripts.

Identifying the Challenge
Without an API for disk metrics, we had partial visibility at best. The team could see how much compute power was being used, but not how much storage was left on each server.
We needed a lightweight, secure, and automated way to collect disk usage data and feed it back into our monitoring dashboard.
After considering a few options, I decided to design a solution that would hook directly into the instance provisioning process itself, using Cloud-Init.

Solution Design and Implementation
The final solution was implemented through a Cloud-Init script and structured into three main components:

Disk_Usage User
The Cloud-Init script began by creating a dedicated user account named disk_usage. This user handled the background process responsible for collecting disk utilization data.
To maintain system security and comply with the principle of least privilege, I ensured this user had only the minimal permissions required to execute the monitoring task. This isolated the disk usage process from other system operations, maintaining both security and stability.
Disk Usage Collection Script
Next, I wrote a Bash script that gathered real-time disk utilization metrics from each server.
Initially, this script worked, but there was a challenge: it collected data without identifying which server the metrics came from. We couldn’t map the usage data back to the specific instance that generated it, which made it useless for visualization or analysis.
To solve this, I collaborated with our backend team to expose a custom API endpoint that returned the unique server identifier (Server ID).
I then modified the script to include this Server ID in its output, ensuring every metric could be accurately traced back to its source instance.
Data Ingestion and Automation
With data now correctly attributed, we needed a reliable way to push it to our monitoring dashboard.
We created an API endpoint that accepted POST requests from the Bash script. Each server would send its latest disk usage metrics, along with its Server ID, at regular intervals. The backend then processed this data and made it available for visualization on the dashboard.
To ensure the process ran automatically, I configured a cron job on each instance to execute the Bash script periodically. This eliminated any manual intervention, keeping the data flow consistent and real-time.
Results and Impact
By the end of the project, we had built a robust and automated monitoring solution that ran seamlessly in the background.
The system continuously collected disk usage data from every instance and displayed it on the monitoring dashboard, giving the team full visibility into storage health across the private cloud.
This implementation led to measurable improvements:
Reduced Mean Time to Detect (MTTD) issues by 25%, thanks to early detection of disk saturation.

Eliminated manual storage checks, freeing up the team’s time for higher-priority work.

Improved infrastructure observability, allowing proactive scaling and faster incident resolution.

What Came Next
Following the success of this implementation, I adapted the same solution for Windows servers using PowerShell.
That version was more complex due to Windows’ different automation and permission models, but the same core principle applied: collect and report system metrics automatically, securely, and in real-time.
I’ll share that implementation in my next post.
Key Takeaway
Monitoring shouldn’t be reactive. By embedding telemetry collection directly into the provisioning workflow, you can design observability into your infrastructure from day one, and save your team hours of troubleshooting later.

Photo credit: www.e-spincorp.com

Case Study: Troubleshooting a Site-to-Site VPN Connection Between AWS and an On-Premises Data Center

Feyisayo Lasisi — Mon, 17 Nov 2025 10:58:56 +0000

Overview
This case study details the investigation and resolution of a routing issue that affected high availability of a site-to-site VPN connection between an AWS-hosted infrastructure and an on-premises data center. The VPN was established to facilitate secure integration with a third-party API service.
Background
As part of a feature implementation project, my team needed to integrate with a third-party API service that required communication over a VPN tunnel. The third party maintained an on-premises environment, while our infrastructure was hosted on Amazon Web Services (AWS). The integration was to be achieved through a site-to-site VPN connection, ensuring a secure and private communication channel between both networks.
The service provider supplied the necessary VPN configuration details, and I was responsible for setting up the AWS side of the connection. The configuration involved creating a Customer Gateway (CGW), a Virtual Private Gateway (VGW), and establishing the site-to-site VPN connection using the parameters provided. Additionally, I updated the route tables to ensure all traffic destined for the on-prem network was directed through the VPN tunnel.
To ensure a high availability setup, I configured two VPN tunnels. The secondary tunnel was intended to take over automatically in the event of a failure or maintenance on the primary tunnel.
Implementation and Monitoring
After configuring the tunnels and setting the pre-shared keys, both tunnels came up successfully. Connectivity was validated by initiating a telnet connection to the on-premises server’s private IP and port, which confirmed successful communication.
To maintain visibility and proactive alerting, I enabled Amazon CloudWatch metrics for both VPN tunnels and configured alarms to notify the team whenever either tunnel went down. With the setup complete and all tests passed, the system was moved into production.
Incident
Some time after deployment, the primary VPN tunnel (Tunnel 1) was taken down for routine maintenance. Shortly after, our monitoring system triggered an alert indicating Tunnel 1 was down. However, despite having Tunnel 2 configured for redundancy, transaction failures began to occur.
Initial troubleshooting revealed that although both tunnels were in an “up” state, communication through Tunnel 2 was not successful. It became evident that, until that point, all network traffic had been routed exclusively through Tunnel 1, and failover was not functioning as expected.
To restore services and minimize downtime, I reactivated Tunnel 1. This restored connectivity and resumed transactions, but it confirmed that the failover mechanism was not working correctly.
Root Cause Analysis
Following service restoration, I initiated a root cause analysis (RCA) session with the third-party service provider. Preliminary checks confirmed that both VPN tunnels were established and active. To gather more insight, I enabled VPN logging and reviewed the CloudWatch log group, as well as the TunnelDataIn and TunnelDataOut metrics for both tunnels.
The metrics revealed a key pattern:
Tunnel 1: Active inbound and outbound data flow.

Tunnel 2: Outbound data during failover attempts, but no inbound responses.

This indicated that traffic from our side was reaching the on-premises network through Tunnel 2, but the responses were not being routed back correctly.
Upon further investigation with the provider, we identified the root cause: the issue was route-based. The customer’s VPN configuration was correctly receiving traffic from our secondary tunnel, but due to an incorrect internal routing setup, return traffic was being sent to the wrong internal server. This misconfiguration prevented successful bidirectional communication during failover events.
Resolution
Once the routing issue on the customer’s side was corrected, we simulated another maintenance event by intentionally bringing down Tunnel 1. This time, a failover occurred seamlessly, traffic successfully transitioned to the secondary Tunnel without service interruption. Monitoring metrics confirmed continuous bidirectional data flow through the secondary tunnel.
Outcome
After the corrective action, VPN failover worked as intended, significantly improving service reliability. The resolution reduced downtime incidents and minimized business impact by 45%, ensuring a stable and redundant communication channel between AWS and the on-premises infrastructure.
Key Takeaways
High Availability Validation: Regular failover testing is essential even when VPN tunnels report as active.

Comprehensive Monitoring: CloudWatch metrics provided critical visibility that aided in identifying asymmetric traffic flow.

Collaborative Troubleshooting: Effective communication between AWS and on-prem network teams was key to resolving the issue quickly.

Route Verification: Route-based VPNs require careful verification on both ends to ensure symmetrical routing.

Hello Dev Community

Feyisayo Lasisi — Wed, 05 Nov 2025 06:27:04 +0000

Hey Dev Community!

I’m Feyisayo Taiwo Lasisi, a DevOps Engineer with over 5 years of experience working across Linux systems, CI/CD pipelines, and cloud infrastructure.

I’ve spent most of my career building and maintaining scalable systems on AWS, automating infrastructure with Terraform and Ansible, and managing containerized workloads with Docker and Kubernetes (EKS/AKS).

In my recent role, I worked on implementing real-time log monitoring and alerting using AWS Lambda, DynamoDB, and CloudWatch — which helped reduce incident detection and recovery times by almost half. I’m also passionate about DevSecOps, so I often integrate tools like SonarQube, Trivy, and Gitleaks into pipelines to keep things secure and compliant.

Right now, I’m exploring more around Database Migration as a Service (DMaaS), automation strategies for complex cloud setups, and the best ways to build resilient infrastructure that scales gracefully.

I’ll be sharing my learnings, experiments, and maybe a few war stories from production environments here on Dev.to — from IaC tips to CI/CD tricks and security-first automation.

If that sounds like your kind of thing, follow along. Let’s build better systems together