Forem: Ijeawele Divine Nkwocha

Deploying a Production-Grade Microservices Platform on AWS EKS, Every Decision, Every Error, Every Lesson

Ijeawele Divine Nkwocha — Thu, 23 Apr 2026 19:12:53 +0000

Most Kubernetes tutorials stop at "your pod is running." That's not production.

Production is secrets management, autoscaling, TLS automation, persistent storage across availability zones, and an ingress layer that handles real traffic patterns. This article walks through a full microservices deployment on AWS EKS, the architecture decisions, the errors that will humble you if you skip the fundamentals, and the things worth doing differently on the next project.

The platform is RideShare Pro. Six independent services, a centralised ingress layer, managed data stores, and a live domain. GitHub repo here.

If you're a DevOps engineer working toward production-grade Kubernetes, this is the kind of breakdown you won't find in a quickstart guide.

What's Being Built

RideShare Pro is a microservices-based rideshare application where each business capability lives in its own independent service:

Rider Service, rider profiles, ride requests, status tracking
Driver Service, driver profiles, vehicle management, availability
Trips Service, trip lifecycle from creation to completion, trip history
Matching Service, real-time matching of riders with the nearest available driver
Email Service, transactional emails triggered by events from other services
Frontend, the user-facing web application

Each service communicates over HTTP APIs. All external traffic is routed through a centralised NGINX Ingress Controller acting as the API gateway.

The goal: deploy this on AWS EKS in a way that is scalable, highly available, and extremely production-ready.

Step 1, Deploy Locally Before Touching the Cluster

Before writing a single Kubernetes manifest, deploy the entire application locally.

This one decision saves hours.

Local deployment surfaces errors in the codebase itself, bugs that would later appear as CrashLoopBackOff pods with no obvious cause. Catching them at the local stage means you're not debugging application code and infrastructure configuration simultaneously. That combination is one of the most frustrating debugging scenarios in DevOps work.

The cluster will surface problems. It won't always tell you clearly why. Local validation first means the only thing the cluster is testing is the infrastructure.

Step 2, Build Docker Images and Push to ECR

Once the application is running locally, build Docker images for each service and store them in Amazon ECR (Elastic Container Registry).

ECR is the right choice when you're already in the AWS ecosystem. It integrates natively with EKS, uses IAM-based access control, and there's no friction pulling images at deploy time. No separate registry credentials to manage. No external dependency at cluster startup.

Step 3, Kubernetes Manifest Structure

The manifests are organised into four directories: aws/, platform/, stateful/, and applications/. Each one has a clear separation of concern.

`aws/`, Cluster-Level Infrastructure

These manifests interact directly with the AWS API to provision cluster-level resources.

node-groups.yaml creates managed node groups, collections of EC2 instances that Kubernetes schedules pods onto. Managed node groups mean AWS handles provisioning, scaling, and lifecycle management. t3.medium instances across multiple Availability Zones cover general-purpose workloads well without over-provisioning.

iam-roles.yaml sets up IAM Roles for Service Accounts (IRSA). IRSA is how specific Kubernetes service accounts get permission to call AWS APIs, in this case, permission to create EBS volumes for persistent storage. This is the correct approach. Giving broad IAM permissions to nodes is a security anti-pattern. IRSA scopes permissions to exactly what each service account needs and nothing more.

storage-classes.yaml creates an EBS gp3 StorageClass using the IRSA role above. The critical setting here is volumeBindingMode: WaitForFirstConsumer. More on why this matters in the errors section.

`platform/`, Shared Cluster-Wide Components

This directory sets up everything that makes the cluster secure and scalable, autoscaling, ingress, secrets, and security.

Autoscaling, Two autoscaling mechanisms work together here. The Horizontal Pod Autoscaler (HPA) scales pods when CPU or memory thresholds are hit. When all nodes are full and more pods need scheduling, the Cluster Autoscaler adds nodes to accommodate them. Both are necessary. HPA without the Cluster Autoscaler means pod scaling stalls when node capacity runs out.

One thing worth knowing about HPA: it requires both resource requests and limits to be set on containers. HPA measures CPU utilisation as a percentage of the requested CPU, not the limit. Without requests, it has no baseline and effectively does nothing. Always set both.

Ingress, NGINX Ingress Controller with path-based routing. The backend ingress rules (ingress-api.yaml) are separated from the frontend (ingress-frontend.yaml) deliberately. API paths need specific annotations, rate limiting, and authentication headers that shouldn't bleed over to the frontend. Separating them gives cleaner, more targeted control and makes future changes safer.

For HTTPS, cert-manager with a ClusterIssuer pointing to Let's Encrypt handles certificate provisioning and renewal automatically. Production deployments need HTTPS. This is the cleanest way to handle it without any manual certificate management.

Secrets, This is where most engineers either get it right or create a future security incident.

Native Kubernetes Secrets are base64 encoded, not encrypted. Anyone with cluster access can decode them. The production-grade approach is the External Secrets Operator (ESO) with AWS Secrets Manager.

Here's how it works: secrets, database URLs, Redis connection strings, JWT keys and other sensitive credentials are stored in AWS Secrets Manager. ESO creates a SecretStore pointing to that service. ExternalSecret resources reference the store and map specific secrets into pods as environment variables. ESO syncs on a configurable schedule, so rotating a secret in AWS Secrets Manager propagates to the cluster automatically.

This eliminates an entire class of security risk. The alternative, base64 encoded secrets committed to YAML files, allows your secrets to be revealed with just a Google search.

Security, Pod Disruption Budgets (PDB) for all critical services. A PDB ensures that during node maintenance or cluster upgrades, Kubernetes cannot take down more than a defined number of pods simultaneously. Setting minAvailable: 2 means regardless of what's happening at the node level, at least 2 pods of that service stay running. This is the difference between a cluster that survives a rolling upgrade and one that causes an outage during routine maintenance.

`stateful/`, Persistent Data

This is the most significant architectural decision in the project, and it's a deliberate departure from the typical tutorial approach.

A standard spec would call for PostgreSQL and Redis deployed as StatefulSets inside the cluster. Here, both are replaced with managed AWS services: Aurora RDS for PostgreSQL and Amazon ElastiCache for Redis.

The reasoning is operational reality.

StatefulSets in Kubernetes are powerful but come with real overhead. Database replication, node failure recovery, volume reattachment, version upgrades, all of that falls on the team. For most production systems, that's engineering time that isn't being spent on the product.

Aurora RDS changes the equation. Replication across Availability Zones is automatic. Storage scales without intervention. Automated backups, failover, and read replicas are built in. ElastiCache gives the same model for Redis, managed, highly available, secure, with automatic failover and no operational burden.

The tradeoff is cost and cloud portability. Managed services cost more than self-hosted, and you're tied to AWS. For a production system where reliability and engineering time both matter, this is the right call. Know the tradeoff, make the decision consciously.

`applications/`, The Microservices

Each service directory contains:

deployment.yaml, pod specs, container definitions, resource requests/limits, environment variables pulled from ExternalSecrets
service.yaml, ClusterIP service exposing the deployment internally
hpa.yaml, Horizontal Pod Autoscaler targeting CPU and memory thresholds
configmap.yaml, non-sensitive configuration like service URLs and feature flags

All services are ClusterIP. External traffic flows through NGINX Ingress only. Exposing individual services directly to the internet through LoadBalancer type is both a cost and security problem.

Step 4, Real Domain, Real HTTPS

Deploying to a cluster is one thing. A live URL that external traffic can hit is another, and it's what makes a portfolio project credible.

Finding the Load Balancer IP

When NGINX Ingress Controller deploys on EKS, it provisions an AWS Load Balancer automatically. To get the external IP:

kubectl get svc -n ingress-nginx

The EXTERNAL-IP column on the ingress-nginx-controller service is the cluster's entry point. That's what DNS points at.

DNS Configuration

In GoDaddy (or whichever registrar), add:

Type: A
Name: rideshare
Value: the Load Balancer IP from above

rideshare.ijeaweledivine.online now routes to the NGINX Ingress Controller, which applies path rules to reach the correct service.

TLS Automation with cert-manager

Certificate management - cert-manager with a Let's Encrypt ClusterIssuer handles provisioning and renewal automatically.

The annotations that drive this on the Ingress manifest:

annotations:
  nginx.ingress.kubernetes.io/use-regex: "true"
  cert-manager.io/cluster-issuer: letsencrypt-prod
  nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
  nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
  nginx.ingress.kubernetes.io/websocket-services: "trip-service"

Understanding what they mean is a Nice-to-Know.

use-regex: "true", enables regex path matching. Without this, path rules are basic prefix matching only. With it, you can write precise rules like /api/trips/.* to catch all trip-related routes cleanly.

cluster-issuer: letsencrypt-prod, the key annotation. cert-manager sees this, creates a CertificateRequest, runs the ACME challenge with Let's Encrypt, gets a signed certificate, stores it as a Kubernetes Secret, and handles renewal before expiry. One annotation. Permanent HTTPS.

proxy-read-timeout and proxy-send-timeout: "3600", sets timeout values to 1 hour. The default NGINX timeout is 60 seconds. For a rideshare platform where an active trip can last 45 minutes, 60 seconds kills live connections mid-trip. Match your timeout values to your actual usage patterns.

websocket-services: "trip-service", the trips service uses WebSockets for real-time communication: live trip status updates, driver location tracking. Standard HTTP is request-response and closes. WebSockets stay open. Without this annotation, NGINX doesn't handle the connection upgrade correctly, and real-time features fail silently.

The cert-manager Automation Flow

Once the Ingress is applied with the cert-manager annotation:

cert-manager detects the annotation and creates a CertificateRequest
Let's Encrypt issues an ACME challenge, proof that the domain is under your control
cert-manager creates a temporary pod and Ingress rule to respond to the challenge
Let's Encrypt verifies, issues the certificate
cert-manager stores it as a Kubernetes Secret and mounts it into the Ingress
NGINX serves HTTPS traffic

Watch it happen in real time:

kubectl get certificaterequest -n your-namespace
kubectl describe certificaterequest <name> -n your-namespace

Status shows Approved and Issued in about 2 minutes. The whole process is hands-off after the initial annotation.

The Errors That May Stress You OUT

Error 1, `no topology key found for node`

The EBS CSI Driver couldn't identify which Availability Zone the worker node was in, so it couldn't safely create a persistent volume. EBS volumes are AZ-specific; a volume in eu-north-1a cannot be mounted by a pod in eu-north-1b.

Two fixes:

EKS nodes need the label topology.ebs.csi.aws.com/zone for the storage driver to identify the AZ. Apply this to your node groups.

More importantly: set volumeBindingMode: WaitForFirstConsumer in your StorageClass. Without this, Kubernetes creates the EBS volume before it knows which node the pod will land on. WaitForFirstConsumer delays volume creation until the pod is scheduled to a node, then creates the volume in the same AZ. This single setting eliminates an entire class of storage scheduling problems.

Error 2, `secret "postgres-credentials" not found`

Namespace isolation. Kubernetes secrets are namespace-scoped. A pod in the rideshare-app namespace cannot access a secret created in the default namespace. The credentials existed, they were just invisible from where the pod was looking.

When pods hit CreateContainerConfigError or CrashLoopBackOff and the image is confirmed healthy, check namespace alignment before anything else. It's almost always either a namespace mismatch or a missing secret.

What to Do Differently

Two gaps from the project review are worth calling out explicitly, because they're easy to miss and impactful to fix.

Health probes, livenessProbe and readinessProbe tell Kubernetes whether a pod is healthy and ready to receive traffic. Without them, Kubernetes has no mechanism to automatically restart a stuck pod or remove it from rotation when it's not ready. The result: a broken pod silently receives live traffic and returns errors until someone notices manually. Adding probes is a small amount of YAML that meaningfully improves reliability. Don't skip them.

Resource requests on HPA, As covered above: HPA uses requests as the baseline for utilization calculations. Limits without requests gives the autoscaler nothing to measure against. Set both, always.

Final Thoughts

The infrastructure setup is the visible part. The real depth is in the operational decisions, why managed services over StatefulSets, why External Secrets over native Kubernetes Secrets, why separate ingress manifests for API and frontend.

Those decisions are what separate someone who knows Kubernetes syntax from someone who can design a system that holds up under real conditions.

For anyone working through something similar: deploy locally first, understand service communication before touching the cluster, and don't treat health probes and resource requests as optional polish. They're not. The cluster runs without them until it doesn't.

*Ijeawele is a DevOps Engineer building production-grade infrastructure and writing about it in plain terms. More projects coming.

Reach out to me for questions or any opportunities on my Linkedin
Check out my other Projects on GitHub*

Azure Disaster Recovery with Terraform: Complete RTO/RPO Guide (2025)

Ijeawele Divine Nkwocha — Thu, 27 Nov 2025 00:27:40 +0000

I recently architected a dual-variant infrastructure testing environment using Terraform. I built both resilient and non-resilient cloud resources to benchmark a cloud resilience monitoring tool. The objective was to calculate precise Recovery Time Objective (RTO) and Recovery Point Objective (RPO) metrics for individual Azure resources across 12 service types.

As an AWS-native DevOps engineer, I approached this Azure implementation strategically. Rather than relying solely on documentation, I took time to understand Azure's disaster recovery approach at a very low level, which differs significantly from AWS's approach. And what I got was A production-grade infrastructure that demonstrates how DR strategies must be tailored to each cloud provider's architecture, not simply translated across platforms.

Understanding Disaster Recovery: RTO and RPO

The main differentiators between resilient and non-resilient resources came down to two key factors: disaster recovery configuration and the chosen tier (Basic, Standard, or Premium). What you deploy your resources with determines how much they can withstand during a disaster.

Recovery Time Objective (RTO) is how long it takes to get operations back to normal and restore systems after a disruption. It's the acceptable amount of time recovery must be achieved to avoid a significant business impact. For example, a system with an RTO of 5 hours means it must be back online within 5 hours after a failure.

Recovery Point Objective (RPO) defines the maximum amount of data loss tolerable, measured in time. It indicates how frequently data backups or replication occur. For example, an RPO of 15 minutes means data is backed up or replicated at least every 15 minutes. It shows how much data an organisation can afford to lose if systems go down.

RTO and RPO differ for every organisation, and the goal of disaster recovery is to ensure these objectives are met in case of any disaster. How do you ensure this? By provisioning resources with failure in mind. You can't build a perfect system, but you can build a perfect recovery system.

How long will it take to bounce back after downtime, and how much data would you lose? Those are things within your control.

What Makes Cloud Resources Resilient

One key thing that makes cloud resources resilient is redundancy, how many times it's replicated and where it's being replicated. You can have resources replicated in the same zone, but that doesn't make them resilient because when something happens in that zone, all the resources there are affected.

Azure has different redundancy options, each with different percentages of uptime represented from 99.9% (three 9s) to 99.999999999% (twelve 9s). When it comes to storage, it's critical to replicate your data across different data centers, zones, and regions.

Azure Storage Redundancy Options

Locally Redundant Storage (LRS): Data is replicated three times within a single data centre to protect against hardware failures, with a durability of 99.99%. It protects against local hardware failures but not data centre-level faults.
Zone-Redundant Storage (ZRS): Data is synchronously replicated across three separate Availability Zones within a region, enhancing resilience against zone or data centre failures. Your data remains accessible even if one zone fails.
Geo-Redundant Storage (GRS): Combines LRS in the primary region with asynchronous replication to a secondary region to guard against entire regional failures.
Geo-Zone-Redundant Storage (GZRS): Combines ZRS in the primary region with asynchronous replication to a secondary region for both zonal and regional fault tolerance.
The kind of storage you choose determines how resilient the whole system is. It's literally how your data is stored and protected.

Resilience Is Component-Specific

Each component has different ways to be resilient, so how do you compare them? You don't. Resilience is calculated individually as a result of the efficiency of each component.

The resources I deployed include:

Virtual Machines
SQL Database
Azure Kubernetes Service (AKS)
Azure Container Registry (ACR)
Managed Disks
Azure Data Lake Storage (ADLS) Ensuring they're all resilient is different for each and must be worked on individually. Each service has its own disaster recovery mechanism - SQL failover groups, ADLS native replication, and so on.

Infrastructure as Code with Terraform

I provisioned each component using Terraform and used modules to set up individual resources. I configured a remote backend using Azure Blob Storage and implemented state locking with Azure storage to prevent concurrent modifications.

Disaster Recovery Solutions by Resource

Virtual Machines
I configured Azure Site Recovery (ASR) for the resilient VM with replication to a paired region, set up a backup policy with scheduled snapshots in a Recovery Services Vault, used Premium SSDs, and deployed across multiple Availability Zones.
Azure Kubernetes Service (AKS)
The disaster recovery solution for AKS included multiple node pools spread across Availability Zones and a backup solution using Azure Backup. Interestingly, the backup vault couldn't be deleted even after running terraform destroy until after a week, a safety feature to prevent accidental data loss.
Managed Disks
I configured GRS storage with Premium SSD, automated snapshot policies to back up the disks, and cross-region replication for both the managed disk and snapshots.
Azure Container Registry (ACR)
For ACR, I set up geo-replication to a secondary region (from westus2 to centralus), ensuring container images and tags are automatically copied and synchronised.
Azure Data Lake Storage Gen2
ADLS was configured with GRS replication and cross-region failover readiness.
SQL Database
The SQL database was set up with zone redundancy, active geo-replication to a paired region, and failover groups for automatic coordinated failover.

Deep Dive into Key DR Solutions

Azure Site Recovery (ASR)
Azure Site Recovery is a Disaster Recovery as a Service solution provided by Azure. I configured it separately from the resources and passed it as a data source in the root module. I created a separate directory named asr-setup-vault with its own state file and provider block.

ASR setup
involves configuring the ASR fabric, ASR protection container, replication policy, and ASR protection container mapping. A shared vault was configured for both the managed disk and VM, and a backup policy was set up for the shared vault. The expected outcome is cross-region replicated VMs and managed disks.

Data Protection Backup Vault
I used a Data Protection backup vault to store snapshots and AKS backups. A data protection backup vault is a secure, centralised storage entity designed to store and manage backup data and recovery points over time. It acts as a container for backups, providing protection through encryption, data isolation, and access control mechanisms to ensure the integrity and availability of backup data even if production systems are compromised.

Multiple Node Pools Across Availability Zones
When multiple node pools in an AKS cluster are spread across Availability Zones, it means that each node pool's virtual machines are distributed across different isolated physical locations within the same Azure region. This configuration boosts the cluster's resilience and availability because even if one AZ experiences an outage, nodes in other AZs remain functional.

Geo-Replication for Container Registry
Geo-replication for ACR means that the container registry's contents, including container images and tags, are automatically copied and synchronized from the primary region to one or more secondary Azure regions. This ensures high availability and reduces latency for pulling images from different geographic locations.

SQL Database Failover Groups
Failover groups for Azure SQL Database enable automatic and coordinated failover of a group of databases from a primary server in one Azure region to a secondary server in another region. This ensures high availability and disaster recovery by replicating databases geo-redundantly and allowing seamless switching to the secondary region if the primary becomes unavailable due to an outage or disaster.

Decision Framework for Disaster Recovery

Here's a simple decision tree I used to determine the right DR strategy:

Is this data critical to business operations?
If YES: Can you afford to lose ANY data?
No data loss acceptable: Use active geo-replication (SQL failover groups, GZRS storage)
Some data loss is acceptable:
• Under 1 hour data loss: ASR for VMs, GZRS for storage
• 1-24 hours data loss: Daily backups, GRS storage
If NO (data not critical): Use the most cost-effective option, LRS storage, Basic SKUs, no ASR.

Key Takeaways

Building resilient infrastructure isn't about making everything highly available; it's about understanding your business requirements and making informed decisions about where to invest in redundancy and disaster recovery. Each Azure service has its own DR mechanisms, and you need to understand them individually to build a truly resilient system.

This project taught me that moving from AWS to Azure isn't just about learning new service names; it's about understanding fundamentally different approaches to disaster recovery and resilience. The skills are transferable, but the implementation details matter.

Got questions about building resilient infrastructure or want to discuss disaster recovery strategies? I'm always happy to discuss!

I'm a DevOps engineer and technical writer currently open to new opportunities. If you're hiring or want to connect, reach out on LinkedIn or drop me an email at [nkwochaijeawele@gmail.com].