Forem: abidaslam892

Building a Production-Multi-Cloud DevOps Platform: A Complete Journey from Zero to Hero

abidaslam892 — Sun, 23 Nov 2025 09:58:32 +0000

Building a Production-Multi-Cloud DevOps Platform: A Complete Journey from Zero to Hero
Abidaslam

Note : Visit the

https://medium.com/design-bootcamp/building-a-production-multi-cloud-devops-platform-a-complete-journey-from-zero-to-hero-ef292ff0f0c6

How I Built and Deployed a FastAPI Application Across AWS EKS and Azure AKS with Full CI/CD, Security Scanning, and Observability

A comprehensive guide to building enterprise-grade cloud infrastructure with security-first principles
I built a complete multi-cloud DevOps platform that deploys a Python FastAPI application to both AWS EKS and Azure AKS with:

Infrastructure as Code (Terraform) for AWS and Azure
CI/CD Pipelines (GitHub Actions) with automated testing and security scanning

Container Security with Trivy and Checkov
Full Observability with Prometheus, Grafana, and Loki
Cost Optimization achieving 96% cost reduction ($141/month → $5/month)
Production-ready Kubernetes deployments with Helm
Project Repository: github.com/abidaslam892/multi-cloud-devsecops

Press enter or click to view image in full size

Table of Contents

The Challenge
Architecture Overview
Tech Stack
Implementation Journey
Infrastructure as Code
CI/CD Pipeline
Security Implementation
Monitoring & Observability
Cost Optimization
Results & Metrics
Lessons Learned
What’s Next

The Challenge

As a DevOps engineer, I wanted to build a project that demonstrates real-world enterprise practices. The goal wasn’t just to deploy an application to the cloud, but to create a production-grade platform that showcases:

Multi-cloud expertise (AWS + Azure)
Infrastructure automation
Security-first approach
Cost-conscious architecture
Observability and monitoring
GitOps principles

Most tutorials show you how to deploy to ONE cloud. But what about multi-cloud? What about security scanning? What about cost optimization? This project answers all those questions.

Architecture Overview

High-Level Architecture

Press enter or click to view image in full size

Infrastructure Components
AWS Environment
EKS Cluster (Kubernetes 1.28)

Press enter or click to view image in full size

2x t3. medium SPOT instances (cost-optimized nodes)
VPC with public/private subnets across 3 AZs
NAT Gateway for private subnet internet access
ECR for container registry
Application Load Balancer for ingress
Azure Environment:
AKS Cluster (Kubernetes 1.31)

Press enter or click to view image in full size

1x Standard_D2s_v3 VM (auto-scaling enabled)
Press enter or click to view image in full size

VNet with subnet configuration
ACR for container registry
Azure Load Balancer for service exposure
Network Security Groups for traffic control
Press enter or click to view image in full size

Tech Stack

Core Technologies
Press enter or click to view image in full size

Why These Choices?

FastAPI : Modern, fast, and async-capable Python framework with automatic API documentation.
Terraform: Cloud-agnostic IaC tool allowing consistent infrastructure patterns across AWS and Azure.
Helm: Templating and versioning for Kubernetes deployments, enabling environment-specific configurations.
GitHub Actions: Native to GitHub, no additional CI/CD tools needed, excellent integration with cloud providers.
Spot Instances: 70% cost savings on AWS compute while maintaining high availability with multiple AZs.
Implementation Journey

Phase 1: Local Development

Started with a simple FastAPI application:
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI(title=”multi-cloud-devsecops-sample”)
class Item(BaseModel):
id: int
name: str
@app.get(“/”, tags=[“root”])
async def read_root():
return {“status”: “ok”, “message”: “Hello from Multi-Cloud DevSecOps sample”}
@app.get(“/health”, tags=[“health”])
async def health_check():
return {“status”: “healthy”}
@app.get(“/metrics”, tags=[“metrics”])
async def metrics():
return {“requests_total”: 0, “errors_total”: 0

Press enter or click to view image in full size

Key Features Implemented

Health check endpoint for Kubernetes probes
Metrics endpoint for Prometheus
RESTful CRUD operations
Input validation with Pydantic
Comprehensive unit tests with pytest

Phase 2: Containerization
Created a multi-stage Dockerfile for optimized builds:


# Builder stage

FROM python:3.11-slim as builder

WORKDIR /app

COPY requirements.txt .

RUN pip install — no-cache-dir — user -r requirements.txt

# Runtime stage

FROM python:3.11-slim

WORKDIR /app

# Security: Non-root user

RUN groupadd -r appuser && useradd -r -g appuser appuser

USER appuser

# Copy dependencies from builder

COPY — from=builder — chown=appuser:appuser /root/.local /home/appuser/.local

COPY — chown=appuser:appuser ./src ./src

ENV PATH=/home/appuser/.local/bin:$PATH

EXPOSE 8080

CMD [“uvicorn”, “src.main:app”, “ — host”, “0.0.0.0”, “ — port”, “8080”]

Press enter or click to view image in full size

**Security Highlights**

1. Multi-stage build reduces image size by 60%
2. Non-root user (UID 1000)
3. Minimal base image (python:3.11-slim)
4. No unnecessary packages
5. Specific version pinning
6. Result: Image size reduced from 1.2GB to ~200MB



**Phase 3: Infrastructure as Code**


Built complete Terraform modules for both clouds:

AWS Infrastructure (`terraform/aws/main.tf`):

Press enter or click to view image in full size

Press enter or click to view image in full size

Details of all the scripts & configuration: Can refer the GitHub

Remote state management (S3 for AWS, Blob for Azure)
Modular design for reusability
Environment-specific variables
Consistent tagging strategy
Security groups/NSGs with least privilege
Phase 4: CI/CD Pipeline
Built three GitHub Actions workflows:

CI Pipeline (`.github/workflows/ci.yaml`):
Press enter or click to view image in full size

CD Pipeline — AWS (`.github/workflows/cd-aws.yaml`):
Press enter or click to view image in full size

**Pipeline Features**

Automated testing on every commit
Security scanning before deployment
Separate workflows for AWS and Azure
Manual deployment approval capability
Rollback support via Helm
Phase 5: Kubernetes Deployment
Created Helm charts for flexible deployments:

`Helm Chart Structure
helm/chart/

├── Chart.yaml

├── templates/

│ ├── deployment.yaml

│ ├── service.yaml

│ ├── servicemonitor.yaml

│ └── ingress.yaml (optional)

└── values.yaml`


**Phase 6: Monitoring & Observability**

Deployed the full observability stack using Helm:

Prometheus/Grafana Installation
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

**helm repo update**

# Install kube-prometheus-stack

helm install prometheus prometheus-community/kube-prometheus-stack \

-f monitoring/prometheus-values.yaml \

— namespace monitoring — create-namespace

Grafana Dashboard — Custom dashboard tracking:

Request rate and latency
Error rates (4xx, 5xx)
Pod CPU and memory usage
Kubernetes health metrics
Container restart count
Security Implementation
Multi-Layer Security Approach
Container Security
Infrastructure Security
Pod Security Context
Network Security

AWS Security Groups with minimal ingress rules
Azure Network Security Groups
Private subnets for worker nodes
NAT Gateway for controlled egress

Secrets Management GitHub Secrets for credentials Kubernetes Service Accounts with RBAC ACR/ECR authentication via managed identities No hardcoded secrets in code Metrics Collection Prometheus Targets Kubernetes API server Kubelet metrics Node exporter (system metrics) Kube-state-metrics (K8s object states) Application /metrics endpoint Press enter or click to view image in full size

Grafana Dashboards

Application Dashboard Request rate (requests/sec) Average latency (ms) Error rate percentage Top endpoints by traffic Response time distribution (P50, P95, P99)
Infrastructure Dashboard Cluster resource utilization Node CPU/Memory/Disk usage Pod distribution across nodes Network I/O Persistent volume usage
Kubernetes Dashboard Pod status overview Deployment health Container restart trends Resource quota usage Namespace metrics Monitoring Access: Azure Grafana: xxxxxx Credentials: xxxx Retention: 7 days of metrics Press enter or click to view image in full size

Press enter or click to view image in full size

Cost Optimization
The Cost Challenge

Initial deployment costs were running at $253/month :
AWS: $136.45/month
Azure: $97/month
S3/Blob state: $0.04/month
This was too high for a learning project. Here’s how I optimized:

Cost Reduction Strategies

Spot Instances (AWS) eks_managed_node_groups = {

main = {

capacity_type = “SPOT” # 70% savings vs On-Demand

instance_types = [“t3.medium”]

}

Savings: $21/month (from $51 to $30)

Single NAT Gateway enable_nat_gateway = true

single_nat_gateway = true # Instead of one per AZ

Savings: $64/month (from $96 to $32)

Right-Sized VMs AWS: t3.medium (2 vCPU, 4GB RAM) — adequate for dev Azure: Standard_D2s_v3 (2 vCPU, 8GB RAM)
Auto-Scaling yaml

autoscaling:

minReplicas: 1 # Scale down to 1 during low traffic

maxReplicas: 4

targetCPUUtilizationPercentage: 80

Destroy When Not in Use Stop everything at end of day ./scripts/destroy-aws-infrastructure.sh

./scripts/destroy-azure-infrastructure.sh

Recreate next morning (30 minutes)
./scripts/deploy-aws-infrastructure.sh

Final Cost Breakdown
Current State (Infrastructure destroyed, state only):

AWS: $0.02/month (S3 state storage)
Azure: $5.02/month (ACR Basic + Blob state)
Total: $5.04/month (96% reduction!)
Active Development (when needed):
AWS (8 hours/day): ~$1.50/day = $45/month
Azure (24/7 minimal): $5.02/month
Total: ~$50/month for active development
Cost Comparison
| Scenario | Monthly Cost | Best For |

| 24/7 Production | $253 | Always-on production |

| 8hr/day Dev | $50 | Active development |

| Weekly Demos | $5–10 | Portfolio/interviews |

| Destroyed (Current) | $5 | Learning/Idle |

ROI on Cost Optimization
Annual Savings: $2,976/year (24/7) vs $60/year (destroyed)
Time to Recreate**: 30 minutes
Infrastructure is Code: Can rebuild anytime
Key Lesson: Don’t pay for idle infrastructure!
Results & Metrics
Deployment Success Metrics
Infrastructure Provisioning
AWS EKS: 28 minutes (fully automated)
Azure AKS: 22 minutes (fully automated)
Success Rate: 100% (reproducible builds)
Application Deployment
Build Time: 3–5 minutes (multi-stage Docker build)
Push to Registry: 1 minute
Helm Deployment: 2 minutes
Total CI/CD Duration: 8–10 minutes

Application Performance

| Availability | 99.9% | 99.9% | 99.5% |

| Avg Response Time | 45ms | 52ms | <100ms |

| P95 Latency | 89ms | 95ms | <200ms |

| Error Rate | 0.01% | 0.01% | <1% |

| CPU | 250m | 45m | 18% |

| Memory | 256Mi | 128Mi | 50% |

Note: Low utilization is expected for this demo app. Production apps would scale based on actual load

Security Metrics

0 Critical Vulnerabilities in production images
0 High Severity IaC issues
00% Secret Coverage (no hardcoded credentials)
Pod Security standards enforced
Network Policies implemented
Testing Coverage
Total Tests: 12
Passed: 12
Failed: 0
Coverage: 85%
CI/CD Metrics
Build Success Rate : 98% (2 failures due to flaky tests)
Average Build Time : 8 minutes
Deployment Frequency**: On-demand (GitOps ready)
Lead Time: < 15 minutes (code to production)
MTTR: < 30 minutes (rollback capability)

What Worked Well

Infrastructure as Code Terraform modules made multi-environment deployments trivial Remote state management prevented conflicts Destroy/recreate workflow enabled cost savings
Helm for Kubernetes Environment-specific values files simplified configuration Version control for deployments Easy rollback capabilities
Multi-Stage Docker Builds 60% reduction in image size Faster deployments Better security (minimal attack surface)
GitHub Actions Native integration with GitHub No additional CI/CD infrastructure needed Secrets management built-in
Spot Instances 70% cost savings on AWS compute No noticeable impact on availability (for dev/test) Challenges Faced Terraform State Lock Lesson: Always clean up failed applies, use DynamoDB lock table

EKS Node Group Deletion
aws eks delete-nodegroup — cluster-name — nodegroup-name

Lesson : Understand resource dependencies

ACR Naming Restrictions
Azure Container Registry names must be lowercase alphanumeric.

What I’d Do Differently

Start with Local Kubernetes
Use kind/minikube for initial development
Only move to cloud for integration testing
Would have saved 2 weeks of cloud costs
Implement GitOps Sooner
ArgoCD or Flux for declarative deployments
Better visibility into deployment state
Automatic sync from Git
Add Service Mesh Earlier
Better traffic management
Enhanced observability
More Comprehensive Monitoring
Log aggregation with Loki from day 1
Distributed tracing with Jaeger
Custom application metrics
Automated Cost Tracking
Daily cost reports via AWS Cost Explorer API
Budget alerts in Slack
Dashboard showing spend by service
Key Takeaways for DevOps Engineers
Infrastructure as Code is Essential

Version control your infrastructure
Make it reproducible
Destroy and recreate confidently
Security is Not Optional

Scan early and often
Implement least privilege
No secrets in code, ever
Cost Awareness Matters

Monitor spending from day 1
Use spot instances for non-critical workloads
Destroy what you don’t use
Observability from the Start

Logs, metrics, and traces
You can’t improve what you can’t measure
Dashboards tell stories
Automation Saves Time

30 minutes to recreate infrastructure
Consistent, repeatable deployments
Focus on building, not clicking
For Job Seekers
This project demonstrates:
Real-world DevOps practices
Multi-cloud expertise
Security-first mindset
Cost optimization skills
Problem-solving ability
Documentation skills

Portfolio Value: Shows you can build production-grade infrastructure, not just follow tutorials.

Resources & Documentation
Project Repository
🔗 github.com/abidaslam892/multi-cloud-devsecops

Documentation Files
Setup Guide
Deployment Guide
Access Guide
Cost Optimization
Monitoring Setup
Technologies Used
FastAPI Documentation
Terraform AWS Provider
Terraform Azure Provider
Helm Documentation
Kubernetes Documentation
Prometheus Documentation
Grafana Documentation
Tools & Security
Trivy Scanner
Checkov IaC Scanner
GitHub Actions
Connect With Me
I’d love to hear your feedback, questions, or suggestions!

GitHub: @abidaslam892
Repository: multi-cloud-devsecops
Email: abidaslam.123@gmail.com
LinkedIn: linkedin.com/in/abid-aslam-75520330
Evidence & Screenshots
See the blog-materials/evidence folder for:

AWS Console screenshots (EKS, ECR, VPC)
Azure Portal screenshots (AKS, ACR)
Grafana dashboards
CI/CD pipeline runs
Cost reports
Security scan results
Acknowledgments
The open-source community for amazing tools
Terraform AWS/Azure modules maintainers
GitHub Actions team
Everyone who contributed to the technologies used
Final Thoughts
Building this project taught me that **DevOps is not about tools, it’s about culture and practices

Automate everything you can
Treat infrastructure as code
Security is everyone’s responsibility
Monitor, measure, improve
Share knowledge (hence this blog!)
If you’re learning DevOps, I encourage you to:

Build something real (not just tutorials)
Make mistakes and learn from them
Document your journey
Share with the community

Remember: The best way to learn is by doing. Start small, iterate, and keep building!

Ifthis article helped you, please give it a ⭐ star on GitHub and share it with others!

DevOps #AWS #Azure #Kubernetes #Terraform #CI/CD #CloudNative #Security #DevSecOps #MultiCloud #Docker #Helm #Prometheus #Grafana #Python #FastAPI #Infrastructure #Automation

netes #Terraform #CI/CD #CloudNative #Security #DevSecOps #MultiCloud #Docker #Helm #Prometheus #Grafana #Python #FastAPI #Infrastructure #Automation

Production Monitoring Made Easy: Prometheus, Grafana, and Docker Explained

abidaslam892 — Sat, 15 Nov 2025 15:42:56 +0000

Original Post:

https://medium.com/design-bootcamp/production-monitoring-made-easy-prometheus-grafana-and-docker-explained-f373607102ed

From Zero to Observability: Building a Production-Grade Monitoring Stack with Prometheus & Grafana

Introduction
In today’s cloud-native world, monitoring isn’t optional — it’s essential. Whether you’re running a small side project or managing enterprise infrastructure, you need visibility into your systems. But setting up monitoring shouldn’t require weeks of configuration and a PhD in DevOps.

In this comprehensive guide, I’ll walk you through building a production-ready monitoring stack using three powerful open-source tools:

Docker for containerization
Prometheus for metrics collection
Grafana for visualization

By the end of this tutorial, you’ll have:

A fully functional monitoring stack running in containers
Real-time system metrics from your infrastructure
Beautiful, interactive dashboards
Knowledge to extend and customize for your needs
Time to complete: 30 minutes
Skill level: Beginner to Intermediate
Prerequisites: Basic command-line knowledge, Docker installed
Why This Stack?

The Problem

Traditional monitoring setups are often:

Complex - Multiple services, complicated configurations
Expensive - Enterprise solutions cost thousands per month
Inflexible- Vendor lock-in limits customization
Hard to scale - Difficult to add new metrics or exporters
The Solution
Our stack solves these problems:

Simple - Deploy everything with one command
Free & Open Source - No licensing costs
Highly Customizable - Full control over metrics and dashboards
Scalable - Easy to add exporters and federate Prometheus

Architecture Overview

_Here's what we're building: _

*Components: *
Prometheus— Collects and stores time-series metrics
Grafana — Creates beautiful dashboards and visualizations
Node Exporter — Exposes system-level metrics (CPU, RAM, disk)
Application Exporter — Custom metrics from your applications

Part 1: Setting Up the Foundation

Step 1: Prepare Your Environment

First, ensure you have Docker and Docker Compose installed:

Check Docker version

docker — version

Docker version 20.10.0 or higher required

Check Docker Compose version

docker-compose — version

Docker Compose version 2.20.0 or higher recommended

Step 2: Create Project Structure

Create project directory

mkdir monitoring-stack && cd monitoring-stack

Create necessary directories

mkdir -p prometheus grafana src

Step 3: Configure Prometheus

Create prometheus/prometheus.yml:


global:

scrape_interval: 15s # Scrape targets every 15 seconds

evaluation_interval: 15s # Evaluate rules every 15 seconds

scrape_configs:

Prometheus monitors itself
- job_name: ‘prometheus’

static_configs:

- targets: [‘localhost:9090’]

Node Exporter — System metrics
- job_name: ‘node_exporter’

static_configs:

- targets: [‘host.docker.internal:9100’]

scrape_interval: 15s

Custom application metrics
- job_name: ‘application’

static_configs:

- targets: [‘host.docker.internal:8000’]

metrics_path: ‘/metrics’

scrape_interval: 5s

## What’s happening here?

scrape_interval: How often Prometheus collects metrics
job_name: Logical grouping for targets
targets: Where to find metrics endpoints
host.docker.internal: Allows containers to reach the host machine

## Part 2: Docker Compose Configuration

Create `docker-compose.yml` in your project root:

yaml

version: ‘3.8’

services:

prometheus:

image: prom/prometheus:latest

container_name: prometheus

ports:

“9091:9090”

volumes:

./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
prometheus_data:/prometheus

command:

‘ — config.file=/etc/prometheus/prometheus.yml’
‘ — storage.tsdb.path=/prometheus’
‘ — storage.tsdb.retention.time=30d’

restart: unless-stopped

networks:

monitoring

grafana:
image: grafana/grafana:latest

container_name: grafana

ports:

“3000:3000”

environment:

GF_SECURITY_ADMIN_PASSWORD=admin
GF_USERS_ALLOW_SIGN_UP=false

volumes:

grafana_data:/var/lib/grafana

restart: unless-stopped

networks:

monitoring

depends_on:

prometheus

volumes:

prometheus_data:

grafana_data:

networks:

monitoring:

driver: bridge


## Key Configuration Details

1. Ports: Prometheus on 9091, Grafana on 3000
2. Volumes: Persist data even if containers restart
3. Networks: Isolated bridge network for service communication
4. Retention: Keep metrics for 30 days
5. Restart Policy: Automatically restart on failure

## Part 3: Installing Node Exporter

Node Exporter provides system-level metrics. Install it on your host machine:
Create a dedicated user

sudo useradd — no-create-home — shell /bin/false node_exporter

Download Node Exporter

cd /tmp

wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz

Extract and install

tar xzf node_exporter-1.7.0.linux-amd64.tar.gz

sudo cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/

sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter

Create a systemd service at `/etc/systemd/system/node_exporter.service`:

ini

[Unit]

Description=Node Exporter

After=network.target

[Service]

User=node_exporter

Group=node_exporter

Type=simple

ExecStart=/usr/local/bin/node_exporter

[Install]

WantedBy=multi-user.target


Start the service:

bash

sudo systemctl daemon-reload

sudo systemctl start node_exporter

sudo systemctl enable node_exporter

Verify it’s running

curl http://localhost:9100/metrics | head -20

You should see metrics output like:

HELP node_cpu_seconds_total Seconds the CPUs spent in each mode.

TYPE node_cpu_seconds_total counter

node_cpu_seconds_total{cpu=”0",mode=”idle”} 12345.67

node_cpu_seconds_total{cpu=”0",mode=”user”} 890.12

Part 4: Creating a Custom Metrics Exporter
Let’s create a simple Python application that exposes custom metrics.

Create src/metrics_exporter.py:


#!/usr/bin/env python3

“””

## Simple Prometheus Metrics Exporter

Demonstrates how to instrument your applications

“””

from prometheus_client import start_http_server, Counter, Gauge, Histogram

import psutil

import random

import time

Define application metrics

request_count = Counter(

‘app_requests_total’,

‘Total HTTP requests’,

[‘method’, ‘endpoint’, ‘status’]

)

active_users = Gauge(

‘app_active_users’,

‘Number of active users’

)

response_time = Histogram(

‘app_response_time_seconds’,

‘Response time in seconds’,

buckets=[0.1, 0.5, 1.0, 2.0, 5.0]

)

System metrics

cpu_gauge = Gauge(‘system_cpu_percent’, ‘CPU usage percentage’)

memory_gauge = Gauge(‘system_memory_percent’, ‘Memory usage percentage’)

disk_gauge = Gauge(‘system_disk_percent’, ‘Disk usage percentage’)

def collect_system_metrics():

Collect system metrics using psutil

cpu_gauge.set(psutil.cpu_percent(interval=1))

memory_gauge.set(psutil.virtual_memory().percent)

disk_gauge.set(psutil.disk_usage(‘/’).percent)

def simulate_application_activity():

Simulate application metrics for demo purposes

methods = [‘GET’, ‘POST’, ‘PUT’, ‘DELETE’]

endpoints = [‘/api/users’, ‘/api/orders’, ‘/api/products’]

statuses = [200, 201, 400, 404, 500]

Simulate a request

method = random.choice(methods)

endpoint = random.choice(endpoints)

status = random.choices(statuses, weights=[85, 10, 3, 1, 1])[0]

request_count.labels(method=method, endpoint=endpoint, status=status).inc()

Simulate response time

response_time.observe(random.uniform(0.05, 2.0))

Update active users

active_users.set(random.randint(10, 100))

def main():

“””Main exporter loop”””

Start metrics server on port 8000

PORT = 8000

start_http_server(PORT)

print (f”Metrics server started on port {PORT}”)

print (f” Metrics available at http://localhost: {PORT}/metrics")

Create `requirements.txt`:
Press enter or click to view image in full size

Create `start_exporter.sh
Press enter or click to view image in full size

Check if Python is installed

if ! command -v python3 &> /dev/null; then

echo “ Python 3 is not installed”

exit 1

fi

Install dependencies

pip3 install -r requirements.txt

Start the exporter

python3 src/metrics_exporter.py

Part 5: Launching the Stack
- Now we’re ready to start everything:
Start Prometheus and Grafana

docker-compose up -d

Check if containers are running

docker-compose ps

NAME IMAGE STATUS

grafana grafana/grafana:latest Up

prometheus prom/prometheus:latest Up

## Access your services

Prometheus: http://localhost:9091
Grafana: http://localhost:3000 (admin/admin)
Node Exporter Metrics: http://localhost:9100/metrics
Application Metrics: http://localhost:8000/metrics

## Part 6: Configuring Grafana

**Step 1: Add Prometheus as a Data Source**

Open Grafana at http://localhost:3000
Login with `admin` / `admin` (change password when prompted)
Go to Configuration → Data Sources
Click Add data source
Select Prometheus
Set URL: `http://prometheus:9090
Click Save & Test
You should see: “Data source is working”

**Step 2: Import a Dashboard**
Go to Dashboards → Import
Enter dashboard ID: 1860 (Node Exporter Full)
Click Load
Select Prometheus as the data source
Click Import
You now have a beautiful dashboard showing:
CPU usage across all cores
Memory utilization
Disk space and I/O
Network traffic
System load
Press enter or click to view image in full size

Press enter or click to view image in full size

Press enter or click to view image in full size

## Part 7: Creating Custom Dashboards
Let’s create a custom dashboard for our application metrics.

**Step 1: Create a New Dashboard**

Click **+** → **Create Dashboard**
Click **Add new panel**
Step 2: Add a Request Rate Panel

Query:

promql

rate(app_requests_total[5m])


Panel Settings:
Title: “HTTP Request Rate”
Visualization: Time series
Legend: `{{method}} {{endpoint}}`
Step 3: Add Active Users Panel
Query:

promql

app_active_users


Panel Settings:
Title: “Active Users”
Visualization: Stat
Color: Based on thresholds (green < 50, yellow < 80, red >= 80)
Step 4: Add Response Time Panel
Query:

promql

histogram_quantile(0.95, rate(app_response_time_seconds_bucket[5m]))


Panel Settings:
Title: “95th Percentile Response Time”
Visualization: Gauge
Unit: seconds
Step 5: Add CPU Usage Panel
Query:

promql

100 — (avg(rate(node_cpu_seconds_total{mode=”idle”}[5m])) * 100)


## Panel Settings:

Title: “CPU Usage %”
Visualization: Graph
Thresholds: Yellow at 60%, Red at 80%
Click Save dashboard and give it a name like “Application Monitoring”.
Part 8: Understanding PromQL
Prometheus Query Language (PromQL) is powerful. Here are essential queries:

## Press enter or click to view image in full size

Basic Queries

promql

Get current value

node_memory_MemTotal_bytes

Press enter or click to view image in full size

Rate of change over 5 minutes

rate(node_cpu_seconds_total[5m])

Press enter or click to view image in full size

Average across all instances

avg(node_load1)

Press enter or click to view image in full size

Part 9: Setting Up Alerts

Alerts notify you when things go wrong. Let’s configure some.

Create prometheus/alerts.yml:

Press enter or click to view image in full size

Update prometheus/prometheus.yml to include alerts:


global:

scrape_interval: 15s

evaluation_interval: 15s

Load alert rules

rule_files:

- ‘/etc/prometheus/alerts.yml’

scrape_configs:

# … (existing scrape configs)

Update docker-compose.yml to mount the alerts file:


prometheus:

# … (existing config)

volumes:

- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml

- ./prometheus/alerts.yml:/etc/prometheus/alerts.yml

- prometheus_data:/prometheus

Restart Prometheus:


docker-compose restart prometheus

Check alerts at http://localhost:9091/alerts
Press enter or click to view image in full size

## Part 10: Production Best Practices

**Security**

**1. Change Default Passwords**

Update `docker-compose.yml`:

yaml

grafana:

environment:

GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}


Create `.env` file:

GRAFANA_PASSWORD=your_secure_password_here


**2. Use Read-Only Volumes**

yaml

volumes:

./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro


**3. Run as Non-Root User**

Resource Limits


Backup Strategy

Backup Prometheus data

docker run — rm \

-v prometheus_data:/data \

-v $(pwd)/backups:/backup \

alpine tar czf /backup/prometheus-$(date +%Y%m%d).tar.gz /data

Backup Grafana data

docker run — rm \

-v grafana_data:/data \

-v $(pwd)/backups:/backup \

alpine tar czf /backup/grafana-$(date +%Y%m%d).tar.gz /data

High Availability

For production, consider:

Prometheus Federation — Multiple Prometheus instances
Thanos — Long-term storage and global view
Grafana HA — Multiple Grafana instances behind load balancer

Part 11: Troubleshooting Common Issues

Issue 1: Container Won’t Start


# Check logs

docker-compose logs prometheus

docker-compose logs grafana

# Common causes:

# — Port already in use

# — Configuration file syntax error

# — Insufficient permissions

**Issue 2: Grafana Can’t Connect to Prometheus**

**Problem: Data source test fails**

Solution: Use container name, not localhost:

URL: http://prometheus:9090
URL: http://localhost:9091
Issue 3: No Metrics Showing

# Check Prometheus targets

curl http://localhost:9091/api/v1/targets | jq

# Verify exporters are reachable

curl http://localhost:9100/metrics

curl http://localhost:8000/metrics

**Issue 4: Data Not Persisting**

# Check volume mounts

docker inspect prometheus | grep -A 10 Mounts

# Fix permissions (Prometheus runs as UID 65534)

sudo chown -R 65534:65534 prometheus_data/

## Part 12: Extending Your Stack

Add MySQL Monitoring

Add Nginx Monitoring

Add Redis Monitoring

## Part 13: Real-World Use Cases

**Use Case 1: E-commerce Platform**

Metrics to track:

1. Order processing rate
2. Payment gateway latency
3. Inventory stock levels
4. User cart abandonment rate

**Sample custom metrics:**

python

from prometheus_client import Counter, Histogram

orders_total = Counter(‘orders_total’, ‘Total orders’, [‘status’])

payment_duration = Histogram(‘payment_duration_seconds’, ‘Payment processing time’)

inventory_stock = Gauge(‘inventory_stock’, ‘Product stock level’, [‘product_id’])


Use Case 2: API Service
Metrics to track:

Request rate per endpoint
Response time percentiles
Error rates by status code
Rate limiting hits
PromQL Queries:

promql

Requests per second by endpoint

rate(api_requests_total[1m]) by (endpoint)

99th percentile latency

histogram_quantile(0.99, rate(api_duration_seconds_bucket[5m]))

Error rate

sum(rate(api_requests_total{status=~”5..”}[5m])) / sum(rate(api_requests_total[5m]))


Use Case 3: Batch Processing Pipeline
Metrics to track:

Job completion time
Records processed per minute
Failed jobs count
Queue depth
Part 14: Performance Optimization
Optimize Prometheus Storage

Optimize Scrape Intervals

Use Recording Rules for Expensive Queries
Press enter or click to view image in full size

Then use the pre-computed metrics:

promql

Instead of this expensive query:

rate(api_requests_total[5m])

Use this:

job:api_request_rate:5m




Conclusion
You’ve built a complete monitoring stack from scratch. Here’s what you’ve accomplished:

Deployed a containerized monitoring infrastructure
Configured Prometheus to collect metrics
Created beautiful Grafana dashboards
Instrumented a custom application
Set up alerts for critical issues
Learned PromQL for advanced queries
Applied production best practices
Key Takeaways
Docker makes deployment simple — One command starts everything
Prometheus is powerful — Time-series data with flexible querying
Grafana is beautiful — Create stunning, informative dashboards
Monitoring is essential — Know what’s happening in your systems
Start simple, extend gradually — Add exporters as you need them
Next Steps
Deploy to production — Use Docker Swarm or Kubernetes
Add more exporters — Monitor databases, message queues, etc.
Implement alerting — Connect to Slack, PagerDuty, or email
Long-term storage — Integrate Thanos for infinite retention
Advanced dashboards — Create business-specific metrics
Resources
GitHub Repository: (https://github.com/abidaslam892/Grafana-Prometheus-Monitoring-Deployment-)
Prometheus Docs: https://prometheus.io/docs/
Grafana Dashboards: https://grafana.com/grafana/dashboards/
PromQL Guide: https://prometheus.io/docs/prometheus/latest/querying/basics/
Docker Docs: https://docs.docker.com/
Questions?
Feel free to reach out in the comments below! I’d love to hear:

What are you monitoring?
What challenges did you face?
What metrics matter most to your business?
If this guide helped you, please:
- ⭐ Star the GitHub repository

- 👏 Clap for this article

- 🔗 Share with your team

- 💬 Leave a comment

Happy monitoring! 📊
#Docker #Prometheus #Grafana #DevOps #Monitoring #Kubernetes #CloudNative #SRE #Infrastructure #Tutorial

DevOps Engineer to Cloud Architect

abidaslam892 — Fri, 14 Nov 2025 06:49:06 +0000

Originally published on Medium:

Medium

Hello! I’m Abid Aslam, a DevOps Engineer and Cloud Solutions Architect with over 15 years of experience in telecom operations, infrastructure automation, and cloud computing. My journey through the Azure Resume Challenge wasn’t just about building a resume website — it was about showcasing the evolution of modern cloud architecture and demonstrating how traditional infrastructure expertise translates to cutting-edge cloud solutions.

In this comprehensive article, I’ll walk you through my complete Azure Resume Challenge experience, the unique approaches I took, and the advanced blog system I built to share my technical expertise with the community.

What Motivated Me to Take the Azure Resume Challenge
As someone who has spent over a decade managing critical telecom BSS (Business Support Systems) and building enterprise-grade infrastructure, I wanted to demonstrate how traditional IT operations expertise translates to modern cloud architecture. The Azure Resume Challenge provided the perfect platform to showcase:

Modern Cloud Architecture: Moving from traditional server management to serverless, scalable solutions

DevOps Best Practices: Implementing CI/CD pipelines, Infrastructure as Code, and automated testing

Full-Stack Development: Combining backend engineering with modern frontend experiences

Technical Writing: Creating comprehensive documentation and sharing knowledge with the community

Project Architecture Overview

My implementation goes beyond the basic challenge requirements, incorporating enterprise-grade patterns and advanced features:

Frontend Architecture

Static Website Hosting: Azure Storage with $web container

Custom Domain: Professional domain with SSL/TLS encryption

Content Delivery: Azure Front Door for global performance

Responsive Design: Mobile-first approach with modern CSS

Backend Architecture

Serverless Computing: Azure Functions with Python 3.11

Database: CosmosDB Table API for visitor counter persistence

API Design : RESTful endpoints with proper CORS handling

Monitoring : Built-in health checks and comprehensive logging

DevOps Pipeline

Source Control : GitHub with organized repository structure

CI/CD : GitHub Actions for automated deployment

Infrastructure as Code : ARM templates for reproducible deployments

Testing: Automated testing and validation workflows

Phase 1: Building the Professional Frontend

The Resume Website

Press enter or click to view image in full size

Abid Aslam
CMPAK (Zong Pakistan Ltd) - Islamabad Lead SA & Project Manager for CRM / OCS / Rating / Mediation and Billing systems…
www.abidaslam.online

I started with creating a clean, professional resume website that reflects modern design principles:


<! — Key features of my resume design →

- Responsive layout optimized for all devices

- Professional typography using Inter font family

- Semantic HTML structure for accessibility

- CSS Grid and Flexbox for modern layouts

- Smooth animations and transitions

The website showcases my 15+ years of experience in:

Telecom Operations and BSS Systems
DevOps and Infrastructure Automation
Cloud Architecture and Migration
Kubernetes and Container Orchestration
Monitoring and Observability Solutions

Advanced Blog System
What sets my implementation apart is the comprehensive technical blog I built as part of the project. The blog features:

9 In-Depth Technical Articles covering:
Azure Resume Challenge complete walkthrough
DevOps automation with GitHub Actions
Kubernetes security and RBAC implementation
Azure Functions best practices
Terraform infrastructure patterns
Evolution from traditional monitoring to modern observability
Legacy application containerization strategies
Advanced troubleshooting methodologies
Enterprise-grade deployment patterns
Interactive Modal System: Professional reading experience with JavaScript-powered modals
Technical Code Examples: Real-world code snippets and architecture diagrams
Responsive Design: Optimized for desktop and mobile reading

Phase 2: Serverless Backend Implementation
Azure Functions Architecture
The backend implementation uses Azure Functions with several advanced features:


# Core visitor counter implementation

@app.route(route=”visitor-counter”, methods=[“GET”, “POST”, “OPTIONS”],

auth_level=func.AuthLevel.ANONYMOUS)

def visitor_counter(req: func.HttpRequest) -> func.HttpResponse:

“””

Azure Function HTTP trigger for visitor counter

- GET: Returns current visitor count

- POST: Increments and returns visitor count

- OPTIONS: CORS preflight support

“””

Database Integration
Press enter or click to view image in full size

Using CosmosDB Table API for reliable, scalable data persistence:


# Advanced table storage management

class TableStorageManager:

def __init__(self):

# Connection string and key-based authentication

# Proper error handling and logging

# Retry logic and connection pooling

Press enter or click to view image in full size

API Endpoints
Press enter or click to view image in full size

Three comprehensive endpoints providing different functionality:

/api/visitor-counter: Main counter functionality
/api/visitor-stats: Detailed analytics and metadata
/api/health: Health monitoring and diagnostics

Phase 3: Infrastructure as Code

Azure Resource Management

Implemented using ARM templates for complete infrastructure automation:


{

“parameters”: {

“storageAccountName”: {

“type”: “string”,

“metadata”: {

“description”: “Name of the storage account for static website”

}

},

“functionAppName”: {

“type”: “string”,

“metadata”: {

“description”: “Name of the Azure Function App”

}

}

}

}

Resource Organization

Press enter or click to view image in full size

-Resource Groups: Logical organization of all Azure resources

Storage Accounts: Static website hosting with CDN integration
Press enter or click to view image in full size

Function Apps: Serverless compute with auto-scaling
Press enter or click to view image in full size

Press enter or click to view image in full size

CosmosDB : Global distribution with Table API
Press enter or click to view image in full size

Press enter or click to view image in full size

Application Insights: Comprehensive monitoring and analytics
Phase 4: Advanced CI/CD Pipeline
GitHub Actions Workflows
Press enter or click to view image in full size

Press enter or click to view image in full size

Implemented separate workflows for frontend and backend deployments:


# Frontend deployment workflow

name: Deploy Frontend

on:

push:

branches: [ main ]

paths: [ ‘*.html’, ‘*.css’, ‘*.js’ ]

jobs:

deploy:

runs-on: ubuntu-latest

steps:

- name: Upload to Azure Storage

uses: azure/ — — -

with:

azcliversion: 2.30.0

inlineScript: |

az storage blob upload-batch \

— account-name ${{ secrets.AZURE_STORAGE_ACCOUNT }} \

— destination ‘$web’ \

— source . \

— auth-mode key

Deployment Strategy

Blue-Green Deployments: Zero-downtime updates
Automated Testing : Integration and unit tests
Environment Management: Separate dev/staging/production environments
Rollback Capabilities: Automated rollback on failure

Phase 5: Advanced Monitoring and Observability

Application Insights Integration

Comprehensive monitoring covering:
Press enter or click to view image in full size

Application Performance: Response times, throughput, error rates
Infrastructure Metrics : CPU, memory, storage utilization
Custom Metrics : Business-specific KPIs and analytics
Distributed Tracing : End-to-end request tracking
Health Monitoring
Press enter or click to view image in full size

Press enter or click to view image in full size

Built-in health checks providing:


{

“status”: “healthy”,

“timestamp”: “2025–10–23T09:11:09.286391+00:00”,

“services”: {

“table_storage”: “connected”,

“function_app”: “running”

}

}

Challenges Overcome

Azure Functions Deployment Issues

Challenge : Initial deployment conflicts with WEBSITE_RUN_FROM_PACKAGE settings

Solution: Implemented proper deployment configuration management and environment variable handling

CORS Configuration

Challenge: Cross-origin requests between static website and Azure Functions

Solution : Comprehensive CORS handling in function code with proper preflight support

CosmosDB Integration

Challenge : Connecting Azure Functions to CosmosDB Table API with proper authentication

Solution : Implemented multiple authentication methods with fallback strategies

Performance Optimization

Challenge: Ensuring fast loading times and smooth user experience

Solution : CDN integration, image optimization, and efficient caching strategies

Technical Skills Demonstrated
Through this project, I’ve showcased expertise in:

Cloud Architecture

Serverless Computing: Azure Functions for scalable backend services
Storage Solutions: Azure Storage for static hosting and CosmosDB for data persistence
CDN Integration: Azure Front Door for global content delivery
Security: SSL/TLS, CORS, and proper authentication mechanisms

DevOps Practices

Infrastructure as Code: ARM templates for reproducible deployments
CI/CD Pipelines : GitHub Actions with comprehensive testing
Monitoring: Application Insights and custom health checks
Documentation: Comprehensive README files and technical documentation

Development Excellence

Full-Stack Development: Modern HTML/CSS/JavaScript with Python backend
API Design: RESTful services with proper error handling
Database Management: NoSQL design patterns and data modeling
Performance Optimization: Efficient caching and content delivery

Advanced Blog Content Creation
One of the unique aspects of my implementation is the comprehensive technical blog featuring 9 detailed articles:

Future Enhancements

Phase 6: Advanced Analytics

User Behavior Tracking : Detailed analytics on resume page interactions
A/B Testing: Continuous optimization of user experience

Phase 7: Multi-Cloud Integration
AWS Integration : Cross-cloud deployment strategies
Hybrid Architecture: On-premises integration patterns

Key Takeaways and Lessons Learned

Technical Insights

Serverless Architecture: Azure Functions provide excellent scalability and cost-effectiveness
Static Site Hosting: Azure Storage offers robust, high-performance hosting for static content
CI/CD Integration: GitHub Actions seamlessly integrates with Azure services
Monitoring Importance: Comprehensive observability is crucial for production systems

Professional Growth

Full-Stack Skills: Combining backend expertise with modern frontend development
Cloud Architecture: Designing scalable, resilient cloud-native solutions
Technical Writing : Sharing knowledge through comprehensive documentation
Community Engagement: Contributing to the broader DevOps and cloud communities

Project Results and Impact
Website Performance

Performance Metrics: Sub-second load times globally
Security: Zero security incidents with proper SSL/TLS implementation

Professional Impact

Portfolio Showcase: Comprehensive demonstration of cloud and DevOps expertise
Knowledge Sharing : Technical blog serving the community
Career Development: Enhanced profile for cloud architecture opportunities
Industry Recognition: Contributing to Azure Resume Challenge community

Conclusion
The Azure Resume Challenge has been far more than just building a resume website — it’s been a comprehensive journey through modern cloud architecture, DevOps best practices, and technical leadership. Through this project, I’ve demonstrated how 15+ years of traditional IT operations expertise translates to cutting-edge cloud solutions.

The combination of professional resume presentation, advanced technical blog content, and enterprise-grade architecture showcases the evolution from traditional infrastructure management to modern cloud-native solutions.

Key Achievements

Complete Azure Architecture: Serverless backend, CDN-enabled frontend, and NoSQL database
Advanced CI/CD Pipeline: Automated deployment with comprehensive testing
Professional Presentation: Modern, responsive design optimized for all devices

Enterprise Patterns: Scalable, maintainable, and secure architecture
What’s Next
I’m excited to continue evolving this platform, adding advanced analytics, multi-cloud integration, and AI-powered features. The Azure Resume Challenge has provided an excellent foundation for showcasing cloud expertise and contributing to the broader DevOps community.

For fellow engineers considering the Azure Resume Challenge, I highly recommend it as a comprehensive way to demonstrate your cloud skills while building something genuinely useful for your career.

— -

Visit my live Azure Resume Challenge implementation:

Resume Website : https://www.abidaslam.online/

Connect with me:

LinkedIn:https://www.linkedin.com/in/abid-aslam-75520330/
Email : abidaslam.123@gmail.comm
GitHub: abidaslam892

Thank you for joining me on this cloud journey! I look forward to connecting with fellow cloud enthusiasts and sharing more advanced technical content.

— -

This article represents my personal experience with the Azure Resume Challenge and includes advanced patterns developed through 15+ years of enterprise IT operations and cloud architecture experience.