Forem: Femi

Scaling Observability: Designing a Resilient Multi-Node Monitoring Stack with Docker, Prometheus & Grafana

Femi — Fri, 15 May 2026 07:27:49 +0000

Building a monitoring environment on a local machine is a great weekend project, but scaling it up to look after a live fleet of remote servers requires shifts in how you handle configuration stability, dashboard variables, and hardware persistence.
In this post, I want to walk through how I configured and optimized a multi-node monitoring stack utilizing Prometheus, Node Exporter, and Grafana deployed entirely via Docker Compose.

The Deployment Architecture To keep things clean and modular, the entire monitoring core runs as separate containerized microservices. The telemetry relies on bind-mounts to guarantee that if a container is wiped or updated, the custom target definitions stay safe on disk. Here is the structural framework of the modern docker-compose.yml layout used to spin it up: version: '3.8'

services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
restart: always
volumes:
- ./prometheus:/etc/prometheus
ports:
- "9090:9090"

grafana:
image: grafana/grafana:latest
container_name: grafana
restart: always
ports:
- "3000:3000"

Solving the High-Availability Problem
A common issue with basic Docker deployments is that if the physical or virtual host undergoes a sudden reboot or power failure, your container instances drop offline into an Exited state.
By applying the restart: always policy under our services, the Docker daemon automatically handles relaunching the infrastructure as soon as the system initializes. No manual ssh intervention required.

Scraping Multiple Remote Targets Inside the prometheus.yml target profile, I pooled our infrastructure assets into distinct target blocks. Rather than hardcoding distinct jobs for every server, grouping identical server profiles under a singular array makes filtering exponentially cleaner global: scrape_interval: 15s

scrape_configs:

job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
job_name: 'remote_ubuntu_nodes'
static_configs:
- targets:
  - '192.168.23.87:9110'
  - '192.168.23.88:9100'
  - '192.168.23.89:9100'
  - '192.168.23.90:9100' Transitioning to a Fleet View in Grafana Standard configurations for public dashboards (like the classic Node Exporter Full) default to strict single-select filters. When checking on multiple nodes like load balancers or app-services, clicking down an endless dropdown isn't sustainable. To move to a comprehensive fleet view, we can tap into Dashboard Settings (s shortcut in Grafana) and adjust the query variables: Multi-value selection: Enabled. Include All option: Enabled. To prevent the gauges from blending the metrics into a confusing average, you can open the row settings for your graphs and toggle Repeat For: Instance. Grafana will then dynamically duplicate that entire row of health metrics for every machine checking into the cluster.

Clean Code for DevOps: Refactoring my Ansible Lab into Roles

Femi — Thu, 23 Apr 2026 16:33:51 +0000

As my Ansible project grew, my single master playbook started to get crowded. Today, I decided to 'graduate' my automation by implementing Ansible Roles.
I’ve moved from a linear script to a modular directory structure:
/roles/web_servers
/roles/workstations
/roles/db_servers
roles/file_servers

This refactor allows me to treat my infrastructure like LEGO blocks. Need a new web server? Just call the role. Want to update my workstation? The logic is isolated and safe.

The biggest challenge? Managing file paths and ensuring the tasks/main.yml in each role was perfectly mapped. It’s a bit more setup time initially, but the long-term maintenance is now nearly zero.

Tool-Chain Automation: Using Ansible to Deploy Terraform and Web Content

Femi — Tue, 14 Apr 2026 20:45:15 +0000

Automation doesn't stop at OS updates. Today, I expanded my Ansible master playbook to handle two very different, but equally important, tasks:

Software Provisioning: Used the unarchive module to fetch, unzip, and install Terraform from a remote URL directly into /usr/local/bin.

No manual downloads, no mess.
Content Orchestration: Deployed a custom HTML site across my web tier using the copy module, ensuring strict Linux permissions (0644) were applied automatically.

By combining package management (apt/dnf), remote resource fetching, and file distribution, I've created a single point of truth for my entire workstation and server fleet.
IaC isn't just about the servers; it's about the tools we use to build them! 🛠️"

Surgical Automation: Mastering Ansible Tags for Multi-Tier Deployments

Femi — Tue, 14 Apr 2026 15:57:38 +0000

Today, I implemented Ansible Tags to solve this.
By tagging my tasks ( tags: apache, tags: db), I now have 'surgical' control over my infrastructure. I can:
Update just the Web tier: ansible-playbook master.yml --tags "apache"
Deploy only Database changes: ansible-playbook master.yml --tags "db"

Skip the long update processes: ansible-playbook master.yml --skip-tags "always"

I also maintained my Multi-OS logic, ensuring my tags work seamlessly across both Ubuntu and CentOS nodes.

This taught me a valuable lesson in 'Developer Experience' (DX)—making my automation tools easy and fast.

From Scripts to Infrastructure-as-Code: Building a Multi-Tier Ansible Playbook

Femi — Sun, 12 Apr 2026 17:35:15 +0000

There is a moment in every DevOps journey where it just 'clicks.' For me, it was today.
I’ve spent the last week moving away from manual configuration to a fully automated, role-based infrastructure. Instead of one long list of tasks, I’ve organized my environment into logical groups:
Web Tier: Automated Apache + PHP setup (handling Ubuntu/CentOS differences).
Database Tier: MariaDB provisioning.
File Tier: Samba deployment using the agnostic package module.

The best part? It’s completely smart. Using Ansible facts, the playbook detects the OS and adjusts package names and managers (apt vs. dnf) on the fly. No more 'permission denied' errors or broken dependencies—just clean, idempotent automation.

Key Lesson Learned: Formatting is everything in YAML! Managing multiple 'plays' in one file requires strict attention to indentation and host grouping.

Smart Playbooks: Handling Ubuntu and CentOS in one go with Ansible

Femi — Sat, 11 Apr 2026 21:57:55 +0000

One of the biggest hurdles in automation is environmental drift—specifically when your fleet isn't running the same OS.
I recently tackled this in an enterprise environment. I wanted to deploy a web stack, but my nodes are a mix of Ubuntu 24.04 and CentOS. Since Ubuntu uses apt and Apache is called apache2, while CentOS uses dnf and Apache is called httpd, a simple script wouldn't cut it.

Enter Ansible Conditionals.
By using the when statement tied to the ansible_distribution fact, I built a 'smart' playbook that:
Detects the OS automatically.
Runs apt tasks for Ubuntu and dnf for CentOS.
Installs the correct package names for each.

It’s a small logic jump, but it’s the difference between a playbook that works on one machine and a playbook that works on a thousand.

Next up on my journey: Implementing Handlers to make these updates even more efficient! 🚀

The proof-of- work with Ansible

Femi — Fri, 10 Apr 2026 21:11:19 +0000

Seeing "changed=1" is the most satisfying feeling in DevOps. 🚀
I just finished building a playbook that handles the full onboarding of a new admin user across multiple Ubuntu servers simultaneously.

The stack:
Control Node: Ubuntu 24.04 (Brown)
Managed Nodes: 3x Ubuntu VMs
Orchestration: Ansible
Security: SHA-512 Hashing & SSH Key injection

By moving away from manual useradd commands, I’m ensuring that my infrastructure is consistent, secure, and easily reproducible. Every error today was just a lesson in how the Linux shadow file works under the hood.
Moving one step closer to a full automation 🐧💻

#Ansible #DevOps #SysAdmin #Tech
#Infrastructure