Forem

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
Monitoring Tools Comparison 2026: VigilOps vs Zabbix vs Prometheus vs Datadog

Monitoring Tools Comparison 2026: VigilOps vs Zabbix vs Prometheus vs Datadog

2
Comments
2 min read
5 CI/CD Pipeline Disasters I Caused (And How I Fixed Them)

5 CI/CD Pipeline Disasters I Caused (And How I Fixed Them)

Comments
8 min read
A 10% traffic spike took down a stable system in 3 minutes and 47 seconds.

A 10% traffic spike took down a stable system in 3 minutes and 47 seconds.

Comments
3 min read
How blue/green deployments saved us from out of hours changes and downtime
Cover image for How blue/green deployments saved us from out of hours changes and downtime

How blue/green deployments saved us from out of hours changes and downtime

Comments
2 min read
Sentrix: An AI SRE Copilot That Debates Its Own Scaling Decisions
Cover image for Sentrix: An AI SRE Copilot That Debates Its Own Scaling Decisions

Sentrix: An AI SRE Copilot That Debates Its Own Scaling Decisions

Comments
2 min read
Alert Fatigue Is Real — Here's What It's Actually Costing Your Team

Alert Fatigue Is Real — Here's What It's Actually Costing Your Team

1
Comments
5 min read
The 15-minute problem: how to decide whether to rollback after deploy

The 15-minute problem: how to decide whether to rollback after deploy

2
Comments
4 min read
Background Jobs in Production: The Problems Queues Don’t Solve
Cover image for Background Jobs in Production: The Problems Queues Don’t Solve

Background Jobs in Production: The Problems Queues Don’t Solve

1
Comments
3 min read
How a 2% Latency Spike Collapses a 20-Service System and How to Prevent It
Cover image for How a 2% Latency Spike Collapses a 20-Service System and How to Prevent It

How a 2% Latency Spike Collapses a 20-Service System and How to Prevent It

1
Comments
3 min read
Throw a Prompt at your IDE and see it get done!

Throw a Prompt at your IDE and see it get done!

2
Comments
1 min read
Chapter 4: GitOps with Terraform + ArgoCD — Self-Hosting LLMs as a Platform Product

Chapter 4: GitOps with Terraform + ArgoCD — Self-Hosting LLMs as a Platform Product

1
Comments
28 min read
Observability and Failure Recovery in Distributed Financial Systems: When Correct Systems Still Break
Cover image for Observability and Failure Recovery in Distributed Financial Systems: When Correct Systems Still Break

Observability and Failure Recovery in Distributed Financial Systems: When Correct Systems Still Break

1
Comments
5 min read
PostgreSQL High Availability: Patroni, Replication and Failover Patterns
Cover image for PostgreSQL High Availability: Patroni, Replication and Failover Patterns

PostgreSQL High Availability: Patroni, Replication and Failover Patterns

1
Comments
12 min read
The Technology You Never See Is Often What Breaks First

The Technology You Never See Is Often What Breaks First

1
Comments
5 min read
Topology-Aware AI Agents for Observability: Automating SLO Breach Root Cause Analysis

Topology-Aware AI Agents for Observability: Automating SLO Breach Root Cause Analysis

1
Comments
5 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.