Forem

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
Tech Horror Codex: Vendor Lock‑In
Cover image for Tech Horror Codex: Vendor Lock‑In

Tech Horror Codex: Vendor Lock‑In

Comments
2 min read
CloudWatch Investigations: Your AI-Powered Troubleshooting Sidekick
Cover image for CloudWatch Investigations: Your AI-Powered Troubleshooting Sidekick

CloudWatch Investigations: Your AI-Powered Troubleshooting Sidekick

1
Comments
4 min read
How We Architected Context: The Connect-Link-Query Pattern

How We Architected Context: The Connect-Link-Query Pattern

1
Comments
2 min read
Beyond Dashboards: How FinOps and AI-Driven Observability are Reshaping SRE in 2026
Cover image for Beyond Dashboards: How FinOps and AI-Driven Observability are Reshaping SRE in 2026

Beyond Dashboards: How FinOps and AI-Driven Observability are Reshaping SRE in 2026

Comments
3 min read
AI Meets DevOps and SRE: The Ultimate Power Trio for Building Unbreakable Systems

AI Meets DevOps and SRE: The Ultimate Power Trio for Building Unbreakable Systems

Comments
4 min read
🚨 How We Rescued a Dead Azure Linux VM After SSH, Agent, and OS Disk All Broke (A Real Production War Story)

🚨 How We Rescued a Dead Azure Linux VM After SSH, Agent, and OS Disk All Broke (A Real Production War Story)

5
Comments
3 min read
Why your system can be 100% up and still completely broken
Cover image for Why your system can be 100% up and still completely broken

Why your system can be 100% up and still completely broken

3
Comments 2
2 min read
Reliability vs Uptime: Why Availability Fails at Scale
Cover image for Reliability vs Uptime: Why Availability Fails at Scale

Reliability vs Uptime: Why Availability Fails at Scale

5
Comments 1
3 min read
Operability First: Policy, Not Hope

Operability First: Policy, Not Hope

Comments
8 min read
SRE is the BEST Thing Ever
Cover image for SRE is the BEST Thing Ever

SRE is the BEST Thing Ever

1
Comments
4 min read
How I Reduced Production Incidents as a Senior SRE (Without Slowing Releases)
Cover image for How I Reduced Production Incidents as a Senior SRE (Without Slowing Releases)

How I Reduced Production Incidents as a Senior SRE (Without Slowing Releases)

1
Comments
2 min read
AI-Assisted Incident Triage in Large-Scale Cloud Systems: A Human-Centered Reliability Framework
Cover image for AI-Assisted Incident Triage in Large-Scale Cloud Systems: A Human-Centered Reliability Framework

AI-Assisted Incident Triage in Large-Scale Cloud Systems: A Human-Centered Reliability Framework

1
Comments
3 min read
Datadog + AWS: Observability Maturity Model 2026
Cover image for Datadog + AWS: Observability Maturity Model 2026

Datadog + AWS: Observability Maturity Model 2026

3
Comments
8 min read
Fallback e Degradação resiliente em APIs com Redis e Circuit Breaker

Fallback e Degradação resiliente em APIs com Redis e Circuit Breaker

Comments
8 min read
EP 6 - Don't Kill Flaky APIs: The Art of Resilient Retries

EP 6 - Don't Kill Flaky APIs: The Art of Resilient Retries

Comments
1 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.