Forem

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

👋 Sign in for the ability to sort posts by relevant, latest, or top.
Claude Code for the Outer Loop: An AI SRE Playbook to Reduce On-Call Toil

Claude Code for the Outer Loop: An AI SRE Playbook to Reduce On-Call Toil

3
Comments
18 min read
Effective On-Call Rotations: Lessons From Building Fair Schedules
Cover image for Effective On-Call Rotations: Lessons From Building Fair Schedules

Effective On-Call Rotations: Lessons From Building Fair Schedules

Comments
3 min read
If You Were a Server: How to Detect Issues and Keep Things Running Smoothly
Cover image for If You Were a Server: How to Detect Issues and Keep Things Running Smoothly

If You Were a Server: How to Detect Issues and Keep Things Running Smoothly

Comments
10 min read
Prometheus at Scale: Surviving the Cardinality Cliff
Cover image for Prometheus at Scale: Surviving the Cardinality Cliff

Prometheus at Scale: Surviving the Cardinality Cliff

Comments
2 min read
Database Reliability: The SRE Approach to Keeping Data Safe
Cover image for Database Reliability: The SRE Approach to Keeping Data Safe

Database Reliability: The SRE Approach to Keeping Data Safe

Comments
3 min read
SLO Design for Agentic AI Systems — Why Traditional Reliability Metrics Break (and What to Use Instead)

SLO Design for Agentic AI Systems — Why Traditional Reliability Metrics Break (and What to Use Instead)

Comments
4 min read
DORA metrics for the CFO: making engineering velocity legible

DORA metrics for the CFO: making engineering velocity legible

Comments
5 min read
Opsgenie 2026: Features, Pricing, EOL & Alternatives
Cover image for Opsgenie 2026: Features, Pricing, EOL & Alternatives

Opsgenie 2026: Features, Pricing, EOL & Alternatives

1
Comments
15 min read
Developer autonomy and the work that repeats after ship
Cover image for Developer autonomy and the work that repeats after ship

Developer autonomy and the work that repeats after ship

2
Comments
3 min read
The Incident Commander Role: Running Incidents Without Chaos
Cover image for The Incident Commander Role: Running Incidents Without Chaos

The Incident Commander Role: Running Incidents Without Chaos

1
Comments
2 min read
I got tired of writing runbooks after incidents. So I'm building something.

I got tired of writing runbooks after incidents. So I'm building something.

Comments
1 min read
Why Your Microservices Need Circuit Breakers (And How to Add Them)
Cover image for Why Your Microservices Need Circuit Breakers (And How to Add Them)

Why Your Microservices Need Circuit Breakers (And How to Add Them)

Comments
2 min read
The On-Call Handoff That Prevents Dropped Incidents
Cover image for The On-Call Handoff That Prevents Dropped Incidents

The On-Call Handoff That Prevents Dropped Incidents

Comments
2 min read
How I Troubleshoot Kubernetes in Production
Cover image for How I Troubleshoot Kubernetes in Production

How I Troubleshoot Kubernetes in Production

3
Comments
6 min read
SLOs That Product Managers Actually Understand
Cover image for SLOs That Product Managers Actually Understand

SLOs That Product Managers Actually Understand

Comments
2 min read
👋 Sign in for the ability to sort posts by relevant, latest, or top.