Forem

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

👋 Sign in for the ability to sort posts by relevant, latest, or top.
A hard-earned rule from incident retrospectives:

A hard-earned rule from incident retrospectives:

1
Comments
1 min read
One insight that changed how I design systems:

One insight that changed how I design systems:

Comments
1 min read
The Nines Are Lying to You: What 99.9% Uptime Actually Costs

The Nines Are Lying to You: What 99.9% Uptime Actually Costs

2
Comments 1
4 min read
I built an AI tool for incident investigation (looking for honest feedback)

I built an AI tool for incident investigation (looking for honest feedback)

1
Comments
2 min read
Determinism Series: Siliconizing Decision-Making (Index)

Determinism Series: Siliconizing Decision-Making (Index)

1
Comments
4 min read
FROM ALERTS TO ACTION: Autonomous Incident Response Agent | Engineering & Devops

FROM ALERTS TO ACTION: Autonomous Incident Response Agent | Engineering & Devops

1
Comments
6 min read
Zero Data Loss Migration: Moving Billions of Rows from SQL Server to Aurora RDS — Architecture, Predictive CDC Monitoring & Lessons from Production

Zero Data Loss Migration: Moving Billions of Rows from SQL Server to Aurora RDS — Architecture, Predictive CDC Monitoring & Lessons from Production

4
Comments 2
7 min read
Go Context Timeouts That Save Real Money

Go Context Timeouts That Save Real Money

Comments
9 min read
From ticket to PR with agents: how to use Claude to automate platform changes without breaking SLOs

From ticket to PR with agents: how to use Claude to automate platform changes without breaking SLOs

Comments
3 min read
SRE Explained: Because 'It Works on My Machine' is Not an SLO 🎯
Cover image for SRE Explained: Because 'It Works on My Machine' is Not an SLO 🎯

SRE Explained: Because 'It Works on My Machine' is Not an SLO 🎯

3
Comments
9 min read
Public status page guide for SaaS teams selling to enterprise
Cover image for Public status page guide for SaaS teams selling to enterprise

Public status page guide for SaaS teams selling to enterprise

2
Comments
4 min read
Why Your Monitoring Is Failing in Microservices (And What Actually Works)
Cover image for Why Your Monitoring Is Failing in Microservices (And What Actually Works)

Why Your Monitoring Is Failing in Microservices (And What Actually Works)

1
Comments
3 min read
SLI/SLO Framework

SLI/SLO Framework

Comments
4 min read
Capacity Planning Toolkit

Capacity Planning Toolkit

Comments
3 min read
On-Call Management Kit

On-Call Management Kit

Comments
4 min read
👋 Sign in for the ability to sort posts by relevant, latest, or top.