Forem

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
Chapter 2 — RML-1 (Closed World): Build a Room Where Failure Is Safe

Chapter 2 — RML-1 (Closed World): Build a Room Where Failure Is Safe

Comments
7 min read
Building Reliable Software: The Trap of Convenience

Building Reliable Software: The Trap of Convenience

Comments
7 min read
When Systems Fail, Trust Is the Real Incident: A Practical Guide to Communication for Engineers and Founders

When Systems Fail, Trust Is the Real Incident: A Practical Guide to Communication for Engineers and Founders

Comments
5 min read
The Silent Process
Cover image for The Silent Process

The Silent Process

1
Comments
3 min read
When Everything Is On Fire: Incident Communication That Engineers (and Users) Can Trust

When Everything Is On Fire: Incident Communication That Engineers (and Users) Can Trust

Comments
5 min read
Circuit Breakers for LLM APIs: Applying SRE Patterns to AI Infrastructure

Circuit Breakers for LLM APIs: Applying SRE Patterns to AI Infrastructure

Comments
6 min read
The Worlds of Distributed Systems — Align Your Team’s Mental Model
Cover image for The Worlds of Distributed Systems — Align Your Team’s Mental Model

The Worlds of Distributed Systems — Align Your Team’s Mental Model

Comments
5 min read
Chapter 1 — Thinking About Rollback in Distributed Systems Through Three Worlds (RML-1/2/3)
Cover image for Chapter 1 — Thinking About Rollback in Distributed Systems Through Three Worlds (RML-1/2/3)

Chapter 1 — Thinking About Rollback in Distributed Systems Through Three Worlds (RML-1/2/3)

Comments
6 min read
How we fixed Real Kubernetes Production Incidents

How we fixed Real Kubernetes Production Incidents

3
Comments
3 min read
The Real Reason AI Agents “Work” in Software
Cover image for The Real Reason AI Agents “Work” in Software

The Real Reason AI Agents “Work” in Software

Comments
6 min read
OpenSRM: An Open Specification for Service Reliability

OpenSRM: An Open Specification for Service Reliability

2
Comments
6 min read
Why is Infrastructure-as-Code so important? Hint: It's correctness

Why is Infrastructure-as-Code so important? Hint: It's correctness

Comments
2 min read
Pourquoi mon serveur est devenu lent : le cas du disque SMR

Pourquoi mon serveur est devenu lent : le cas du disque SMR

Comments
2 min read
OpenTelemetry vs Logstash - Which Logging Tool Is Right for You?

OpenTelemetry vs Logstash - Which Logging Tool Is Right for You?

1
Comments
9 min read
OpenTelemetry vs Loki - Choosing the Right Observability Tool

OpenTelemetry vs Loki - Choosing the Right Observability Tool

1
Comments
13 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.