Forem

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
I Got Lost in Canary Wharf for 30 Minutes, But I Found the Future of SRE
Cover image for I Got Lost in Canary Wharf for 30 Minutes, But I Found the Future of SRE

I Got Lost in Canary Wharf for 30 Minutes, But I Found the Future of SRE

24
Comments 23
4 min read
Why Platform Engineering Is the Next Big Shift (and How Ops Teams Win)
Cover image for Why Platform Engineering Is the Next Big Shift (and How Ops Teams Win)

Why Platform Engineering Is the Next Big Shift (and How Ops Teams Win)

Comments 2
3 min read
When Your System Is Up But Users Still Don’t Trust It

When Your System Is Up But Users Still Don’t Trust It

1
Comments
5 min read
Trust Is a Technical Feature: How Engineers Can Communicate Reliability Without Hype

Trust Is a Technical Feature: How Engineers Can Communicate Reliability Without Hype

Comments
5 min read
Trust Is an Engineered Outcome: How Tech Teams Can Communicate Through Failure Without Losing Their Future

Trust Is an Engineered Outcome: How Tech Teams Can Communicate Through Failure Without Losing Their Future

Comments
5 min read
Zero-Downtime Schema Changes in SQL Server: The Reality Behind “Just Run the Migration”

Zero-Downtime Schema Changes in SQL Server: The Reality Behind “Just Run the Migration”

Comments
6 min read
Noisy alerts làm kiệt sức on-call: thiết kế alert theo SLO (ít nhưng chất)

Noisy alerts làm kiệt sức on-call: thiết kế alert theo SLO (ít nhưng chất)

1
Comments
3 min read
Inside the Agentic Loop: How 5 AI Agents Autonomously Investigate IT Incidents

Inside the Agentic Loop: How 5 AI Agents Autonomously Investigate IT Incidents

Comments
3 min read
Designing Systems That Don’t Lie: How to Build Software That Fails Loudly, Recovers Fast, and Keeps User Trust

Designing Systems That Don’t Lie: How to Build Software That Fails Loudly, Recovers Fast, and Keeps User Trust

Comments
6 min read
The Quiet Skill That Prevents Repeat Outages: Writing Incidents Like an Engineer, Not a Courtroom

The Quiet Skill That Prevents Repeat Outages: Writing Incidents Like an Engineer, Not a Courtroom

Comments
5 min read
Chapter 2 — RML-1 (Closed World): Build a Room Where Failure Is Safe

Chapter 2 — RML-1 (Closed World): Build a Room Where Failure Is Safe

Comments
7 min read
Building Reliable Software: The Trap of Convenience

Building Reliable Software: The Trap of Convenience

Comments
7 min read
Debugging & Production Incidents with AI
Cover image for Debugging & Production Incidents with AI

Debugging & Production Incidents with AI

Comments
4 min read
When Systems Fail, Trust Is the Real Incident: A Practical Guide to Communication for Engineers and Founders

When Systems Fail, Trust Is the Real Incident: A Practical Guide to Communication for Engineers and Founders

Comments
5 min read
AI Agents in Production: The Future of SRE and DevOps

AI Agents in Production: The Future of SRE and DevOps

4
Comments 1
3 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.