Forem

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
The Silent Process
Cover image for The Silent Process

The Silent Process

1
Comments
3 min read
How We Stopped Fighting Enterprise Auth and Read Calendars With a URL
Cover image for How We Stopped Fighting Enterprise Auth and Read Calendars With a URL

How We Stopped Fighting Enterprise Auth and Read Calendars With a URL

1
Comments
8 min read
Beyond the Hype: The Realignment of AI Power and the Rise of Model-Agnostic Infrastructure
Cover image for Beyond the Hype: The Realignment of AI Power and the Rise of Model-Agnostic Infrastructure

Beyond the Hype: The Realignment of AI Power and the Rise of Model-Agnostic Infrastructure

8
Comments 2
7 min read
When Everything Is On Fire: Incident Communication That Engineers (and Users) Can Trust

When Everything Is On Fire: Incident Communication That Engineers (and Users) Can Trust

Comments
5 min read
Circuit Breakers for LLM APIs: Applying SRE Patterns to AI Infrastructure

Circuit Breakers for LLM APIs: Applying SRE Patterns to AI Infrastructure

Comments
6 min read
Monitoring Tools Comparison 2026: VigilOps vs Zabbix vs Prometheus vs Datadog

Monitoring Tools Comparison 2026: VigilOps vs Zabbix vs Prometheus vs Datadog

3
Comments
2 min read
The Worlds of Distributed Systems — Align Your Team’s Mental Model
Cover image for The Worlds of Distributed Systems — Align Your Team’s Mental Model

The Worlds of Distributed Systems — Align Your Team’s Mental Model

Comments
5 min read
Why LeetCode Habits Get Senior Engineers Rejected in Google SRE Coding Rounds
Cover image for Why LeetCode Habits Get Senior Engineers Rejected in Google SRE Coding Rounds

Why LeetCode Habits Get Senior Engineers Rejected in Google SRE Coding Rounds

1
Comments
4 min read
Chapter 1 — Thinking About Rollback in Distributed Systems Through Three Worlds (RML-1/2/3)
Cover image for Chapter 1 — Thinking About Rollback in Distributed Systems Through Three Worlds (RML-1/2/3)

Chapter 1 — Thinking About Rollback in Distributed Systems Through Three Worlds (RML-1/2/3)

Comments
6 min read
Most Kubernetes Clusters Are Over-Engineered
Cover image for Most Kubernetes Clusters Are Over-Engineered

Most Kubernetes Clusters Are Over-Engineered

Comments 2
4 min read
A 10% traffic spike took down a stable system in 3 minutes and 47 seconds.

A 10% traffic spike took down a stable system in 3 minutes and 47 seconds.

3
Comments
3 min read
The Big Tech Reality Check: Why "Senior" Architecture Fails at Global Scale
Cover image for The Big Tech Reality Check: Why "Senior" Architecture Fails at Global Scale

The Big Tech Reality Check: Why "Senior" Architecture Fails at Global Scale

Comments 1
3 min read
What Actually Happens When You Put an AI Agent on Call
Cover image for What Actually Happens When You Put an AI Agent on Call

What Actually Happens When You Put an AI Agent on Call

9
Comments 2
3 min read
Background Jobs in Production: The Problems Queues Don’t Solve
Cover image for Background Jobs in Production: The Problems Queues Don’t Solve

Background Jobs in Production: The Problems Queues Don’t Solve

2
Comments 1
3 min read
Why is Infrastructure-as-Code so important? Hint: It's correctness

Why is Infrastructure-as-Code so important? Hint: It's correctness

Comments
2 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.