<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Samar Santosh Patil</title>
    <description>The latest articles on Forem by Samar Santosh Patil (@samaarr).</description>
    <link>https://forem.com/samaarr</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3810818%2Fa4873496-4ad4-4dfd-a9eb-aba2671568a6.png</url>
      <title>Forem: Samar Santosh Patil</title>
      <link>https://forem.com/samaarr</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/samaarr"/>
    <language>en</language>
    <item>
      <title>Building a Production-Grade Observability Stack from Scratch (and What I Learned)</title>
      <dc:creator>Samar Santosh Patil</dc:creator>
      <pubDate>Sat, 07 Mar 2026 02:21:27 +0000</pubDate>
      <link>https://forem.com/samaarr/building-a-production-grade-observability-stack-from-scratch-and-what-i-learned-226o</link>
      <guid>https://forem.com/samaarr/building-a-production-grade-observability-stack-from-scratch-and-what-i-learned-226o</guid>
      <description>&lt;p&gt;&lt;em&gt;A student's two-week dive into OpenTelemetry, SLOs, and Go&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;I'm a recent grad preparing for Google Dublin SRE/infrastructure interviews. One thing I kept reading was that the best way to demonstrate systems thinking is to build something real, not just study flashcards. So I spent two weeks building &lt;code&gt;otel-slo-guard-demo&lt;/code&gt;: a failure-injected microservices stack with full observability, SLO enforcement, and burn-rate alerting.&lt;/p&gt;

&lt;p&gt;This post is a honest account of what I built, what broke, and what I'd do differently.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Was Trying to Learn
&lt;/h2&gt;

&lt;p&gt;Google SRE interviews test whether you understand &lt;em&gt;reliability as an engineering discipline&lt;/em&gt;, not just as a vague goal. That means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How do you define "reliable enough"? (SLOs)&lt;/li&gt;
&lt;li&gt;How do you know when you're burning through reliability too fast? (error budgets + burn-rate alerts)&lt;/li&gt;
&lt;li&gt;How do you instrument code so you can answer those questions? (OpenTelemetry)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I wanted to build something that forced me to answer all three questions with real code, not theory.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Stack
&lt;/h2&gt;

&lt;p&gt;Eight containers, one compose command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;service_a (Python/FastAPI) ─── HTTP ──► service_b (Go/FastAPI)
      │                                       │
      └──── OTLP gRPC ──► otel-collector ◄────┘
                               │
                    ┌──────────┴──────────┐
                    ▼                     ▼
                 Jaeger               Prometheus
                (traces)              (metrics)
                                         │
                              ┌──────────┴──────────┐
                              ▼                     ▼
                         Alertmanager           Grafana
                              │
                              ▼
                       webhook-receiver
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;service_a&lt;/code&gt; is the gateway — it receives requests and calls &lt;code&gt;service_b&lt;/code&gt;. &lt;code&gt;service_b&lt;/code&gt; has a &lt;code&gt;/admin/failmode&lt;/code&gt; endpoint that lets you inject errors, latency, or both at runtime without restarting anything.&lt;/p&gt;

&lt;p&gt;The OTel Collector receives traces over gRPC (port 4317) and exports them to Jaeger. Prometheus scrapes &lt;code&gt;/metrics&lt;/code&gt; from both services. Alertmanager handles routing. Grafana visualises everything.&lt;/p&gt;




&lt;h2&gt;
  
  
  SLO Design: The Math Matters
&lt;/h2&gt;

&lt;p&gt;Before writing any instrumentation code, I should have done this. I didn't — I wrote metrics first and designed the SLO after, which meant I had to re-label some counters. Lesson learned.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The SLO:&lt;/strong&gt; 99% success rate on &lt;code&gt;service_a_requests_total&lt;/code&gt;. That's a 1% error budget.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What "1% error budget" actually means:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Over a 30-day window: ~7.2 hours of acceptable downtime&lt;/li&gt;
&lt;li&gt;Over a 7-day window: ~1.68 hours&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Multi-window burn-rate alerting&lt;/strong&gt; is how you catch both fast and slow failures:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Alert&lt;/th&gt;
&lt;th&gt;Windows&lt;/th&gt;
&lt;th&gt;Burn Rate&lt;/th&gt;
&lt;th&gt;Error Budget Gone In&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SLOBurnFast (page)&lt;/td&gt;
&lt;td&gt;5m + 1h&lt;/td&gt;
&lt;td&gt;&amp;gt;14.4×&lt;/td&gt;
&lt;td&gt;~2 hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SLOBurnSlow (ticket)&lt;/td&gt;
&lt;td&gt;30m + 6h&lt;/td&gt;
&lt;td&gt;&amp;gt;6×&lt;/td&gt;
&lt;td&gt;~5 days&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The logic: if errors are arriving 14.4× faster than your budget allows, your entire monthly budget is gone in under 2 hours. That's a 3am page. If they're arriving at 6×, it'll last days — worth a ticket, not a wake-up call.&lt;/p&gt;

&lt;p&gt;In Prometheus recording rules:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;record&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;service_a:burn_rate:5m&lt;/span&gt;
  &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;(&lt;/span&gt;
      &lt;span class="s"&gt;rate(service_a_requests_total{status="error"}[5m])&lt;/span&gt;
      &lt;span class="s"&gt;/&lt;/span&gt;
      &lt;span class="s"&gt;rate(service_a_requests_total[5m])&lt;/span&gt;
    &lt;span class="s"&gt;) / 0.01&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then the alert:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SLOServiceAErrorBudgetBurnFast&lt;/span&gt;
  &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;service_a:burn_rate:5m &amp;gt; 14.4&lt;/span&gt;
    &lt;span class="s"&gt;AND&lt;/span&gt;
    &lt;span class="s"&gt;service_a:burn_rate:1h &amp;gt; 14.4&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;page&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The dual-window requirement (both 5m AND 1h must exceed threshold) prevents noisy alerts from short spikes.&lt;/p&gt;




&lt;h2&gt;
  
  
  OpenTelemetry: What Actually Tripped Me Up
&lt;/h2&gt;

&lt;p&gt;OTel has a lot of moving parts and the Python docs assume you already know the right setup order. I didn't, so I hit this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Spans were silently no-ops.&lt;/strong&gt; The instrumentation appeared to work — no errors — but Jaeger showed nothing. The root cause: I was calling &lt;code&gt;FastAPIInstrumentor.instrument_app(app)&lt;/code&gt; before configuring the &lt;code&gt;TracerProvider&lt;/code&gt;. OTel falls back to a no-op provider if none is set, and it does so silently.&lt;/p&gt;

&lt;p&gt;Correct order:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# 1. Configure TracerProvider
&lt;/span&gt;&lt;span class="n"&gt;provider&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TracerProvider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Resource&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="n"&gt;SERVICE_NAME&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;service_a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}))&lt;/span&gt;
&lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_span_processor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;BatchSpanProcessor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;OTLPSpanExporter&lt;/span&gt;&lt;span class="p"&gt;(...)))&lt;/span&gt;
&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_tracer_provider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 2. Create FastAPI app
&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# 3. Instrument
&lt;/span&gt;&lt;span class="n"&gt;FastAPIInstrumentor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;instrument_app&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 4. Define routes
&lt;/span&gt;&lt;span class="nd"&gt;@app.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/work&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;work&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gap in the docs is what led to my upstream PR to the &lt;code&gt;opentelemetry-python-contrib&lt;/code&gt; repo — the FastAPI instrumentation README had no usage example at all.&lt;/p&gt;




&lt;h2&gt;
  
  
  Porting service_b to Go
&lt;/h2&gt;

&lt;p&gt;Halfway through I decided to port &lt;code&gt;service_b&lt;/code&gt; to Go. Reasons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Google's backend stack is heavily Go&lt;/li&gt;
&lt;li&gt;I wanted to see how OTel maps across languages&lt;/li&gt;
&lt;li&gt;Go's static binary made the Docker image ~10MB vs ~200MB for Python slim&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The Go OTel setup is more explicit than Python's:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;grpc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DialContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"otel-collector:4317"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;grpc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithTransportCredentials&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;insecure&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewCredentials&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;exporter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;otlptracegrpc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;otlptracegrpc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithGRPCConn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;tp&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;sdktrace&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewTracerProvider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;sdktrace&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithBatcher&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exporter&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;sdktrace&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithResource&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewWithAttributes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;semconv&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SchemaURL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;semconv&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ServiceName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"service_b_go"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;otel&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetTracerProvider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;More boilerplate, but you can't accidentally forget to set the provider — the compiler will complain if you don't use the variable.&lt;/p&gt;

&lt;p&gt;HTTP handlers get automatic span creation via &lt;code&gt;otelhttp&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;mux&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Handle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/compute"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;otelhttp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewHandler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HandlerFunc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s"&gt;"compute"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same Prometheus metrics shape as the Python service — same label names, same counter/histogram pattern — so the existing recording rules and alerts worked without changes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure Injection Demo
&lt;/h2&gt;

&lt;p&gt;This is where things get fun to demo. With the stack running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Inject 100% error rate&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8002/admin/failmode &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"mode": "error", "error_rate": 1.0}'&lt;/span&gt;

&lt;span class="c"&gt;# Hammer the endpoint to burn the error budget fast&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;i &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;seq &lt;/span&gt;1 100&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://localhost:8001/work &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /dev/null
&lt;span class="k"&gt;done&lt;/span&gt;

&lt;span class="c"&gt;# Watch Prometheus: burn_rate:5m spikes to &amp;gt;&amp;gt;14.4&lt;/span&gt;
&lt;span class="c"&gt;# Alertmanager fires SLOBurnFast within 2-3 minutes&lt;/span&gt;
&lt;span class="c"&gt;# Webhook receiver logs the alert payload&lt;/span&gt;

&lt;span class="c"&gt;# Reset&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8002/admin/failmode &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"mode": "none"}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Watching the burn rate metric spike in Grafana, then seeing the alert fire in Alertmanager, then seeing the webhook payload — that's when the SLO math stops being abstract.&lt;/p&gt;




&lt;h2&gt;
  
  
  CI and Repo Hygiene
&lt;/h2&gt;

&lt;p&gt;Since this is a portfolio project, I wanted it to look like something I'd be comfortable running in production. That meant:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;.github/workflows/ci.yml&lt;/code&gt; — spins up the full stack on push to main, hits &lt;code&gt;/healthz&lt;/code&gt; on all services, tears it down&lt;/li&gt;
&lt;li&gt;Single &lt;code&gt;docker-compose.yml&lt;/code&gt; with a single override file (I started with three separate overrides — consolidating them was a good exercise in compose file hygiene)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;.venv&lt;/code&gt; gitignored and untracked&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;README.md&lt;/code&gt; with quickstart, ports table, failure injection commands, and troubleshooting section&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Design the SLO before writing code.&lt;/strong&gt; Knowing your error budget and burn-rate thresholds upfront changes how you label your metrics. I had to relabel counters mid-project.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Start with the collector config.&lt;/strong&gt; I spent hours debugging missing spans because the OTel Collector pipeline wasn't wired up correctly. It should have been the first thing I validated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Write the Go service first.&lt;/strong&gt; The Python service was a comfortable starting point, but Go forced me to be more explicit about everything — which built better understanding.&lt;/p&gt;




&lt;h2&gt;
  
  
  Resources That Actually Helped
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://sre.google/sre-book/eliminating-toil/" rel="noopener noreferrer"&gt;Google SRE Book — Chapter 5: Eliminating Toil&lt;/a&gt; — the burn-rate math comes from here&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/open-telemetry/opentelemetry-python-contrib" rel="noopener noreferrer"&gt;OpenTelemetry Python contrib&lt;/a&gt; — read the source, not just the docs&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://prometheus.io/docs/practices/alerting/" rel="noopener noreferrer"&gt;Prometheus Alerting on SLOs&lt;/a&gt; — the multi-window rationale&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Repo
&lt;/h2&gt;

&lt;p&gt;Everything is at: &lt;strong&gt;&lt;a href="https://github.com/samaarr/otel-slo-guard-demo" rel="noopener noreferrer"&gt;https://github.com/samaarr/otel-slo-guard-demo&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One command to run the full stack:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-f&lt;/span&gt; docker-compose.yml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-f&lt;/span&gt; docker-compose.prometheus.override.yml &lt;span class="se"&gt;\&lt;/span&gt;
  up &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;--build&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you're preparing for SRE or infrastructure interviews and want to talk through any of this, I'm learning in public and happy to compare notes.&lt;/p&gt;

</description>
      <category>career</category>
      <category>go</category>
      <category>monitoring</category>
      <category>sre</category>
    </item>
  </channel>
</rss>
