Forem: Samson Tanimawo

How We Killed Our Worst Alert (And What We Learned)

Samson Tanimawo — Mon, 25 May 2026 17:30:36 +0000

For two years, one alert dominated our on-call pages. It fired roughly 40% of all pages. Nobody had fixed it because 'it was important.' Here's how we finally killed it.

The alert

'Kafka consumer lag > 100k messages on topic X.'

Sounds reasonable. Actually fired 3-4 times a day, always between 8am-9am, almost never corresponding to a real problem. The 'fix' was to wait 10 minutes and let it catch up.

Why nobody killed it

Everyone was afraid to. 'What if the one time we ignore it, it's real?' Classic alert fatigue excuse.

What we did

We spent two days doing nothing but watching this alert. We checked every fire over 6 months of history.

Findings:

98% of fires were auto-resolved within 15 minutes
The 2% that weren't were correlated with a specific upstream service's cold start
The 'lag > 100k' threshold was arbitrary, set 18 months ago when traffic was 1/10 of current

The fix

Raised the threshold to 500k (still safely above real problems)
Added a 20-minute duration requirement (the alert only fires if lag stays elevated for 20 minutes)
Added a second alert for 'upstream service Y cold start detected' so we catch the real underlying cause directly

Result: alert volume on this one metric dropped from 1,200/month to 3/month. The 3 that remain are always real problems.

The lesson

Every noisy alert is a threshold problem, a duration problem, or a cause problem. Not 'we need to care more.'

Fix the system, not the attitude.

The bonus lesson

After killing this alert, three engineers independently told me they sleep better now. That alert had been waking one of them at 5 AM twice a week for a year.

Reliability work is human work. Never forget the human on the other end of the page.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

The Reliability Roadmap: A 90-Day Plan for New SRE Teams

Samson Tanimawo — Sun, 24 May 2026 17:39:53 +0000

New SRE team at your company? Here's a 90-day plan I've used twice. It works because it balances 'show immediate value' with 'build for the long term.'

Days 1-14: Observe

Resist the urge to change things. Watch the current system, read existing post-mortems, shadow on-call, talk to engineers about their pain.

Output: a list of the top 5 reliability problems ranked by 'engineering time lost per week.'

Days 15-30: Quick wins

Pick the top 2 from your list. Fix them. Make the fixes visible announce them in eng all-hands.

Good quick wins: delete a flaky alert, automate a repetitive runbook, fix a broken dashboard everyone complains about.

Bad quick wins: rewrite the deployment pipeline. Too big, takes 90 days alone.

Output: visible reliability improvements + trust from engineering teams.

Days 31-60: Foundations

Now use your trust. Introduce the boring stuff:

Define SLOs for the top 3 critical services
Set up an error budget dashboard
Establish a weekly reliability review (10 minutes, not an hour)
Write an incident response runbook template

Output: measurable reliability targets that engineering can rally around.

Days 61-90: Programmatic change

Start turning the reliability work into ongoing programs:

Post-mortem process with action-item tracking
Monthly toil survey (what did engineers do this month that could've been automated?)
Quarterly reliability review with leadership
A clear hand-off process: when does reliability work become product engineering work?

Output: processes that continue working when you take a vacation.

The trap

'This is fine, we can do all this in the first week.' You cannot. Every team I've seen that tried it got burned out or resented. 90 days is the minimum. More is normal.

The real goal

At day 91, the engineering team should be able to describe your team's value in one sentence. If they can't, you spent 90 days on the wrong things.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

Scaling On-Call When You Only Have 5 Engineers

Samson Tanimawo — Sat, 23 May 2026 17:18:22 +0000

On-call is brutal at small scale. Every engineer takes 1 week in 5. You get woken up once a week. Burnout is weeks away.

Here's what works at 5 engineers, from someone who's been there.

Accept the reality

You cannot build a 'rested, follow-the-sun, healthy' on-call rotation with 5 people. Stop trying to mimic Google. Build for small-team reality.

The 3 things that help

1. Aggressively reduce alerts. When you have 5 engineers, you cannot afford 50 alerts/day. Cut mercilessly. Target: 1-2 pages per week per on-call. Yes, you might miss things. You'll miss more by being exhausted.

2. Kill pager fatigue with business hours routing. Non-urgent alerts go to a ticket, not a page. Only 'user-facing impact right now' alerts wake someone up. Everything else waits for morning.

3. Pay for on-call. $500-$1000/week for primary. Yes, you can afford it. If you can't, your company is too small for 24/7 on-call just accept overnight delays.

What doesn't help

'Just be better at triage' (not a system fix)
Bringing in contractors for on-call (they don't know your system)
Unplanned time off after a rough week (too late, damage done)

The emotional side

The hardest part of small-team on-call isn't the pages. It's the feeling that the company rests on you personally. Fight that narrative.

Take real vacations. Block the week. No Slack.
Rotate the 'primary' role explicitly so nobody becomes the default expert
Document everything so anyone can handle anything

The growth path

As you hire, protect the on-call ratio. Don't add 3 engineers and immediately expand the services they're responsible for. Use growth to shrink individual load first. Then expand scope.

5-engineer on-call is survivable. 7-engineer with the same scope is comfortable. Plan for the second, suffer the first.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

TLS Certificate Management Without Tears

Samson Tanimawo — Fri, 22 May 2026 17:59:07 +0000

Expired certificates cause more outages than they should. Every time, the post-mortem says 'we'll monitor expiry dates.' Every time, six months later, someone forgets.

Here's how to actually solve it.

The two rules

Rule 1: Don't manage certs manually. If a human has to remember to renew, the system is broken. Use Let's Encrypt + cert-manager (or your cloud's equivalent) and let the machines handle it.

Rule 2: Monitor expiry as an SLI. 'Days until cert expires' is a metric. Alert at 14 days and at 7 days. Actually page at 3 days.

The gotchas

Certs you didn't know about. Internal services with self-signed certs that someone deployed in 2019 and nobody has touched since. Scan your infrastructure. Inventory everything.

Client certs. mTLS clients can have expired certs too. These are harder to find because they're often distributed across devices.

Third-party APIs. You don't manage their certs, but you break when they expire without notice. Monitor outbound connections with TLS validation turned on.

The renewal that silently fails. Automated renewal fails because of a config change. Nobody notices because nothing changed visibly until the old cert expires. Alert on renewal failures, not just expiry dates.

The quarterly audit

Once a quarter:

List every domain/service that uses TLS
Verify the renewal automation is working
Check monitoring is actually firing (test alert on a staging cert)
Delete certs that belong to services that no longer exist

The emotional truth

Nobody wants to work on cert management. That's why it breaks. Make it someone's explicit quarterly responsibility and reward them for boring success. You'll never have another cert outage.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

DNS: The SRE's Most Underrated Skill

Samson Tanimawo — Thu, 21 May 2026 17:57:49 +0000

I've seen more outages caused by DNS than by code. And it's always the same story: the team shipped, something broke, and three hours into debugging someone said, 'wait, is it DNS?'

It's always DNS.

Why DNS bites SREs specifically

DNS is invisible until it breaks. It caches at every layer (OS, resolver, app, CDN). TTLs are rarely what you expect. And it's usually owned by 'the networking team' who are actually just one guy who left the company in 2022.

The debugging mindset

When something weird happens, especially 'works from my laptop, broken in prod,' check DNS before you check code.

dig +short the hostname from the affected host
Check the TTL: dig HOSTNAME. Short TTL (60s)? Probably fine. Long TTL (86400)? You have a problem during rollout.
Is the resolver returning stale records? Try dig @8.8.8.8 HOSTNAME to bypass local cache.

The 3 DNS setups I've seen break

1. Split-horizon DNS with cached results. Internal resolver returns one IP, external returns another. Your service caches the wrong one. Mysterious connection failures ensue.

2. Short TTL during migration, long TTL in resolver. You set the TTL to 60s for a cutover. Your downstream service's resolver has its own cache that respects the record's initial TTL, which was 86400. Your cutover doesn't propagate for a day.

3. DNS-based health checks with slow propagation. You remove a bad host from DNS. Clients keep hitting it because of cache. Outage continues for the length of the TTL.

The rule

Lower your TTL before you need to. Not during the outage. A long TTL on production records is a loaded gun.

DNS deserves respect. Learn it. Love it. Debug it first.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

The Silent Outage: Monitoring What You Can't See

Samson Tanimawo — Wed, 20 May 2026 17:02:15 +0000

The worst kind of outage is one nobody notices. Your metrics are green. Your dashboards are fine. Your users are quietly getting a broken experience.

I've been burned by three silent outages in my career. Here's how I catch them now.

How silent outages happen

Frontend caching the error. Your API returned a 500. Your CDN cached it. Now all users get the cached error for 10 minutes, but your API health check passes because the CDN never re-asks.

Partial feature breakage. Login works. Checkout works. The search bar silently returns empty results. Your dashboards don't track 'zero-result searches' so you don't see anything wrong.

Stale data pipelines. The data pipeline stopped running 3 hours ago. Your dashboards are showing frozen numbers but the backend looks fine.

What to monitor

Synthetic user journeys from the outside. A test user clicks login, search, checkout every 5 minutes. If any step fails, alert.
Data freshness, not just data availability. Alert on 'last data write > X minutes ago,' not just 'database is up.'
Business metrics, not just tech metrics. 'Checkouts per hour' as an alert. If it drops 50% unexpectedly, something is wrong even if all your infra is green.
Error budget burn rate. Sudden burn rate spike = something silent is happening even if individual alerts aren't firing.

The harder problem

The truly silent outages are the ones where your users go quiet because they've given up on you. No complaints, just churn. You only find out weeks later from a usage graph.

Business metric monitoring is the only defense against this. Treat conversion rate, daily active users, and session length as SLIs.

Your real job isn't to keep the servers up. It's to keep users succeeding. Monitor that.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

Why Every SRE Should Learn a Little Rust

Samson Tanimawo — Tue, 19 May 2026 17:01:58 +0000

I'm not saying rewrite your stack in Rust. I'm saying: learn enough to read it.

Here's why, from someone who dragged their feet for years and finally gave in.

Rust is showing up everywhere in infra

Observability: Vector, OpenTelemetry collector components, Tempo, Loki's ingesters in some paths
Proxies: Linkerd's data plane, parts of Istio's future, newer eBPF-based tools
Databases: TiKV, SurrealDB, and bits of Postgres extensions
CLIs: Half the infra tools you install now are Rust binaries

If your stack has observability or networking components, you're going to be reading Rust code in an incident sooner or later. Better to be able to follow along.

You don't need to be fluent

You need to:

Read Rust well enough to follow a function call
Understand what ownership and borrowing mean (you won't debug them, but you need to read the code)
Compile a small program
Read a panic stack trace

That's enough. That's maybe 2 weekends of work.

The side benefit

Rust teaches you to think about state, concurrency, and error handling more carefully. Those skills show up in whatever language you actually write in. I write less broken Python and Go after learning Rust, even though I rarely write Rust.

The starting point

The Rust book (free online) + one small project (I did a log parser). Skip the async stuff for now. Come back to it when you need it.

You're not becoming a Rust engineer. You're becoming an SRE who can read the tools your stack depends on. That's a real edge.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

How We Built Our Own Incident Management System

Samson Tanimawo — Mon, 18 May 2026 17:09:08 +0000

A couple of years ago we built our own incident management system instead of buying one. I'd do it again. Here's why, and the pieces that mattered.

Why not buy?

We looked at PagerDuty, Incident.io, FireHydrant, and a couple of others. Good tools. Each was $40-80/user/month. For 40 engineers, that's $20-40k/year.

The real problem: none of them fit our workflow exactly. We'd pay $30k/year and still have to work around the tool.

What we built

A small Slack-first tool. Total: ~3000 lines of Go. Took one engineer 3 weeks.

Features:

/incident start [title] creates a channel, pings on-call, assigns a commander
/incident update [message] appends to a timeline that gets used in the retro
/incident severity [sev-1..sev-5] routes escalation based on severity
/incident close triggers post-mortem doc auto-generation from the timeline
Integrations with our monitoring, Jira, and status page

That's it. No 50-feature bloat.

What we skipped

Most of the fancy features in commercial tools go unused. We skipped:

Custom roles and permissions
Auto-generated stakeholder updates (we write them by hand better)
Post-mortem templates beyond the one we chose
Runbook hosting (we use our docs repo)

Would I buy instead today?

If you're under 50 engineers, probably yes buy. Your engineering time is more valuable than the tool cost.

If you're bigger and have specific workflow needs, build. A focused in-house tool beats a feature-bloated commercial one every time.

The worst option is buying a tool and then fighting it. Pick the fit, not the feature list.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

The Role of Platform Engineering in a Startup

Samson Tanimawo — Sun, 17 May 2026 17:08:42 +0000

Platform engineering sounds like a big-company thing. But I think every startup past 20 engineers needs a small platform function. Here's why.

The problem platform engineering solves

At 5 engineers, everybody knows how to deploy. At 20, they don't. People start copy-pasting deployment configs, breaking things, and asking the same questions in Slack every day.

You need one person who owns 'how we ship software,' even part-time. That person is a platform engineer whether you call them that or not.

What a startup platform function does

1. Owns the deployment path. One golden path from git push to production. Documented, maintained, and defended from one-off exceptions.

2. Owns the dev environment. Laptop setup, local testing, shared services. New hire productive in days, not weeks.

3. Owns the shared services. Auth, logging, tracing, secrets management. Used by everyone, owned by no one until you assign it.

4. Owns the developer experience. CI speed. Local/prod parity. Error messages. The stuff that's no one's job but costs everyone.

What it doesn't do

It doesn't build a K8s abstraction layer that rivals AWS. Startups can't afford that. Use off-the-shelf. Customize lightly.

When to start

At 15-20 engineers, if you're still asking 'why is my build failing' in Slack 3 times a week, you're ready.

Before 15, just have engineers fix each other's stuff and take turns being the 'ops person.' It's not efficient, but it's cheaper than a full function.

The hire

Hire someone who loves developer experience. Not the best infrastructure engineer on the team the one who genuinely cares about making everyone else faster. That's a different skill set.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

Building Dashboards People Actually Use

Samson Tanimawo — Sat, 16 May 2026 17:08:15 +0000

I've built dozens of dashboards. Most have been ignored. A few have been used constantly. The difference isn't the graphs. It's the design.

The 3-second test

A useful dashboard answers 'is everything OK?' in 3 seconds. Not 'let me scroll through 40 graphs to find out.'

Big colored header at the top: green = healthy, yellow = watching, red = broken. That's the 3-second answer. Everything else is drill-down.

The hierarchy rule

Three layers, no more:

Overview one line per service, status color, key SLI
Service detail one dashboard per service, 6-12 graphs max
Deep dive triggered from service detail, domain-specific

Anything beyond 3 layers is 'please get lost in my dashboard tree.'

The on-call test

Imagine you're on-call at 3 AM. You get paged for 'service X is slow.' Can you, in 30 seconds, use this dashboard to tell if the problem is the service itself, its database, its upstream dependency, or its downstream consumers?

If yes, the dashboard works.
If no, redesign.

What to cut

Graphs with no baseline (flat line or spiky forever how do you know if it's bad?)
Metrics you've never used in an actual incident
Vanity metrics (total requests ever)
Graphs where the y-axis is in units nobody understands

The hidden metric

The real measure of a dashboard's value: does the on-call engineer open it before or after the paging tool?

If they open it first it's their compass.
If they open it only after being paged it's a reference, not a dashboard.

Aim for the first.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

SRE Maturity Models: Where Is Your Team?

Samson Tanimawo — Fri, 15 May 2026 17:59:16 +0000

Where is your SRE team on the maturity curve? I've worked with teams at every stage. Here's a rough map.

Stage 0: Reactive

The site goes down, someone scrambles to fix it, the cycle repeats. No on-call rotation. No dashboards. Alerts are emails nobody reads.

Characteristic phrase: 'We'll look at that after launch.'

Stage 1: Foundation

On-call rotation exists. Alerts route to a paging tool. Basic dashboards for CPU, memory, error rate. Post-mortems happen sometimes.

Characteristic phrase: 'Did anyone see that spike last night?'

Stage 2: Measured

SLOs defined for critical services. Error budgets tracked. Alert volume is monitored and pruned. Post-mortems are written and reviewed.

Characteristic phrase: 'We're at 80% of our error budget for the quarter.'

Stage 3: Automated

Runbooks exist for top alerts. Toil is measured and reduced. Deployment pipeline has automatic rollback. Chaos engineering is practiced.

Characteristic phrase: 'The auto-rollback caught it.'

Stage 4: Predictive

Anomaly detection catches issues before alerts fire. Capacity planning is data-driven. New services have SLOs and dashboards at launch, not after. AI/ML assists incident response.

Characteristic phrase: 'We caught that before customers noticed.'

Where most teams are

Most teams I've worked with are at Stage 1 or Stage 2, trying to get to Stage 3. The jump from 2 to 3 is the hardest it requires sustained investment with no immediate crisis to justify it.

The trap

Don't try to skip stages. Teams that install ML anomaly detection at Stage 0 just have prettier chaos. Get the foundation right first. Then automate. Then predict.

The highest maturity team I've seen was boring. Almost nothing broke. The engineers had time to work on interesting problems. That's the goal.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

The Art of Writing a Good Post-Mortem

Samson Tanimawo — Thu, 14 May 2026 17:59:00 +0000

A good post-mortem is a piece of technical writing. It should be readable by someone who wasn't there, convey the timeline clearly, and suggest concrete changes.

Most are none of these things. They're a wall of Slack screenshots.

Here's how to write one that people will read.

The structure

1. Summary (3 sentences). What happened. Impact. Duration. That's it.

2. Timeline. Precise times, in UTC. Not 'around 3pm.' 14:02 UTC. Every event gets one line.

3. Impact. Quantified. 'X% of checkout traffic failed between 14:02 and 14:17.' Not 'some users affected.'

4. Root cause. What broke and why. Not who. If the answer is 'human error,' keep going why did the system allow human error to reach production?

5. Action items. Concrete. Owner. Due date. 'Add validation to config pipeline @alice 2 weeks.' Not 'be more careful.'

The tone

Write it like you're explaining to a curious outsider. No inside jokes. No 'as you all know.' Future readers don't know.

Keep it honest. If the cause was something embarrassing, write it down. Post-mortems that hide the ugly parts are worthless to the reader.

The distribution

The best post-mortems get read by people outside the team. Share them. Put them in a searchable archive. Reference them in design reviews. ('We tried this in 2024, see post-mortem 42.')

Institutional memory lives in good post-mortems. Bad ones evaporate the day they're written.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com