Forem: Bosun Sogeke

Incident Debugging in Production Systems (Part 2)

Bosun Sogeke — Tue, 17 Mar 2026 19:11:37 +0000

Why Logs Alone Don’t Explain Production Incidents

Logs tell you what happened.
They rarely tell you what matters.

The False Sense of Confidence

Most engineers are taught:

When something breaks, check the logs

That is not wrong, but it’s incomplete, because during a real production incident, logs do not behave like a helpful timeline.

They behave like this:

Thousands of entries per second
Repeated noise
Partial truths
Missing context

You don’t get clarity, you get volume.

What Logs Actually Are (and What They Aren’t)

Logs are:

Raw system outputs
Event-level signals
Localised observations

Logs are not:

Root cause explanations
System-wide context
Decision-ready insights

That gap is where most incident delays happen.

A Real Scenario (You’ve Probably Seen This)

A production alert fires:

❗ API latency spike (p95 > 4s)

You open logs and immediately see:

TimeoutError: downstream request exceeded 3000ms

So the natural conclusion is:

The downstream service is slow

But here is what the logs don’t show you:

Was the downstream actually slow?
Or was it never reached?
Or were retries amplifying load?
Or was there a connection pool exhaustion upstream?

The log entry is technically correct but operationally misleading.

The Core Problem: Logs Lack Context

Logs operate at the event level.

Incidents happen at the system level.

That mismatch is the root of the issue.

Logs tell you:

This request timed out

But you need to know:

Why is the system behaving this way right now?

Those are not the same question.

Why Engineers Get Trapped in Logs

During incidents, engineers often:

Find the first error
Assume causation
Follow that thread
Lose 20–40 minutes

This is not a skill issue, it is a model issue.

We are trained to debug code.

But incidents require you to debug systems under stress.

From Logs → Signals → Patterns

To actually debug incidents effectively, you need to move up levels:

1. Logs (Raw Data)

Individual events
High volume
Low context

2. Signals (Filtered Meaning)

Latency spikes
Error rate changes
Deployment correlation

3. Patterns (Recognisable Failure Shapes)

Retry amplification
Dependency timeouts
Queue backlogs
Logs live at the bottom.

Decisions happen at the top.

The Shift That Changes Everything

Instead of asking:

What do the logs say?

Ask:

What failure pattern does this resemble?

This small shift:

Reduces noise chasing
Improves classification accuracy
Speeds up triage decisions

Where Most Tooling Falls Short

Most observability tools:

Aggregate logs
Add search
Add dashboards

But they still leave you with:

Interpretation responsibility during peak pressure

Which is exactly when humans perform worst.

The Missing Layer: Structured Judgement

What is needed, is a layer that sits above logs and answers:

What kind of failure is this?
How confident are we?
What action should follow?

Not raw data.

Not dashboards.

👉 Judgement.

How This Connects to the Bigger Picture

This is the model we’ve been building toward:

Production Incident
↓
Incident Engineering Patterns
↓
AWS Log Search Recipes
↓
ExplainError (structured judgement)
↓
Faster decisions

Logs are just one piece, without structure, they slow you down but with the right layers, they become powerful.

Key Takeaways

Logs are necessary—but not sufficient
Errors ≠ root cause
Context is everything during incidents
Pattern recognition beats raw log reading
Decision support is the missing piece

What is Next?

In Part 3, I go deeper into:

Incident Engineering Patterns: How to Recognise Failure Before You Debug

Because once you can recognise the pattern, you stop chasing noise entirely.

If You’re Curious

I am currently building a system that turns raw errors into structured outputs with:

Confidence scoring
Failure classification
Action signals

👉 Live:
https://bernalo-lab.github.io/explain-error/

👉 Docs:
https://explain-error-api.onrender.com/docs/

👉 Dataset (real incidents):
https://incident-dataset.onrender.com/dataset/

Final Thought

Logs don’t fail you.

They were never designed to guide decisions.

📌 Part of the series: Incident Debugging in Production Systems

Part 1: The 5 Error Patterns Engineers Misclassify During Production Incidents
Part 2: (this post)

The 5 Error Patterns Engineers Misclassify During Production Incidents

Bosun Sogeke — Tue, 10 Mar 2026 19:23:28 +0000

Production Incident
↓
Error Appears
↓
Misleading Signal
↓
Investigation
↓
Real Root Cause

Production incidents rarely fail in the way engineers expect.
The error message often points in the wrong direction.

During high-pressure debugging sessions, this leads to one of the most common reliability problems in distributed systems:

error misclassification

An engineer sees a message that looks like the root cause, reacts quickly, and begins investigating the wrong system.

Meanwhile the real failure continues spreading.

After investigating many production incidents across cloud platforms and distributed architectures, certain patterns appear repeatedly.

In this article, i will explore five error patterns engineers frequently misinterpret during incidents and how to recognise them faster.

1. Dependency Failures That Look Like Application Bugs

One of the most common mistakes is assuming an error originates from the application itself.

Example:

_System.Net.Http.HttpRequestException:
The server returned an invalid or unrecognized response.

At first glance this appears to be:

application logic failure
serialization problem
malformed API response

In reality, these errors are often caused by dependency outages.

Examples include:

upstream API downtime
load balancer failures
service mesh routing problems
transient network interruptions

The application simply reports what it received.
The real issue exists one layer deeper.

Experienced engineers immediately ask:

What upstream dependency could cause this behaviour?

This mindset shift often reduces debugging time dramatically.

2. HTTP 500 Errors That Aren't Real Failures

HTTP 500 responses look severe.
But in modern distributed systems they are sometimes intentional behaviour.

Examples include:

circuit breaker protection
controlled fail-fast responses
fallback service logic
rate limiting protection

A system may deliberately return HTTP 500 in order to:

prevent cascading failures
shed load
protect dependencies

Engineers investigating incidents often treat these as primary failures, when they are actually protective mechanisms.

Understanding the architecture behind the system is critical.

The question becomes:
Is this error the cause — or the system protecting itself?

3. Timeout Errors That Hide the Real Bottleneck

Timeout messages are among the most misleading signals during incidents.

Example:

Request timed out after 30 seconds

The immediate assumption is usually:

slow database
inefficient query
overloaded application server

However timeouts often originate from queue congestion or resource exhaustion elsewhere.

Typical hidden causes include:

thread pool exhaustion
dependency latency spikes
message queue backlog
retry storms

The timeout is simply where the failure becomes visible.
The real problem occurred earlier in the request path.

When engineers see timeouts during an incident, the real investigation question should be:

What happened before the timeout occurred?

4. Connection Errors That Look Like Network Problems

Errors such as these frequently trigger network investigations:

Connection reset by peer
Connection refused
Unexpected EOF While these messages appear to indicate networking issues, they are often symptoms of something else.

Common hidden causes include:

service crashes
container restarts
dependency overload
load balancer health check failures

In these scenarios the network behaved correctly.

The connection was reset because the service stopped responding properly.

Investigating the network layer first can waste valuable time.

Instead, engineers should verify:

container health
service restarts
CPU or memory spikes
upstream saturation

The network error is often just the messenger.

5. Retry Amplification That Looks Like a Traffic Surge

One of the most dangerous patterns in distributed systems is retry amplification.

Imagine the following scenario:

- A dependency becomes slow.
- Clients begin retrying requests.
- Retry traffic multiplies.
The dependency becomes overloaded..

Soon the system experiences a traffic pattern that looks like a sudden surge in demand, but the traffic is actually self-generated.

This pattern is particularly common in:

microservice architectures
payment processing systems
API gateway layers

The misleading signal is that monitoring dashboards show:

traffic spike

But the root cause is actually:

retry amplification

Identifying this pattern quickly can prevent large-scale outages.

Why Misclassification Happens

During incidents, engineers operate under pressure.

They must:

interpret logs quickly
analyse unfamiliar errors
make rapid decisions

Human intuition tends to favour the most obvious explanation.

But distributed systems rarely fail in obvious ways.

Failures often appear far from their origin.

Understanding common misclassification patterns helps engineers avoid chasing the wrong signals.

A Practical Investigation Approach

When encountering a confusing error during an incident, a simple investigation model can help.

Signal spike
↓
first observable error
↓
dependency investigation
↓
request path tracing
↓
root signal

Instead of assuming the error message is the cause, engineers should treat it as a clue in a larger system investigation.

Final Thoughts

The most difficult part of incident debugging is rarely fixing the problem.

It is finding the correct signal among many misleading ones.

Errors such as timeouts, connection failures, and HTTP status codes often represent symptoms rather than causes.

Recognising common misclassification patterns allows engineers to navigate incidents faster and reduce investigation time.

In the next article of this series, i will explore how engineers investigate AWS CloudWatch logs during production incidents, including practical techniques for locating the first meaningful signal in large log streams.

Part of the series: Incident Debugging in Production Systems