Tracing a 2s Latency Spike to a Single SQL Query

wheresthelag — Wed, 06 May 2026 08:45:05 +0000

We received an alert around early morning 4AM indicating that our checkout service latency had jumped from its usual 50ms p99 to over 2 seconds. There were no errors, CPU usage was normal, and the database appeared healthy. Despite everything looking fine in the logs, users were clearly experiencing delays.

Initial Checks
We started with the usual suspects:

Application metrics: CPU and memory utilization were stable.
Database health: PostgreSQL showed no signs of stress.
Slow query logs: No entries, even with the threshold set to 1 second.
Redis/cache layer: Operating as expected.

Why Logs Weren’t Enough
Our logs provided detailed information about individual events such as HTTP requests, SQL executions, and service interactions. However, they lacked the context needed to understand how time was spent across the entire request lifecycle. Logs answered what happened, but not where the time went.

Request Trace Overview
Incoming Request (~2.1s total) ├── Auth Service (~120ms) ├── Business Logic (~150ms) └── Database Call (~35ms execution + ~2.05s data transfer)
To gain better visibility, we examined the transaction trace using opmanager nexus. The trace revealed that while the database executed the query quickly, the application thread spent significant time waiting to read the response from the network buffer (SocketInputStream.read()).

Identifying the Root Cause
The SQL query involved was straightforward:
SELECT * FROM inventory_logs WHERE item_id = ?;

A recent schema update had introduced a JSONB column storing detailed audit information. For frequently updated items, this column had grown to more than 15MB per row. Because of the SELECT * statement, the application fetched this entire payload, leading to significant network transfer and deserialization overhead.

The Fix
We updated the query to retrieve only the necessary columns:

SELECT status, last_updated FROM inventory_logs WHERE item_id = ?;

The impact was immediate:

Database time: Reduced from ~35ms to ~12ms.
Result processing time: Dropped from ~2.05s to a negligible level.
End-to-end latency: Improved from ~2.1 seconds to ~12ms.

Key Takeaways

Not all slow requests are caused by slow queries.
Avoid SELECT * in production systems.
Distinguish between query execution and data transfer time.
Logs provide events, not end-to-end context.
Distributed tracing is essential for accurate root cause analysis.

The invisible bottleneck
This incident highlighted how performance bottlenecks can arise from subtle changes in data access patterns. Even well-indexed queries can introduce latency if they return more data than necessary. Differentiating between query execution and data transfer is essential for accurate root cause analysis.

Forem: wheresthelag

Tracing a 2s Latency Spike to a Single SQL Query