Forem: Saby_Explain

Reading Parallel Plans Correctly

Saby_Explain — Mon, 23 Dec 2024 13:56:28 +0000

Historically, the PostgreSQL server operating model looks like many independent processes with partially shared memory. Each process serves only one client connection and handles one query at a time, meaning no multithreading occurs.
That’s why, within each individual process, you won’t find any typical oddities with parallel code execution, blocks, race condition, etc. The development of the DBMS itself becomes a pleasant and simple task.
But this very simplicity also brings a significant limitation. Since there is only one worker thread per process, it can use no more than one CPU core to execute a query, which means the server's speed depends directly on the frequency and architecture of a separate core.
In today's world where multicore and multiprocessor systems have stolen the limelight, such behavior is an unacceptable luxury and waste of resources. For this reason, starting with PostgreSQL 9.6, a number of operations in a query can be parallelized, i.e. performed by several processes simultaneously.

You can find a number of cases of parallelism described in the article “Parallelism in PostgreSQL” by Ibrar Ahmed, from where this image is taken.
If you're using one of the latest versions of PostgreSQL, there is a good chance you'll see Parallel… in the plan when dealing with Seq Scan, Join, Aggregate, Append, etc.
Parallelism brings certain complications, but don't worry — we've already studied these peculiarities and have you covered.

As we keep sharing more features of our PostgreSQL query analysis service Saby Explain, below you’ll find our experience of addressing these challenges.

Time oddities

Let’s have a look at a plan from PostgreSQL 9.6:

Parallel Seq Scan alone was running 153.621 ms within a subtree, while Gather with all its subnodes — only 104.867 ms.
Why do we get less time for a parent with its children?
Let's have a closer look at the Gather node:

Gather (actual time=0.969..104.867 rows=333333 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  Buffers: shared hit=4425

Workers Launched: 2 shows us that 2 more processes down the tree have been added to the main process. That means the Gather time is the summary of all the 3 processes altogether.
Now let's see what’s happening in Parallel Seq Scan:

Parallel Seq Scan on tst (actual time=0.024..51.207 rows=111111 loops=3)
  Filter: ((i % 3) = 0)
  Rows Removed by Filter: 222222
  Buffers: shared hit=4425

loops=3 indicates that we’re dealing with the data on all the 3 processes here. On average, each loop took 51.207 ms, meaning the server needed 51.207 x 3 = 153.621 milliseconds of CPU time to process this node in total. This is exactly the number we need to get a real picture of what the server was busy with.
Note that to understand the real execution time, we need to divide the total time by the number of workers— i.e., [actual time] x [loops] / [Workers Launched].
In our example, each worker executed only one cycle per node, so 153.621 / 3 = 51.207. It's no longer strange that the only Gather in the parent process was executed in sort of less time.
The main takeaway is that you actually need the total execution time of the node (all processes summed up) to find what was keeping your server busy and identify the query part for further optimization.
From this perspective, explain.depesz.com, for example, showing the “averaged real” time, looks less useful for debugging purposes:

Disagree? Feel free to comment below.

Gather Merge loses it all

Let’s run the same query with PostgreSQL 10:

Note that we now have the Gather Merge node instead of Gather in the plan. Here is what the manual says about it:

When the node at the top of the parallel portion of the plan is Gather Merge rather than Gather, it indicates that each process executing the parallel portion of the plan is producing tuples in sorted order, and that the leader is performing an order-preserving merge. In contrast, Gather reads tuples from the workers in whatever order is convenient, destroying any sort order that may have existed.

But it seems something is rotten in the state of Denmark:

Limit (actual time=110.740..113.138 rows=10000 loops=1)
  Buffers: shared hit=888 read=801, temp read=18 written=218
  I/O Timings: read=9.709
  ->  Gather Merge (actual time=110.739..117.654 rows=10000 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        Buffers: shared hit=2943 read=1578, temp read=24 written=571
        I/O Timings: read=17.156

When the Buffers and I/O Timings attributes were transmitted upwards through the tree, some data was lost. We can estimate the size of this loss as about 2/3, which have been formed by auxiliary processes.
Unfortunately, there is no way to get this information in the plan itself. That’s why we see negative values for the top node. And if we look at the further evolution of this plan in PostgreSQL 12, it remains unchanged, except for some stats added for each worker on the Sort node:

Limit (actual time=77.063..80.480 rows=10000 loops=1)
  Buffers: shared hit=1764, temp read=223 written=355
  ->  Gather Merge (actual time=77.060..81.892 rows=10000 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        Buffers: shared hit=4519, temp read=575 written=856
        ->  Sort (actual time=72.630..73.252 rows=4278 loops=3)
              Sort Key: i
              Sort Method: external merge  Disk: 1832kB
              Worker 0:  Sort Method: external merge  Disk: 1512kB
              Worker 1:  Sort Method: external merge  Disk: 1248kB
              Buffers: shared hit=4519, temp read=575 written=856
              ->  Parallel Seq Scan on tst (actual time=0.014..44.970 rows=111111 loops=3)
                    Filter: ((i % 3) = 0)
                    Rows Removed by Filter: 222222
                    Buffers: shared hit=4425
Planning Time: 0.142 ms
Execution Time: 83.884 ms

There is one more takeaway I’d like to leave you with. Do not rely blindly on the data you see above the Gather Merge node. Get Saby Explain

Bringing PostgreSQL Query Issues to Light with Insightful Visuals

Saby_Explain — Thu, 19 Dec 2024 09:57:14 +0000

As we keep sharing more features of our PostgreSQL query analysis service Saby Explain, today we'll give you a quick overview of how to spot issues in large and complex query plans by just taking a quick look at their visualization.

You’ll see that our visualization tools may really come in handy for this. To analyze your plan click DEMO on the Saby page.

Simplified visualization

Reading a query plan as plain text can be quite challenging, even if it's an easy one:

That’s exactly why our developers normally deal with a simplified version of the plan. It shows all the key data on execution time and buffers used for each node separately, making it easier to pinpoint extreme values:

Pie chart

Just figuring out the most critical issues can sometimes be a real hassle, especially with so many nodes to look through and a simplified visualization that spans more than two screens.

In this case, you can hardly do without a pie chart.

Glance over the chart to quickly assess the approximate share of resource usage for each node. When hovering over a specific sector, an icon appears to help you locate the corresponding node in the plan’s text.

Tile view

The pie chart doesn’t give us the whole story. If you need to see the relationship between nodes or identify critical issues, you’d better use a tile visualization.

Processing flow diagram

But neither of the visualization options above shows the full hierarchy of nested CTE, InitPlan, and SubPlan nodes. If you want to get the whole picture, take a look at the diagram:

More metrics? Say no more!

If you run the query execution plan using the EXPLAIN (ANALYZE) command, you'll only see the time spent. But we need more than just that to come to an accurate conclusion.
Here’s the thing: when you run a query on a cold cache, the time you actually see includes the time it takes to fetch data from storage, not just the time spent executing the query itself (even though it might not be obvious).
Building on that, here are a couple of tips to keep in mind:

Use the EXPLAIN (ANALYZE, BUFFERS) command only to see how many data pages are being read. This metric doesn’t depend much on the server workload, so you can use it as a reliable benchmark for query optimization.
To measure the time spent on disk operations, use track_io_timing. And since your execution plan shows not just the time spent but also things like buffers and I/O timings, you can easily get to the analysis of all these metrics from any type of diagram. This can sometimes reveal the unpleasant truth that more than half of all the reads go to just one problematic node:

More articles if you're interested in the topic:

Get Saby Explain

Reading PostgreSQL Query Plans Brought to a New Level

Saby_Explain — Tue, 17 Dec 2024 12:21:10 +0000

We've already introduced you to Saby Explain, an open service designed to help you read and analyze query plans in PostgreSQL, and shared the insights on improving poorly performing queries using Explain recommendations.

In this post, we’ll highlight the key features that we believe will greatly improve your experience with our solution. Click DEMO on the Saby page to analyze your plan.

All plan formats supported

Get query plans analyzed using only log data
In the developer console, just copy the data as it is, starting from the Query Text line and including leading spaces:

 Query Text: INSERT INTO  dicquery_20200604  VALUES ($1.*) ON CONFLICT (query)
                           DO NOTHING;
        Insert on dicquery_20200604  (cost=0.00..0.05 rows=1 width=52) (actual time=40.376..40.376 rows=0 loops=1)
          Conflict Resolution: NOTHING
          Conflict Arbiter Indexes: dicquery_20200604_pkey
          Tuples Inserted: 1
          Conflicting Tuples: 0
          Buffers: shared hit=9 read=1 dirtied=1
          ->  Result  (cost=0.00..0.05 rows=1 width=52) (actual time=0.001..0.001 rows=1 loops=1)

Then, paste it all into the field for query analysis, no refining needed:

Finally, along with the analyzed query plan, you'll find all the query execution details in the Context tab:

JSON and YAML files

EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON)
SELECT * FROM pg_class;


"[
  {
    "Plan": {
      "Node Type": "Seq Scan",
      "Parallel Aware": false,
      "Relation Name": "pg_class",
      "Alias": "pg_class",
      "Startup Cost": 0.00,
      "Total Cost": 1336.20,
      "Plan Rows": 13804,
      "Plan Width": 539,
      "Actual Startup Time": 0.006,
      "Actual Total Time": 1.838,
      "Actual Rows": 10266,
      "Actual Loops": 1,
      "Shared Hit Blocks": 646,
      "Shared Read Blocks": 0,
      "Shared Dirtied Blocks": 0,
      "Shared Written Blocks": 0,
      "Local Hit Blocks": 0,
      "Local Read Blocks": 0,
      "Local Dirtied Blocks": 0,
      "Local Written Blocks": 0,
      "Temp Read Blocks": 0,
      "Temp Written Blocks": 0
    },
    "Planning Time": 5.135,
    "Triggers": [
    ],
    "Execution Time": 2.389
  }
]"

It doesn't matter if the data contains external quotes, the way pgAdmin copies them, or not—you can paste your file data into the same field for analysis, and the result won't take long:

Advanced plan visualization

Planning time vs Execution time
Easily track what's taking extra time when you're running the query:

I/O timing stats
Sometimes, even though a plan may have few resources read or written, it can still take ages to execute.
We could blame the server disk overload for that, but is it the real root cause? Well, if you want to be sure about this, all you need is track_io_timing. You can find out more about this config parameter in the documentation:

Enables timing of database I/O calls. This parameter is off by default because it will repeatedly query the operating system for the current time, which may cause significant overhead on some platforms. You can use the pg_test_timing tool to measure the overhead of timing on your system. I/O timing information is displayed in pg_stat_database, in the output of EXPLAIN when the BUFFERS option is used, and by pg_stat_statements.

If you need to, you can also enable this parameter in a local session using the following command:
SET track_io_timing = TRUE;
The great thing about our solution is that we can read and visualize IO timing information:

So, the total execution time (0.790 ms) was used like this: it took 0.718 ms to read one page of data and 0.044 ms to write it. That leaves just 0.028 ms for all the other tasks if you know what I mean.

Keeping up with PostgreSQL updates

Over time, we’ve worked hard to keep abreast of PostgreSQL development. Let me focus on the key changes we've made to ensure that our solution remains up-to-date and effective.

Buffers usage while planning
There's a different way to monitor the resources used during query planning, which has nothing to do with pg_stat_statements. To see how many buffers were used during the planning stage, run the EXPLAIN statement with BUFFERS enabled:

Seq Scan on pg_class (actual rows=386 loops=1)
   Buffers: shared hit=9 read=4
 Planning Time: 0.782 ms
   Buffers: shared hit=103 read=11
 Execution Time: 0.219 ms

Incremental sorting
In cases where you need to sort by multiple keys (k1, k2, k3, ...), the optimizer can now use the fact that the data is already sorted by some of the initial keys (for example, k1 and k2). So, instead of sorting all the data from scratch, the data can be sequentially grouped by the k1 and k2 values, and then “incrementally sorted” by the next key, k3.
This approach breaks the whole process down into a few smaller sorts, which uses less memory. You can also see the first results before the whole sorting is complete.

Incremental Sort (actual rows=2949857 loops=1)
   Sort Key: ticket_no, passenger_id
   Presorted Key: ticket_no
   Full-sort Groups: 92184 Sort Method: quicksort Memory: avg=31kB peak=31kB
   ->  Index Scan using tickets_pkey on tickets (actual rows=2949857 loops=1)
 Planning Time: 2.137 ms
 Execution Time: 2230.019 ms

UI/UX improvements

Screenshots everywhere!
You can quickly take a screenshot of each tab and copy it to the clipboard. The screenshot will capture everything that's on the tab. Just click the corresponding button in the upper-right corner:

This is exactly how we got most of the images for this post.

Even more suggestions for query improvement
We've not only added more recommendations but also gone into more detail about them. Just follow the link to find out more about your issue.

Deleting plans from the archive
Some users asked if there's a way to permanently delete plans from the archive, even if they're not published. We've heard you and made this possible:

If you have any comments or suggestions, please feel free to contact us.

Remedy for Poor-Performing SQL Queries

Saby_Explain — Tue, 22 Oct 2024 13:53:02 +0000

In previous posts, we introduced you to Saby Explain, a public service for the analysis and visualization of PostgreSQL query plans. Several months after the launch we've reached the milestone of 6,000 usages, but one of the helpful features sometimes falls under the radar — it is the query structure suggestions, which look like this:

Just adhere to the suggestions to boost the efficiency of your queries. Actually, many of the situations that slow down the query and make it resource-hungry are typical and can be detected based on the query plan structure and data. We’ve worked out such patterns with the reasons and suggestions for improvement explained. They’re meant to free our developers from the mundane task of searching for ways to optimize queries from scratch.
Let's take a closer look at these typical situations — the issues at the heart of them and the ways they can be solved.

1: index “undersorting”

When it takes place
Show last invoice for the client Bluebell Ltd.

How to recognize

-> Limit
   -> Sort
      -> Index [Only] Scan [Backward] | Bitmap Heap Scan

Recommendations
Widen the index by adding columns for sorting.

Example

CREATE TABLE tbl AS
SELECT
 generate_series(1, 100000) pk  -- 100K "facts"
, (random() * 1000)::integer fk_cli; -- 1K different foreign keys


CREATE INDEX ON tbl(fk_cli); -- index for foreign key


SELECT
 *
FROM
 tbl
WHERE
 fk_cli = 1 -- selection based on a specific link
ORDER BY
 pk DESC -- only one "last" record required
LIMIT 1;

Click DEMO on the Saby Explain page.
You can easily observe that more than 100 records were read based on the index, and then all those records were sorted and only one record was left.

Improvements

DROP INDEX tbl_fk_cli_idx;
CREATE INDEX ON tbl(fk_cli, pk DESC); -- sorting key added

Even for this simple data selection, the query becomes 8.5 times faster and results in 33 times less readings. And the effect will be much more obvious, if you have more "facts” for every fk value.
Note that such “prefix” index will work for other queries with fk and no sorting by pk as well. Moreover, it will properly support an explicit foreign key for this column.

2: index intersection (BitmapAnd)

When it takes place
Show all contracts for the client Bluebell Ltd. concluded on behalf of Buttercup LLC.

How to recognize

-> BitmapAnd
   -> Bitmap Index Scan
   -> Bitmap Index Scan

Recommendations
Create a composite index based on columns of both source indexes or widen one of the existing indexes by adding columns from the second one.

Example

CREATE TABLE tbl AS
SELECT
 generate_series(1, 100000) pk      -- 100K "facts"
, (random() *  100)::integer fk_org  -- 100 different foreign keys
, (random() * 1000)::integer fk_cli; -- 1K different foreign keys

CREATE INDEX ON tbl(fk_org); -- index for foreign key
CREATE INDEX ON tbl(fk_cli); -- index for foreign key

SELECT
 *
FROM
 tbl
WHERE
 (fk_org, fk_cli) = (1, 999); -- selection based on a specific pair

Improvements

DROP INDEX tbl_fk_org_idx;
CREATE INDEX ON tbl(fk_org, fk_cli);

The win is not so great here because Bitmap Heap Scan is quite efficient itself. However, we get a 7 times faster query with 2.5 less readings.

3: index combination (BitmapOr)

When it takes place
Show the first 20 oldest tickets to be processed, both assigned to me or unallocated, and the tickets assigned to me should be of higher priority.

How to recognize

-> BitmapOr
   -> Bitmap Index Scan
   -> Bitmap Index Scan

Recommendations
Use UNION [ALL] for joining subqueries for each of the OR-condition blocks.

Example

CREATE TABLE tbl AS
SELECT
 generate_series(1, 100000) pk  -- 100K "facts"
, CASE
   WHEN random() < 1::real/16 THEN NULL -- with the probability of 1:16, the record is unallocated
   ELSE (random() * 100)::integer -- 100 different foreign keys
 END fk_own;


CREATE INDEX ON tbl(fk_own, pk); -- index with sorting that "seems to be suitable"


SELECT
 *
FROM
 tbl
WHERE
 fk_own = 1 OR -- assigned to me
 fk_own IS NULL -- ... or unallocated
ORDER BY
 pk
, (fk_own = 1) DESC -- assigned to me first
LIMIT 20;

Improvements

(
 SELECT
   *
 FROM
   tbl
 WHERE
   fk_own = 1 -- first, 20 tickets assigned to me
 ORDER BY
   pk
 LIMIT 20
)
UNION ALL
(
 SELECT
   *
 FROM
   tbl
 WHERE
   fk_own IS NULL -- then, 20 unallocated tickets
 ORDER BY
   pk
 LIMIT 20
)
LIMIT 20; -- but the total quantity is 20, and we don’t need more

We benefited from the fact that all the 20 records we needed were received within the first block, and the second one with more resource-consuming Bitmap Heap Scan wasn’t run at all. So, we got a 22 times faster query with 44 times less readings!

4: excessive reading

When it takes place
As a rule, this antipattern appears when you want to add another filter to the existing query. For example, you may want to modify the task described above and to get the first 20 of the oldest and most critical tickets to be processed, no matter whether they are assigned to somebody or unallocated.

How to recognize

-> Seq Scan | Bitmap Heap Scan | Index [Only] Scan [Backward]
   && 5 × rows < RRbF -- >80% of the read data is filtered out
   && loops × RRbF > 100 -- and the total number of records is over 100

Recommendations
Create a [more] specific index with a WHERE condition or add more columns to the index. If the filtering condition is “static” for your tasks, i.e. if it doesn't involve extension of the list of values in the future, we recommend using the WHERE index. It is the perfect choice for various boolean/enum statuses.
If the filtering condition may take on different values, it’s rational to widen the index by adding these columns, as in the BitmapAnd case described above.

Example

CREATE TABLE tbl AS
SELECT
 generate_series(1, 100000) pk -- 100K "facts"
, CASE
   WHEN random() < 1::real/16 THEN NULL
   ELSE (random() * 100)::integer -- 100 different foreign keys
 END fk_own
, (random() < 1::real/50) critical; -- with the probability of 1:50, the ticket is "critical"

CREATE INDEX ON tbl(pk);
CREATE INDEX ON tbl(fk_own, pk);

SELECT
 *
FROM
 tbl
WHERE
 critical
ORDER BY
 pk
LIMIT 20;

Improvements

CREATE INDEX ON tbl(pk)
 WHERE critical; -- a "static" filtering condition added

As you can see, the filtering is removed from the plan, and the query became 5 times faster.

5: sparse table

When it takes place
Various attempts to create a queue for task processing when many updates/deletions lead to many “dead” records.

How to recognize

-> Seq Scan | Bitmap Heap Scan | Index [Only] Scan [Backward]
   && loops × (rows + RRbF) < (shared hit + shared read) × 8
      -- over 1KB read for each record
   && shared hit + shared read > 64

Recommendations
Regularly run VACUUM [FULL] manually or ensure frequent enough execution of autovacuum by fine-tuning its parameters, including for a specific table.
In most cases, such problems result from poor arrangement of queries in case of calls from business logic. However, it must be kept in mind that sometimes even VACUUM FULL may be of no help.

6: reading from the “middle” of the index

When it takes place
You still get a larger number of pages read as compared to what you want it to be even though not much data has been read, relevant indexes applied, and no excessive filtering performed.

How to recognize

-> Index [Only] Scan [Backward]
   && loops × (rows + RRbF) < (shared hit + shared read) × 8
      -- more than 1 KB is read per each record
   && shared hit + shared read > 64

Recommendations
Carefully examine the structure of the index used and the key columns specified in the query: it’s most likely that some part of the index wasn’t specified. You’ll probably have to create a similar index without prefix columns or learn how to iterate their values.

Example

CREATE TABLE tbl AS
SELECT
 generate_series(1, 100000) pk      -- 100K "facts"
, (random() *  100)::integer fk_org  -- 100 different foreign keys
, (random() * 1000)::integer fk_cli; -- 1K different foreign keys


CREATE INDEX ON tbl(fk_org, fk_cli); -- everything is almost like in #2
-- only that we’ve considered a separate index for fk_cli unnecessary and deleted it


SELECT
 *
FROM
 tbl
WHERE
 fk_cli = 999 -- and fk_org is not specified, though it appears in the index earlier
LIMIT 20;

Everything seems to be OK, even the index was successfully used, but it is suspicious that for every 20 records read we had to read 4 pages of data. Isn’t 32 KB per record too much? And the index name tbl_fk_org_fk_cli_idx gives us food for thought.

Improvements

CREATE INDEX ON tbl(fk_cli);

Bingo! We got a 10 times faster query with 4 times less readings!

7: CTE × CTE

When it takes place
We filled the query with large CTEs from different tables and then decided to JOIN them.
The case is relevant for the versions before v12 or the queries containing WITH MATERIALIZED.

How to recognize

-> CTE Scan
   && loops > 10
   && loops × (rows + RRbF) > 10000
      -- the Cartesian product for the CTEs used is too big

Recommendations
Analyze the query carefully and see whether you need the CTEs here at all. If yes, apply dictionaries in hstore/json.

8: swapping to disk (temp written)

When it takes place
Simultaneous processing (sorting or getting unique values) of a large number of records requires more memory than allocated.

How to recognize

-> *
   && temp written > 0

Recommendations
If the memory used by the operation exceeds the specified value of the work_mem parameter insignificantly, it is reasonable to correct the value. You can do it either directly in the configuration file for all the queries or using SET [LOCAL] for a certain query/transaction.

Example

SHOW work_mem;
-- "16MB"

SELECT
 random()
FROM
 generate_series(1, 1000000)
ORDER BY

Improvements

SET work_mem = '128MB'; -- before running the query

For obvious reasons, if we use only the memory and don’t use the disk, the query will be executed much faster. Moreover, it will partially unload the HDD. It's also important to realize that we can’t regularly allocate a lot of memory because it won’t be enough for everything.

9: outdated stats

When it takes place
A lot of data has been added to the database at once, but we haven’t run ANALYZE.

How to recognize

-> Seq Scan | Bitmap Heap Scan | Index [Only] Scan [Backward]
   && ratio >> 10

Recommendations
Do run ANALYZE.

10: “something went wrong”

When it takes place
A lock wait has occurred due to a concurrent query, or the shortage of the CPU/hypervisor hardware resources is in place.

How to recognize

-> *
   && (shared hit / 8K) + (shared read / 1K) < time / 1000
      -- RAM hit = 64MB/s, HDD read = 8MB/s
   && time > 100ms -- not much was read but it took a lot of time

Recommendations
Use an external system for monitoring the server to detect locks or abnormal consumption of resources.

Best Practices for Bulk Optimization of Queries in PostgreSQL

Saby_Explain — Mon, 21 Oct 2024 12:01:58 +0000

After reading this article, you’ll learn about the techniques we use to analyze SQL query performance when dealing with millions of queries per day and monitoring hundreds of PostgreSQL servers.

We’ll talk about Saby Explain, a tool that helps us handle this amount of data and make life much easier for the average developer.
SQL is a declarative programming language where you basically describe what you want to achieve, rather than how it should be done. DBMSs are better at working out how to join tables, which conditions to apply, and whether to use indexes…
You can instruct some DBMSs using hints, such as "Join these two tables using this particular sequence”, but this doesn’t work for PostgreSQL. Top developers believe that it’s better to improve the query optimizer rather than allow using such hints.
While PostgreSQL doesn’t provide for external control, you can still see what's going on internally when queries are running and when any issues come up. What are the most common issues that developers bring to DBAs? It's usually about queries running slowly, operations freezing, other stuff... In other words, queries that aren't working properly.
The reasons are pretty much always the same:

Inefficient query algorithms. When developers attempt to simultaneously join 10 tables in SQL hoping that all their conditions will be resolved efficiently and the result won't take long, they need to remember that miracles don't exist. Obviously, any system that deals with such complexity (10 tables in a single FROM clause) will always have inaccuracies.
Outdated stats. This is mostly applicable to PostgreSQL. If you upload a large dataset to the server and then run a query, PostgreSQL might do a sequential scan on the table because it’s only aware of 10 records as of yesterday, but now there are 10 million records. In this case, we need to let PostgreSQL know about this change.
Shortage of resources. You’ve got a large, heavy, and loaded database hosted on a weak server with insufficient disk space, RAM memory, and CPU processing power. And that's it—there's the performance limit you just can't go beyond.
Locked queries. It’s a complex issue which is especially relevant for various modifying queries, such as INSERT, UPDATE, and DELETE.

Getting a query plan

To resolve all the other issues, we need to look at the query plans. That will give us a clear picture of what’s going on inside the server.
In PostgreSQL, a query execution plan is basically a tree-structured algorithm that shows how a query is executed. The query plan is displayed only for the algorithm which the optimizer considers the most efficient.
Each node in the tree is an operation, such as fetching data from a table or index, creating a bitmap, joining two tables, or merging, overlapping, or excluding result sets. The whole query execution means going through all the tree nodes in the plan.
You can quickly view the query execution plan using the EXPLAIN statement. If you need a detailed plan that includes the actual parameters, use EXPLAIN (ANALYZE, BUFFERS) SELECT...
One of the drawbacks is that the EXPLAIN plan is based on the current state of the database, so it’s only useful for local debugging.
Imagine that you're working on a high-load server with lots of data changes and notice a slow query. By the time you retrieve the query from log files and run it again, the dataset and statistics will be completely different. So when you finally execute the query for debugging, it might run quickly this time, which makes you wonder why it was slow before.
It was a clever idea to create the auto_explain module that gives us the ability to see what’s going on at the exact moment when the query is executed on the server. This module is included in almost all popular PostgreSQL distributions and can be easily enabled in the configuration file.
When the module detects that a query is running longer than the limit you’ve set, it captures and logs the data of this query plan.
Well, now when we have this job done and finally open the log file to find out… a long wall of plain text. It actually gives us no info about the query plan except for its execution time.
Even though interpreting plans this way isn’t informative or at least convenient, we also have other issues to deal with.

The node contains only the total values of the subtree resources. And what does it mean to us? Only one thing: we can’t find out the exact time spent on a specific operation, for example, an index scan, if it has nested conditions. We need to look into the real-time situation, check if there are child operations or conditional variables inside, such as CTEs, and if present, subtract all this data from the total execution time.
Let’s point out one more issue: the time shown for the node is the time of a single execution of the node. If you run this node as a part of a cycle multiple times, you'll see that the number of loops for the node has increased, but the node execution time remains the same in the plan. So, to get the total execution time for this node, you need to multiply the one value by another one. And all these calculations are supposed to be in your head.

This way, we have almost no chance to detect the weakest points in our plans. As the developers themselves admit in their manual, “Plan reading is an art that requires some experience to master…”.
Just picture it: 1,000 developers. Is it real to explain all the details to each of them? Sure, some developers are quick to grasp things, but there are people who just can’t figure it out. Anyway, they all need this knowledge to apply in their work, but how can we teach them our own experience?

Plan visualization

And that made us realize that if we’re about finding a solution for these issues, an insightful plan visualization is a must.
We started to look though what the market could offer, what developers were using, and what solutions were available.
Eventually, we found out that there were not so many solutions that were still growing up and developing, to be honest, it was just the single software — Explain Depesz developed by Hubert Lubaczewski. The solution required you to paste your text query into the field, and after a bit of processing, you’d get the plan details organized as a table:

individual execution time for each node
total execution time for the subtree
actual and estimated numbers of extracted records
each node contents

This service also allowed users to share their archive of links. Users could add their plans to the archive and message the link to everyone they want to discuss the issues with.
The service had its drawbacks, of course. First of all, it’s the need to copy and paste a lot of data. You constantly copy the log data and paste it into the service, and it is repeated over and over again.
The solution also lacked analysis of the amount of data read, we just had no information on the usage of buffers which is normally displayed with the help of EXPLAIN (ANALYZE, BUFFERS). This system couldn’t parse, analyze, or manage such data. When you read a lot of data and need to ensure it's properly allocated across disk and memory cache, this information is crucial.
And, finally, this project was developing really slowly. There were rare commits with minimal updates. Plus, the service is written in Perl, and anyone who has experience with this language will know what I mean.

We could have handled these issues but for the main problem that finally made us give up on this system. I'm talking about errors when analyzing CTEs and other types of dynamic nodes, such as InitPlan and SubPlan nodes.
Let’s have a look at the most common case. The total execution time of each individual node is greater than the overall execution time of the entire query. The issue is that the time spent generating the CTE wasn't subtracted from the CTE Scan node so we don't actually know how long it took to scan the CTE.
So, we realized it was high time to start building our native solution, which our developers were very excited about. We used a typical web services stack: Node.js with Express for the backend, Bootstrap for design, and D3.js for creating chart visualizations. The first prototype was out in about two weeks, with a number of features we really needed:

native parser for query plans which allows us to parse any PostgreSQL-generated plan
accurate analysis of dynamic nodes, such as CTE Scans, InitPlans, and SubPlans
analysis of buffers usage — whether data pages are read from the memory, local cache, or disk
visualization of data which makes it easier for us to identify issues by looking at the visual display rather than digging deep into the logged data.

As a result, we've got a plan where all the key data is highlighted. Our developers normally deal with a simplified version of the plan. With all the numbers already parsed and distributed to the left and right, we can only leave the first line in the middle to indicate what type of node it is: CTE Scan, CTE generation, or Seq Scan.
This simplified visualization of the query plan is what we call the plan template.

We also thought that it would be useful to see the time spent on each node as a share of the overall execution time. We got this done by simply adding the pie chart.
If we take a look at the node, we can see that the Seq Scan took less than a quarter of the total time, while the remaining three quarters were taken by the CTE Scan. Just a quick note about the CTE Scan running time if you're using them in your queries. They're actually not that fast, in fact, they're even slower than regular table scans.
Charts like this are usually far more interesting in certain cases, for example, when we look at a specific segment and see that over a half of the total time was used by the Seq Scan or there was some filter applied, which discarded many records… It's quite handy that you can send this analysis to the developer so they can fix the query issue.
Of course, there must be some hiccups along the way. The first problem we faced was rounding. As we know, the execution time of each node is specified to an accuracy of 1 microsecond. In cases with the number of loops exceeding 1,000 and PostgreSQL calculating the execution time to a specified accuracy value, we will get inaccurate total time "somewhere between 0.95 ms and 1.05 ms" after reverse calculation. It’s all right for microseconds, but not for milliseconds because we need to keep this fact in mind when analyzing the resource usage of each individual node.

The next problem is much more complicated as it concerns the usage of buffers among nodes. It took us another 4 weeks added to the initial 2 to get this sorted out.
Actually, it's quite a typical problem when you create a CTE and add a read operation. PostgreSQL is set up not to read anything right from the query. Here we get the 1st record and then add the 101st record from the same CTE.

Then let's take a look at the plan of our query. We can see that 3 pages of buffers data are used in the Seq Scan operation, 1 page in the CTE Scan, and 2 more pages in the second CTE Scan. If we add it all up, we get the result of 6, while only 3 pages are actually read from the table. Note that CTE Scans can't read from other sources because they only work with the CPU cache. With all this, we finally see that it doesn't work the way it’s supposed to!
It turns out that only 3 pages of buffers were needed from the Seq Scan. The 1st page was read in the CTE Scan, and then the next 2 pages were read in the 2nd CTE Scan. So, in total, 3 pages of data were read, not 6.

Exactly this view helped us see that the plan execution isn't just a tree, but is actually a kind of acyclic graph. As a result, we got a kind of diagram where we could finally put all the pieces together.
Let me show you what we did. We started with a CTE created from pg_class and ran it twice. Most of the time was spent for the execution of the second query. It's clear that reading the 101st entry is a lot more expensive than just the first one from the table.
Now we could sit back and relax for a while. It was like, “I know Kung Fu”. We brought our experience to life and it was on the screen. Finally, we could use it.

Log consolidation

While our developers breathed a sigh of relief, we did some thinking and realized that with hundreds of active servers, copy and paste wasn't going to work. The only solution was to start collecting this data on our side. Actually, there is a default module that can collect statistics, but it needs to be enabled in the configuration file. I’m talking about the pg_stat_statements module which for a number of reasons wasn’t a good fit for us.
First of all, it assigns different Query IDs to the same queries when different schemes used within the same database. So, if you first do SET search_path = '01'; SELECT * FROM users LIMIT 1; and after that you type SET search_path = '02'; with the same query, the module will record them as different entries in the statistics. This makes it impossible to collect overall stats in the context of this query profile without taking schemes into account.
The second thing is lack of plans, which means that we have no plan, only the query itself. We see what's slowing us down, but we don't know why. And here we come back to the problem of frequently updated datasets.
And to top it off, this module lacks "facts", which doesn't allow you to address a specific instance of the query execution because we only have access to aggregated statistics. Of course, you can deal with this, but why should you? That’s why this whole copy and paste thing made us start working on our own log collector.
The collector connects via SSH and sets up a secure connection to the database server using a certificate. The tail-F command is used to follow the log file in real time. This session provides a full mirror of the entire log file generated by the server. The load on the server is minimal because we don’t parse anything, we just mirror the traffic.
Since we’d already used Node.js to develop the interface, we decided to keep using this tool to write the collector. This technology proved to be effective because it’s convenient to use JavaScript for loosely-formatted textual data, such as logs. The Node.js environment as a backend platform makes it easy to work with network connections and data streams.
This way, we set up two connections: the first one to follow the log and retrieve its data, and the other one to periodically get data from the database. Imagine that the log file says that the table with OID = 123 is locked, but you have no idea what it could be. It's high time to ask the database about the mysterious OID = 123, or anything else we don't know yet, to find out what it really means.
Right at the start of developing this system, we needed to monitor 10 servers that were causing us the most trouble and were not that easy to manage. But after the first quarter, we ended up with a hundred of servers because our system really took off and everyone wanted to use it.
Once we crunched the numbers, it became clear that a huge amount of data was flowing through. We went with PostgreSQL for data storage because it's what we monitor and what we know how to deal with. And using the COPY operator at that time seemed to be the fastest way to collect the data.
Well, simply aggregating the data isn’t our method. If you're dealing with 50,000 queries per second across a hundred servers, that can add up to 100-150 GB of log data per day. That's why we partitioned the data.
The first big step was daily log partitioning because, in most cases, developers weren't interested in the correlation between days. There's no point in looking back at what happened yesterday if we've already released a new version of the app tonight. The stats will be different, you know.
We also got really good at writing using COPY. It's not just about using COPY because it's faster than INSERT but about mastering the skill of COPYing.
The third thing we had to do was to give up on triggers and foreign keys.
In other words, our database has no referential integrity. So, if a table has FKs and you try to insert a record that references other records in the database, PostgreSQL has to check whether these records exist. To do this, it runs SELECT 1 FROM master_fk1_table WHERE ..., with the identifier you're inserting to make sure the foreign key isn't broken and the insertion is successful.
Instead of inserting just one record with its indexes into the target table, we’re dealing with extra readings from all the tables that it references. We didn't sign up for this as our goal was to record as much data as we could in the shortest time and with the least load possible. Well, it's probably best to get rid of the foreign keys.
We're moving on to aggregation and hashing. First, we implemented these in the database—it was really handy to be able to use a trigger to increment a counter by 1 in a table as soon as a record appeared. It worked well and was convenient, but it had its drawbacks. Inserting a single record also required reading from and writing to related tables. Moreover, you had to do it every time you inserted a record.
Now, imagine you have a table where you simply count the number of queries going through a particular host: +1, +1, +1, ..., +1. But you don't really need this — you can sum it up in the collector memory and send it to the database in one go, say +10.
Of course, in case of failures, you can lose logical integrity, but this is an almost unrealistic scenario because you have a reliable server with a battery backed controller, a transaction log, and a file system log... The performance loss due to using triggers and FKs and the associated overhead is just not worth it.
The same goes for hashing. Once a query comes in, you compute a unique ID in the database, save it there, and then everyone can use it. Everything’s fine until at the same time another request to save the same ID comes in, which causes a lock. So, it's better to generate IDs on the client side if you can.
We found that using the MD5 on the text (query, plan, template, etc.) worked really well for us. We compute it on the collector side and save the ready ID into the database. The length of the MD5 and daily partitioning allow us not to worry about possible collisions.
And to make the writing process even faster, we had to modify the procedure itself.
How do we usually write data? We have a dataset, and we split it into several tables, then use COPY—one part into the first table, then the second, then the third... It's inconvenient because we're writing one data stream in three sequential operations. Could be better.
And to make it better, we just need to split these streams to run in parallel with each other. This way, we can have separate streams for errors, queries, templates, locks, and so on, all of which can be written in parallel. To do this, we just need to continuously keep a COPY stream open for each target table.
Note that the collector must always have a stream where you can write the necessary data. To make sure the database can see this data and no one getting stuck waiting for these data to be written, the COPY operation needs to be periodically interrupted. We found out that the most effective period in our case was around 100 milliseconds—we close and immediately open the COPY stream for the same table. If one stream wasn't enough during peak times, we set up a pull with a certain limit defined.
We also found that any aggregation (meaning collecting records into batches) is a bad idea. A typical example of this is doing an INSERT ... VALUES with 1,000 records. This causes a sudden spike in write activity, which means that all other write operations have to wait.
Just to be on the safe side, it's better not to aggregate or buffer data at all. If buffering to the disk does occur (thanks to the Stream API in Node.js, we can detect this), just pause that connection. Once you get an event that it’s free again, write to it from the accumulated queue. While it’s busy, use the next free connection from the pool and write to it.
Before we started using this data writing approach, we’re doing about 4,000 write operations per second. With this method, we managed to cut the load by four times. Now, as we’ve got new monitored databases, our indicators are up to 100 MB/s. We keep logs for the previous three months, which adds up to the data volume about 10-15 TB. Anyway, we believe that most developers can handle any issue within three months.

Analyzing issues

Of course, it’s great to collect all this data, but what's really important is being able to analyze it. Millions of different plans are created every day and we have to manage them somehow.
Once we have so many plans to deal with, we need to start by making their amount more manageable. First of all, it’s about getting this structured. We've identified three main things to focus on:

Find out where this query came from. We need to know which application sent this query: the web interface, the backend, the payment system, or any other.
Localize the issue. It’s important to identify which server the issue is about. If you have a few servers for one app and one of them suddenly stops working, you need to address the server directly.
Describe the details of the issue. To identify the source that sent us a query, we use a built-in method called SET application_name = '{bl-host}:{bl-method}';. This sets up a session variable and sends the name of the business logic host that made the query and the name of the method or application that initiated it. Once we've got the source of the query, we need to log this data. To do this, we need to configure the variable log_line_prefix = ' %m [%p:%v] [%d] %r %a'.

As a result, the log file shows:

time
process and transaction identifiers
database name
IP address of the requester
method name.

We then realized that it's not actually very helpful to look at the correlations for a single query across different servers. It's pretty unusual for the application to have the same issue on different servers, but if it does happen, you can look at any one of these servers.
So, our new principle one server = one day proved to be enough for any kind of analysis.
We start with the analysis in the context of the template, which is a simplified version of the query plan with all the numerical values removed. Next, we look at the application or method, and finally, we focus on the specific node of the plan that caused problems.
Once we started using templates instead of particular instances, we noticed a couple of advantages:

Far fewer objects to analyze. We used to deal with thousands of queries or plans, but now we only have to deal with a few dozen templates.
Timeline. By summarizing the "facts" within a certain context, we can display their occurrence throughout the day. This helps us notice if a template is too frequent, like once an hour when it should be once a day. If this happens, we need to find out what's going wrong—why it's happening, or whether it should be happening at all. This is another, purely visual, method of analysis that doesn't involve numbers.

Other methods of analysis are based on the metrics we can get from the query plan: the number of times the template occurred, the total and average running time, the amount of data read from disk compared to the data read from memory, and so on.
Imagine, you visit the analytics page for a host and notice that disk reads are unusually high, even the server disk is struggling. So, the next thing to think about is who is reading from the disk.
You can sort by any column to decide which issue to focus on: CPU load, disk load, or the total number of queries. Once sorted out, we can identify the top issues, fix them, and then deploy a new version of the application.
You can also see different apps that use the same template for queries like SELECT * FROM users WHERE login = 'John'. This could be anything from a frontend app to a backend app or even a processing app. You may start wondering why the processing layer is reading user data if it isn’t interacting with users.
Another approach is to take a look at how each individual app works by examining its operations. For example, using the timeline, you can see that the frontend app runs certain tasks once an hour. This makes you think that it's not typical for the frontend app to run something once an hour.
After a while, we realized that we needed aggregated statistics for plan nodes. We extracted only those nodes from the plans that interact with the table data (whether they read/write data using an index or not). So, we added one more aspect to the previous analysis: the number of rows each node added versus the number of rows it removed (Rows Removed by Filter).
For example, if you don't have the right table index and you run a query, it might bypass the index and fall back to a Seq Scan. In this case, you're filtering out all the rows except for one. If you're filtering out 100 million rows in a day, it might be worth creating an index to boost the performance.

After analyzing all the plan nodes, we noticed some patterns that seem to cause more issues. It’d be a big help if we could send a message to the developers saying, "Hey, there you're reading by index, then sorting, and then filtering", which usually just returns one record.
This pattern is something that developers often come across, for example, when looking up the latest order date for a user. If there's no date index or if the existing index doesn't include the date, you'll run into this issue.
Since we know these patterns are causing problems, we can highlight them to the developers so they can address these issues. Now, when our developers open a plan, they can see exactly what needs to be done and how to fix it.

Now, thanks to the Saby Explain tool we've designed, developers with any level of expertise can handle their query issues.

Decoding EXPLAIN: How to Uncover Its Hidden Messages

Saby_Explain — Thu, 17 Oct 2024 08:15:25 +0000

This standard question always comes up when developers and business owners consult with database administrators or PostgreSQL experts. It sounds like, “Why do database queries take so long?” These are the most common whys:

ineffective query algorithms
it's probably not a great idea to join CTEs that have thousands of records
outdated stats
the actual content of the tables is very different from the statistics previously collected using the ANALYZE statement
shortage of resources
CPU processing isn't powerful enough, gigabytes of memory are constantly being processed, or the disk can't keep up with all the database demands
other processes may lock queries from running.

While it's not so easy to spot locked queries and analyze why they get locked, all the other issues can be solved with the help of query plans. You can get these plans using the EXPLAIN statement or the auto_explain module. I'd definitely recommend using EXPLAIN (ANALYZE, BUFFERS) straight away.

Even though the PostgreSQL developers note in their documentation, “Plan reading is an art that requires some experience to master…” We can get this job done without much experience as long as we have the right tool. What’s the average query plan like? Let’s take a look:

Index Scan using pg_class_relname_nsp_index on pg_class (actual time=0.049..0.050 rows=1 loops=1)
  Index Cond: (relname = $1)
  Filter: (oid = $0)
  Buffers: shared hit=4
  InitPlan 1 (returns $0,$1)
    ->  Limit (actual time=0.019..0.020 rows=1 loops=1)
          Buffers: shared hit=1
          ->  Seq Scan on pg_class pg_class_1 (actual time=0.015..0.015 rows=1 loops=1)
                Filter: (relkind = 'r'::"char")
                Rows Removed by Filter: 5
                Buffers: shared hit=1

Or the plan may also have this format:

"Append  (cost=868.60..878.95 rows=2 width=233) (actual time=0.024..0.144 rows=2 loops=1)"
"  Buffers: shared hit=3"
"  CTE cl"
"    ->  Seq Scan on pg_class  (cost=0.00..868.60 rows=9972 width=537) (actual time=0.016..0.042 rows=101 loops=1)"
"          Buffers: shared hit=3"
"  ->  Limit  (cost=0.00..0.10 rows=1 width=233) (actual time=0.023..0.024 rows=1 loops=1)"
"        Buffers: shared hit=1"
"        ->  CTE Scan on cl  (cost=0.00..997.20 rows=9972 width=233) (actual time=0.021..0.021 rows=1 loops=1)"
"              Buffers: shared hit=1"
"  ->  Limit  (cost=10.00..10.10 rows=1 width=233) (actual time=0.117..0.118 rows=1 loops=1)"
"        Buffers: shared hit=2"
"        ->  CTE Scan on cl cl_1  (cost=0.00..997.20 rows=9972 width=233) (actual time=0.001..0.104 rows=101 loops=1)"
"              Buffers: shared hit=2"
"Planning Time: 0.634 ms"
"Execution Time: 0.248 ms"

Anyway, interpreting the plain-text query plan can be quite tricky:

nodes contain only the total of resources of the subtree,
which requires you to subtract one value from the other to find out how long it took to execute a specific node or how much table data was read from the disk
the node execution time needs to be multiplied by the number of loops,
which makes it clear that the subtraction isn't the most challenging calculation to do in your head. This multiplication is much harder because we have only the averaged node execution time, whereas in reality there can be hundreds of different execution times
all this makes it tough to figure out what the “weakest link” is.

After we’d tried to explain the whole thing to our developers, we finally got a clear picture of what it really looked like:

That was the moment when we realized we needed…

Powerful tool

We've added all the key features that help identify issues based on the query and plan data and suggest the best solution. Now we’re ready to share some of our experience with the community.
Excited to introduce our tool, and hope you'll enjoy using it — Saby Explain. You can click DEMO to analyze your plan.

Plan visibility

Could you easily interpret the plan that looks like this?

Seq Scan on pg_class (actual time=0.009..1.304 rows=6609 loops=1)
  Buffers: shared hit=263
Planning Time: 0.108 ms
Execution Time: 1.800 ms

I don’t think so.

But what if the plan was visually structured and the key indicators were displayed separately, it'd be much easier to get a grasp of the situation:

As for more complex plans, a pie chart will come in handy to find out the time spent on the execution of each node:

And when dealing with large plans, a query processing flowchart will be a great help:

The flowchart may also help you out in non-standard situations, for example, where an execution plan has multiple roots:

Query structure suggestions

Once we've laid out the query plan structure and identified all the potential issues, we can point out these weak points to the developers with the simple-language explanation.

We’ve already collected a number of patterns of query plans for which we know ways for improvement.

Row-level query profiling

You can also overlay the analyzed plan with the original query to see the execution time for each operator, just like this:

or even this way:

Using parameters in queries

If you’ve attached to the plan not only the query but also its parameters from the DETAIL log line, you can copy this data in the following ways:

substitute values directly in the query
for immediate execution against your database and for further profiling
substitute values using the PREPARE/EXECUTE statement
to emulate the work of the query optimizer with the parameter values that can be ignored, such as processing partitioned tables

Archive of query plans

We’d love you to add your plans, get them analyzed, and share with colleagues! Your plans will be archived so you can have a look at them again later.
If you don’t want others to see your plan, remember to select the "Do not save to archive" checkbox.
In future articles, we'll talk about some of the difficulties that may arise when analyzing plans and their possible solutions.

More information on the Explain tool.