<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Philip McClarence</title>
    <description>The latest articles on Forem by Philip McClarence (@philip_mcclarence_2ef9475).</description>
    <link>https://forem.com/philip_mcclarence_2ef9475</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2690053%2F913499a1-620d-4487-a868-d677f1aca106.png</url>
      <title>Forem: Philip McClarence</title>
      <link>https://forem.com/philip_mcclarence_2ef9475</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/philip_mcclarence_2ef9475"/>
    <language>en</language>
    <item>
      <title>PostgreSQL Join Optimization: Nested Loop, Hash, and Merge</title>
      <dc:creator>Philip McClarence</dc:creator>
      <pubDate>Tue, 28 Apr 2026 14:00:09 +0000</pubDate>
      <link>https://forem.com/philip_mcclarence_2ef9475/postgresql-join-optimization-nested-loop-hash-and-merge-1cn9</link>
      <guid>https://forem.com/philip_mcclarence_2ef9475/postgresql-join-optimization-nested-loop-hash-and-merge-1cn9</guid>
      <description>&lt;p&gt;PostgreSQL has three join algorithms. The planner picks between them for every join in every query, driven by several things at once: the estimated sizes of the two inputs, whether they arrive already sorted on the join key, the type of join (inner vs left/semi/anti), which operators are &lt;code&gt;mergejoinable&lt;/code&gt; or &lt;code&gt;hashjoinable&lt;/code&gt;, whether a hash table will fit in &lt;code&gt;work_mem&lt;/code&gt;, and the cost parameters that weigh I/O against CPU. Get the decision right and a three-way join across millions of rows runs in tens of milliseconds. Get it wrong — usually by encouraging a Nested Loop on two large unsorted inputs — and the same query takes minutes.&lt;/p&gt;

&lt;p&gt;This article is the third in the &lt;a href="https://mydba.dev/blog/postgres-query-analysis-complete-guide" rel="noopener noreferrer"&gt;Complete Guide to PostgreSQL SQL Query Analysis &amp;amp; Optimization&lt;/a&gt; series. We assume the reader can &lt;a href="https://mydba.dev/blog/postgres-explain-analyze-reading" rel="noopener noreferrer"&gt;read EXPLAIN output&lt;/a&gt; and is familiar with the &lt;a href="https://mydba.dev/blog/postgres-index-usage-optimization" rel="noopener noreferrer"&gt;indexing vocabulary&lt;/a&gt;. The running dataset is the same Neon Postgres 17.8 database used throughout the series: 500,000-row &lt;code&gt;sim_bp_orders&lt;/code&gt;, 1,000,000-row &lt;code&gt;sim_bp_order_items&lt;/code&gt;, 200,000-row &lt;code&gt;sim_bp_users&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;We'll cover how each of the three join strategies works, when the planner picks each, what indexes each one wants, and how to read multi-way joins.&lt;/p&gt;

&lt;h2&gt;
  
  
  Nested Loop — small outer, indexed inner
&lt;/h2&gt;

&lt;p&gt;Nested Loop is the simplest strategy: for each row on the outer side, scan the inner side for matches. Without any index on the inner side, this is a full scan per outer row — O(outer × inner) — and catastrophic for two large tables. With an index on the inner side's join key, each "scan" of the inner is a handful of page reads (a btree descent plus a heap fetch for any columns not in the index), so the total cost is &lt;em&gt;outer-rows × random-I/O-per-probe&lt;/em&gt; rather than a polynomial blowup. When the outer side is small and the inner has an index, Nested Loop is nearly unbeatable.&lt;/p&gt;

&lt;p&gt;Here's a three-way join that the planner executes as a tower of Nested Loops. The query is "twenty recent pending orders with the user's email and the items in each order":&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;oi&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;oi&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;unit_price_cents&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;sim_bp_order_items&lt;/span&gt; &lt;span class="n"&gt;oi&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;oi&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'pending'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'active'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Limit  (cost=1.28..18.06 rows=20 width=41) (actual time=5.098..43.038 rows=20 loops=1)
  Buffers: shared hit=96 read=45
  -&amp;gt;  Nested Loop  (cost=1.28..159741.54 rows=190376 width=41)
        (actual time=5.097..43.027 rows=20 loops=1)
        -&amp;gt;  Nested Loop  (cost=0.85..78058.04 rows=95188 width=33)
              (actual time=2.949..10.878 rows=9 loops=1)
              Inner Unique: true
              -&amp;gt;  Index Scan Backward using idx_sim_bp_orders_created_at on sim_bp_orders o
                    (cost=0.42..30949.29 rows=100300 width=16)
                    (actual time=1.566..2.610 rows=9 loops=1)
                    Filter: ((o.status)::text = 'pending'::text)
                    Rows Removed by Filter: 44
              -&amp;gt;  Memoize  (cost=0.43..0.55 rows=1 width=25)
                    (actual time=0.916..0.916 rows=1 loops=9)
                    Cache Key: o.user_id
                    Cache Mode: logical
                    Hits: 0  Misses: 9  Evictions: 0  Overflows: 0  Memory Usage: 2kB
                    -&amp;gt;  Index Scan using sim_bp_users_pkey on sim_bp_users u
                          (cost=0.42..0.54 rows=1 width=25)
                          (actual time=0.846..0.846 rows=1 loops=9)
                          Index Cond: (u.user_id = o.user_id)
                          Filter: ((u.status)::text = 'active'::text)
        -&amp;gt;  Index Scan using idx_sim_bp_order_items_order_id on sim_bp_order_items oi
              (cost=0.42..0.83 rows=3 width=12)
              (actual time=2.418..3.567 rows=2 loops=9)
              Index Cond: (oi.order_id = o.order_id)
 Execution Time: 43.129 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;43 ms for a three-way join across 200k × 500k × 1M rows is good. The plan is a tower of two Nested Loops — the inner one joins orders and users, the outer one joins that intermediate result with order items. Read it top-down:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The &lt;code&gt;Index Scan Backward&lt;/code&gt; on &lt;code&gt;sim_bp_orders.created_at&lt;/code&gt; walks the index in reverse — newest first — looking for pending orders. &lt;code&gt;rows=9 loops=1&lt;/code&gt; means the outer driver produced nine orders before the whole pipeline had enough downstream rows to satisfy &lt;code&gt;LIMIT 20&lt;/code&gt;. Forty-four rows were read and filtered as non-pending along the way.&lt;/li&gt;
&lt;li&gt;For each of those nine orders, a &lt;code&gt;Memoize → Index Scan on sim_bp_users_pkey&lt;/code&gt; looks up the user. Memoize is a PostgreSQL 14+ cache that short-circuits the inner scan when the same key appears repeatedly; here the nine orders happen to be from nine different users, so it's effectively nine primary-key lookups with no cache hits.&lt;/li&gt;
&lt;li&gt;For each matching &lt;code&gt;(order, user)&lt;/code&gt; pair, the outer &lt;code&gt;Index Scan using idx_sim_bp_order_items_order_id&lt;/code&gt; returns an average of two to three line items per order (&lt;code&gt;rows=2 loops=9&lt;/code&gt;). The &lt;code&gt;LIMIT 20&lt;/code&gt; applies to the final joined row count, so the executor stops as soon as 20 &lt;code&gt;(order, user, item)&lt;/code&gt; tuples have been produced — which is roughly the point where 9 orders × ~2 items each = 20 rows.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is the Nested Loop success case: the outer driver returns a tiny number of rows thanks to the &lt;code&gt;LIMIT&lt;/code&gt; + ordered index, and every inner lookup is an indexed point query. Without the &lt;code&gt;LIMIT&lt;/code&gt;, the planner would likely pick a very different strategy — possibly a Hash Join cascade — because it would have to produce tens of thousands of rows instead of twenty.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Nested Loop failure mode
&lt;/h3&gt;

&lt;p&gt;The same strategy is a disaster when the outer side is large. Consider "count the items across all pending orders," which must process 100,000 pending orders:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;sim_bp_order_items&lt;/span&gt; &lt;span class="n"&gt;oi&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;oi&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'pending'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If we force the planner to use a Nested Loop (by disabling hash and merge joins), the result is telling:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Aggregate (actual time=1621.494..1621.495 rows=1 loops=1)
  Buffers: shared hit=398994 read=2894
  -&amp;gt;  Nested Loop  (actual time=6.422..1606.338 rows=200535 loops=1)
        -&amp;gt;  Index Only Scan on sim_bp_orders o
              (actual time=4.859..123.354 rows=100252 loops=1)
        -&amp;gt;  Index Only Scan on sim_bp_order_items oi
              (actual time=0.013..0.014 rows=2 loops=100252)
              Index Cond: (oi.order_id = o.order_id)
 Execution Time: 1621.525 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;1.6 seconds for the same result the planner produces in 1.2 seconds via a Parallel Hash Join (next section). More interestingly, the &lt;code&gt;Buffers&lt;/code&gt; line shows &lt;strong&gt;398,994 pages hit&lt;/strong&gt; — that's from 100,252 inner-index probes, each one re-traversing the btree descent of &lt;code&gt;idx_sim_bp_order_items_order_id&lt;/code&gt;. Many of those probes hit the same upper index pages over and over (that's why it's mostly &lt;code&gt;hit&lt;/code&gt;, not &lt;code&gt;read&lt;/code&gt;), but it's still enormous repeated page traffic that dominates CPU even when the data is fully cached. Under concurrency, other queries would find their own working set evicted from &lt;code&gt;shared_buffers&lt;/code&gt; to make room.&lt;/p&gt;

&lt;p&gt;The MyDBA analyzer rule &lt;code&gt;nested_loop_large&lt;/code&gt; is specifically for this failure mode: it fires when a Nested Loop has &lt;code&gt;Plan Rows &amp;gt; 1000&lt;/code&gt; on the outer side and &lt;code&gt;Plan Rows &amp;gt; 100&lt;/code&gt; on the inner side. At those sizes the Nested Loop is almost always the wrong strategy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hash Join — larger sides, unsorted input
&lt;/h2&gt;

&lt;p&gt;Hash Join works in two phases:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Build phase.&lt;/strong&gt; Read the smaller side in full, building an in-memory hash table keyed by the join column(s). This happens inside the &lt;code&gt;Hash&lt;/code&gt; node you see in the plan.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Probe phase.&lt;/strong&gt; Stream the larger side through the hash table, emitting matched rows as they come.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Hash Join doesn't care whether the inputs are sorted, which makes it the fallback when Merge Join isn't available. It wants the build side to fit in &lt;code&gt;work_mem&lt;/code&gt;; if it doesn't, the join spills: PostgreSQL partitions both sides by the join key and processes one pair of partitions at a time. Spilling is visible in the plan as &lt;code&gt;Batches &amp;gt; 1&lt;/code&gt; on the &lt;code&gt;Hash&lt;/code&gt; or &lt;code&gt;Hash Join&lt;/code&gt; node, and the MyDBA analyzer rule &lt;code&gt;hash_batches_spill&lt;/code&gt; fires on it.&lt;/p&gt;

&lt;p&gt;Here's the same count query the planner actually chose — a Parallel Hash Join:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Finalize Aggregate  (actual time=1196.234..1199.894 rows=1 loops=1)
  Buffers: shared hit=3827 read=6356
  -&amp;gt;  Gather (Workers Planned: 2, Workers Launched: 2)
        -&amp;gt;  Partial Aggregate  (actual time=1179.014..1179.016 rows=1 loops=3)
              -&amp;gt;  Parallel Hash Join
                    (actual time=170.554..1143.676 rows=333333 loops=3)
                    Hash Cond: (oi.order_id = o.order_id)
                    -&amp;gt;  Parallel Seq Scan on sim_bp_order_items oi
                          (actual time=1.589..703.241 rows=333333 loops=3)
                    -&amp;gt;  Parallel Hash
                          Buckets: 524288  Batches: 1  Memory Usage: 23712kB
                          -&amp;gt;  Parallel Seq Scan on sim_bp_orders o
                                (actual time=0.009..38.403 rows=166667 loops=3)
                                Filter: ((o.status)::text = 'pending'::text)
 Execution Time: 1199.945 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;1.2 seconds, 10,183 buffer pages touched — about 40× fewer than the forced Nested Loop. The planner built the hash table from &lt;code&gt;sim_bp_orders&lt;/code&gt; (the smaller filtered side, 100k pending rows) and probed it with &lt;code&gt;sim_bp_order_items&lt;/code&gt;. &lt;code&gt;Batches: 1&lt;/code&gt; means the hash table fit in &lt;code&gt;work_mem&lt;/code&gt; entirely, so there was no spill.&lt;/p&gt;

&lt;p&gt;Note the Parallel Seq Scan on both sides. That is not a planner mistake — when you're going to read every pending row anyway, a sequential scan is cheaper than an indexed scan because it avoids random I/O and plays nicely with read-ahead. Hash Join is perfectly happy to consume an unsorted stream.&lt;/p&gt;

&lt;p&gt;The Parallel Hash Join is a newer variant (PostgreSQL 11+) where workers collaborate to build one shared hash table and then probe it in parallel. Under the hood, &lt;code&gt;Parallel Hash&lt;/code&gt; coordinates the build; each worker contributes to it and then proceeds to scan its share of the probe side. This is why you see &lt;code&gt;Workers Planned: 2, Workers Launched: 2&lt;/code&gt; at the top and three loops in each node (one leader + two workers).&lt;/p&gt;

&lt;h3&gt;
  
  
  When Hash Join is suboptimal
&lt;/h3&gt;

&lt;p&gt;Three cases:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Build side too large.&lt;/strong&gt; If the smaller table is still multiple-of-work_mem, hash-join spilling degrades performance sharply. The fix is either to raise &lt;code&gt;work_mem&lt;/code&gt; (per-session, not cluster-wide), or to force a different strategy via index creation. &lt;code&gt;hash_batches_spill&lt;/code&gt; flags this in the analyzer output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Probe side is tiny.&lt;/strong&gt; If one input is five rows and the other is fifty million, Nested Loop into an indexed inner is cheaper than building any hash table. PostgreSQL's cost model handles this case correctly most of the time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Both inputs already sorted.&lt;/strong&gt; If both sides come out of index scans that produce rows in join-key order, Merge Join is strictly cheaper because it skips the hash build. The planner usually figures this out on its own when it sees the access paths.&lt;/p&gt;

&lt;h2&gt;
  
  
  Merge Join — both sides sorted
&lt;/h2&gt;

&lt;p&gt;Merge Join walks two pre-sorted inputs in parallel, pairing rows with matching keys in a single pass. It's optimal when both inputs are already sorted on the join key — typically because both are served from index scans on the join column, or because the query itself requires an &lt;code&gt;ORDER BY&lt;/code&gt; that aligns with the join key.&lt;/p&gt;

&lt;p&gt;The planner picks Merge Join less often than you might expect, because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If one side has a smaller size and the other has an index, Nested Loop is usually cheaper per row.&lt;/li&gt;
&lt;li&gt;If neither side is sorted and both are large, Hash Join wins — sorting both sides just to merge them is rarely cost-effective.&lt;/li&gt;
&lt;li&gt;Merge Join's sweet spot is two large pre-sorted streams, which is often a signal that a materialised view or a pre-joined table would be cheaper still.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A canonical Merge Join shape:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;oi&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;sim_bp_order_items&lt;/span&gt; &lt;span class="n"&gt;oi&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;oi&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If both tables have indexes on &lt;code&gt;order_id&lt;/code&gt; (they do — the primary key on orders and &lt;code&gt;idx_sim_bp_order_items_order_id&lt;/code&gt;) and the ORDER BY forces ordered output, the planner may produce something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Merge Join
  Merge Cond: (o.order_id = oi.order_id)
  -&amp;gt;  Index Scan using sim_bp_orders_pkey on sim_bp_orders o
  -&amp;gt;  Index Scan using idx_sim_bp_order_items_order_id on sim_bp_order_items oi
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Single pass through both indexes, no hash build, no random access. When the prerequisites are met — both sides produced in join-key order — Merge Join is the cheapest option by a wide margin.&lt;/p&gt;

&lt;p&gt;In practice you'll see Merge Join most often on joins with explicit ordering, or in the middle of larger plans where the planner noticed that an upstream node was already producing sorted output.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the planner chooses
&lt;/h2&gt;

&lt;p&gt;PostgreSQL's planner is cost-based. For each join, it enumerates the plausible strategies (Nested Loop, Hash Join, Merge Join, and each direction for each — which side is inner, which is outer) and picks the lowest-cost option. The cost model incorporates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Estimated row counts from both sides (crucially — if these are wrong, everything downstream is wrong).&lt;/li&gt;
&lt;li&gt;Whether each side has a useful index on the join column.&lt;/li&gt;
&lt;li&gt;Current &lt;code&gt;work_mem&lt;/code&gt; — the planner knows whether a hash table will fit or whether it'll have to plan a spill.&lt;/li&gt;
&lt;li&gt;Whether inputs are already sorted (from index scans or prior sort nodes).&lt;/li&gt;
&lt;li&gt;The cost parameters: &lt;code&gt;random_page_cost&lt;/code&gt;, &lt;code&gt;seq_page_cost&lt;/code&gt;, &lt;code&gt;cpu_tuple_cost&lt;/code&gt;, etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The single biggest cause of wrong-strategy joins is &lt;strong&gt;bad row estimates&lt;/strong&gt;. If the planner thinks a side will produce 15 rows and it actually produces 150,000, it might pick a Nested Loop (optimal for 15) when a Hash Join (optimal for 150,000) would be 100× faster. The MyDBA analyzer rule &lt;code&gt;row_estimate_inaccurate&lt;/code&gt; fires when the actual-to-estimated ratio exceeds 10× in either direction, and the fix is almost always &lt;code&gt;ANALYZE&lt;/code&gt; on the affected table, or extended statistics if the bad estimate comes from a correlation the planner doesn't know about.&lt;/p&gt;

&lt;p&gt;The second biggest cause is &lt;strong&gt;stale column statistics on correlated predicates&lt;/strong&gt;. The planner assumes predicates are independent — if &lt;code&gt;WHERE tenant_id = 7 AND region = 'eu'&lt;/code&gt; implies a much narrower row set than &lt;code&gt;P(tenant_id=7) × P(region='eu')&lt;/code&gt;, the planner will underestimate and pick the wrong join strategy. Extended statistics (&lt;code&gt;CREATE STATISTICS ... ON tenant_id, region FROM ...&lt;/code&gt;) are the specific fix.&lt;/p&gt;

&lt;h2&gt;
  
  
  Join order: how PostgreSQL decides what to join first
&lt;/h2&gt;

&lt;p&gt;In a three-way join &lt;code&gt;A ⨝ B ⨝ C&lt;/code&gt;, there are several possible orders: &lt;code&gt;(A ⨝ B) ⨝ C&lt;/code&gt;, &lt;code&gt;A ⨝ (B ⨝ C)&lt;/code&gt;, and if the join conditions allow it, &lt;code&gt;(A ⨝ C) ⨝ B&lt;/code&gt;. For a fourth table you get a lot more permutations. PostgreSQL's planner searches through them.&lt;/p&gt;

&lt;p&gt;The heuristic is: &lt;strong&gt;do the most selective joins first&lt;/strong&gt;, so the intermediate result is as small as possible. A join that filters &lt;code&gt;rows_A × rows_B&lt;/code&gt; down to 100 rows should happen before a join that would blow the intermediate to millions.&lt;/p&gt;

&lt;p&gt;For queries with fewer than 12 tables, PostgreSQL uses dynamic programming to enumerate orders exhaustively. For 12+ tables, the planner switches to the Genetic Query Optimizer (GEQO) which uses heuristic search — sometimes producing non-optimal plans on complex joins. If you have a very wide query (12+ tables, complex conditions), tune &lt;code&gt;geqo_threshold&lt;/code&gt; and &lt;code&gt;from_collapse_limit&lt;/code&gt; or consider rewriting with explicit CTEs to split the problem.&lt;/p&gt;

&lt;p&gt;A few practical levers when the planner picks a wrong join order:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Add or fix indexes.&lt;/strong&gt; A missing index on a join column often drives the planner to avoid that join until later, resulting in large intermediates. Indexing fixes it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ANALYZE&lt;/code&gt; recently.&lt;/strong&gt; Stale row counts → bad estimates → bad orders. Autovacuum handles this for active tables; it's often out of date after a bulk load.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extended statistics.&lt;/strong&gt; For correlated join keys, &lt;code&gt;CREATE STATISTICS&lt;/code&gt; on the correlation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rewriting to constrain the planner.&lt;/strong&gt; &lt;code&gt;STRAIGHT_JOIN&lt;/code&gt; doesn't exist in PostgreSQL, but you can force the order by using explicit &lt;code&gt;JOIN&lt;/code&gt; syntax and setting &lt;code&gt;join_collapse_limit = 1&lt;/code&gt;. Use sparingly — the cost model is usually right.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  When join strategy doesn't matter — and what does
&lt;/h2&gt;

&lt;p&gt;Sometimes the join strategy is correct and the query is still slow. The real costs are upstream:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A slow sub-query or CTE feeding the join.&lt;/strong&gt; The join isn't the problem; its input is. Diagnose by looking at the actual timing of each side.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;An expensive filter that prevents index use.&lt;/strong&gt; If one side of the join is doing a sequential scan because of a non-sargable WHERE clause, the join strategy can't save you. See &lt;a href="https://mydba.dev/blog/postgres-where-clause-optimization" rel="noopener noreferrer"&gt;WHERE Clause Optimisation&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Over-selective projections.&lt;/strong&gt; &lt;code&gt;SELECT *&lt;/code&gt; on a 400-column table passed through a join is expensive in row width; projecting only the columns you need tightens the whole pipeline.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When reading a multi-way join plan, resist the urge to focus on the outermost join. Instead, scan the leaves of the plan tree for the biggest &lt;code&gt;actual rows × loops&lt;/code&gt; node — that's where the time is actually going.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick reference
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Outer size&lt;/th&gt;
&lt;th&gt;Inner size&lt;/th&gt;
&lt;th&gt;Inner indexed?&lt;/th&gt;
&lt;th&gt;Inputs sorted?&lt;/th&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Small (≤1K)&lt;/td&gt;
&lt;td&gt;Any&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Nested Loop&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Large&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Nested Loop or Hash&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Large&lt;/td&gt;
&lt;td&gt;Large&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Both&lt;/td&gt;
&lt;td&gt;Merge Join&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Large&lt;/td&gt;
&lt;td&gt;Large&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Hash Join (may spill)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Large&lt;/td&gt;
&lt;td&gt;Large&lt;/td&gt;
&lt;td&gt;Build side &amp;gt; work_mem&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Hash Join with spill — raise work_mem or add an index&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A plan shape that should always prompt investigation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Nested Loop with outer rows &amp;gt; 1,000 and no Memoize cache → fires &lt;code&gt;nested_loop_large&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Hash or Hash Join with &lt;code&gt;Batches &amp;gt; 1&lt;/code&gt; → fires &lt;code&gt;hash_batches_spill&lt;/code&gt;; either raise &lt;code&gt;work_mem&lt;/code&gt; or index to eliminate the join.&lt;/li&gt;
&lt;li&gt;Any join where &lt;code&gt;row_estimate_inaccurate&lt;/code&gt; fires on either side — fix statistics first, then re-examine the join.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Next steps
&lt;/h2&gt;

&lt;p&gt;Joins are the category most affected by the quality of your WHERE clauses. The next article in the series covers &lt;a href="https://mydba.dev/blog/postgres-where-clause-optimization" rel="noopener noreferrer"&gt;WHERE Clause Optimisation&lt;/a&gt; — sargability, composite-index column ordering, and the operators that silently disable indexes. If your joins look right but the inputs to them are slow, that's almost always where the fix lives.&lt;/p&gt;

&lt;p&gt;For the subquery/CTE patterns that sometimes appear in place of explicit joins (&lt;code&gt;EXISTS&lt;/code&gt;, correlated subqueries, LATERAL), see &lt;a href="https://mydba.dev/blog/postgres-subquery-cte-optimization" rel="noopener noreferrer"&gt;Subquery &amp;amp; CTE Optimisation&lt;/a&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  postgres #performance #database #sql
&lt;/h1&gt;

&lt;p&gt;Originally published at &lt;a href="https://mydba.dev/blog/postgres-join-optimization" rel="noopener noreferrer"&gt;mydba.dev/blog/postgres-join-optimization&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>database</category>
      <category>performance</category>
      <category>postgres</category>
      <category>sql</category>
    </item>
    <item>
      <title>PostgreSQL Index Usage and Optimization</title>
      <dc:creator>Philip McClarence</dc:creator>
      <pubDate>Mon, 27 Apr 2026 14:00:03 +0000</pubDate>
      <link>https://forem.com/philip_mcclarence_2ef9475/postgresql-index-usage-and-optimization-4jgf</link>
      <guid>https://forem.com/philip_mcclarence_2ef9475/postgresql-index-usage-and-optimization-4jgf</guid>
      <description>&lt;h1&gt;
  
  
  PostgreSQL Index Usage and Optimization
&lt;/h1&gt;

&lt;p&gt;Indexing is the single biggest lever in SQL performance, and it is also the category where most of the bad advice lives. "Add an index" solves a narrow class of problems. "Add the right index, in the right shape, for the right query, and drop the ones you don't need" is the actual job — and it's more design work than most teams expect.&lt;/p&gt;

&lt;p&gt;This is article 2 in a series on PostgreSQL query analysis. The pillar is &lt;a href="https://mydba.dev/blog/postgres-query-analysis-complete-guide" rel="noopener noreferrer"&gt;The Complete Guide to PostgreSQL SQL Query Analysis &amp;amp; Optimization&lt;/a&gt;; article 1 covers &lt;a href="https://mydba.dev/blog/postgres-explain-analyze-reading" rel="noopener noreferrer"&gt;reading EXPLAIN output&lt;/a&gt;. The running dataset is 500k-row &lt;code&gt;sim_bp_orders&lt;/code&gt; / 200k-row &lt;code&gt;sim_bp_users&lt;/code&gt; / 50k-row &lt;code&gt;sim_bp_products&lt;/code&gt; on Neon Postgres 17.8; every EXPLAIN block is from a real run.&lt;/p&gt;

&lt;p&gt;We'll cover: when the planner actually uses an index, the four design choices that matter most (column selection, partial, covering, expression), the less-common index types and when they beat btrees, how to find unused indexes, and four cases where &lt;em&gt;not&lt;/em&gt; adding an index is the correct call.&lt;/p&gt;

&lt;h2&gt;
  
  
  When the planner picks an index
&lt;/h2&gt;

&lt;p&gt;An index is a data structure; "using an index" is a planner decision. PostgreSQL estimates the cost of each candidate plan — sequential scan, index scan, index-only scan, bitmap scan — and picks the cheapest. Three things drive that choice:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Selectivity.&lt;/strong&gt; The estimated fraction of rows the query will return. If the filter returns 0.1% of rows, an index scan is almost always cheaper. If the filter returns 30%, it depends on the rest of the query shape. If the filter returns 70%, the planner will almost always choose a sequential scan because visiting most of the heap sequentially costs less than reading index pages plus random heap I/O.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Correlation.&lt;/strong&gt; If the rows matching the filter are physically clustered on disk, the planner's random-access penalty shrinks and an index scan becomes more attractive. If they're scattered, random I/O dominates and seq scan wins. The &lt;code&gt;pg_stats.correlation&lt;/code&gt; column (range -1 to 1) tells you how clustered each column's values are. Time-series tables (&lt;code&gt;created_at&lt;/code&gt;) often have near-1 correlation because they're append-mostly; status columns usually hover near 0.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost parameters.&lt;/strong&gt; &lt;code&gt;random_page_cost&lt;/code&gt; (default 4.0) vs &lt;code&gt;seq_page_cost&lt;/code&gt; (default 1.0). On SSD-backed storage those defaults are too conservative; lowering &lt;code&gt;random_page_cost&lt;/code&gt; to 1.5 or 2.0 makes the planner reach for indexes more readily. Setting it &lt;em&gt;below&lt;/em&gt; &lt;code&gt;seq_page_cost&lt;/code&gt; is almost always wrong — it implies random I/O is faster than sequential, which isn't true on any real storage. If you're tempted to go there, you probably want to raise &lt;code&gt;effective_cache_size&lt;/code&gt; instead.&lt;/p&gt;

&lt;p&gt;If a plan has a &lt;code&gt;Seq Scan&lt;/code&gt;, no index-type nodes, and more than two nodes total, you probably have a missing or ignored index. It's a signal, not a verdict — some queries genuinely don't want an index — but it's worth checking.&lt;/p&gt;

&lt;h2&gt;
  
  
  The boring case — primary key lookup
&lt;/h2&gt;

&lt;p&gt;The cheapest index in any database is the primary-key btree:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_users&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;12345&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Index Scan using sim_bp_users_pkey on sim_bp_users
  (cost=0.42..8.44 rows=1 width=51) (actual time=8.683..8.686 rows=1 loops=1)
  Index Cond: (sim_bp_users.user_id = 12345)
  Buffers: shared read=4
 Execution Time: 9.700 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Four shared-buffer reads for a 200,000-row table. The 9.7 ms execution time is dominated by cold-cache reads against Neon's networked storage; on a warm-cache benchmark this drops to sub-millisecond. This is the shape every OLTP single-row lookup should have.&lt;/p&gt;

&lt;h2&gt;
  
  
  The four design choices that matter
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Column selection — matching the query shape
&lt;/h3&gt;

&lt;p&gt;A composite index on &lt;code&gt;(user_id, created_at)&lt;/code&gt; helps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;WHERE user_id = ?&lt;/code&gt; (uses the leading column alone).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;WHERE user_id = ? AND created_at &amp;gt; ?&lt;/code&gt; (uses both).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;WHERE user_id = ? ORDER BY created_at DESC LIMIT n&lt;/code&gt; (uses leading equality + sorted trailing column).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It does &lt;strong&gt;not&lt;/strong&gt; help &lt;code&gt;WHERE created_at &amp;gt; ?&lt;/code&gt; in isolation. This is the leftmost-prefix rule: a btree composite index can answer queries that use a contiguous prefix of its columns, starting with the leading one. Skip-scan isn't efficient on PostgreSQL btrees for reasonable-cardinality leading columns.&lt;/p&gt;

&lt;p&gt;Rule of thumb: leading columns should be equality predicates, trailing columns range predicates or sort keys. &lt;code&gt;(tenant_id, created_at)&lt;/code&gt;, not &lt;code&gt;(created_at, tenant_id)&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Partial indexes — when 80% of the table is irrelevant
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_bp_orders_pending_recent&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'pending'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The index only contains rows where &lt;code&gt;status = 'pending'&lt;/code&gt;, so it's roughly one-fifth the size of a full index on &lt;code&gt;created_at&lt;/code&gt;. The planner will use it for any query whose &lt;code&gt;WHERE&lt;/code&gt; clause &lt;em&gt;implies&lt;/em&gt; &lt;code&gt;status = 'pending'&lt;/code&gt; — it proves this by theorem-proving over the predicates. So &lt;code&gt;WHERE status = 'pending' AND created_at &amp;gt; now() - interval '1 day'&lt;/code&gt; works, but &lt;code&gt;WHERE status IN ('pending', 'shipped') AND ...&lt;/code&gt; doesn't (the &lt;code&gt;IN&lt;/code&gt; predicate doesn't imply the partial predicate).&lt;/p&gt;

&lt;p&gt;Two gotchas: they're fragile to query rewording (a function, a cast, a reworded predicate can break the implication proof), and they pay write cost whenever a row moves into or out of the partial predicate.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Covering indexes — eliminating heap fetches
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;INCLUDE&lt;/code&gt; tucks non-key columns into the leaf pages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_bp_orders_pending_by_amount&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_amount_cents&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;INCLUDE&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'pending'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A query that &lt;code&gt;SELECT&lt;/code&gt;s any combination of &lt;code&gt;order_id, user_id, total_amount_cents, created_at&lt;/code&gt; from this index can be served entirely from index pages — provided the visibility map marks the relevant heap pages as all-visible. On a write-heavy table where autovacuum can't keep up, you'll see non-zero &lt;code&gt;Heap Fetches:&lt;/code&gt; in EXPLAIN, which defeats most of the benefit.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;INCLUDE&lt;/code&gt; columns cannot be used for index conditions. Rule: put columns used for filtering/joining/ordering in the key; put columns you're only retrieving in &lt;code&gt;INCLUDE&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Expression indexes — indexing computed values
&lt;/h3&gt;

&lt;p&gt;This is where most "why isn't my index being used?" problems live. A btree on &lt;code&gt;email&lt;/code&gt; can't serve &lt;code&gt;WHERE lower(email) = ?&lt;/code&gt; or &lt;code&gt;WHERE lower(email) LIKE 'prefix%'&lt;/code&gt;. Case-insensitive prefix search on a 200k-row table without an expression index:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Gather  (cost=1000.00..5841.09 rows=1000 width=25) (actual time=0.553..122.758 rows=1 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  -&amp;gt;  Parallel Seq Scan on sim_bp_users
        Filter: (lower((email)::text) ~~ 'user12%'::text)
        Rows Removed by Filter: 94444
 Execution Time: 122.833 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Parallel seq scan, 94k rows filtered per worker, 122 ms. The fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_bp_users_email_lower&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;sim_bp_users&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;text_pattern_ops&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For equality on lowercased email, a plain &lt;code&gt;CREATE INDEX ... (lower(email))&lt;/code&gt; is enough. For prefix &lt;code&gt;LIKE&lt;/code&gt;, &lt;code&gt;text_pattern_ops&lt;/code&gt; is needed because PostgreSQL can only rewrite &lt;code&gt;LIKE 'prefix%'&lt;/code&gt; into an index range scan when the index orders text by byte value rather than by locale collation.&lt;/p&gt;

&lt;p&gt;With the existing &lt;code&gt;idx_sim_bp_users_email_pattern&lt;/code&gt; index on &lt;code&gt;email text_pattern_ops&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Index Only Scan using idx_sim_bp_users_email_pattern on sim_bp_users
  (cost=0.42..29.87 rows=20 width=8) (actual time=0.057..24.729 rows=20 loops=1)
  Index Cond: ((email ~&amp;gt;=~ 'user12'::text) AND (email ~&amp;lt;~ 'user13'::text))
  Filter: ((email)::text ~~ 'user12%'::text)
  Heap Fetches: 0
 Execution Time: 24.757 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Index Cond uses &lt;code&gt;~&amp;gt;=~&lt;/code&gt; and &lt;code&gt;~&amp;lt;~&lt;/code&gt; — real PostgreSQL operators from &lt;code&gt;text_pattern_ops&lt;/code&gt; that do byte-order comparisons. 24.7 ms vs 122.8 ms — five times faster, and the gap widens on larger tables.&lt;/p&gt;

&lt;h2&gt;
  
  
  Index types beyond btree
&lt;/h2&gt;

&lt;h3&gt;
  
  
  GIN — when equality becomes containment
&lt;/h3&gt;

&lt;p&gt;For values with internal structure (arrays, JSONB, full-text search vectors, trigrams):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_events_data_gin&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;gin&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event_data&lt;/span&gt; &lt;span class="n"&gt;jsonb_path_ops&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Now this is sargable:&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;event_data&lt;/span&gt; &lt;span class="o"&gt;@&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'{"type": "purchase"}'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;jsonb_path_ops&lt;/code&gt; indexes only the &lt;code&gt;@&amp;gt;&lt;/code&gt; operator but produces a significantly smaller and faster index than the default &lt;code&gt;jsonb_ops&lt;/code&gt;. Use it unless you need the other JSONB operators.&lt;/p&gt;

&lt;p&gt;GIN with &lt;code&gt;pg_trgm&lt;/code&gt; turns substring &lt;code&gt;LIKE&lt;/code&gt; queries (&lt;code&gt;LIKE '%needle%'&lt;/code&gt;) into index-backed scans.&lt;/p&gt;

&lt;h3&gt;
  
  
  BRIN — when the data is physically ordered
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_bp_orders_created_at_brin&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;brin&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For our 500,000-row orders table, a BRIN index is ~24 kB; a btree on the same column is ~5 MB. BRIN loses effectiveness immediately if the data isn't correlated — on a shuffled table, the min/max of every page range overlaps the whole value domain and the planner can't skip anything. BRIN is effectively useless on uncorrelated columns and brilliant on time-series data.&lt;/p&gt;

&lt;h3&gt;
  
  
  GiST / SP-GiST / hash
&lt;/h3&gt;

&lt;p&gt;Geometric types, ranges, and fuzzy matching use GiST or SP-GiST. Hash indexes only support equality and are usually beaten by btrees even for point lookups — use them only when you've measured a specific case where they win.&lt;/p&gt;

&lt;h2&gt;
  
  
  When NOT to add an index
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Write-heavy, read-light tables.&lt;/strong&gt; Every index is write cost.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Low selectivity.&lt;/strong&gt; A btree on a boolean &lt;code&gt;is_active&lt;/code&gt; where 90% of rows are active will never be used. A partial index is better.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Queries that need most of the table.&lt;/strong&gt; Reports over large windows are best served by parallel seq scan.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Redundant indexes.&lt;/strong&gt; &lt;code&gt;(a, b, c)&lt;/code&gt; subsumes &lt;code&gt;(a, b)&lt;/code&gt; and &lt;code&gt;(a)&lt;/code&gt;. Drop the prefixes, keep the longest.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Finding unused indexes
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;indexrelname&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;index_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;relname&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;pg_size_pretty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;indexrelid&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;idx_scan&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_user_indexes&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;schemaname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'public'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;idx_scan&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_constraint&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
      &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conindid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;indexrelid&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;contype&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'p'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'u'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'x'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;indexrelid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Real result from the running database:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;index_name&lt;/th&gt;
&lt;th&gt;size&lt;/th&gt;
&lt;th&gt;idx_scan&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;idx_sim_bp_users_username_pattern&lt;/td&gt;
&lt;td&gt;6184 kB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;idx_sim_bp_users_email_pattern&lt;/td&gt;
&lt;td&gt;7960 kB&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;One 6 MB index with zero scans is a straightforward drop. The &lt;code&gt;NOT EXISTS&lt;/code&gt; clause skips PK/unique/exclusion constraint indexes — those enforce integrity and are used internally even if no user query hits them.&lt;/p&gt;

&lt;p&gt;Two caveats: &lt;code&gt;pg_stat_reset()&lt;/code&gt; zeros the counter (check the stats timestamp before acting), and a replica's stats only count scans on that replica (don't drop an index from the primary based on replica stats alone).&lt;/p&gt;

&lt;h2&gt;
  
  
  Adding the right index — a complete example
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total_amount_cents&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'pending'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;total_amount_cents&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;51 ms sequential scan over 500k rows with a top-n heapsort. Three plausible candidates:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;(status)&lt;/code&gt;&lt;/strong&gt; — cheapest, most general, but the planner still needs a sort step.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;(status, total_amount_cents DESC)&lt;/code&gt;&lt;/strong&gt; — solves filter and sort. The sort is free because the index is already ordered on the trailing column within each status group.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;(total_amount_cents DESC) WHERE status = 'pending'&lt;/code&gt;&lt;/strong&gt; — only pending rows indexed. Smaller, faster to maintain, but only helps pending queries.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Option 3 plus &lt;code&gt;INCLUDE (order_id, user_id, created_at)&lt;/code&gt; gives Index Only Scan and is the right call for this specific query. If the dashboard later adds &lt;code&gt;status IN ('pending', 'processing')&lt;/code&gt;, you'd want option 2 instead. Design indexes for the query you have, and re-read the plans every six months.&lt;/p&gt;




&lt;h1&gt;
  
  
  postgres #performance #database #sql
&lt;/h1&gt;

&lt;p&gt;Originally published at &lt;a href="https://mydba.dev/blog/postgres-index-usage-optimization" rel="noopener noreferrer"&gt;mydba.dev/blog/postgres-index-usage-optimization&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>database</category>
      <category>performance</category>
      <category>postgres</category>
      <category>sql</category>
    </item>
    <item>
      <title>Reading PostgreSQL EXPLAIN and EXPLAIN ANALYZE Output</title>
      <dc:creator>Philip McClarence</dc:creator>
      <pubDate>Fri, 24 Apr 2026 14:00:07 +0000</pubDate>
      <link>https://forem.com/philip_mcclarence_2ef9475/reading-postgresql-explain-and-explain-analyze-output-3o74</link>
      <guid>https://forem.com/philip_mcclarence_2ef9475/reading-postgresql-explain-and-explain-analyze-output-3o74</guid>
      <description>&lt;p&gt;Every PostgreSQL performance conversation eventually lands on a question that sounds trivial: &lt;em&gt;what does this EXPLAIN mean?&lt;/em&gt; The output is almost readable. There are node names in English, numbers that look familiar, and enough structure that you can guess at the intent. But if you're guessing, you're going to miss the signal that actually matters — and the difference between a plan that returns in 0.3 ms and one that returns in 400 ms is often one line of EXPLAIN output that looks like boilerplate.&lt;/p&gt;

&lt;p&gt;This article is a systematic walk through how to read an EXPLAIN plan on PostgreSQL 17, using real output captured from a live database. By the end you should be able to look at a plan, identify what each node is doing and why, spot the three places where things usually go wrong, and articulate in one sentence why the query is slow — or whether it's actually fine and something else is wrong.&lt;/p&gt;

&lt;p&gt;This is part of the &lt;a href="https://mydba.dev/blog/postgres-query-analysis-complete-guide" rel="noopener noreferrer"&gt;Complete Guide to PostgreSQL SQL Query Analysis &amp;amp; Optimization&lt;/a&gt; series.&lt;/p&gt;

&lt;h2&gt;
  
  
  EXPLAIN vs EXPLAIN ANALYZE vs EXPLAIN (ANALYZE, BUFFERS)
&lt;/h2&gt;

&lt;p&gt;The three variants you'll use in practice:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;EXPLAIN&lt;/code&gt;&lt;/strong&gt; — asks the planner what it &lt;em&gt;would&lt;/em&gt; do, without running the query. Fast (milliseconds), safe for expensive queries, but every number is an estimate. Useful for "how expensive does the planner think this is?" and "did my new index change the plan shape?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt;&lt;/strong&gt; — actually runs the query and reports what happened. You get both the planner's estimates and the real measured results, side by side. Use this in development and staging; use it on production only after thinking about the cost. &lt;strong&gt;Three warnings:&lt;/strong&gt; (1) &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; on an &lt;code&gt;INSERT&lt;/code&gt;/&lt;code&gt;UPDATE&lt;/code&gt;/&lt;code&gt;DELETE&lt;/code&gt; will execute the DML — wrap in a &lt;code&gt;BEGIN; ... ROLLBACK;&lt;/code&gt; if you don't want the side effects. (2) The query runs end-to-end, so a slow query is slow again, and any locks it takes are held for real. (3) &lt;code&gt;ANALYZE&lt;/code&gt; pulls rows into the buffer cache and may evict other working-set pages; running it on a busy production system can perturb the performance of the exact thing you're measuring. On hot-path queries, prefer capturing a representative plan via &lt;code&gt;auto_explain&lt;/code&gt; or an EXPLAIN visualiser in a monitoring tool rather than running &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; ad-hoc under load.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;EXPLAIN (ANALYZE, BUFFERS, VERBOSE, SETTINGS)&lt;/code&gt;&lt;/strong&gt; — the version you should default to. &lt;code&gt;BUFFERS&lt;/code&gt; adds per-node cache-hit/read/dirtied counts and must still be specified explicitly; &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; on its own does &lt;em&gt;not&lt;/em&gt; include buffer statistics. &lt;code&gt;VERBOSE&lt;/code&gt; adds the output column list at each node (useful for spotting why indexes aren't being chosen). &lt;code&gt;SETTINGS&lt;/code&gt; reports any non-default planner knobs that might be influencing the plan.&lt;/p&gt;

&lt;p&gt;You can also ask for structured output with &lt;code&gt;FORMAT JSON&lt;/code&gt;, &lt;code&gt;FORMAT YAML&lt;/code&gt;, or &lt;code&gt;FORMAT XML&lt;/code&gt;. JSON preserves every field and is what you want for programmatic analysis; the text format is easier to read inline.&lt;/p&gt;

&lt;h2&gt;
  
  
  The plan tree
&lt;/h2&gt;

&lt;p&gt;Every EXPLAIN output is a tree. The root is the outermost node, which is whatever produces the query's final rows; children feed their output up to their parent. PostgreSQL indents children under their parent with arrows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Parent Node
  -&amp;gt;  Child A
  -&amp;gt;  Child B
        -&amp;gt;  Grandchild
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The top-down narrative is: "to produce &lt;code&gt;Parent Node&lt;/code&gt;'s output, PostgreSQL runs &lt;code&gt;Child A&lt;/code&gt; and &lt;code&gt;Child B&lt;/code&gt;, feeding both into the parent. &lt;code&gt;Child B&lt;/code&gt; itself is produced by running &lt;code&gt;Grandchild&lt;/code&gt;." Execution order is bottom-up (leaves run first), but the way to &lt;em&gt;read&lt;/em&gt; the plan is top-down — start with "what is this query ultimately asking for?" and then follow the tree down to understand how PostgreSQL intends to answer.&lt;/p&gt;

&lt;p&gt;Here's a real example — "show twenty recent pending orders with the user's email." The plan is against a 500,000-row &lt;code&gt;sim_bp_orders&lt;/code&gt; table and 200,000-row &lt;code&gt;sim_bp_users&lt;/code&gt; table on PostgreSQL 17.8:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;Limit&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;85&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;38&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;37&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;075&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;277&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;Buffers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;shared&lt;/span&gt; &lt;span class="n"&gt;hit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;211&lt;/span&gt;
  &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;  &lt;span class="n"&gt;Nested&lt;/span&gt; &lt;span class="n"&gt;Loop&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;85&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;77853&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;84&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100300&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;37&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;074&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;275&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;Inner&lt;/span&gt; &lt;span class="k"&gt;Unique&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;
        &lt;span class="n"&gt;Buffers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;shared&lt;/span&gt; &lt;span class="n"&gt;hit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;211&lt;/span&gt;
        &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;  &lt;span class="k"&gt;Index&lt;/span&gt; &lt;span class="n"&gt;Scan&lt;/span&gt; &lt;span class="k"&gt;Backward&lt;/span&gt; &lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;idx_sim_bp_orders_created_at&lt;/span&gt; &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
              &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;30949&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;29&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100300&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
              &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;012&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;151&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
              &lt;span class="n"&gt;Filter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'pending'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
              &lt;span class="k"&gt;Rows&lt;/span&gt; &lt;span class="n"&gt;Removed&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="n"&gt;Filter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;106&lt;/span&gt;
              &lt;span class="n"&gt;Buffers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;shared&lt;/span&gt; &lt;span class="n"&gt;hit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;131&lt;/span&gt;
        &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;  &lt;span class="n"&gt;Memoize&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;43&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;55&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;006&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;006&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
              &lt;span class="k"&gt;Cache&lt;/span&gt; &lt;span class="k"&gt;Key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
              &lt;span class="k"&gt;Cache&lt;/span&gt; &lt;span class="k"&gt;Mode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;logical&lt;/span&gt;
              &lt;span class="n"&gt;Hits&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;  &lt;span class="n"&gt;Misses&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;  &lt;span class="n"&gt;Evictions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;  &lt;span class="n"&gt;Overflows&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;  &lt;span class="n"&gt;Memory&lt;/span&gt; &lt;span class="k"&gt;Usage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;kB&lt;/span&gt;
              &lt;span class="n"&gt;Buffers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;shared&lt;/span&gt; &lt;span class="n"&gt;hit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;80&lt;/span&gt;
              &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;  &lt;span class="k"&gt;Index&lt;/span&gt; &lt;span class="n"&gt;Scan&lt;/span&gt; &lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;sim_bp_users_pkey&lt;/span&gt; &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="n"&gt;sim_bp_users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;
                    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;54&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;003&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;003&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="k"&gt;Index&lt;/span&gt; &lt;span class="n"&gt;Cond&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="n"&gt;Buffers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;shared&lt;/span&gt; &lt;span class="n"&gt;hit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;80&lt;/span&gt;
 &lt;span class="n"&gt;Planning&lt;/span&gt; &lt;span class="nb"&gt;Time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;183&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
 &lt;span class="n"&gt;Execution&lt;/span&gt; &lt;span class="nb"&gt;Time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;309&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Read top-down. The root is &lt;code&gt;Limit&lt;/code&gt;, which caps the result at twenty rows. Below it is a &lt;code&gt;Nested Loop&lt;/code&gt; that joins two sources: an &lt;code&gt;Index Scan Backward&lt;/code&gt; over &lt;code&gt;sim_bp_orders&lt;/code&gt; and a &lt;code&gt;Memoize&lt;/code&gt; wrapping an &lt;code&gt;Index Scan&lt;/code&gt; on &lt;code&gt;sim_bp_users&lt;/code&gt;. The outer loop walks the orders index backwards (newest first) filtering for &lt;code&gt;status = 'pending'&lt;/code&gt;, and for each matching order, looks up the user via the primary-key index — but the &lt;code&gt;Memoize&lt;/code&gt; caches results by &lt;code&gt;user_id&lt;/code&gt; in case the same user appears multiple times (they don't in this particular run, so all 20 are cache misses).&lt;/p&gt;

&lt;p&gt;This is a very good plan. 0.309 ms, 211 shared-buffer hits, no reads from disk. The &lt;code&gt;LIMIT 20&lt;/code&gt; short-circuits the nested loop early — only 106 rows are read and filtered before twenty matches are found. The same query with a much larger &lt;code&gt;LIMIT&lt;/code&gt; would have very different numbers.&lt;/p&gt;

&lt;p&gt;Now let's break down what each number means.&lt;/p&gt;

&lt;h2&gt;
  
  
  Per-node fields: cost, rows, width, time, loops
&lt;/h2&gt;

&lt;p&gt;On every node, PostgreSQL prints something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;Node&lt;/span&gt; &lt;span class="n"&gt;Name&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;S&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;R&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;W&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first parenthesis (&lt;code&gt;cost=... rows=... width=...&lt;/code&gt;) is the &lt;strong&gt;planner's estimate&lt;/strong&gt;. The second (&lt;code&gt;actual time=... rows=... loops=...&lt;/code&gt;) is &lt;strong&gt;what actually happened&lt;/strong&gt; when the query ran. &lt;code&gt;EXPLAIN&lt;/code&gt; without &lt;code&gt;ANALYZE&lt;/code&gt; only prints the first.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;cost=startup..total&lt;/code&gt;.&lt;/strong&gt; Dimensionless units, scaled relative to &lt;code&gt;seq_page_cost&lt;/code&gt; (1.0 by default). The other cost GUCs — &lt;code&gt;random_page_cost&lt;/code&gt;, &lt;code&gt;cpu_tuple_cost&lt;/code&gt;, &lt;code&gt;cpu_index_tuple_cost&lt;/code&gt;, &lt;code&gt;cpu_operator_cost&lt;/code&gt; — are all expressed in the same arbitrary unit, which lets the planner compare heterogeneous operations against each other. &lt;code&gt;startup&lt;/code&gt; is the estimated cost to produce the first row from this node; &lt;code&gt;total&lt;/code&gt; is the estimated cost to produce all rows. The difference matters: a &lt;code&gt;Sort&lt;/code&gt; node has a high startup cost (it has to consume all input before it can produce the first row) but a low marginal cost per row after that. An &lt;code&gt;Index Scan&lt;/code&gt; has a very low startup cost. When you see a node above a &lt;code&gt;LIMIT&lt;/code&gt;, what matters is the &lt;em&gt;startup&lt;/em&gt; cost of the child, because the limit stops asking for rows as soon as it has enough.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;rows=N&lt;/code&gt;.&lt;/strong&gt; The planner's estimate of how many rows this node will emit. &lt;em&gt;Per loop&lt;/em&gt; — see below.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;width=W&lt;/code&gt;.&lt;/strong&gt; Estimated average row width in bytes. Mostly informational; you use it to sanity-check whether a &lt;code&gt;Sort&lt;/code&gt; or &lt;code&gt;Hash&lt;/code&gt; might spill to disk (row width × estimated rows ≈ memory requirement).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;actual time=startup..total&lt;/code&gt;.&lt;/strong&gt; Wall-clock milliseconds, measured &lt;em&gt;per loop&lt;/em&gt;. &lt;code&gt;startup&lt;/code&gt; is the time to produce the first row from this node; &lt;code&gt;total&lt;/code&gt; is the time to produce the last row.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;actual rows=r loops=l&lt;/code&gt;.&lt;/strong&gt; &lt;code&gt;rows&lt;/code&gt; is the number of rows produced &lt;em&gt;per loop&lt;/em&gt;, averaged over all &lt;code&gt;l&lt;/code&gt; loops. To get the total rows this node emitted, multiply: &lt;code&gt;rows × loops&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Loops matter. In the nested loop example above, the &lt;code&gt;Memoize&lt;/code&gt; node reports &lt;code&gt;rows=1 loops=20&lt;/code&gt; — meaning the node was executed 20 times (once per outer row), and each execution produced 1 row. Total output: 20 rows. But the &lt;code&gt;actual time=0.006..0.006&lt;/code&gt; is &lt;em&gt;per loop&lt;/em&gt;, so the total time spent in &lt;code&gt;Memoize&lt;/code&gt; was about &lt;code&gt;0.006 ms × 20 = 0.12 ms&lt;/code&gt;. Forgetting to multiply by &lt;code&gt;loops&lt;/code&gt; is the single most common mistake in reading EXPLAIN output — a node that looks fast per loop can still dominate the query time if it runs 50,000 times.&lt;/p&gt;

&lt;p&gt;The relationship between &lt;code&gt;rows&lt;/code&gt; estimate and &lt;code&gt;actual rows&lt;/code&gt; is arguably &lt;em&gt;the&lt;/em&gt; most important signal in a plan. If the planner estimated 15 and the actual was 8,000, the plan was built on bad assumptions: every decision it made downstream (join strategy, memory allocation, whether to parallelise) was wrong. A ratio past 10× in either direction is worth treating as a warning; past 100× it's usually critical. The fix is almost always &lt;code&gt;ANALYZE&lt;/code&gt; on the affected table, or extended statistics if the bad estimate comes from correlated columns that the planner assumes are independent.&lt;/p&gt;

&lt;h2&gt;
  
  
  Node types you'll see most often
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Scan nodes&lt;/strong&gt; — where rows enter the plan.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Seq Scan&lt;/code&gt;&lt;/strong&gt; — read every row of a table. Reports &lt;code&gt;Filter:&lt;/code&gt; when there's a WHERE clause applied, and &lt;code&gt;Rows Removed by Filter:&lt;/code&gt; telling you how many rows were read and discarded. Cheap on small tables, catastrophic on large ones with selective filters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Index Scan&lt;/code&gt;&lt;/strong&gt; — use an index to find rows, then fetch each matching row from the heap for any columns the index doesn't contain. Reports &lt;code&gt;Index Cond:&lt;/code&gt; for conditions satisfied by the index, and optionally &lt;code&gt;Filter:&lt;/code&gt; for conditions that have to be rechecked after the heap fetch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Index Only Scan&lt;/code&gt;&lt;/strong&gt; — use an index to find rows &lt;em&gt;and&lt;/em&gt; return all requested columns directly from the index, skipping the heap entirely. Requires either that the index includes every referenced column (see &lt;code&gt;INCLUDE&lt;/code&gt;) or that all columns are part of the index keys. Reports &lt;code&gt;Heap Fetches:&lt;/code&gt; — this number should be close to zero; a non-zero count means the visibility map didn't cover some pages and PostgreSQL had to check the heap anyway, defeating the point.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Bitmap Index Scan&lt;/code&gt; + &lt;code&gt;Bitmap Heap Scan&lt;/code&gt;&lt;/strong&gt; — two-step pattern for combining multiple index conditions or for queries that match many rows. First, the index scan builds a bitmap of heap pages that might have matches. Then the heap scan visits those pages once each, avoiding re-reading pages that contain multiple matches. Reports &lt;code&gt;Exact Heap Blocks&lt;/code&gt; and &lt;code&gt;Lossy Heap Blocks&lt;/code&gt; — a high lossy-block count means &lt;code&gt;work_mem&lt;/code&gt; was too small to track individual tuples, so PostgreSQL fell back to page-level tracking and has to re-filter the matches.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Join nodes&lt;/strong&gt; — combining two inputs.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Nested Loop&lt;/code&gt;&lt;/strong&gt; — for each row on the outer side, scan the inner side. Optimal when the outer side is small and the inner side has an index on the join column. Pathological when both sides are large.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Hash Join&lt;/code&gt;&lt;/strong&gt; — build a hash table from the smaller side (the &lt;code&gt;Hash&lt;/code&gt; child), then probe it with each row from the other side. Optimal for equi-joins on unordered data when the smaller side fits in &lt;code&gt;work_mem&lt;/code&gt;. Reports &lt;code&gt;Hash Batches:&lt;/code&gt; — if this is greater than 1, the hash table didn't fit in memory and had to spill.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Merge Join&lt;/code&gt;&lt;/strong&gt; — two pre-sorted inputs, walked in parallel. Optimal when both sides are already sorted (or can be sorted cheaply via an index). Reports &lt;code&gt;Merge Cond:&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Sort and aggregation nodes.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Sort&lt;/code&gt;&lt;/strong&gt; — ordering rows. Reports &lt;code&gt;Sort Key:&lt;/code&gt; (the columns being sorted), &lt;code&gt;Sort Method:&lt;/code&gt; (algorithm), &lt;code&gt;Sort Space Type:&lt;/code&gt; (Memory or Disk), and &lt;code&gt;Sort Space Used:&lt;/code&gt; (in KB).

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;top-N heapsort&lt;/code&gt;&lt;/strong&gt; — used under a &lt;code&gt;LIMIT N&lt;/code&gt;. Keeps only N rows in a heap regardless of input size. Efficient in memory and time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;quicksort&lt;/code&gt;&lt;/strong&gt; — everything fits in &lt;code&gt;work_mem&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;external merge&lt;/code&gt;&lt;/strong&gt; — didn't fit; spilled to disk.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;&lt;code&gt;Aggregate&lt;/code&gt; / &lt;code&gt;HashAggregate&lt;/code&gt; / &lt;code&gt;GroupAggregate&lt;/code&gt;&lt;/strong&gt; — SUM/AVG/COUNT/GROUP BY. &lt;code&gt;HashAggregate&lt;/code&gt; builds a hash table keyed by the group-by columns; &lt;code&gt;GroupAggregate&lt;/code&gt; requires presorted input. &lt;code&gt;HashAggregate&lt;/code&gt; can spill to disk with &lt;code&gt;Planned Partitions: N Batches: M&lt;/code&gt;.&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;&lt;code&gt;Limit&lt;/code&gt;&lt;/strong&gt; — cap the number of rows. Often the shortcut that makes a plan fast.&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;&lt;code&gt;WindowAgg&lt;/code&gt;&lt;/strong&gt; — window functions like &lt;code&gt;ROW_NUMBER()&lt;/code&gt; and &lt;code&gt;SUM() OVER&lt;/code&gt;.&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Parallelism.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Gather&lt;/code&gt; / &lt;code&gt;Gather Merge&lt;/code&gt;&lt;/strong&gt; — the leader process collecting results from parallel workers. &lt;code&gt;Workers Planned:&lt;/code&gt; and &lt;code&gt;Workers Launched:&lt;/code&gt; tell you how many workers the planner asked for vs actually got. When &lt;code&gt;Launched &amp;lt; Planned&lt;/code&gt;, the system is short on parallel worker slots.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Parallel Seq Scan&lt;/code&gt; / &lt;code&gt;Parallel Index Scan&lt;/code&gt; / &lt;code&gt;Parallel Hash Join&lt;/code&gt;&lt;/strong&gt; — parallel-aware variants of the base node types.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Utility nodes.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Materialize&lt;/code&gt;&lt;/strong&gt; — cache an intermediate result so the parent can rescan it without redoing the work. Common above the inner side of a Nested Loop.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Memoize&lt;/code&gt;&lt;/strong&gt; (new in PostgreSQL 14) — LRU cache above an inner loop. Reports &lt;code&gt;Cache Key:&lt;/code&gt;, &lt;code&gt;Hits:&lt;/code&gt;, &lt;code&gt;Misses:&lt;/code&gt;, &lt;code&gt;Evictions:&lt;/code&gt;, and &lt;code&gt;Memory Usage:&lt;/code&gt;. A high hit ratio is good; a high miss ratio just means the cache didn't help this particular query but didn't hurt either.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;CTE Scan&lt;/code&gt;&lt;/strong&gt; — reading from a materialised CTE. In PostgreSQL 12+ most CTEs are inlined and this node disappears; you see it when a CTE is referenced multiple times or marked &lt;code&gt;MATERIALIZED&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;SubPlan&lt;/code&gt;&lt;/strong&gt; — a correlated subquery, executed once per outer row. Almost always worth rewriting as a JOIN.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Buffers line
&lt;/h2&gt;

&lt;p&gt;With &lt;code&gt;BUFFERS&lt;/code&gt; enabled, every node reports how many 8 KB pages it touched:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Buffers: shared hit=3689
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The four counters to know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;shared hit=N&lt;/code&gt;&lt;/strong&gt; — pages found in &lt;code&gt;shared_buffers&lt;/code&gt; (PostgreSQL's cache). No I/O system calls.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;shared read=N&lt;/code&gt;&lt;/strong&gt; — pages the backend had to read into &lt;code&gt;shared_buffers&lt;/code&gt; via a &lt;code&gt;read()&lt;/code&gt; system call. Whether the OS page cache satisfied the read without touching disk is invisible to EXPLAIN — these show up as reads regardless.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;shared dirtied=N&lt;/code&gt;&lt;/strong&gt; — pages the query modified in cache. Common with DML; in a read-only SELECT, usually comes from hint-bit updates or cleanup.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;shared written=N&lt;/code&gt;&lt;/strong&gt; — pages written back out during this node's execution. Usually this is the backend itself being forced to evict dirty pages to make room for new ones, not the background writer — so a high &lt;code&gt;written&lt;/code&gt; count means your query is doing someone else's work because the dirty-page pool was already full.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There's also &lt;code&gt;local hit/read/dirtied/written&lt;/code&gt; for per-session temporary tables, and &lt;code&gt;temp read=N written=N&lt;/code&gt; for work files produced by sorts and hash joins that spilled.&lt;/p&gt;

&lt;p&gt;A query doing &lt;code&gt;shared read=2016, temp written=2051&lt;/code&gt; in a single node is telling you two things: the table isn't fitting in cache, &lt;em&gt;and&lt;/em&gt; the query itself is generating its own on-disk temp files because some operation (hash, sort, bitmap) exceeded &lt;code&gt;work_mem&lt;/code&gt;. Both are fixable; both hurt.&lt;/p&gt;

&lt;h2&gt;
  
  
  A harder plan: the HashAggregate spill
&lt;/h2&gt;

&lt;p&gt;Here's a plan with more going on — "the twenty users with the most pending-or-shipped orders." Against the same 500,000-row orders table and 200,000-row users table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;Limit&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42281&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;85&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;42282&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;29&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;408&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;141&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;408&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;145&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;Buffers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;shared&lt;/span&gt; &lt;span class="n"&gt;hit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3737&lt;/span&gt; &lt;span class="k"&gt;read&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2016&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;temp&lt;/span&gt; &lt;span class="k"&gt;read&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1320&lt;/span&gt; &lt;span class="n"&gt;written&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2051&lt;/span&gt;
  &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;  &lt;span class="n"&gt;Sort&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42281&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;85&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;42722&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;81&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;176383&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;29&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;406&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;664&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;406&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;667&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;Sort&lt;/span&gt; &lt;span class="k"&gt;Key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
        &lt;span class="n"&gt;Sort&lt;/span&gt; &lt;span class="k"&gt;Method&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;top&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt; &lt;span class="n"&gt;heapsort&lt;/span&gt;  &lt;span class="n"&gt;Memory&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;26&lt;/span&gt;&lt;span class="n"&gt;kB&lt;/span&gt;
        &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;  &lt;span class="n"&gt;HashAggregate&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;33757&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;54&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;37588&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;36&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;176383&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;29&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
              &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;347&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;354&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;392&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;138&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;117060&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
              &lt;span class="k"&gt;Group&lt;/span&gt; &lt;span class="k"&gt;Key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;
              &lt;span class="n"&gt;Planned&lt;/span&gt; &lt;span class="n"&gt;Partitions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;  &lt;span class="n"&gt;Batches&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;  &lt;span class="n"&gt;Memory&lt;/span&gt; &lt;span class="k"&gt;Usage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;8241&lt;/span&gt;&lt;span class="n"&gt;kB&lt;/span&gt;  &lt;span class="n"&gt;Disk&lt;/span&gt; &lt;span class="k"&gt;Usage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;6920&lt;/span&gt;&lt;span class="n"&gt;kB&lt;/span&gt;
              &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;  &lt;span class="n"&gt;Hash&lt;/span&gt; &lt;span class="k"&gt;Join&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;7932&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;21080&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;02&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;176383&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;21&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;140&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;974&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;285&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;275&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;175263&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="n"&gt;Hash&lt;/span&gt; &lt;span class="n"&gt;Cond&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;  &lt;span class="n"&gt;Seq&lt;/span&gt; &lt;span class="n"&gt;Scan&lt;/span&gt; &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;9939&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;176383&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                          &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;018&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;51&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;809&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;175263&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                          &lt;span class="n"&gt;Filter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;ANY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'{pending,shipped}'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;[]))&lt;/span&gt;
                          &lt;span class="k"&gt;Rows&lt;/span&gt; &lt;span class="n"&gt;Removed&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="n"&gt;Filter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;324737&lt;/span&gt;
                    &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;  &lt;span class="n"&gt;Hash&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4064&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;4064&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200000&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                          &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;140&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;864&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;140&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;865&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200000&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                          &lt;span class="n"&gt;Buckets&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;131072&lt;/span&gt;  &lt;span class="n"&gt;Batches&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;  &lt;span class="n"&gt;Memory&lt;/span&gt; &lt;span class="k"&gt;Usage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;6822&lt;/span&gt;&lt;span class="n"&gt;kB&lt;/span&gt;
                          &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;  &lt;span class="n"&gt;Seq&lt;/span&gt; &lt;span class="n"&gt;Scan&lt;/span&gt; &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="n"&gt;sim_bp_users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;4064&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200000&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                                &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;735&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;87&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;218&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200000&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
 &lt;span class="n"&gt;Planning&lt;/span&gt; &lt;span class="nb"&gt;Time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;107&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
 &lt;span class="n"&gt;Execution&lt;/span&gt; &lt;span class="nb"&gt;Time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;408&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;215&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;408 ms. Let's read it top-down and find where the time actually goes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Root: &lt;code&gt;Limit&lt;/code&gt; + &lt;code&gt;Sort&lt;/code&gt;.&lt;/strong&gt; The Sort is &lt;code&gt;top-N heapsort, Memory: 26 kB&lt;/code&gt; — fine. Under the &lt;code&gt;LIMIT 20&lt;/code&gt;, a top-N sort is almost free regardless of input size.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;HashAggregate&lt;/code&gt; — the first red flag.&lt;/strong&gt; The Group Key is &lt;code&gt;u.email&lt;/code&gt;; the aggregate is a &lt;code&gt;count(*)&lt;/code&gt; across the 175k joined rows. Two numbers jump out: &lt;code&gt;Planned Partitions: 4  Batches: 5&lt;/code&gt; and &lt;code&gt;Memory Usage: 8241 kB  Disk Usage: 6920 kB&lt;/code&gt;. PostgreSQL 13+ can spill a &lt;code&gt;HashAggregate&lt;/code&gt; to disk when the hash table exceeds &lt;code&gt;work_mem&lt;/code&gt;: the executor detects that not all groups will fit in memory, writes unfinished groups out to per-partition spill files, and processes them in a second pass. The exact number of spill-and-resume cycles isn't something you should read literally from the &lt;code&gt;Batches&lt;/code&gt; count, but the presence of &lt;code&gt;Disk Usage&lt;/code&gt; at all is the signal — this query is paying for temp file I/O on every run. The &lt;code&gt;temp written=2051&lt;/code&gt; buffer count at the top is driven by exactly this, and this is the dominant cost of the query.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;Hash Join&lt;/code&gt; + &lt;code&gt;Hash&lt;/code&gt; child — the second red flag.&lt;/strong&gt; &lt;code&gt;Buckets: 131072  Batches: 2  Memory Usage: 6822 kB&lt;/code&gt;. The hash table built from &lt;code&gt;sim_bp_users&lt;/code&gt; needed about 13 MB (the build side is 200k rows at ~64 bytes each) and didn't fit in 4 MB of &lt;code&gt;work_mem&lt;/code&gt;. When a hash join spills, PostgreSQL partitions &lt;em&gt;both&lt;/em&gt; sides by the join key and processes one matched pair of partitions at a time — each probe row is tested only against its matching partition, not against every batch. The cost is the extra I/O of writing the build and probe sides to per-partition temp files and reading them back.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;Seq Scan on sim_bp_orders&lt;/code&gt;.&lt;/strong&gt; 175k rows returned, 324k removed by filter (total = 500k, the whole table). The filter is &lt;code&gt;status IN ('pending', 'shipped')&lt;/code&gt;. No index on &lt;code&gt;status&lt;/code&gt;, so the whole table is scanned.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;Seq Scan on sim_bp_users&lt;/code&gt;.&lt;/strong&gt; 200k rows returned, no filter — we need all users. Reads 2016 pages from disk (&lt;code&gt;shared read=2016&lt;/code&gt;); the users table is mostly cold in cache.&lt;/p&gt;

&lt;p&gt;The bottleneck order, from biggest to smallest: HashAggregate spill, Hash Join build-side batches, Seq Scans. Three different fixes are plausible, and which one is appropriate depends on how often this query runs, how much &lt;code&gt;work_mem&lt;/code&gt; the rest of the workload can tolerate, and whether the data is append-mostly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Raise &lt;code&gt;work_mem&lt;/code&gt; per-session&lt;/strong&gt; to ~20 MB so both the HashAggregate and the Hash Join stay in memory. Caveat: &lt;code&gt;work_mem&lt;/code&gt; is allocated per sort/hash node per connection, so raising it globally multiplies by the number of concurrent queries doing sorts. Set it per-role (&lt;code&gt;ALTER ROLE dashboard SET work_mem = '32MB'&lt;/code&gt;) or per-session in the dashboard's connection pool, not cluster-wide.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Index &lt;code&gt;sim_bp_orders.status&lt;/code&gt;&lt;/strong&gt; so the scan becomes a Bitmap or Index Scan instead of reading all 500k rows. At ~35% selectivity a plain btree might not beat a seq scan by much, but a partial index or a multi-column &lt;code&gt;(status, user_id)&lt;/code&gt; would.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Materialise the aggregate&lt;/strong&gt; into a small summary table refreshed on a schedule or via triggers, if the query is a dashboard that runs every 10 seconds and the underlying data is append-mostly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A fair DBA answer is "measure each fix in isolation and pick based on the workload" — not any specific prescribed order. If the query runs once a day in a reporting job, the &lt;code&gt;work_mem&lt;/code&gt; bump is cheapest; if it runs constantly and powers a UI, the materialised result wins.&lt;/p&gt;

&lt;h2&gt;
  
  
  The five most common mistakes in reading plans
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Comparing &lt;code&gt;rows&lt;/code&gt; without multiplying by &lt;code&gt;loops&lt;/code&gt;.&lt;/strong&gt; A node reporting &lt;code&gt;rows=1 loops=50000&lt;/code&gt; produced 50,000 rows. A node reporting &lt;code&gt;rows=50000 loops=1&lt;/code&gt; produced the same 50,000 rows in a very different shape. Always look at &lt;code&gt;loops&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Looking at top-line cost/time and calling it a day.&lt;/strong&gt; The top-line number tells you the query is slow; it doesn't tell you &lt;em&gt;which node&lt;/em&gt; is slow. Scan the tree for the node with the highest &lt;code&gt;actual time × loops&lt;/code&gt; — that's where the time is spent, and usually where the fix is.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Trusting the planner's estimates when &lt;code&gt;actual rows&lt;/code&gt; disagrees.&lt;/strong&gt; If &lt;code&gt;rows=15&lt;/code&gt; on the estimate and &lt;code&gt;actual rows=8000&lt;/code&gt;, every downstream decision was built on the wrong premise. Don't try to understand why the plan is shaped the way it is until you've fixed the estimate (usually with &lt;code&gt;ANALYZE&lt;/code&gt; or extended statistics).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Missing the &lt;code&gt;Rows Removed by Filter&lt;/code&gt; line.&lt;/strong&gt; A &lt;code&gt;Seq Scan&lt;/code&gt; returning a reasonable number of rows looks fine — until you notice the filter line says ten million rows were read and discarded to produce those few. The scan was fine; the cost is in the discard.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Ignoring the &lt;code&gt;Buffers&lt;/code&gt; line.&lt;/strong&gt; Two plans can have identical shapes and wildly different performance if one hits cache and the other doesn't. &lt;code&gt;shared hit=5&lt;/code&gt; means "hot"; &lt;code&gt;shared read=50000&lt;/code&gt; means "the storage layer did all the work, and next time it might be even worse." The &lt;code&gt;Buffers&lt;/code&gt; line is the only way to see this without looking at the timing.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Next steps
&lt;/h2&gt;

&lt;p&gt;If the first plan in this article (the nested loop) looked straightforward and the second (the HashAggregate spill) made sense, you've mostly got it. The rest of the series digs into specific bottleneck categories — missing indexes, join-strategy mistakes, aggregate spills, non-sargable WHERE clauses — and what to do about each. The next piece is &lt;a href="https://mydba.dev/blog/postgres-index-usage-optimization" rel="noopener noreferrer"&gt;PostgreSQL Index Usage and Optimization&lt;/a&gt;.&lt;/p&gt;

&lt;h1&gt;
  
  
  postgres #performance #database #sql
&lt;/h1&gt;

&lt;p&gt;Originally published at &lt;a href="https://mydba.dev/blog/postgres-explain-analyze-reading" rel="noopener noreferrer"&gt;https://mydba.dev/blog/postgres-explain-analyze-reading&lt;/a&gt;&lt;/p&gt;

</description>
      <category>database</category>
      <category>performance</category>
      <category>postgres</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>The Complete Guide to PostgreSQL SQL Query Analysis &amp; Optimization</title>
      <dc:creator>Philip McClarence</dc:creator>
      <pubDate>Thu, 23 Apr 2026 14:00:06 +0000</pubDate>
      <link>https://forem.com/philip_mcclarence_2ef9475/the-complete-guide-to-postgresql-sql-query-analysis-optimization-3lbe</link>
      <guid>https://forem.com/philip_mcclarence_2ef9475/the-complete-guide-to-postgresql-sql-query-analysis-optimization-3lbe</guid>
      <description>&lt;p&gt;Most PostgreSQL performance work is wasted because it starts from the wrong end. Someone notices a slow query, skim-reads &lt;code&gt;EXPLAIN&lt;/code&gt;, pattern-matches to "missing index," adds one, and moves on. Sometimes that works. Often it doesn't — and when it doesn't, the next attempt is usually an even blunter instrument: "just add more RAM," "just use a read replica," "just cache it."&lt;/p&gt;

&lt;p&gt;This guide is a systematic alternative. The argument is that a large fraction of single-query latency problems in OLTP workloads fall into one of a small number of bottleneck categories, each with a characteristic EXPLAIN signature and a well-understood fix. (Lock contention, vacuum bloat, replication lag, and the generic-plan vs custom-plan behaviour of prepared statements are real and common, but they are cluster-level or protocol-level problems rather than single-plan problems; this guide is strictly about the latter.) If you can name the category in sixty seconds of reading the plan, the fix usually follows in minutes.&lt;/p&gt;

&lt;p&gt;We'll work through the full workflow end-to-end on a real query against a real PostgreSQL 17 database, then map the eight bottleneck categories to the eight deep-dive articles that make up this series. Every EXPLAIN snippet below is captured from an actual run against a 500,000-row &lt;code&gt;sim_bp_orders&lt;/code&gt; table on a Neon Postgres 17.8 database — not a synthetic example.&lt;/p&gt;

&lt;h2&gt;
  
  
  The workflow
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Read the EXPLAIN plan&lt;/strong&gt; — specifically the three signals that matter most: estimated-vs-actual row counts, access path at each scan node, and where time is actually spent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Categorise the bottleneck&lt;/strong&gt; — translate the plan signals into one of eight categories.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apply the matching fix&lt;/strong&gt; — index, rewrite, tune memory, or restructure the query.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verify with a second EXPLAIN&lt;/strong&gt; — before/after is how you know you actually fixed something.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's it. The rest of this article walks through each step on a concrete example.&lt;/p&gt;

&lt;h2&gt;
  
  
  A typical slow query
&lt;/h2&gt;

&lt;p&gt;Our running example is a dashboard query: "show me the fifty highest-value pending orders."&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;total_amount_cents&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;created_at&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'pending'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;total_amount_cents&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The table is 500,000 rows, with roughly 20% in &lt;code&gt;status = 'pending'&lt;/code&gt;. There's a primary key on &lt;code&gt;order_id&lt;/code&gt;, indexes on &lt;code&gt;user_id&lt;/code&gt; and &lt;code&gt;created_at&lt;/code&gt;, but &lt;strong&gt;no index on &lt;code&gt;status&lt;/code&gt; or &lt;code&gt;total_amount_cents&lt;/code&gt;&lt;/strong&gt;. We've disabled parallel execution (&lt;code&gt;SET max_parallel_workers_per_gather = 0&lt;/code&gt;) for this example so the plan reads cleanly. Here's the plan:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;Limit&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;13270&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;89&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;13271&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;02&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;873&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;883&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;Buffers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;shared&lt;/span&gt; &lt;span class="n"&gt;hit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3689&lt;/span&gt;
  &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;  &lt;span class="n"&gt;Sort&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;13270&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;89&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;13521&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100300&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;871&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;877&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;Sort&lt;/span&gt; &lt;span class="k"&gt;Key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_amount_cents&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
        &lt;span class="n"&gt;Sort&lt;/span&gt; &lt;span class="k"&gt;Method&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;top&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt; &lt;span class="n"&gt;heapsort&lt;/span&gt;  &lt;span class="n"&gt;Memory&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="n"&gt;kB&lt;/span&gt;
        &lt;span class="n"&gt;Buffers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;shared&lt;/span&gt; &lt;span class="n"&gt;hit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3689&lt;/span&gt;
        &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;  &lt;span class="n"&gt;Seq&lt;/span&gt; &lt;span class="n"&gt;Scan&lt;/span&gt; &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;9939&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100300&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;011&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;37&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;781&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100252&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
              &lt;span class="n"&gt;Filter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'pending'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
              &lt;span class="k"&gt;Rows&lt;/span&gt; &lt;span class="n"&gt;Removed&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="n"&gt;Filter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;399748&lt;/span&gt;
              &lt;span class="n"&gt;Buffers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;shared&lt;/span&gt; &lt;span class="n"&gt;hit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3689&lt;/span&gt;
 &lt;span class="n"&gt;Planning&lt;/span&gt; &lt;span class="nb"&gt;Time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;073&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
 &lt;span class="n"&gt;Execution&lt;/span&gt; &lt;span class="nb"&gt;Time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;908&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Fifty-one milliseconds is not a disaster on its own. It's the kind of number that gets shrugged at until a hundred of these queries run concurrently on a busy application server, at which point CPU saturates and every request starts stacking.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1 — Read the plan
&lt;/h2&gt;

&lt;p&gt;Three signals tell you nearly everything about a plan node.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Signal 1 — how the table is accessed.&lt;/strong&gt; At the leaf of this plan is &lt;code&gt;Seq Scan on sim_bp_orders&lt;/code&gt;. A sequential scan means the planner's cost model decided reading every row was cheaper than any available index — sometimes because no useful index exists, sometimes because existing indexes don't match the query shape, occasionally because statistics are misleading the cost estimate. On small tables, or when the query needs a large fraction of the table anyway, a seq scan is often genuinely the cheapest plan. But on a 500k-row table with a selective filter and an &lt;code&gt;ORDER BY ... LIMIT 50&lt;/code&gt;, it's the wrong shape.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Signal 2 — rows removed by filter.&lt;/strong&gt; &lt;code&gt;Rows Removed by Filter: 399,748&lt;/code&gt;, with &lt;code&gt;Actual Rows: 100,252&lt;/code&gt; matching. The scan touched every row in the table. The filter selectivity is ~20% — not pathological by itself — but 400,000 rows of pure waste every time the dashboard refreshes. An index on the filter column would let PostgreSQL skip them entirely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Signal 3 — planner estimate vs reality.&lt;/strong&gt; &lt;code&gt;rows=100,300&lt;/code&gt; estimated vs &lt;code&gt;rows=100,252&lt;/code&gt; actual. Essentially perfect. If the ratio had been ten-to-one or worse in either direction, the plan would be built on bad assumptions and &lt;code&gt;ANALYZE&lt;/code&gt; would be the first move. Here, statistics are healthy.&lt;/p&gt;

&lt;p&gt;There's a fourth node worth naming: the &lt;code&gt;Sort&lt;/code&gt; above the scan is a &lt;code&gt;top-N heapsort&lt;/code&gt;. Unlike a full sort, a top-N heapsort streams all input rows through a heap of size N (50 here) — it reads all 100,252 pending rows but only ever holds 50 in memory. That's why the &lt;code&gt;Memory: 30kB&lt;/code&gt; is so small. Even so, it's 100,252 rows of unnecessary work: an index on &lt;code&gt;(total_amount_cents DESC) WHERE status = 'pending'&lt;/code&gt; would let the planner walk the index from the largest value downward and stop after fifty entries.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2 — Categorise the bottleneck
&lt;/h2&gt;

&lt;p&gt;Once you've read the plan, map what you see to one of eight categories. Each category has a characteristic signature; each maps to a deep-dive article in this series.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Plan signal&lt;/th&gt;
&lt;th&gt;Bottleneck category&lt;/th&gt;
&lt;th&gt;Fix article&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;You can't even read the plan confidently&lt;/td&gt;
&lt;td&gt;Plan literacy&lt;/td&gt;
&lt;td&gt;&lt;a href="https://mydba.dev/blog/postgres-explain-analyze-reading" rel="noopener noreferrer"&gt;Reading EXPLAIN / EXPLAIN ANALYZE Output&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;Seq Scan&lt;/code&gt; with large row counts, many &lt;code&gt;Rows Removed by Filter&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Missing or wrong index&lt;/td&gt;
&lt;td&gt;&lt;a href="https://mydba.dev/blog/postgres-index-usage-optimization" rel="noopener noreferrer"&gt;Index Usage &amp;amp; Optimisation&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Nested Loop joining large tables; Hash Join spilling to disk&lt;/td&gt;
&lt;td&gt;Join strategy&lt;/td&gt;
&lt;td&gt;&lt;a href="https://mydba.dev/blog/postgres-join-optimization" rel="noopener noreferrer"&gt;Join Optimisation&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;CTE Scan&lt;/code&gt; feeding a filter; &lt;code&gt;SubPlan&lt;/code&gt; running per outer row&lt;/td&gt;
&lt;td&gt;Subquery / CTE structure&lt;/td&gt;
&lt;td&gt;&lt;a href="https://mydba.dev/blog/postgres-subquery-cte-optimization" rel="noopener noreferrer"&gt;Subquery &amp;amp; CTE Optimisation&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;HashAggregate&lt;/code&gt; or &lt;code&gt;Sort&lt;/code&gt; spilling, expensive window functions&lt;/td&gt;
&lt;td&gt;Aggregate or window tuning&lt;/td&gt;
&lt;td&gt;&lt;a href="https://mydba.dev/blog/postgres-aggregate-window-tuning" rel="noopener noreferrer"&gt;Aggregate &amp;amp; Window Function Tuning&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Index exists but isn't being used; function on indexed column&lt;/td&gt;
&lt;td&gt;WHERE clause shape&lt;/td&gt;
&lt;td&gt;&lt;a href="https://mydba.dev/blog/postgres-where-clause-optimization" rel="noopener noreferrer"&gt;WHERE Clause Optimisation&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Plan is "fine" but the query itself is the problem&lt;/td&gt;
&lt;td&gt;Query rewriting&lt;/td&gt;
&lt;td&gt;&lt;a href="https://mydba.dev/blog/postgres-query-rewriting-techniques" rel="noopener noreferrer"&gt;Query Rewriting Techniques&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;SELECT *&lt;/code&gt;, implicit casts, deep pagination, N+1 from ORM&lt;/td&gt;
&lt;td&gt;Anti-pattern&lt;/td&gt;
&lt;td&gt;&lt;a href="https://mydba.dev/blog/postgres-query-anti-patterns" rel="noopener noreferrer"&gt;Anti-Patterns &amp;amp; Common Mistakes&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Our example plan maps cleanly. A &lt;code&gt;Seq Scan&lt;/code&gt; with 400,000 rows removed by filter, sitting under an &lt;code&gt;ORDER BY ... LIMIT&lt;/code&gt; that can't exploit any existing index, is the textbook signature for the &lt;em&gt;Missing or wrong index&lt;/em&gt; category. The &lt;code&gt;Sort&lt;/code&gt; above it is solvable in the same stroke — a single partial index can eliminate both the scan and the sort.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3 — Apply the fix
&lt;/h2&gt;

&lt;p&gt;The fix is a partial index, with the non-filter columns tucked into &lt;code&gt;INCLUDE&lt;/code&gt; so the planner can serve the query from the index alone without touching the heap:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;CONCURRENTLY&lt;/span&gt; &lt;span class="n"&gt;idx_sim_bp_orders_pending_by_amount&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_amount_cents&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;INCLUDE&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'pending'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Four non-obvious choices:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Partial index.&lt;/strong&gt; Only pending orders are indexed, because that's the only status the query cares about. A full index on &lt;code&gt;(status, total_amount_cents)&lt;/code&gt; would work too; it would contain roughly 5× more entries. Partial indexes only help queries whose WHERE clause implies the index's predicate — so if this dashboard later adds &lt;code&gt;WHERE status IN ('pending', 'processing')&lt;/code&gt;, the planner will skip this index.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sort direction in the index.&lt;/strong&gt; Specifying &lt;code&gt;total_amount_cents DESC&lt;/code&gt; means the planner can scan the btree in the direction that produces rows in the needed order without an explicit &lt;code&gt;Sort&lt;/code&gt; node.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tiebreaker.&lt;/strong&gt; In a real dashboard you'd almost always want a tiebreaker column — &lt;code&gt;ORDER BY total_amount_cents DESC&lt;/code&gt; isn't deterministic for ties, and two rows with equal totals would shuffle between pages; adding &lt;code&gt;, order_id DESC&lt;/code&gt; to both the index and the query fixes that.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Covering (&lt;code&gt;INCLUDE&lt;/code&gt;).&lt;/strong&gt; The SELECT list is satisfied entirely from index tuples, which lets the planner serve the query as an Index Only Scan without heap fetches. Index Only Scan also requires the visibility map to mark the relevant heap pages all-visible, so on a write-heavy table where autovacuum can't keep up, you may still see heap fetches even with a covering index.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;CREATE INDEX CONCURRENTLY&lt;/code&gt; avoids taking an &lt;code&gt;ACCESS EXCLUSIVE&lt;/code&gt; lock on the table, so normal reads and writes continue while the index builds. It still takes weaker locks (&lt;code&gt;SHARE UPDATE EXCLUSIVE&lt;/code&gt;) twice, waits for transactions that hold old snapshots on the target table to finish before advancing between phases, and runs two passes over the table — so it's slower than &lt;code&gt;CREATE INDEX&lt;/code&gt; in wall-clock time, and a single long-running transaction that has touched this table can stall the build indefinitely. On a 500k-row table the build takes seconds; on a 500M-row table it can take hours. The application stays up the whole time. A partial index on &lt;code&gt;status = 'pending'&lt;/code&gt; still pays write cost when rows are inserted into or updated out of that state — so if &lt;code&gt;pending&lt;/code&gt; is a high-churn status, weigh the read win against the write overhead.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4 — Verify
&lt;/h2&gt;

&lt;p&gt;Same query, same data, index in place:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;Limit&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;021&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;031&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;Buffers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;shared&lt;/span&gt; &lt;span class="n"&gt;hit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;
  &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;  &lt;span class="k"&gt;Index&lt;/span&gt; &lt;span class="k"&gt;Only&lt;/span&gt; &lt;span class="n"&gt;Scan&lt;/span&gt; &lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;idx_sim_bp_orders_pending_by_amount&lt;/span&gt; &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;3544&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;68&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100300&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;021&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;026&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;Heap&lt;/span&gt; &lt;span class="n"&gt;Fetches&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
        &lt;span class="n"&gt;Buffers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;shared&lt;/span&gt; &lt;span class="n"&gt;hit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;
 &lt;span class="n"&gt;Planning&lt;/span&gt; &lt;span class="nb"&gt;Time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;186&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
 &lt;span class="n"&gt;Execution&lt;/span&gt; &lt;span class="nb"&gt;Time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;045&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;0.045 ms&lt;/strong&gt;, down from 50.9 ms — roughly 1100× faster. Buffers dropped from 3,689 hit to 5 hit. The &lt;code&gt;Sort&lt;/code&gt; node is gone entirely: the index is already sorted in the right order. The &lt;code&gt;Filter&lt;/code&gt; line is gone: the partial index guarantees every row it contains already satisfies &lt;code&gt;status = 'pending'&lt;/code&gt;. &lt;code&gt;Heap Fetches: 0&lt;/code&gt; means the visibility map covered every leaf page we touched, so PostgreSQL served all 50 tuples from the index without reading a single heap page.&lt;/p&gt;

&lt;p&gt;Two caveats on the headline number. First, &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt;'s &lt;code&gt;Execution Time&lt;/code&gt; measures server-side SQL execution only — it excludes network round-trip, client-side tuple deserialisation, and connection pool overhead. Real application latency for this query is probably closer to 2–10 ms depending on your region. Second, the measurement is on a hot cache with an immediately-post-vacuum visibility map; a colder cache would show &lt;code&gt;Buffers: shared read=N&lt;/code&gt; instead of all &lt;code&gt;hit&lt;/code&gt;. The meaningful improvement is the ~700× drop in buffer reads — that's what translates into lower CPU under concurrency.&lt;/p&gt;

&lt;h2&gt;
  
  
  The eight categories
&lt;/h2&gt;

&lt;p&gt;The workflow above treats "spot the category" as a two-sentence step. In practice, each category has its own rules, exceptions, and non-obvious variants. The rest of this series is eight standalone articles, each diving into one category.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Reading EXPLAIN / EXPLAIN ANALYZE output
&lt;/h3&gt;

&lt;p&gt;Before you can optimise a plan, you have to be able to read one. EXPLAIN reports the planner's estimated plan; EXPLAIN ANALYZE executes the query and reports what actually happened. The deep dive covers every common node type, the meaning of &lt;code&gt;loops&lt;/code&gt;, &lt;code&gt;Buffers&lt;/code&gt;, &lt;code&gt;Memory&lt;/code&gt;, &lt;code&gt;Workers Planned vs Launched&lt;/code&gt;, and the five most common ways to misread a plan. → &lt;a href="https://mydba.dev/blog/postgres-explain-analyze-reading" rel="noopener noreferrer"&gt;Reading EXPLAIN / EXPLAIN ANALYZE Output&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Index usage and optimisation
&lt;/h3&gt;

&lt;p&gt;A large share of single-query OLTP latency problems come down to indexing — either missing, or present but not matching the query shape. But "add an index" understates what's actually required: choosing columns in the right order, deciding between full and partial indexes, using &lt;code&gt;INCLUDE&lt;/code&gt; for covering indexes, expression indexes for computed predicates, GIN/GiST/BRIN for the data types where btrees are wrong, and knowing when &lt;em&gt;not&lt;/em&gt; to add one. → &lt;a href="https://mydba.dev/blog/postgres-index-usage-optimization" rel="noopener noreferrer"&gt;Index Usage &amp;amp; Optimisation&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Join optimisation
&lt;/h3&gt;

&lt;p&gt;The planner picks between Nested Loop, Hash Join, and Merge Join based on cost estimates. Each has a regime where it's best, and the worst joins are the ones using the wrong strategy — usually a Nested Loop on two large tables. → &lt;a href="https://mydba.dev/blog/postgres-join-optimization" rel="noopener noreferrer"&gt;Join Optimisation&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Subquery and CTE optimisation
&lt;/h3&gt;

&lt;p&gt;PostgreSQL 12 changed CTE semantics — what used to always be materialised is now inlined by default, except when you ask for materialisation explicitly. That change made many old "CTE as optimisation fence" tricks silently stop working. → &lt;a href="https://mydba.dev/blog/postgres-subquery-cte-optimization" rel="noopener noreferrer"&gt;Subquery &amp;amp; CTE Optimisation&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Aggregate and window function tuning
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;GROUP BY&lt;/code&gt; and window functions look declarative, but the planner has strong opinions about how to execute them: HashAggregate versus GroupAggregate, partial and parallel aggregation, window frame optimisation. Sorts and hashes that spill to disk are almost always the visible symptom, and &lt;code&gt;work_mem&lt;/code&gt; is almost always the knob. → &lt;a href="https://mydba.dev/blog/postgres-aggregate-window-tuning" rel="noopener noreferrer"&gt;Aggregate &amp;amp; Window Function Tuning&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. WHERE clause optimisation
&lt;/h3&gt;

&lt;p&gt;An index is only useful if the WHERE clause is &lt;em&gt;sargable&lt;/em&gt;. Wrapping an indexed column in a function (&lt;code&gt;lower(email) = '...'&lt;/code&gt;), doing implicit casts (&lt;code&gt;varchar_column = 123&lt;/code&gt;), or comparing on the wrong side of an operator all silently disable indexes that look like they should apply. → &lt;a href="https://mydba.dev/blog/postgres-where-clause-optimization" rel="noopener noreferrer"&gt;WHERE Clause Optimisation&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. Query rewriting techniques
&lt;/h3&gt;

&lt;p&gt;Sometimes the plan is "fine" but the query itself is asking the wrong question. Correlated subqueries can usually become lateral joins; &lt;code&gt;NOT IN&lt;/code&gt; with NULLs should be &lt;code&gt;NOT EXISTS&lt;/code&gt;; offset pagination past a few hundred pages should be keyset pagination; &lt;code&gt;DISTINCT&lt;/code&gt; over a large set is often &lt;code&gt;GROUP BY&lt;/code&gt; in disguise. → &lt;a href="https://mydba.dev/blog/postgres-query-rewriting-techniques" rel="noopener noreferrer"&gt;Query Rewriting Techniques&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  8. Anti-patterns and common mistakes
&lt;/h3&gt;

&lt;p&gt;The final category is queries that are wrong by construction: &lt;code&gt;SELECT *&lt;/code&gt; in hot paths, implicit type casts that silently disable indexes, missing &lt;code&gt;LIMIT&lt;/code&gt; on exploratory joins, N+1 patterns coming out of ORMs, inserting one row at a time instead of batching. → &lt;a href="https://mydba.dev/blog/postgres-query-anti-patterns" rel="noopener noreferrer"&gt;Anti-Patterns &amp;amp; Common Mistakes&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to start
&lt;/h2&gt;

&lt;p&gt;If you can already read EXPLAIN confidently, the highest-value articles are probably &lt;strong&gt;Index Usage&lt;/strong&gt; and &lt;strong&gt;Query Rewriting&lt;/strong&gt;, because those are where the largest wins hide. If reading the plans in this article felt like work, start with &lt;strong&gt;Reading EXPLAIN&lt;/strong&gt; and come back here.&lt;/p&gt;

&lt;p&gt;Slow queries are not mysterious. They fall into a small number of categories, each with a characteristic plan signature and a well-understood fix. Learn to recognise the signatures and most of the rest follows.&lt;/p&gt;




&lt;h1&gt;
  
  
  postgres #performance #database #sql
&lt;/h1&gt;

&lt;p&gt;Canonical version with the full series linked: &lt;a href="https://mydba.dev/blog/postgres-query-analysis-complete-guide" rel="noopener noreferrer"&gt;https://mydba.dev/blog/postgres-query-analysis-complete-guide&lt;/a&gt;&lt;/p&gt;

</description>
      <category>database</category>
      <category>performance</category>
      <category>postgres</category>
      <category>sql</category>
    </item>
    <item>
      <title>PostgreSQL Parallel Query: Configuration &amp; Performance Tuning</title>
      <dc:creator>Philip McClarence</dc:creator>
      <pubDate>Wed, 22 Apr 2026 10:00:02 +0000</pubDate>
      <link>https://forem.com/philip_mcclarence_2ef9475/postgresql-parallel-query-configuration-performance-tuning-1oih</link>
      <guid>https://forem.com/philip_mcclarence_2ef9475/postgresql-parallel-query-configuration-performance-tuning-1oih</guid>
      <description>&lt;h1&gt;
  
  
  PostgreSQL Parallel Query: Configuration &amp;amp; Performance Tuning
&lt;/h1&gt;

&lt;p&gt;Your analytical query scans a 50 GB table, aggregates 200 million rows, and takes 25 seconds. Your server has 16 CPU cores. PostgreSQL uses... 2 of them. The other 14 sit idle. The &lt;code&gt;max_parallel_workers_per_gather&lt;/code&gt; default of 2 is leaving 7x potential speedup on the table. Let's fix that -- and understand when you should not.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Parallel Query Works
&lt;/h2&gt;

&lt;p&gt;PostgreSQL divides large operations across multiple CPU cores. Worker processes each scan a portion of the data, feed results through a Gather node to the leader process, which combines them. Sequential scans, hash joins, aggregates, and B-tree index scans all support parallel execution.&lt;/p&gt;

&lt;p&gt;The key defaults:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;max_parallel_workers_per_gather = 2&lt;/code&gt; -- max workers per parallel operation&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;max_parallel_workers = 8&lt;/code&gt; -- total parallel workers across all sessions&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;max_worker_processes = 8&lt;/code&gt; -- total background workers (shared with other subsystems)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;min_parallel_table_scan_size = 8MB&lt;/code&gt; -- minimum table size for parallel scan&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;parallel_setup_cost = 1000&lt;/code&gt; -- planner's estimate for starting a worker&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;parallel_tuple_cost = 0.1&lt;/code&gt; -- per-tuple transfer cost estimate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are tuned for general-purpose workloads. For analytical queries on large tables, they're far too conservative.&lt;/p&gt;

&lt;h2&gt;
  
  
  Detecting the Problem
&lt;/h2&gt;

&lt;p&gt;Check your current settings:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;setting&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;unit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;short_desc&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_settings&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'%parallel%'&lt;/span&gt;
   &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'max_worker_processes'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check whether queries actually use parallel workers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;EXPLAIN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ANALYZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;VERBOSE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BUFFERS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="s1"&gt;'2026-01-01'&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;customer_region&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Look for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Gather&lt;/code&gt; or &lt;code&gt;Gather Merge&lt;/code&gt; -- parallel execution is happening&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Workers Planned: 2&lt;/code&gt; and &lt;code&gt;Workers Launched: 2&lt;/code&gt; -- how many workers&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Workers Launched &amp;lt; Workers Planned&lt;/code&gt; -- system ran out of workers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Check if workers are being exhausted:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Currently active parallel workers&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;active_parallel_workers&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_activity&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;backend_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'parallel worker'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- The limit&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;setting&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;max_parallel_workers&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_settings&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'max_parallel_workers'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If active workers frequently approach the limit, queries are competing for workers and some run with fewer than planned.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tuning for Analytical Workloads
&lt;/h2&gt;

&lt;p&gt;If your database runs analytical queries on large tables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- More workers per query&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;max_parallel_workers_per_gather&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- More total workers&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;max_parallel_workers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Enough background worker slots&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;max_worker_processes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Lower table size threshold&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;min_parallel_table_scan_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'1MB'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Apply (max_worker_processes requires restart)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;pg_reload_conf&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Rule of thumb: &lt;code&gt;max_parallel_workers_per_gather&lt;/code&gt; = half your CPU cores, &lt;code&gt;max_parallel_workers&lt;/code&gt; = total cores. On a 16-core server: 8 and 16 respectively.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lower Cost Thresholds
&lt;/h3&gt;

&lt;p&gt;If medium-sized tables aren't getting parallelized despite adequate configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;parallel_setup_cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;    &lt;span class="c1"&gt;-- default: 1000&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;parallel_tuple_cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;   &lt;span class="c1"&gt;-- default: 0.1&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;pg_reload_conf&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Lower &lt;code&gt;parallel_setup_cost&lt;/code&gt; makes the planner consider parallelism for smaller operations. Lower &lt;code&gt;parallel_tuple_cost&lt;/code&gt; makes parallel plans look cheaper.&lt;/p&gt;

&lt;h3&gt;
  
  
  Per-Session Overrides
&lt;/h3&gt;

&lt;p&gt;For mixed workloads, set parallelism based on the connection type:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Reporting query: maximum parallelism&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;LOCAL&lt;/span&gt; &lt;span class="n"&gt;max_parallel_workers_per_gather&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;LOCAL&lt;/span&gt; &lt;span class="n"&gt;parallel_setup_cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;LOCAL&lt;/span&gt; &lt;span class="n"&gt;parallel_tuple_cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="k"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- OLTP session: disable parallelism&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;LOCAL&lt;/span&gt; &lt;span class="n"&gt;max_parallel_workers_per_gather&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Per-Table Settings
&lt;/h3&gt;

&lt;p&gt;For critical large tables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Guarantee up to 8 workers for scans on this table&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parallel_workers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This overrides the planner's automatic worker count calculation.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Parallelizes (and What Doesn't)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Operation&lt;/th&gt;
&lt;th&gt;Parallel?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Sequential Scan&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;B-tree Index Scan&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bitmap Heap Scan&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hash Join&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Merge Join&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Nested Loop&lt;/td&gt;
&lt;td&gt;Yes (outer side)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Aggregate (count, sum, avg)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CREATE INDEX (B-tree)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Append (UNION ALL, partitions)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;UPDATE, DELETE&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CTEs (WITH queries)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cursors&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FOR UPDATE/SHARE&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Memory Multiplication Trap
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;work_mem&lt;/code&gt; applies per worker. A query with &lt;code&gt;work_mem = 256MB&lt;/code&gt; and 4 parallel workers can consume 1.28 GB for sorting and hashing. Budget accordingly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;max_connections * max_parallel_workers_per_gather * work_mem &amp;lt; available RAM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This catches people who increase parallelism without accounting for memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  Verify the Impact
&lt;/h2&gt;

&lt;p&gt;Compare sequential vs parallel execution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Disable parallelism&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;LOCAL&lt;/span&gt; &lt;span class="n"&gt;max_parallel_workers_per_gather&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;EXPLAIN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ANALYZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BUFFERS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TIMING&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;customer_region&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Enable parallelism&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;LOCAL&lt;/span&gt; &lt;span class="n"&gt;max_parallel_workers_per_gather&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;EXPLAIN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ANALYZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BUFFERS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TIMING&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;customer_region&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The parallel plan should show roughly &lt;code&gt;sequential_time / (1 + num_workers)&lt;/code&gt; execution time, with 60-80% of theoretical speedup typical due to Gather overhead.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prevention Strategy
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;OLAP databases&lt;/strong&gt;: aggressive parallelism. &lt;code&gt;max_parallel_workers_per_gather = CPU_cores / 2&lt;/code&gt;, &lt;code&gt;max_parallel_workers = CPU_cores&lt;/code&gt;, lower cost thresholds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OLTP databases&lt;/strong&gt;: keep defaults or disable. Many short concurrent queries don't benefit -- worker overhead exceeds speedup on small queries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mixed workloads&lt;/strong&gt;: per-connection settings. Reporting connections get high parallelism. App connections get zero.&lt;/p&gt;

&lt;p&gt;Monitor &lt;code&gt;Workers Launched&lt;/code&gt; vs &lt;code&gt;Workers Planned&lt;/code&gt;. Consistent shortfall means you need more &lt;code&gt;max_parallel_workers&lt;/code&gt;. If CPU hits 100% during parallel queries and other sessions slow down, reduce &lt;code&gt;max_parallel_workers_per_gather&lt;/code&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://mydba.dev/blog/postgres-parallel-query" rel="noopener noreferrer"&gt;mydba.dev/blog/postgres-parallel-query&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>database</category>
      <category>performance</category>
      <category>postgres</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>PostgreSQL Point-in-Time Recovery with pgBackRest</title>
      <dc:creator>Philip McClarence</dc:creator>
      <pubDate>Tue, 21 Apr 2026 10:00:03 +0000</pubDate>
      <link>https://forem.com/philip_mcclarence_2ef9475/postgresql-point-in-time-recovery-with-pgbackrest-1cg6</link>
      <guid>https://forem.com/philip_mcclarence_2ef9475/postgresql-point-in-time-recovery-with-pgbackrest-1cg6</guid>
      <description>&lt;h1&gt;
  
  
  PostgreSQL Point-in-Time Recovery with pgBackRest
&lt;/h1&gt;

&lt;p&gt;&lt;code&gt;pg_dump&lt;/code&gt; gives you a snapshot at the moment you ran it. If your last dump was 6 hours ago and someone accidentally deletes a production table, those 6 hours are gone. Even with hourly dumps, you lose everything between the last dump and the incident. For a database processing thousands of transactions per minute, that gap is devastating. Point-in-time recovery (PITR) eliminates that gap -- restoring your database to any specific second by replaying the write-ahead log on top of a base backup.&lt;/p&gt;

&lt;h2&gt;
  
  
  How PITR Works
&lt;/h2&gt;

&lt;p&gt;Two mechanisms combine:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Base backups&lt;/strong&gt; -- periodic snapshots of all database files&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WAL archiving&lt;/strong&gt; -- continuous streaming of every WAL segment to a backup repository&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The WAL records every change made to the database. By replaying WAL segments from a base backup forward to a target timestamp, you reconstruct the exact state at that moment. If the last archived WAL is 30 seconds old, your maximum data loss is 30 seconds -- not 6 hours.&lt;/p&gt;

&lt;p&gt;pgBackRest is the standard tool for this. It handles base backups (full, incremental, differential), WAL archiving, retention, verification, and recovery -- with parallel compression, encryption, and remote repository support.&lt;/p&gt;

&lt;h2&gt;
  
  
  Detecting Whether You're Protected
&lt;/h2&gt;

&lt;p&gt;Check if WAL archiving is even enabled:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;setting&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_settings&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s1"&gt;'archive_mode'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'archive_command'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'archive_timeout'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'wal_level'&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You need &lt;code&gt;archive_mode = on&lt;/code&gt; and &lt;code&gt;wal_level = replica&lt;/code&gt; (or &lt;code&gt;logical&lt;/code&gt;). If &lt;code&gt;archive_mode&lt;/code&gt; is &lt;code&gt;off&lt;/code&gt;, PITR is impossible.&lt;/p&gt;

&lt;p&gt;Check for archiving failures:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;archived_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;failed_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;last_archived_wal&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;last_archived_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;last_failed_wal&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;last_failed_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;last_archived_time&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;archive_lag&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_archiver&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A non-zero &lt;code&gt;failed_count&lt;/code&gt; or &lt;code&gt;archive_lag&lt;/code&gt; greater than a few minutes means the pipeline is broken. WAL segments are accumulating on the primary and will eventually fill the disk.&lt;/p&gt;

&lt;p&gt;Verify backup freshness:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pgbackrest info &lt;span class="nt"&gt;--stanza&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;main
pgbackrest verify &lt;span class="nt"&gt;--stanza&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;main
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Setting Up pgBackRest
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Install and Configure
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Debian/Ubuntu&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt-get &lt;span class="nb"&gt;install &lt;/span&gt;pgbackrest

&lt;span class="c"&gt;# RHEL/Rocky&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;dnf &lt;span class="nb"&gt;install &lt;/span&gt;pgbackrest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create &lt;code&gt;/etc/pgbackrest/pgbackrest.conf&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="nn"&gt;[main]&lt;/span&gt;
&lt;span class="py"&gt;pg1-path&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/var/lib/postgresql/18/main&lt;/span&gt;

&lt;span class="nn"&gt;[global]&lt;/span&gt;
&lt;span class="py"&gt;repo1-path&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/var/lib/pgbackrest&lt;/span&gt;
&lt;span class="py"&gt;repo1-retention-full&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;2&lt;/span&gt;
&lt;span class="py"&gt;repo1-retention-diff&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;7&lt;/span&gt;
&lt;span class="py"&gt;repo1-cipher-type&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;aes-256-cbc&lt;/span&gt;
&lt;span class="py"&gt;repo1-cipher-pass&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;your-secure-encryption-passphrase&lt;/span&gt;

&lt;span class="py"&gt;process-max&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;4&lt;/span&gt;
&lt;span class="py"&gt;compress-type&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;zst&lt;/span&gt;
&lt;span class="py"&gt;compress-level&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;6&lt;/span&gt;

&lt;span class="py"&gt;log-level-console&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;info&lt;/span&gt;
&lt;span class="py"&gt;log-level-file&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;detail&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Configure PostgreSQL
&lt;/h3&gt;

&lt;p&gt;Add to &lt;code&gt;postgresql.conf&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight conf"&gt;&lt;code&gt;&lt;span class="n"&gt;wal_level&lt;/span&gt; = &lt;span class="n"&gt;replica&lt;/span&gt;
&lt;span class="n"&gt;archive_mode&lt;/span&gt; = &lt;span class="n"&gt;on&lt;/span&gt;
&lt;span class="n"&gt;archive_command&lt;/span&gt; = &lt;span class="s1"&gt;'pgbackrest --stanza=main archive-push %p'&lt;/span&gt;
&lt;span class="n"&gt;archive_timeout&lt;/span&gt; = &lt;span class="m"&gt;60&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;archive_timeout = 60&lt;/code&gt; forces a WAL switch every 60 seconds even if the segment isn't full. This caps maximum data loss at 60 seconds.&lt;/p&gt;

&lt;p&gt;Restart PostgreSQL, then initialize the stanza:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pgbackrest &lt;span class="nt"&gt;--stanza&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;main stanza-create
pgbackrest &lt;span class="nt"&gt;--stanza&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;main check
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Schedule Backups
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Full backup (weekly)&lt;/span&gt;
pgbackrest &lt;span class="nt"&gt;--stanza&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;main &lt;span class="nt"&gt;--type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;full backup

&lt;span class="c"&gt;# Differential (daily -- changes since last full)&lt;/span&gt;
pgbackrest &lt;span class="nt"&gt;--stanza&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;main &lt;span class="nt"&gt;--type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;diff backup

&lt;span class="c"&gt;# Incremental (every 6 hours -- changes since last any backup)&lt;/span&gt;
pgbackrest &lt;span class="nt"&gt;--stanza&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;main &lt;span class="nt"&gt;--type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;incr backup
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cron schedule:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0 2 * * 0  pgbackrest --stanza=main --type=full backup
0 2 * * 1-6  pgbackrest --stanza=main --type=diff backup
0 */6 * * *  pgbackrest --stanza=main --type=incr backup
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Performing Recovery
&lt;/h2&gt;

&lt;p&gt;When disaster strikes, restore to a specific timestamp:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl stop postgresql

pgbackrest &lt;span class="nt"&gt;--stanza&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;main &lt;span class="nt"&gt;--type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;time&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"2026-02-28 14:30:00+00"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--target-action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;promote &lt;span class="se"&gt;\&lt;/span&gt;
    restore

&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl start postgresql
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Set &lt;code&gt;--target&lt;/code&gt; to just before the incident. &lt;code&gt;--target-action=promote&lt;/code&gt; opens the database for read-write after recovery.&lt;/p&gt;

&lt;p&gt;You can also restore to a named restore point:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Create before a risky operation&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;pg_create_restore_point&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'before_schema_migration'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pgbackrest &lt;span class="nt"&gt;--stanza&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;main &lt;span class="nt"&gt;--type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;name &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"before_schema_migration"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--target-action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;promote &lt;span class="se"&gt;\&lt;/span&gt;
    restore
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Test Recovery Regularly
&lt;/h2&gt;

&lt;p&gt;This is the most critical step. Schedule monthly recovery tests to a standby server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pgbackrest &lt;span class="nt"&gt;--stanza&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;main &lt;span class="nt"&gt;--type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;time&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"2026-02-28 12:00:00+00"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--target-action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;promote &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--pg1-path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/var/lib/postgresql/18/test_recovery &lt;span class="se"&gt;\&lt;/span&gt;
    restore
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify the restored database contains expected data at the target timestamp. If recovery fails, fix the configuration before you need it in an emergency. Record the actual recovery time -- that's your real RTO.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prevention
&lt;/h2&gt;

&lt;p&gt;Build PITR into infrastructure from day one. Every production PostgreSQL database should have WAL archiving before its first production write.&lt;/p&gt;

&lt;p&gt;Monitor three metrics continuously:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Archive lag&lt;/strong&gt; -- alert if &amp;gt; 5 minutes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failed archive count&lt;/strong&gt; -- any non-zero value requires investigation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backup age&lt;/strong&gt; -- alert if exceeding your backup interval + buffer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;An untested backup is not a backup. Test quarterly at minimum. Document the procedure, the expected recovery time, and who executes it. Run end-to-end: restore, replay WAL, verify data, record duration.&lt;/p&gt;

&lt;p&gt;Store backups off-host. A backup on the same disk is destroyed by the same failure. Use S3, Azure Blob, GCS, or a separate server. Enable encryption.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://mydba.dev/blog/postgres-point-in-time-recovery" rel="noopener noreferrer"&gt;mydba.dev/blog/postgres-point-in-time-recovery&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>database</category>
      <category>devops</category>
      <category>postgres</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>PostgreSQL BRIN Indexes: When &amp; How to Use Block Range Indexes</title>
      <dc:creator>Philip McClarence</dc:creator>
      <pubDate>Mon, 20 Apr 2026 10:00:03 +0000</pubDate>
      <link>https://forem.com/philip_mcclarence_2ef9475/postgresql-brin-indexes-when-how-to-use-block-range-indexes-3g6d</link>
      <guid>https://forem.com/philip_mcclarence_2ef9475/postgresql-brin-indexes-when-how-to-use-block-range-indexes-3g6d</guid>
      <description>&lt;h1&gt;
  
  
  PostgreSQL BRIN Indexes: When &amp;amp; How to Use Block Range Indexes
&lt;/h1&gt;

&lt;p&gt;You have a 500-million-row events table. The B-tree index on &lt;code&gt;created_at&lt;/code&gt; consumes 12 GB. Every insert must update that 12 GB index. Backups include 12 GB of index data. The buffer cache is full of index pages. And all you ever do is range queries: "give me events from last week." There's a better way. A BRIN index on the same column would be roughly 100 KB -- not 12 GB -- and for your query pattern, it works just as well.&lt;/p&gt;

&lt;h2&gt;
  
  
  How BRIN Works
&lt;/h2&gt;

&lt;p&gt;Instead of indexing every individual row (like B-tree), BRIN stores the minimum and maximum values for ranges of consecutive physical blocks. The default is 128 pages (~1 MB of table data) per range entry.&lt;/p&gt;

&lt;p&gt;To find rows where &lt;code&gt;created_at = '2026-01-15'&lt;/code&gt;, PostgreSQL reads the BRIN index, identifies which block ranges &lt;em&gt;could&lt;/em&gt; contain that date (any range where min &amp;lt;= '2026-01-15' &amp;lt;= max), and scans only those ranges. Block ranges that can't contain the target value are skipped entirely.&lt;/p&gt;

&lt;p&gt;The trade-off is precision. B-tree points to exact rows. BRIN points to block ranges that &lt;em&gt;might&lt;/em&gt; contain matching rows, then scans those blocks sequentially. This is fine when matching rows are clustered together (time-series data), but terrible when values are scattered randomly across the table.&lt;/p&gt;

&lt;h2&gt;
  
  
  When BRIN Works (and When It Doesn't)
&lt;/h2&gt;

&lt;p&gt;The key metric is &lt;strong&gt;physical correlation&lt;/strong&gt; -- how closely the column values track with the physical row position on disk.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Check correlation for candidate columns&lt;/span&gt;
&lt;span class="c1"&gt;-- Values close to 1.0 or -1.0 = good for BRIN&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;attname&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;column_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;correlation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;n_distinct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;null_frac&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stats&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;schemaname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'public'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;tablename&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'events'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;attname&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'created_at'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'event_id'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'user_id'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;correlation&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Above 0.9&lt;/strong&gt;: ideal for BRIN. Matching rows are tightly clustered in a few block ranges.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0.7 to 0.9&lt;/strong&gt;: can still benefit, but more false-positive blocks scanned.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Below 0.7&lt;/strong&gt;: BRIN will scan too many irrelevant blocks. Use B-tree.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The ideal BRIN candidate:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Characteristic&lt;/th&gt;
&lt;th&gt;Why It Matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Append-only or mostly-append&lt;/td&gt;
&lt;td&gt;Physical order matches logical order&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time-series or log data&lt;/td&gt;
&lt;td&gt;Timestamp correlates with insertion order&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Table size &amp;gt; 1 GB&lt;/td&gt;
&lt;td&gt;B-tree overhead becomes significant&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Range queries are primary access&lt;/td&gt;
&lt;td&gt;BRIN excels at range filtering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Low update/delete frequency&lt;/td&gt;
&lt;td&gt;Updates break physical correlation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The failure mode: creating a BRIN index on &lt;code&gt;user_id&lt;/code&gt; in a table where inserts come from many users in random order. Every block range contains every user_id, and PostgreSQL must scan the entire table anyway.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding BRIN Candidates
&lt;/h2&gt;

&lt;p&gt;Identify large tables with oversized B-tree indexes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;schemaname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;relname&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;pg_size_pretty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;relid&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;table_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;indexrelname&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;index_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;pg_size_pretty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;indexrelid&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;index_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;indexrelid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="k"&gt;NULLIF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;relid&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;index_to_table_pct&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_user_tables&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;pg_stat_user_indexes&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;relid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;relid&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;relid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1073741824&lt;/span&gt;  &lt;span class="c1"&gt;-- tables &amp;gt; 1 GB&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;indexrelid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tables over 1 GB with B-tree indexes consuming 10%+ of the table size are prime candidates -- if the indexed columns have high correlation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Creating and Tuning BRIN Indexes
&lt;/h2&gt;

&lt;p&gt;Basic BRIN index:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;CONCURRENTLY&lt;/span&gt; &lt;span class="n"&gt;idx_events_created_brin&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;brin&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Tune pages_per_range
&lt;/h3&gt;

&lt;p&gt;The default of 128 pages summarizes ~1 MB of table data per entry. You can tune this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- More granular: larger index, fewer false positives&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;CONCURRENTLY&lt;/span&gt; &lt;span class="n"&gt;idx_events_created_brin_fine&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;brin&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pages_per_range&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Less granular: tiny index, more false positives&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;CONCURRENTLY&lt;/span&gt; &lt;span class="n"&gt;idx_events_created_brin_coarse&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;brin&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pages_per_range&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For most time-series tables, the default of 128 works well.&lt;/p&gt;

&lt;h3&gt;
  
  
  Enable Autosummarize
&lt;/h3&gt;

&lt;p&gt;By default, new blocks are not reflected in the BRIN index until vacuum runs. This means recent data might trigger sequential scans:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;CONCURRENTLY&lt;/span&gt; &lt;span class="n"&gt;idx_events_created_brin&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;brin&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;autosummarize&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;on&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For append-heavy workloads, &lt;code&gt;autosummarize = on&lt;/code&gt; is strongly recommended.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multi-Column BRIN
&lt;/h3&gt;

&lt;p&gt;BRIN indexes support multiple columns when both correlate with physical order:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;CONCURRENTLY&lt;/span&gt; &lt;span class="n"&gt;idx_events_multi_brin&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;brin&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event_id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both &lt;code&gt;created_at&lt;/code&gt; and an auto-incrementing &lt;code&gt;event_id&lt;/code&gt; increase together, so both have high correlation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Verify the Improvement
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;EXPLAIN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ANALYZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BUFFERS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="s1"&gt;'2026-01-01'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="s1"&gt;'2026-01-31'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should see &lt;code&gt;Bitmap Heap Scan&lt;/code&gt; with &lt;code&gt;Bitmap Index Scan on idx_events_created_brin&lt;/code&gt;. Buffer count should be much lower than a full sequential scan.&lt;/p&gt;

&lt;p&gt;Compare index sizes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;indexrelname&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;index_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;pg_size_pretty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;indexrelid&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;index_size&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_user_indexes&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;relname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'events'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The size difference should be dramatic -- often 100-1000x smaller.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prevention
&lt;/h2&gt;

&lt;p&gt;Always check &lt;code&gt;pg_stats.correlation&lt;/code&gt; before creating a BRIN index. A BRIN on a low-correlation column is worse than useless -- it costs maintenance time and fools you into thinking data is indexed.&lt;/p&gt;

&lt;p&gt;Monitor BRIN effectiveness after creation. Compare buffer counts in EXPLAIN ANALYZE between BRIN-indexed queries and sequential scans. If BRIN isn't reducing buffer reads by at least 50%, the correlation is too low.&lt;/p&gt;

&lt;p&gt;Watch for operations that break physical correlation: UPDATEs that change the indexed column, CLUSTER on a different column, or bulk DELETEs followed by new inserts. If correlation degrades, consider running &lt;code&gt;CLUSTER&lt;/code&gt; to restore physical order.&lt;/p&gt;

&lt;p&gt;For time-series tables growing by gigabytes per day, switching from B-tree to BRIN can reduce index storage by 99% -- and your insert performance will thank you.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://mydba.dev/blog/postgres-brin-index" rel="noopener noreferrer"&gt;mydba.dev/blog/postgres-brin-index&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>database</category>
      <category>performance</category>
      <category>postgres</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>PostgreSQL Covering Indexes: Eliminate Heap Fetches with INCLUDE</title>
      <dc:creator>Philip McClarence</dc:creator>
      <pubDate>Sun, 19 Apr 2026 10:00:02 +0000</pubDate>
      <link>https://forem.com/philip_mcclarence_2ef9475/postgresql-covering-indexes-eliminate-heap-fetches-with-include-3lcl</link>
      <guid>https://forem.com/philip_mcclarence_2ef9475/postgresql-covering-indexes-eliminate-heap-fetches-with-include-3lcl</guid>
      <description>&lt;h1&gt;
  
  
  PostgreSQL Covering Indexes: Eliminate Heap Fetches with INCLUDE
&lt;/h1&gt;

&lt;p&gt;You have an index on &lt;code&gt;customer_id&lt;/code&gt;. Your query filters by &lt;code&gt;customer_id&lt;/code&gt; and selects &lt;code&gt;customer_name&lt;/code&gt; and &lt;code&gt;customer_email&lt;/code&gt;. PostgreSQL finds the matching rows in the index (fast), then fetches each row from the heap table to get the name and email (slow). Those heap fetches are random I/O operations scattered across the table. On a 100M-row table returning 1,000 rows, that is 1,000 random reads -- and they dominate the query execution time. A covering index eliminates them entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Index Lookups Actually Work
&lt;/h2&gt;

&lt;p&gt;Every standard B-tree index lookup is two steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Index scan&lt;/strong&gt;: find matching row pointers (TIDs) in the index -- fast, sequential access&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Heap fetch&lt;/strong&gt;: retrieve actual row data from the heap table -- slow, random I/O&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The heap fetch is the bottleneck. For single-row lookups it's barely noticeable. For queries returning hundreds or thousands of rows, it dominates execution time.&lt;/p&gt;

&lt;p&gt;An &lt;strong&gt;index-only scan&lt;/strong&gt; skips step 2 entirely. If all columns the query needs exist in the index, PostgreSQL reads everything from the index. No heap access, no random I/O.&lt;/p&gt;

&lt;h2&gt;
  
  
  The INCLUDE Clause (PostgreSQL 11+)
&lt;/h2&gt;

&lt;p&gt;Before PostgreSQL 11, covering indexes required composite indexes on all columns: &lt;code&gt;CREATE INDEX ON customers (customer_id, customer_name, customer_email)&lt;/code&gt;. This works but has a cost -- the index maintains sort order on all three columns, wasting CPU on sorts for columns nobody searches or orders by.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;INCLUDE&lt;/code&gt; adds columns to the index leaf pages without including them in the sort key:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;CONCURRENTLY&lt;/span&gt; &lt;span class="n"&gt;idx_customers_covering&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;INCLUDE&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer_email&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key column (&lt;code&gt;customer_id&lt;/code&gt;) is the search key for &lt;code&gt;WHERE&lt;/code&gt;, &lt;code&gt;ORDER BY&lt;/code&gt;, and joins. The included columns are stored alongside but not sorted -- they exist solely to enable index-only scans.&lt;/p&gt;

&lt;h2&gt;
  
  
  Detecting Covering Index Opportunities
&lt;/h2&gt;

&lt;p&gt;Look for index scans with heap fetches in EXPLAIN output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;EXPLAIN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ANALYZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BUFFERS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer_email&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="mi"&gt;2000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Watch for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Index Scan using idx_customers_id&lt;/code&gt; -- the heap was visited for each row&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Heap Fetches: 1000&lt;/code&gt; -- 1,000 trips to the heap table&lt;/li&gt;
&lt;li&gt;High &lt;code&gt;Buffers: shared hit=...&lt;/code&gt; from random heap access&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal: &lt;code&gt;Index Only Scan&lt;/code&gt; with &lt;code&gt;Heap Fetches: 0&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Find candidates system-wide:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;schemaname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;relname&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;indexrelname&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;index_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;idx_scan&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;idx_tup_read&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;idx_tup_fetch&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;pg_size_pretty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;indexrelid&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;index_size&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_user_indexes&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;idx_scan&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;idx_tup_fetch&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tables with high &lt;code&gt;idx_tup_fetch&lt;/code&gt; are performing many heap fetches. Cross-reference with your most frequent queries.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Examples
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Dashboard Query
&lt;/h3&gt;

&lt;p&gt;A reporting dashboard showing recent orders:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- The query&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_total&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_status&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;interval&lt;/span&gt; &lt;span class="s1"&gt;'7 days'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- The covering index&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;CONCURRENTLY&lt;/span&gt; &lt;span class="n"&gt;idx_orders_recent_covering&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;INCLUDE&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_total&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_status&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This single index handles the WHERE filter, ORDER BY, LIMIT, and returns all selected columns -- entirely from the index. No heap access.&lt;/p&gt;

&lt;h3&gt;
  
  
  INCLUDE vs Composite
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- INCLUDE: customer_name is retrieved but never searched&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;INCLUDE&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_total&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Composite: both columns are used in WHERE or ORDER BY&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use &lt;code&gt;INCLUDE&lt;/code&gt; when extra columns are only in the SELECT list. Use composite when columns appear in WHERE or ORDER BY.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Vacuum Dependency
&lt;/h2&gt;

&lt;p&gt;Index-only scans require pages to be marked "all-visible" in the visibility map. PostgreSQL can skip the heap only for these pages. Vacuum maintains the visibility map.&lt;/p&gt;

&lt;p&gt;If vacuum falls behind:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pages are not marked all-visible&lt;/li&gt;
&lt;li&gt;The planner falls back to regular index scans with heap fetches&lt;/li&gt;
&lt;li&gt;Your covering index provides zero benefit
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Check visibility map health&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;relname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;n_live_tup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;n_dead_tup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;last_autovacuum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;last_vacuum&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_user_tables&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;relname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'customers'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;last_autovacuum&lt;/code&gt; is stale and dead tuples are accumulating, tune autovacuum to run more frequently on that table. A covering index without healthy vacuum is a wasted investment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prevention
&lt;/h2&gt;

&lt;p&gt;When writing a new query that selects specific columns from an indexed table, ask: "Can I add these columns to the index with INCLUDE to eliminate heap fetches?"&lt;/p&gt;

&lt;p&gt;Keep covering indexes focused. Don't add every column -- include only the columns your most frequent queries select. Different queries needing different columns should get targeted covering indexes, not one massive index.&lt;/p&gt;

&lt;p&gt;Monitor heap fetch counts over time. A query showing &lt;code&gt;Heap Fetches: 0&lt;/code&gt; today may regress if vacuum falls behind or if a new column is added to the SELECT list. Track &lt;code&gt;idx_tup_fetch&lt;/code&gt; on important indexes -- a sudden increase signals regression.&lt;/p&gt;

&lt;p&gt;Review covering indexes when query patterns change. Application refactors that add columns to SELECT clauses break index-only scans silently -- the query works, but performance drops.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://mydba.dev/blog/postgres-covering-index" rel="noopener noreferrer"&gt;mydba.dev/blog/postgres-covering-index&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>database</category>
      <category>performance</category>
      <category>postgres</category>
      <category>sql</category>
    </item>
    <item>
      <title>PostgreSQL JSONB Indexing: GIN, Expression &amp; Partial Index Strategies</title>
      <dc:creator>Philip McClarence</dc:creator>
      <pubDate>Sat, 18 Apr 2026 10:00:04 +0000</pubDate>
      <link>https://forem.com/philip_mcclarence_2ef9475/postgresql-jsonb-indexing-gin-expression-partial-index-strategies-i11</link>
      <guid>https://forem.com/philip_mcclarence_2ef9475/postgresql-jsonb-indexing-gin-expression-partial-index-strategies-i11</guid>
      <description>&lt;h1&gt;
  
  
  PostgreSQL JSONB Indexing: GIN, Expression &amp;amp; Partial Index Strategies
&lt;/h1&gt;

&lt;p&gt;JSONB is one of PostgreSQL's killer features. You get schema-less flexibility inside a relational database -- user preferences, API payloads, feature flags, event metadata, all stored without defining every column up front. The problem is that most developers treat JSONB as a black box: throw data in, query it with &lt;code&gt;-&amp;gt;&lt;/code&gt; and &lt;code&gt;-&amp;gt;&amp;gt;&lt;/code&gt;, maybe slap a GIN index on it, and assume PostgreSQL will figure out how to make it fast. It will not. Let's walk through the three indexing strategies and when to use each.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fundamental Confusion
&lt;/h2&gt;

&lt;p&gt;The most common JSONB indexing mistake: creating a GIN index and expecting it to speed up &lt;code&gt;-&amp;gt;&amp;gt;&lt;/code&gt; equality queries. It doesn't.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- You create this index&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_events_metadata_gin&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;gin&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- And expect this query to use it&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="s1"&gt;'status'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'active'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- NOPE. Sequential scan. The GIN index doesn't support -&amp;gt;&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The default GIN index (&lt;code&gt;jsonb_ops&lt;/code&gt;) supports &lt;code&gt;@&amp;gt;&lt;/code&gt;, &lt;code&gt;?&lt;/code&gt;, &lt;code&gt;?|&lt;/code&gt;, and &lt;code&gt;?&amp;amp;&lt;/code&gt; operators -- not &lt;code&gt;-&amp;gt;&amp;gt;&lt;/code&gt;. To accelerate that query, you either need an expression index or must rewrite the query to use &lt;code&gt;@&amp;gt;&lt;/code&gt; containment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Detecting the Problem
&lt;/h2&gt;

&lt;p&gt;Find JSONB columns that are triggering sequential scans:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;schemaname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;relname&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;attname&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;column_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;pg_size_pretty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;relid&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;table_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;seq_scan&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;seq_tup_read&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_user_tables&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;pg_attribute&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;attrelid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;relid&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;pg_type&lt;/span&gt; &lt;span class="n"&gt;ty&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;ty&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;oid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;atttypid&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;ty&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;typname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'jsonb'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;attnum&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;attisdropped&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;seq_scan&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;seq_tup_read&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Confirm with EXPLAIN:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- This does NOT use a GIN index&lt;/span&gt;
&lt;span class="k"&gt;EXPLAIN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ANALYZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BUFFERS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="s1"&gt;'status'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'active'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- This DOES use a GIN index&lt;/span&gt;
&lt;span class="k"&gt;EXPLAIN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ANALYZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BUFFERS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="o"&gt;@&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'{"status": "active"}'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you see &lt;code&gt;Seq Scan&lt;/code&gt; on the first query despite having a GIN index, the operator mismatch is your problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Strategy 1: GIN Index for Containment (&lt;code&gt;@&amp;gt;&lt;/code&gt;)
&lt;/h2&gt;

&lt;p&gt;When your queries use containment, GIN is the right choice:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Default operator class: supports @&amp;gt;, ?, ?|, ?&amp;amp;&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;CONCURRENTLY&lt;/span&gt; &lt;span class="n"&gt;idx_events_metadata_gin&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;gin&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- jsonb_path_ops: supports only @&amp;gt;, but 2-3x smaller and faster&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;CONCURRENTLY&lt;/span&gt; &lt;span class="n"&gt;idx_events_metadata_pathops&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;gin&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="n"&gt;jsonb_path_ops&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use &lt;code&gt;jsonb_path_ops&lt;/code&gt; when you only need &lt;code&gt;@&amp;gt;&lt;/code&gt; containment. It hashes full paths rather than indexing every key/value, creating a significantly smaller index. On 10M rows with complex documents, the difference can be 3-4x.&lt;/p&gt;

&lt;p&gt;Rewrite &lt;code&gt;-&amp;gt;&amp;gt;&lt;/code&gt; equality to containment to leverage GIN:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Before (no GIN support)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="s1"&gt;'status'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'active'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- After (uses GIN index)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="o"&gt;@&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'{"status": "active"}'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Strategy 2: Expression Index for Specific Keys
&lt;/h2&gt;

&lt;p&gt;When you repeatedly query one specific key, an expression index is more efficient:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- B-tree on a specific extracted key&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;CONCURRENTLY&lt;/span&gt; &lt;span class="n"&gt;idx_events_status&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="s1"&gt;'status'&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;

&lt;span class="c1"&gt;-- Now this works&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="s1"&gt;'status'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'active'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expression indexes are smaller than GIN (one value per row vs. decomposing the entire document), support range queries (&lt;code&gt;&amp;lt;&lt;/code&gt;, &lt;code&gt;&amp;gt;&lt;/code&gt;, &lt;code&gt;BETWEEN&lt;/code&gt;), and support &lt;code&gt;ORDER BY&lt;/code&gt;. They're the right choice when you query known, specific keys.&lt;/p&gt;

&lt;p&gt;For JSONB arrays:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;CONCURRENTLY&lt;/span&gt; &lt;span class="n"&gt;idx_events_tags_gin&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;gin&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="s1"&gt;'tags'&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;

&lt;span class="c1"&gt;-- Query: find events tagged "important"&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="s1"&gt;'tags'&lt;/span&gt; &lt;span class="o"&gt;@&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'"important"'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Strategy 3: Partial Index for Selective Conditions
&lt;/h2&gt;

&lt;p&gt;When only a fraction of rows match your query, avoid indexing the entire table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Index only active events (if 90% are archived)&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;CONCURRENTLY&lt;/span&gt; &lt;span class="n"&gt;idx_events_active_metadata&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;gin&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="n"&gt;jsonb_path_ops&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="s1"&gt;'status'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'active'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Dramatically smaller, faster to maintain, and writes to archived rows skip the index entirely. Combine with expression indexes for laser-targeted optimization:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Expression index on user_id, only for purchase events&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;CONCURRENTLY&lt;/span&gt; &lt;span class="n"&gt;idx_events_purchase_user&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="s1"&gt;'user_id'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="s1"&gt;'type'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'purchase'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Decision Table
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Query Pattern&lt;/th&gt;
&lt;th&gt;Index Type&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;@&amp;gt;&lt;/code&gt; containment&lt;/td&gt;
&lt;td&gt;GIN (&lt;code&gt;jsonb_path_ops&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;WHERE data @&amp;gt; '{"k": "v"}'&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;-&amp;gt;&amp;gt;&lt;/code&gt; equality on known key&lt;/td&gt;
&lt;td&gt;Expression (B-tree)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;WHERE data-&amp;gt;&amp;gt;'status' = 'x'&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Key existence (&lt;code&gt;?&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;GIN (default &lt;code&gt;jsonb_ops&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;WHERE data ? 'email'&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Range on extracted value&lt;/td&gt;
&lt;td&gt;Expression (B-tree)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;WHERE (data-&amp;gt;&amp;gt;'score')::int &amp;gt; 90&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Array containment&lt;/td&gt;
&lt;td&gt;GIN on sub-path&lt;/td&gt;
&lt;td&gt;&lt;code&gt;WHERE data-&amp;gt;'tags' @&amp;gt; '"x"'&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Write Performance Trade-off
&lt;/h2&gt;

&lt;p&gt;GIN index maintenance is expensive. Every INSERT or UPDATE touching the JSONB column must decompose the entire document to update the index. On write-heavy tables with large documents, this can cut insert throughput by 30-50%.&lt;/p&gt;

&lt;p&gt;If you only query 2-3 keys, expression indexes on those keys are dramatically cheaper to maintain. Monitor insert latency after adding GIN indexes -- if INSERTs slow down significantly, switch to targeted expression or partial indexes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prevention
&lt;/h2&gt;

&lt;p&gt;Document which JSONB keys will be queried when you add the column. This drives index selection from day one instead of retrofitting after performance degrades.&lt;/p&gt;

&lt;p&gt;Monitor GIN index size relative to table size. If the GIN index approaches the table size, you're indexing too much.&lt;/p&gt;

&lt;p&gt;Review JSONB query patterns quarterly. Application features evolve, and the keys you query change. Unused JSONB indexes waste space and slow writes for zero benefit.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://mydba.dev/blog/postgres-jsonb-indexing" rel="noopener noreferrer"&gt;mydba.dev/blog/postgres-jsonb-indexing&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>database</category>
      <category>performance</category>
      <category>postgres</category>
      <category>sql</category>
    </item>
    <item>
      <title>PostgreSQL Transaction Isolation Levels Explained</title>
      <dc:creator>Philip McClarence</dc:creator>
      <pubDate>Fri, 17 Apr 2026 10:00:04 +0000</pubDate>
      <link>https://forem.com/philip_mcclarence_2ef9475/postgresql-transaction-isolation-levels-explained-52kj</link>
      <guid>https://forem.com/philip_mcclarence_2ef9475/postgresql-transaction-isolation-levels-explained-52kj</guid>
      <description>&lt;h1&gt;
  
  
  PostgreSQL Transaction Isolation Levels Explained
&lt;/h1&gt;

&lt;p&gt;If you've ever had a subtle data inconsistency bug in production and traced it back to "we're in a transaction, so the data should be consistent" -- you've run into a transaction isolation misunderstanding. PostgreSQL's Read Committed default is perfectly correct, but it doesn't do what most developers think it does. And switching to a higher level introduces a completely different class of problems. Let's unpack all three levels and when to actually use each one.&lt;/p&gt;

&lt;h2&gt;
  
  
  What PostgreSQL Actually Supports
&lt;/h2&gt;

&lt;p&gt;The SQL standard defines four isolation levels: Read Uncommitted, Read Committed, Repeatable Read, and Serializable. PostgreSQL accepts all four in syntax, but Read Uncommitted behaves identically to Read Committed -- PostgreSQL's MVCC architecture never exposes uncommitted data. So in practice, you have three distinct behaviors.&lt;/p&gt;

&lt;h2&gt;
  
  
  Read Committed (The Default)
&lt;/h2&gt;

&lt;p&gt;Each statement within a transaction sees a snapshot as of the start of &lt;em&gt;that statement&lt;/em&gt;, not the start of the transaction. If another transaction commits between your two SELECTs, the second one sees the committed changes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;BEGIN&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;balance&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;account_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;-- sees 1000&lt;/span&gt;

&lt;span class="c1"&gt;-- Another transaction commits: UPDATE accounts SET balance = 500 WHERE account_id = 1;&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;balance&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;account_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;-- sees 500 (non-repeatable read)&lt;/span&gt;
&lt;span class="k"&gt;COMMIT&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the right choice for most workloads. Non-repeatable reads and phantom reads are only a problem if your application logic depends on seeing the same data across multiple queries within a single transaction -- and most transactions execute a single query or a short sequence of independent queries.&lt;/p&gt;

&lt;h2&gt;
  
  
  Repeatable Read (Snapshot Isolation)
&lt;/h2&gt;

&lt;p&gt;The transaction sees a snapshot as of its first non-transaction-control statement. All queries within the transaction see the same consistent view, regardless of concurrent commits.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;BEGIN&lt;/span&gt; &lt;span class="k"&gt;ISOLATION&lt;/span&gt; &lt;span class="k"&gt;LEVEL&lt;/span&gt; &lt;span class="k"&gt;REPEATABLE&lt;/span&gt; &lt;span class="k"&gt;READ&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;balance&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;account_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;-- sees 1000&lt;/span&gt;

&lt;span class="c1"&gt;-- Another transaction commits: UPDATE accounts SET balance = 500 WHERE account_id = 1;&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;balance&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;account_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;-- still sees 1000 (snapshot isolation)&lt;/span&gt;

&lt;span class="c1"&gt;-- But trying to update the same row fails:&lt;/span&gt;
&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;balance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;balance&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;account_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- ERROR: could not serialize access due to concurrent update&lt;/span&gt;
&lt;span class="k"&gt;ROLLBACK&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;-- must retry the entire transaction&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The catch: if your transaction tries to UPDATE a row that another committed transaction modified after the snapshot was taken, PostgreSQL aborts with a serialization error. Your application must catch this and retry the entire transaction.&lt;/p&gt;

&lt;h2&gt;
  
  
  Serializable (SSI)
&lt;/h2&gt;

&lt;p&gt;The strongest level. PostgreSQL uses Serializable Snapshot Isolation (SSI) to detect read/write dependency cycles and abort transactions that would produce non-serializable results.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Transaction A:&lt;/span&gt;
&lt;span class="k"&gt;BEGIN&lt;/span&gt; &lt;span class="k"&gt;ISOLATION&lt;/span&gt; &lt;span class="k"&gt;LEVEL&lt;/span&gt; &lt;span class="k"&gt;SERIALIZABLE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;balance&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;branch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'east'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;branch&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;balance&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'west'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;COMMIT&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Transaction B (concurrent):&lt;/span&gt;
&lt;span class="k"&gt;BEGIN&lt;/span&gt; &lt;span class="k"&gt;ISOLATION&lt;/span&gt; &lt;span class="k"&gt;LEVEL&lt;/span&gt; &lt;span class="k"&gt;SERIALIZABLE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;balance&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;branch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'west'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;branch&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;balance&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'east'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;COMMIT&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- One of these will fail with a serialization error&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This catches more anomalies than Repeatable Read but has a higher abort rate. PostgreSQL tracks predicate locks (SIRead locks) which consume memory and CPU.&lt;/p&gt;

&lt;h2&gt;
  
  
  Detecting Isolation Problems
&lt;/h2&gt;

&lt;p&gt;Check the server-wide default:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;default_transaction_isolation&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Monitor for sessions holding long-running snapshots (common with Repeatable Read/Serializable):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;usename&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;datname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;backend_xid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;backend_xmin&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;substring&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;query_preview&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_activity&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;backend_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'client backend'&lt;/span&gt;
    &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;backend_xmin&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;backend_xmin&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Sessions with very old &lt;code&gt;backend_xmin&lt;/code&gt; values are holding snapshots that prevent vacuum from cleaning dead tuples -- a major bloat risk.&lt;/p&gt;

&lt;p&gt;Check for lock contention and deadlocks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;datname&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;database_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;deadlocks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;conflicts&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_database&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;datname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;current_database&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Implementing Retry Logic
&lt;/h2&gt;

&lt;p&gt;This is non-negotiable for Repeatable Read and Serializable. The retry must re-execute the &lt;em&gt;entire transaction&lt;/em&gt;, not just the failed statement:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;psycopg2&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;execute_with_retry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;connection_pool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transaction_fn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Execute a transaction function with automatic retry on serialization failure.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;connection_pool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getconn&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_isolation_level&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;psycopg2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;extensions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ISOLATION_LEVEL_SERIALIZABLE&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;transaction_fn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;commit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;psycopg2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SerializationFailure&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rollback&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;raise&lt;/span&gt;
            &lt;span class="c1"&gt;# Exponential backoff with jitter
&lt;/span&gt;            &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.01&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;finally&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;connection_pool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;putconn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Which Level When?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Read Committed&lt;/strong&gt; (default): most web applications, CRUD operations, independent queries within a transaction. The non-repeatable read phenomenon is only a problem if your logic depends on reading the same data twice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repeatable Read&lt;/strong&gt;: read-heavy reporting that needs a consistent point-in-time snapshot. Financial reports, balance calculations, audit queries. Keep these transactions short to avoid holding snapshots that prevent vacuum.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Serializable&lt;/strong&gt;: write-heavy transactions with complex invariants where application-level locking is impractical. Double-booking prevention, inventory allocation, accounting entries with zero-sum constraints.&lt;/p&gt;

&lt;p&gt;Set isolation per-transaction, not globally:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Per-transaction (recommended)&lt;/span&gt;
&lt;span class="k"&gt;BEGIN&lt;/span&gt; &lt;span class="k"&gt;ISOLATION&lt;/span&gt; &lt;span class="k"&gt;LEVEL&lt;/span&gt; &lt;span class="k"&gt;REPEATABLE&lt;/span&gt; &lt;span class="k"&gt;READ&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- ... queries ...&lt;/span&gt;
&lt;span class="k"&gt;COMMIT&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Kill long-idle transactions&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;idle_in_transaction_session_timeout&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'5min'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;pg_reload_conf&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Prevention Strategy
&lt;/h2&gt;

&lt;p&gt;Default to Read Committed. Upgrade per-transaction only when you have a documented reason.&lt;/p&gt;

&lt;p&gt;Monitor &lt;code&gt;idle in transaction&lt;/code&gt; sessions regardless of isolation level. At Read Committed it's wasteful; at Repeatable Read or Serializable it actively prevents vacuum from reclaiming dead tuples.&lt;/p&gt;

&lt;p&gt;Design schemas to minimize serialization conflicts. Two transactions that always update the same counter row will always conflict under Serializable. Restructure to per-user or per-partition counters to reduce contention.&lt;/p&gt;

&lt;p&gt;Under high concurrency at Serializable level, expect 5-20% serialization failures. If your retry rate is higher, the access patterns may need restructuring more than the isolation level needs raising.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://mydba.dev/blog/postgres-transaction-isolation-levels" rel="noopener noreferrer"&gt;mydba.dev/blog/postgres-transaction-isolation-levels&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>database</category>
      <category>postgres</category>
      <category>sql</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>PostgreSQL Slow Query Log: Finding &amp; Fixing Your Slowest Queries</title>
      <dc:creator>Philip McClarence</dc:creator>
      <pubDate>Thu, 16 Apr 2026 10:00:06 +0000</pubDate>
      <link>https://forem.com/philip_mcclarence_2ef9475/postgresql-slow-query-log-finding-fixing-your-slowest-queries-2n4d</link>
      <guid>https://forem.com/philip_mcclarence_2ef9475/postgresql-slow-query-log-finding-fixing-your-slowest-queries-2n4d</guid>
      <description>&lt;h1&gt;
  
  
  PostgreSQL Slow Query Log: Finding &amp;amp; Fixing Your Slowest Queries
&lt;/h1&gt;

&lt;p&gt;Here's something that surprises a lot of developers: PostgreSQL doesn't log slow queries by default. The &lt;code&gt;log_min_duration_statement&lt;/code&gt; parameter ships at &lt;code&gt;-1&lt;/code&gt; (disabled), which means your database could be running queries that take 5 seconds, 10 seconds, even 30 seconds -- and nobody knows unless someone complains. Performance regressions happen silently, accumulate over months, and by the time they cross the pain threshold, the root cause is buried under layers of schema changes and data growth.&lt;/p&gt;

&lt;p&gt;Let's fix that.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;The most common scenario is a query that gradually degrades. It took 50ms six months ago, takes 500ms today, and nobody noticed the drift because slow query logging was never enabled. The table grew, statistics shifted, maybe an index got dropped during a migration. Without logging, there's no trail to follow.&lt;/p&gt;

&lt;p&gt;Even teams that enable slow query logging often misconfigure it. Three common mistakes:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Too high a threshold.&lt;/strong&gt; Setting &lt;code&gt;log_min_duration_statement&lt;/code&gt; to 10 seconds catches only catastrophic queries. The steady stream of 1-2 second queries that collectively dominate database load goes completely unnoticed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Too low a threshold.&lt;/strong&gt; Setting it to 1ms floods the logs with noise. Finding meaningful signals in gigabytes of log output is impractical.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Optimizing outliers instead of total load.&lt;/strong&gt; Slow query logs show individual executions. A query that runs once at 800ms is less important than a query running 10,000 times per hour at 50ms each (500 seconds of total load). Without aggregation via &lt;code&gt;pg_stat_statements&lt;/code&gt;, you're chasing the wrong queries.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Detect It
&lt;/h2&gt;

&lt;p&gt;First, check what's currently configured:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Check current slow query log configuration&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;setting&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;unit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;short_desc&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_settings&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s1"&gt;'log_min_duration_statement'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'log_statement'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'log_duration'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'log_line_prefix'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'auto_explain.log_min_duration'&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;log_min_duration_statement&lt;/code&gt; shows &lt;code&gt;-1&lt;/code&gt;, slow query logging is completely disabled. A value of &lt;code&gt;0&lt;/code&gt; logs everything (useful for debugging, too noisy for production). A good production starting point is &lt;code&gt;250&lt;/code&gt; to &lt;code&gt;1000&lt;/code&gt; milliseconds.&lt;/p&gt;

&lt;p&gt;Use &lt;code&gt;pg_stat_statements&lt;/code&gt; to find the queries that matter most -- sorted by total execution time, not individual execution time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Top queries by total execution time (cumulative load)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="k"&gt;substring&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;query_preview&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;calls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_exec_time&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;numeric&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_time_ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mean_exec_time&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;numeric&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_time_ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_exec_time&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;numeric&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;max_time_ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stddev_exec_time&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;numeric&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;stddev_ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;rows&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_statements&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;calls&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;total_exec_time&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;total_exec_time&lt;/code&gt; column is the most actionable metric. A query averaging 5ms called 1 million times (5,000 seconds total) deserves more attention than one averaging 2 seconds called 10 times (20 seconds total). The &lt;code&gt;stddev_exec_time&lt;/code&gt; reveals plan instability -- queries where some executions are fast and others are slow.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting Up Slow Query Logging
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Enable with an appropriate threshold:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- 250ms is a good starting point for web/API workloads&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;log_min_duration_statement&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'250ms'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;pg_reload_conf&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="c1"&gt;-- Verify&lt;/span&gt;
&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;log_min_duration_statement&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For analytical workloads where multi-second queries are normal, set it higher (2-5 seconds). You can always lower it later once you've processed the initial findings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Add useful context to log lines:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;log_line_prefix&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'%m [%p] %q%u@%d '&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- %m = timestamp with milliseconds&lt;/span&gt;
&lt;span class="c1"&gt;-- %p = process ID&lt;/span&gt;
&lt;span class="c1"&gt;-- %u = user name&lt;/span&gt;
&lt;span class="c1"&gt;-- %d = database name&lt;/span&gt;

&lt;span class="c1"&gt;-- Log parameters for parameterized queries (PostgreSQL 14+)&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;log_parameter_max_length_on_error&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;pg_reload_conf&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Enable auto_explain for automatic plan capture:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- In postgresql.conf: shared_preload_libraries = 'pg_stat_statements, auto_explain'&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;auto_explain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log_min_duration&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'500ms'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;auto_explain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log_analyze&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;off&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;    &lt;span class="c1"&gt;-- on = actual times (adds overhead)&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;auto_explain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log_buffers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;on&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;auto_explain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log_format&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'json'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;auto_explain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log_nested_statements&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;on&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;pg_reload_conf&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Setting &lt;code&gt;auto_explain.log_analyze = off&lt;/code&gt; logs the estimated plan without running ANALYZE, avoiding execution overhead. Turn it &lt;code&gt;on&lt;/code&gt; only during targeted debugging.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;On AWS RDS/Aurora&lt;/strong&gt;, configure via parameter groups:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="py"&gt;log_min_duration_statement&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;250&lt;/span&gt;
&lt;span class="py"&gt;log_statement&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;none&lt;/span&gt;
&lt;span class="py"&gt;shared_preload_libraries&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;pg_stat_statements,auto_explain&lt;/span&gt;
&lt;span class="py"&gt;auto_explain.log_min_duration&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;500&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;RDS logs go to CloudWatch. Enable "Publish to CloudWatch" in the RDS console to export PostgreSQL logs automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Enable pg_stat_statements if not active:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;EXTENSION&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;pg_stat_statements&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;pg_stat_statements&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;max&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;pg_stat_statements&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;track&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'all'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;pg_reload_conf&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Analyzing Logs at Scale
&lt;/h2&gt;

&lt;p&gt;For historical analysis, pgBadger generates HTML reports from PostgreSQL log files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Generate an HTML report&lt;/span&gt;
pgbadger /var/log/postgresql/postgresql-&lt;span class="k"&gt;*&lt;/span&gt;.log &lt;span class="nt"&gt;-o&lt;/span&gt; slow_query_report.html

&lt;span class="c"&gt;# For a specific date range&lt;/span&gt;
pgbadger &lt;span class="nt"&gt;--begin&lt;/span&gt; &lt;span class="s2"&gt;"2026-02-01 00:00:00"&lt;/span&gt; &lt;span class="nt"&gt;--end&lt;/span&gt; &lt;span class="s2"&gt;"2026-02-28 23:59:59"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    /var/log/postgresql/postgresql-&lt;span class="k"&gt;*&lt;/span&gt;.log &lt;span class="nt"&gt;-o&lt;/span&gt; february_report.html
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;pgBadger normalizes queries, groups them by pattern, and shows total execution time, call frequency, and hourly distribution. It's an excellent complement to real-time monitoring.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prevention Strategy
&lt;/h2&gt;

&lt;p&gt;Establish a slow query budget as part of your performance SLA. Define what "slow" means for your app -- for a web API, over 100ms might be unacceptable; for a batch pipeline, 5 seconds might be fine. Set &lt;code&gt;log_min_duration_statement&lt;/code&gt; to match your SLA and treat every logged query as a performance bug.&lt;/p&gt;

&lt;p&gt;Monitor trends, not just individual slow queries. A query degrading from 10ms to 50ms over three months won't trigger a 250ms threshold, but it's a 5x regression that will eventually become a problem.&lt;/p&gt;

&lt;p&gt;Include &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; in code reviews. Before merging any PR that adds or modifies a database query, run it against a production-sized dataset and verify the plan uses indexes appropriately.&lt;/p&gt;

&lt;p&gt;Reset &lt;code&gt;pg_stat_statements&lt;/code&gt; after major releases (&lt;code&gt;SELECT pg_stat_statements_reset()&lt;/code&gt;) to get a clean baseline. Cumulative stats make it hard to isolate the impact of recent changes.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://mydba.dev/blog/postgres-slow-query-log" rel="noopener noreferrer"&gt;mydba.dev/blog/postgres-slow-query-log&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>database</category>
      <category>monitoring</category>
      <category>performance</category>
      <category>postgres</category>
    </item>
    <item>
      <title>PostgreSQL Monitoring Tools Compared (2026)</title>
      <dc:creator>Philip McClarence</dc:creator>
      <pubDate>Wed, 15 Apr 2026 10:00:03 +0000</pubDate>
      <link>https://forem.com/philip_mcclarence_2ef9475/postgresql-monitoring-tools-compared-2026-24k3</link>
      <guid>https://forem.com/philip_mcclarence_2ef9475/postgresql-monitoring-tools-compared-2026-24k3</guid>
      <description>&lt;h1&gt;
  
  
  PostgreSQL Monitoring Tools Compared (2026)
&lt;/h1&gt;

&lt;p&gt;PostgreSQL gives you everything you need to understand what's happening inside your database -- &lt;code&gt;pg_stat_statements&lt;/code&gt;, &lt;code&gt;pg_stat_activity&lt;/code&gt;, &lt;code&gt;pg_locks&lt;/code&gt;, &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt;. The data is all there. The problem is turning it into actionable insight without building a custom monitoring stack.&lt;/p&gt;

&lt;p&gt;If you're running more than a couple of PostgreSQL instances, or if "SSH in and run a query" isn't a sustainable monitoring strategy, you need a tool. Here's every major option compared.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Baseline: What PostgreSQL Provides Natively
&lt;/h2&gt;

&lt;p&gt;Before evaluating tools, know what you get for free:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Current activity&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wait_event_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wait_event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_activity&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;backend_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'client backend'&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wait_event_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wait_event&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Top queries by total time (requires pg_stat_statements)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="k"&gt;substring&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;query_preview&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;calls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_exec_time&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;numeric&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mean_exec_time&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;numeric&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;rows&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_statements&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;total_exec_time&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Database health snapshot&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;datname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;blks_hit&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="k"&gt;nullif&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;blks_hit&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;blks_read&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;cache_hit_ratio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;xact_commit&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;commits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;xact_rollback&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;rollbacks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;deadlocks&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_database&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;datname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;current_database&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you find yourself running these queries manually more than once a week, you need a monitoring tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Tool Categories
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Postgres-Specialized SaaS
&lt;/h3&gt;

&lt;h4&gt;
  
  
  myDBA.dev
&lt;/h4&gt;

&lt;p&gt;Purpose-built for PostgreSQL. A lightweight Go collector gathers metrics at 15-second intervals. No agent installation on the database server needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it does well:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Automated health checks across 10 domains with scored assessments&lt;/li&gt;
&lt;li&gt;EXPLAIN plan capture and regression detection&lt;/li&gt;
&lt;li&gt;Index advisor with specific recommendations&lt;/li&gt;
&lt;li&gt;Lock chain visualization&lt;/li&gt;
&lt;li&gt;Extension-specific monitoring: TimescaleDB, pgvector, PostGIS&lt;/li&gt;
&lt;li&gt;Replication topology mapping&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Where it falls short:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Newer product, smaller community than established tools&lt;/li&gt;
&lt;li&gt;No infrastructure-level metrics (CPU, memory, disk) -- monitors PostgreSQL, not the host&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt; Free tier (7-day retention, 1 instance). Pro tier for extended retention and multi-instance.&lt;/p&gt;

&lt;h4&gt;
  
  
  pganalyze
&lt;/h4&gt;

&lt;p&gt;Mature PostgreSQL monitoring focused on query performance analysis. Ruby-based collector. Good documentation and established user base.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it does well:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deep query performance analysis&lt;/li&gt;
&lt;li&gt;Automated index recommendations&lt;/li&gt;
&lt;li&gt;Schema evolution tracking&lt;/li&gt;
&lt;li&gt;Log-based EXPLAIN plan collection&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Where it falls short:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Collector requires installation on the database host or sidecar container&lt;/li&gt;
&lt;li&gt;Batch-interval processing (not real-time)&lt;/li&gt;
&lt;li&gt;No health check scoring&lt;/li&gt;
&lt;li&gt;No extension-specific monitoring (pgvector, PostGIS, TimescaleDB)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt; Starts at $249/month per server. No free tier for production.&lt;/p&gt;

&lt;h3&gt;
  
  
  General-Purpose SaaS
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Datadog
&lt;/h4&gt;

&lt;p&gt;Monitors PostgreSQL as part of its broader platform alongside host metrics, APM, and logs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it does well:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Unified view across your entire stack&lt;/li&gt;
&lt;li&gt;Correlate database slowness with application latency and infra issues&lt;/li&gt;
&lt;li&gt;Excellent alerting and dashboard customization&lt;/li&gt;
&lt;li&gt;APM integration shows which endpoints generate the most DB load&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Where it falls short:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Shallow Postgres depth -- no EXPLAIN plans, no index advisor, no vacuum analysis&lt;/li&gt;
&lt;li&gt;Query analysis relies on pg_stat_statements without deeper plan-level insights&lt;/li&gt;
&lt;li&gt;Postgres is one of hundreds of integrations, not the focus&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt; Database monitoring starts at $70/host/month (on top of $15+/host for infra monitoring).&lt;/p&gt;

&lt;h3&gt;
  
  
  Self-Hosted Open Source
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Percona Monitoring and Management (PMM)
&lt;/h4&gt;

&lt;p&gt;Open-source monitoring for PostgreSQL, MySQL, and MongoDB. Grafana-based dashboards with VictoriaMetrics storage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it does well:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Free and open-source&lt;/li&gt;
&lt;li&gt;Query analytics (QAN) for slow query identification&lt;/li&gt;
&lt;li&gt;Familiar Grafana dashboards&lt;/li&gt;
&lt;li&gt;Multi-database support&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Where it falls short:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You must host, upgrade, backup, and scale the PMM server&lt;/li&gt;
&lt;li&gt;PostgreSQL support less mature than MySQL (Percona's core focus)&lt;/li&gt;
&lt;li&gt;No automated health checks or EXPLAIN plan regression detection&lt;/li&gt;
&lt;li&gt;Dashboard complexity can overwhelm smaller teams&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt; Free. Commercial support available.&lt;/p&gt;

&lt;h4&gt;
  
  
  pgwatch2
&lt;/h4&gt;

&lt;p&gt;Postgres-only monitoring. Collects metrics via SQL queries, stores in InfluxDB or TimescaleDB, visualizes with Grafana.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it does well:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Free and Postgres-specific&lt;/li&gt;
&lt;li&gt;Flexible custom SQL metric collection&lt;/li&gt;
&lt;li&gt;Lightweight collector&lt;/li&gt;
&lt;li&gt;Good time-series storage choices&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Where it falls short:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Three components to host and maintain (collector, metrics store, Grafana)&lt;/li&gt;
&lt;li&gt;No built-in alerting (Grafana alerting or external tools)&lt;/li&gt;
&lt;li&gt;No EXPLAIN plan analysis or recommendations&lt;/li&gt;
&lt;li&gt;Significantly more setup effort than hosted tools&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt; Free.&lt;/p&gt;

&lt;h3&gt;
  
  
  CLI / Log Analysis
&lt;/h3&gt;

&lt;h4&gt;
  
  
  pgBadger
&lt;/h4&gt;

&lt;p&gt;Parses PostgreSQL log files and generates detailed HTML reports. Not real-time -- post-hoc analysis only.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it does well:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Extremely detailed reports: query normalization, hourly patterns, error categorization&lt;/li&gt;
&lt;li&gt;Zero database load (reads log files, not live connections)&lt;/li&gt;
&lt;li&gt;Single binary, no dependencies&lt;/li&gt;
&lt;li&gt;Free&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Where it falls short:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Static reports, not continuous monitoring&lt;/li&gt;
&lt;li&gt;Requires specific PostgreSQL logging configuration&lt;/li&gt;
&lt;li&gt;No alerting, no dashboards, no ongoing tracking&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  DIY: pg_stat_statements + Grafana
&lt;/h3&gt;

&lt;p&gt;Build your own by querying system views, storing results in Prometheus/InfluxDB/TimescaleDB, and building Grafana dashboards.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt; Complete control, no vendor lock-in, integrates with existing Grafana.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reality:&lt;/strong&gt; Significant build and maintenance time. No automated analysis. Every PG upgrade may break queries. The builder must maintain it forever.&lt;/p&gt;

&lt;h2&gt;
  
  
  Decision Matrix
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Criteria&lt;/th&gt;
&lt;th&gt;myDBA.dev&lt;/th&gt;
&lt;th&gt;pganalyze&lt;/th&gt;
&lt;th&gt;Datadog&lt;/th&gt;
&lt;th&gt;PMM&lt;/th&gt;
&lt;th&gt;pgwatch2&lt;/th&gt;
&lt;th&gt;pgBadger&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Setup time&lt;/td&gt;
&lt;td&gt;Minutes&lt;/td&gt;
&lt;td&gt;Hours&lt;/td&gt;
&lt;td&gt;Hours&lt;/td&gt;
&lt;td&gt;Hours-Days&lt;/td&gt;
&lt;td&gt;Hours-Days&lt;/td&gt;
&lt;td&gt;Minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Self-hosting&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Postgres depth&lt;/td&gt;
&lt;td&gt;Deep&lt;/td&gt;
&lt;td&gt;Deep&lt;/td&gt;
&lt;td&gt;Shallow&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Deep (logs)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EXPLAIN plans&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Health check scoring&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Index advisor&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Extension monitoring&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Real-time&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Delayed&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Alerting&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Via Grafana&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Free tier&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (self-host)&lt;/td&gt;
&lt;td&gt;Yes (self-host)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  How to Choose
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Small team, few instances, need Postgres depth:&lt;/strong&gt; myDBA.dev (free tier) or pganalyze (paid). Both provide the Postgres-specific insights that generic tools miss.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Platform team, many services, need unified observability:&lt;/strong&gt; Datadog. Its Postgres monitoring is shallow but the correlation with APM and infrastructure metrics is valuable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Budget-constrained, willing to self-host:&lt;/strong&gt; PMM for the most features, pgwatch2 for a lighter footprint.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Just need periodic analysis:&lt;/strong&gt; pgBadger. Parse your logs, get a report, fix the issues. No ongoing infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The general rule:&lt;/strong&gt; evaluate tools against your most common incidents. If your problems are missing indexes, vacuum backlogs, and replication lag, choose a tool that monitors all three with specific recommendations -- not one that shows you a CPU graph and leaves you to figure out the database-level cause.&lt;/p&gt;

&lt;h2&gt;
  
  
  Start Here
&lt;/h2&gt;

&lt;p&gt;Regardless of which tool you choose, these are foundational:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Enable &lt;code&gt;pg_stat_statements&lt;/code&gt;&lt;/strong&gt; -- every tool relies on it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set &lt;code&gt;log_min_duration_statement&lt;/code&gt;&lt;/strong&gt; -- capture slow queries in logs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Learn &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt;&lt;/strong&gt; -- no tool replaces understanding query plans&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitor continuously&lt;/strong&gt; -- trends reveal problems before users do&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Then layer your monitoring tool on top for historical analysis, alerting, and automated recommendations.&lt;/p&gt;

</description>
      <category>database</category>
      <category>monitoring</category>
      <category>postgres</category>
      <category>tooling</category>
    </item>
  </channel>
</rss>
