Forem: Philip McClarence

PostgreSQL Plan Signatures: Quick Reference

Philip McClarence — Wed, 06 May 2026 14:00:06 +0000

PostgreSQL Plan Signatures: Quick Reference

A scannable lookup companion to the Complete Guide to PostgreSQL SQL Query Analysis & Optimization series. Designed for when you have an EXPLAIN plan in front of you and need the pattern → fix mapping fast, without re-reading the deep-dive articles. Three tables below:

Plan-node signatures — "when you see this, do that."
SQL anti-patterns — "if your code looks like this, replace with that."
MyDBA analyzer rules — severity, trigger condition, and the article that covers each.

If a row points to a deeper article, that's where the full explanation with captured EXPLAIN examples lives.

1. Plan-node signatures

A "signature" is a field or combination of fields you can spot in an EXPLAIN plan at a glance.

Plan signature	What it means	Fix	Deep dive
`Seq Scan` on table > 10k rows with selective filter	Missing or unusable index	Add an index on the filter column, or see non-sargable cases below	Index usage
`Seq Scan` + `Filter: fn(col) = ...`	Function on column disables index	Normalise on write, or add expression index `ON t (fn(col))`	WHERE clause
`Seq Scan` with `Rows Removed by Filter >> Actual Rows`	Filter running after scan instead of index	Add index matching the filter; check if filter is sargable	WHERE clause
`Index Scan` with large `Heap Fetches:`	Index not covering SELECT list	Add `INCLUDE` columns to index	Index usage
`Index Only Scan` with `Heap Fetches: 0`	Optimal — the visibility map is working	No action; keep autovacuum healthy	Reading EXPLAIN
`Nested Loop` with outer > 1,000 and no Memoize	Quadratic join on large inputs	Add index on inner join column; `nested_loop_large` rule	Joins
`Hash Join` with `Batches > 1`	Hash table spilled to disk	Raise `work_mem` per-session, or add index for different strategy	Joins
`Hash` node `Memory Usage > work_mem`	Will spill on next execution	Same: raise `work_mem` or change join strategy	Joins
`Sort` with `Sort Method: external merge`	Sort didn't fit in work_mem	Raise `work_mem` or add index providing sorted input	Aggregate/window
`Sort` with `Sort Method: top-N heapsort`	Good — only N rows kept in memory	No action; verify `LIMIT` is reasonable	Reading EXPLAIN
`HashAggregate` with `Batches > 1` or `Disk Usage > 0`	Aggregate spill	Raise `work_mem` or enable `GroupAggregate` with sorted index	Aggregate/window
`actual rows=r loops=l` where `l >> 1`	Node executed per outer-loop iteration	Usually a nested loop or SubPlan; see whether rewrite is possible	Joins, Subquery/CTE
`SubPlan N` appearing under an outer node	Correlated subquery executed per row	Rewrite as JOIN, aggregating JOIN, or LATERAL	Subquery/CTE
`CTE Scan`	Materialised CTE, predicates can't push in	`NOT MATERIALIZED` if referenced once	Subquery/CTE
`Plan Rows` vs `Actual Rows` off by 10×+	Stale statistics → bad plan	`ANALYZE` table, consider extended statistics	Reading EXPLAIN
`Workers Planned > Workers Launched`	Parallel-worker pool exhausted	Raise `max_parallel_workers`, check for contention	Reading EXPLAIN
`Lossy Heap Blocks > 50%` on `Bitmap Heap Scan`	Bitmap exceeded work_mem, fell back to page-level	Raise `work_mem`	Reading EXPLAIN
`Gather` / `Gather Merge` above every scan	Parallelism engaged; check worker count is optimal	Usually fine; tune `max_parallel_workers_per_gather` if I/O-bound	Joins
`Buffers: shared read >> shared hit` on hot path	Working set doesn't fit in cache	Raise `shared_buffers`, check for too-small cache	Reading EXPLAIN
`Memoize` with near-zero hits	Cache isn't paying off (no repeated keys)	No action; negligible cost	Joins
`Run Condition:` on `WindowAgg`	PG 15+ optimisation — window function short-circuited	Working as intended	Aggregate/window
`Incremental Sort` with `Presorted Key:`	Partial index order let sort be localised	Working as intended; cheaper than full sort	Aggregate/window

2. SQL anti-patterns and replacements

When you recognise one of these in code review, the replacement is usually a mechanical substitution.

Anti-pattern	Why it's bad	Replacement
`SELECT *` from wide table	Disables Index Only Scan; bloats network payload	Name the columns you actually need
`WHERE text_col = 123`	Implicit cast on column disables index	`WHERE text_col = '123'`
`WHERE lower(col) = 'x'`	Function on column disables index	Normalise on write, or `CREATE INDEX ON t (lower(col))`
`WHERE col NOT IN (SELECT ...)`	NULL-unsafe; returns 0 rows on NULLs	`WHERE NOT EXISTS (SELECT 1 FROM ... WHERE ...)`
`WHERE col = 'a' OR col = 'b'`	Usually fine, but harder to index-plan	`WHERE col IN ('a', 'b')`
`OFFSET N LIMIT M` with large N	Reads and discards N rows	Keyset pagination with composite cursor
`SELECT DISTINCT ... ORDER BY ...` for top-1-per-group	Ambiguous — any row, not the first	`SELECT DISTINCT ON (key) ... ORDER BY key, ...`
Loop of one-row `INSERT`s	1 round-trip per row	`COPY FROM STDIN`, or multi-row `VALUES`, or `INSERT ... SELECT`
`SELECT` then `INSERT` upsert	Race condition; two round trips	`INSERT ... ON CONFLICT (col) DO UPDATE`
`DELETE FROM t WHERE old_date < ...` on massive table	Single huge lock; WAL storm; autovacuum blocked	Chunked loop of `DELETE ... WHERE id IN (SELECT ... LIMIT 10000)` with `COMMIT` per chunk
N+1 from ORM loops	1 + N round trips; N plan-and-execute cycles	Eager load with JOIN (`joinedload`, `includes`, `prefetch_related`)
`count(*)` on huge tables	Full table scan	`reltuples` estimate, trigger-maintained counter, or redesign UI to not need the total
`date_trunc('day', col) = '2024-01-15'`	Function disables index on `col`	`col >= '2024-01-15' AND col < '2024-01-16'`
Storing dates/numbers/booleans as text	Every type-aware query non-sargable	`ALTER TABLE` to the right type
WHERE clause with `NOW()` wrapped by user-defined function marked VOLATILE	Re-evaluated per row	Mark UDF `STABLE` or `IMMUTABLE` if semantics allow
Transactions held open across external calls	Autovacuum blocked, bloat	Finish SQL before external calls; keep transactions short
`SELECT ... FOR UPDATE` on big ranges	Locks every row returned	Use SELECT with `SKIP LOCKED` for worker queues; narrow the SELECT

3. MyDBA analyzer rules

The 15-rule first-pass analyzer in frontend/src/utils/explain-plan-analyzer.ts. Use this as a reference for what each rule means and where the full treatment lives in the series.

Rule ID	Severity	Trigger condition	Article
`seq_scan_large`	warning / critical	`Seq Scan` with `Plan Rows > 10,000` (critical above 100,000)	Index usage
`excessive_filter_rows`	warning	`Rows Removed by Filter / Actual Rows > 10` and `Rows Removed > 1000`	WHERE clause
`nested_loop_large`	warning	`Nested Loop` with outer > 1,000 and inner > 100 rows	Joins
`sort_on_disk`	warning	`Sort Space Type = Disk`	Aggregate/window
`hash_batches_spill`	warning	`Hash Batches > 1` on Hash or Hash Join	Joins
`row_estimate_inaccurate`	warning / critical	`actual_rows / plan_rows` ratio > 10 or < 0.1 (critical at 100 / 0.01)	Reading EXPLAIN
`lossy_bitmap_scan`	warning	`Lossy Heap Blocks / Total > 50%` on `Bitmap Heap Scan`	Reading EXPLAIN
`cte_materialized`	info	`CTE Scan` node present	Subquery/CTE
`correlated_subplan`	warning	Node's JSON has a non-empty `Subplan Name`	Subquery/CTE
`parallel_workers_missing`	info	`Workers Launched < Workers Planned`	Reading EXPLAIN
`high_cache_miss_rate`	warning	`shared_read / (read + hit) > 50%` and `read > 1000` blocks	Reading EXPLAIN
`temp_blocks_written`	warning	`Temp Written Blocks > 100`	Aggregate/window
`very_high_total_cost`	warning	`Total Cost > 1,000,000`	Reading EXPLAIN
`deep_plan_tree`	info	Plan has > 30 nodes	Reading EXPLAIN
`no_index_usage`	warning	No `Index*` nodes, at least one `Seq Scan`, > 2 total nodes	Index usage

Severity meanings:

critical: plan is almost certainly broken for this query size.
warning: plan has a specific, known problem; investigate.
info: worth noting, not necessarily actionable.

The analyzer runs on EXPLAIN plans in JSON format. If you paste a text-format plan into the MyDBA EXPLAIN visualiser, some rules (specifically the ones that rely on fields the text parser doesn't extract — such as Hash Batches or Temp Written Blocks) may not fire even when the underlying condition is present. Prefer capturing plans with EXPLAIN (ANALYZE, BUFFERS, VERBOSE, SETTINGS, FORMAT JSON) for the most reliable analysis.

How to use this reference

Capture the plan with EXPLAIN (ANALYZE, BUFFERS, VERBOSE, SETTINGS) in psql, or via the MyDBA EXPLAIN visualiser which captures it automatically.
Scan for the dominant signature — the single biggest cost in the plan, usually the node with the highest actual time × loops.
Look up the signature in table 1. If the fix points to a deeper article, read the relevant section there.
Apply the fix. Almost all of them are single SQL statements or a single session GUC change.
Re-run EXPLAIN. Verify the plan actually changed the way you expected. If it didn't, the fix was for a different root cause.

The entire workflow, end-to-end, is described in the pillar article. If the series has been useful, that's the single link to bookmark — everything else is deep-dive for specific categories.

postgres #performance #database #sql

Full series and canonical copy: https://mydba.dev/blog/postgres-plan-signatures-quick-reference

PostgreSQL Query Anti-Patterns and Common Mistakes

Philip McClarence — Tue, 05 May 2026 14:00:05 +0000

Most of the articles in this series are about making reasonable queries faster. This one is about queries that are wrong by construction — patterns that account for most production SQL performance incidents, not because the query is subtle, but because the pattern is near-universal and the fix is well-known to people who've seen it before. If you recognise these in your codebase, fixing them is almost always a straightforward win.

Each anti-pattern below maps to a specific plan signature or log symptom; cross-references to deeper treatment are inline.

1. N+1 queries — the ORM default

The single most common performance problem in applications talking to PostgreSQL. The pattern is one query to fetch a list of parent rows, plus one query per parent to fetch related data:

# One query — fetches 500 orders.
orders = db.query("SELECT * FROM orders WHERE status = 'pending'")

# Five hundred more queries — one per order.
for order in orders:
    items = db.query(
        "SELECT * FROM order_items WHERE order_id = %s",
        order.id,
    )

501 network round trips. 501 query plans. 501 parse-and-execute cycles. Even if each individual query is 2 ms, you've burned a full second of wall-clock time on a dashboard query that could have been one SQL statement.

Fix: a single JOIN.

SELECT o.order_id, o.status, o.created_at,
       oi.item_id, oi.quantity, oi.unit_price_cents
FROM sim_bp_orders o
JOIN sim_bp_order_items oi ON oi.order_id = o.order_id
WHERE o.status = 'pending';

One query. The application then groups the result by order_id to reconstruct the parent-children shape.

Detecting N+1 from PostgreSQL side: pg_stat_statements will show many rows of the same normalised order_items query with different parameter values. Most modern ORMs can eager-load related rows with a single JOIN when you ask — in SQLAlchemy it's joinedload(); in ActiveRecord, .includes(:items); in Django, prefetch_related(). Usually a one-line change.

2. `SELECT *` on wide tables in hot paths

Every column the query returns has a cost: bytes read from heap, bytes serialised over the wire, bytes deserialised on the client. A 400-column table with SELECT * returns hundreds of bytes per row that the application typically discards.

The worse consequence is plan quality. An Index Only Scan can only be chosen when every referenced column is available from the index — SELECT * forces heap fetches unconditionally. A covering index with INCLUDE becomes useless because the planner still has to visit the heap for the columns not in the index.

Fix: name the columns you actually need. The resulting plan can use covering indexes, narrower sort keys, and tighter network payload.

Where it's OK to keep SELECT *: database-admin queries, dump scripts, and CTEs that genuinely need every column. In production application code, it's almost always wrong.

3. Implicit type casts that disable indexes

Covered in detail in the WHERE clause article. The symptom: an index exists on the filtered column, but the plan shows Seq Scan with the filter appearing in a Filter: line instead of Index Cond:.

-- text column compared against integer literal.
SELECT * FROM t WHERE text_col = 123;

PostgreSQL coerces the text column to an int for the comparison, which disables the index. The three fix patterns:

Use the correct literal type: WHERE text_col = '123'.
Cast the literal explicitly: WHERE text_col = 123::text.
If the column semantically should be an int, change the schema.

This anti-pattern also appears after schema migrations — a column type change from varchar(N) to citext or a domain type can silently shift which side of the comparison gets cast. Always re-run EXPLAIN on the key queries after any column-type change.

4. Functions on indexed columns

The other half of the non-sargable family. Wrapping an indexed column in a function disables the index:

SELECT * FROM sim_bp_users WHERE lower(email) = '...';
SELECT * FROM sim_bp_orders WHERE date_trunc('day', created_at) = '2024-01-15';

On our dataset, WHERE lower(email) LIKE 'user12%' produces a parallel sequential scan at 122 ms. The same query against the plain column with a pattern-ops index takes 25 ms.

Fixes:

Normalise on write — store emails lowercased, dates as dates not timestamps, and so on.
Expression index that computes the same function — CREATE INDEX ON sim_bp_users (lower(email)).
Rewrite the predicate to avoid the function — WHERE created_at >= '2024-01-15' AND created_at < '2024-01-16' instead of date_trunc('day', created_at) = '2024-01-15'.

5. Missing `LIMIT` on exploratory joins

A query written in development against a 100-row test table can become a 10-million-row query in production. Joins without a LIMIT are the most common source of surprise-large result sets — a new feature ships, the WHERE clause on the production dataset matches a million rows instead of a hundred, and the application silently returns a million rows per request.

-- No LIMIT — returns however many rows happen to match.
SELECT o.*, u.email
FROM sim_bp_orders o JOIN sim_bp_users u ON u.user_id = o.user_id
WHERE o.status = 'pending';

Fix: add a LIMIT. For user-facing queries, paginate (ideally with keyset — see query rewriting techniques). For background jobs, chunk the processing.

This is also a security posture question. A query without a LIMIT exposed through an API endpoint gives an attacker a trivial amplification attack — one HTTP request → one full-table read.

6. One-row-at-a-time INSERTs

Bulk-loading data with a loop of INSERT INTO t VALUES ($1, $2, $3) calls is common in migrations, ETL jobs, and CSV importers. Each statement is a full round trip plus a plan-and-execute cycle; for a million rows, that's a million of each.

Fixes, roughly in order of preference:

COPY FROM STDIN — PostgreSQL's bulk-load protocol. Orders of magnitude faster than INSERT because it avoids per-row parsing and planning. Most drivers expose it.
Multi-row VALUES — INSERT INTO t (...) VALUES ($1, ...), ($2, ...), ($3, ...) packs multiple rows into one statement. Ten to a hundred per call is a sweet spot.
INSERT INTO t SELECT ... FROM source — if the source data is already in the database, do the load in SQL and skip the client round-trip entirely.

Wrap the bulk operation in a single transaction so WAL flushes happen at commit, not per row. synchronous_commit = off for the loading session is safe for non-durable data and further speeds things up.

7. Keeping transactions open

Transactions hold locks and prevent vacuum cleanup on the tables they've read. A transaction left open for an hour — usually by a stuck batch job or a long-running analytical query — blocks autovacuum across the whole database for the duration. Symptoms: growing table bloat, mysterious missed vacuum schedules, XID-wraparound warnings on busy databases.

The specific antipattern is usually application-level:

# Accidentally nests transactions; if one step hangs, the whole transaction hangs.
with db.transaction():
    rows = db.query("SELECT ...")  # reads a snapshot.
    for row in rows:
        call_external_api(row)  # could block for minutes.
        db.execute("UPDATE ...")

Fix: keep transactions short. Never call out to external services inside a transaction. Batch the reads outside the transaction; inside the transaction, do only the minimum set of SQL changes that need to be atomic.

pg_stat_activity shows xact_start for every running transaction; any row where now() - xact_start > interval '5 minutes' is worth investigating. The idle in transaction state is the most common failure mode.

8. Storing dates, numbers, or booleans as strings

Every time a type-mismatched predicate runs, the planner either (a) casts the literal to text (sargable, OK) or (b) casts the column to the right type (not sargable, disables index). The schema wins: if the column is created_at text, every query that filters it numerically or by date pays a non-sargable cost.

-- created_at stored as text like '2024-01-15 14:22:00'.
SELECT * FROM logs
WHERE created_at::timestamp > now() - interval '1 hour';  -- non-sargable.

Fix: migrate to the right type. ALTER TABLE ... ALTER COLUMN ... TYPE timestamptz USING created_at::timestamptz; — takes a brief ACCESS EXCLUSIVE lock, so pick a quiet moment; for very large tables do an online migration (new column, backfill, switch, drop old).

Indexes on the right type are smaller, faster, and actually usable by predicates. Indexes on text-representations of dates have all the downsides of neither — locale-sensitive comparison, extra parse cost, and usually the wrong sort order for natural ranges.

9. `count(*)` as a cheap operation

Applications often display "total records: 847,291" somewhere, computed with SELECT count(*) FROM big_table. On PostgreSQL, that's a full scan — the storage layer doesn't maintain row counts because of MVCC visibility. A 100-million-row count(*) takes seconds to minutes.

Fixes:

Accept an approximate count. pg_class.reltuples is the planner's cached estimate — good within a few percent on recently-analyzed tables, and free. SELECT reltuples::bigint FROM pg_class WHERE relname = 'big_table'.
Trigger-maintained counter table. If you truly need an exact count and it's shown on every page load, maintain it with an AFTER INSERT/DELETE trigger that updates a tiny summary table.
Avoid the need for a count. Pagination UIs that show "Page 127 of 3,482,195" rarely benefit from the total; "Page 127" with Next/Previous buttons is enough.

10. Ignoring `pg_stat_statements`

The extension that tracks query statistics — most-expensive queries, average time per execution, total time across all runs. On any non-toy database it's the single most useful diagnostic tool.

CREATE EXTENSION pg_stat_statements;

-- Top 10 queries by total time:
SELECT query,
       calls,
       total_exec_time,
       mean_exec_time,
       rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

Total time (not per-call time) is the right sort: a 2 ms query called a million times is a bigger target than a 5 s query called once a day.

Set pg_stat_statements.track = all (not the default top) to capture statements inside functions too. On cloud-managed instances, this is often already on by default.

Catching anti-patterns automatically

Many of these patterns show up as specific plan shapes:

seq_scan_large → patterns 3, 4, 5 (non-sargable predicates, missing indexes).
excessive_filter_rows → pattern 5 (missing LIMIT, wide filter).
nested_loop_large → pattern 1 (N+1 at plan level, or missing join index).
no_index_usage → patterns 3, 4, 8 (index never usable because of casts or types).
hash_batches_spill, sort_on_disk, temp_blocks_written → work_mem / aggregate patterns from earlier articles.

The workflow is the same every time: capture a plan, identify the category, apply the fix, verify. Most of the value is in recognising the category quickly — the fixes are standard patterns once the category is clear. The pillar guide ties the full series together.

postgres #performance #database #sql

Originally published at https://mydba.dev/blog/postgres-query-anti-patterns

PostgreSQL Query Rewriting Techniques

Philip McClarence — Mon, 04 May 2026 14:00:07 +0000

PostgreSQL Query Rewriting Techniques

The previous articles in this series covered performance problems you fix by adding indexes, restructuring joins, or tuning memory. This one is about the queries where the plan is "fine" — every node is doing something reasonable — but the query itself is asking the wrong question, producing unnecessarily large intermediate results or forcing the planner down a path that a different SQL shape would avoid.

These rewrites don't change what the query returns. They change how PostgreSQL goes about computing it. Learn to recognise the patterns and most of them are mechanical — if the original form matches X, rewrite to Y — and the performance improvement is often an order of magnitude or more with no downside.

This article is the seventh in the Complete Guide to PostgreSQL SQL Query Analysis & Optimization series. Every EXPLAIN block below is captured from the same Neon Postgres 17.8 database used throughout.

Offset pagination → keyset pagination

The single highest-impact rewrite in this article. OFFSET N LIMIT M is the default pagination shape in most ORMs and REST API frameworks. It's also a performance landmine as soon as users deep-paginate. To return page 1000 of 500,000 rows (20 per page), PostgreSQL must read and discard 19,980 rows before returning the 20 you want. Page 1 is fast; page 1000 is slow; page 10000 is a disaster.

Captured against our 500,000-row sim_bp_orders table — "page 24000 of 25000, 20 orders per page":

SELECT order_id, user_id, created_at
FROM sim_bp_orders
ORDER BY created_at DESC
LIMIT 20 OFFSET 480000;

Limit  (cost=28511.34..28512.53 rows=20 width=16) (actual time=1900.713..1900.731 rows=20 loops=1)
  Buffers: shared hit=481693 read=1775
  ->  Index Scan Backward using idx_sim_bp_orders_created_at
        (actual time=0.018..1878.609 rows=480020 loops=1)
 Execution Time: 1900.750 ms

1.9 seconds for 20 rows. The Index Scan Backward returns rows=480020 before the Limit takes 20 — PostgreSQL walked the created_at index backwards, visited every heap tuple for visibility checks, and discarded 99.996% of them. Buffers: shared hit=481693 read=1775 is 3.8 GB of page traffic for a result the size of a tweet.

The fix is keyset pagination — instead of OFFSET 480000, remember the cursor value of the last row you returned and ask for rows less than that:

-- Pass the (created_at, order_id) from the last row of the previous page.
SELECT order_id, user_id, created_at
FROM sim_bp_orders
WHERE created_at < '2024-03-01'
ORDER BY created_at DESC
LIMIT 20;

Limit  (actual time=0.979..1.014 rows=20 loops=1)
  Buffers: shared hit=22 read=1
  ->  Index Scan Backward using idx_sim_bp_orders_created_at
        Index Cond: (created_at < '2024-03-01'::timestamptz)
        (actual time=0.978..1.010 rows=20 loops=1)
 Execution Time: 1.032 ms

1 ms, 23 buffers hit. The Index Cond means the planner could start the index scan from the cursor position rather than the beginning — no discarded rows, no wasted buffer reads. Page 1 and page 10,000 have identical cost.

Three things to know about keyset pagination:

Use a composite cursor for uniqueness. ORDER BY created_at DESC isn't a deterministic total order unless created_at is unique. For production systems, use (created_at, id) or similar: WHERE (created_at, order_id) < ('2024-03-01 14:22:00+00', 984523) ORDER BY created_at DESC, order_id DESC LIMIT 20. This ensures no rows are skipped or duplicated at page boundaries when multiple rows share the same timestamp.
The index has to match the sort. ORDER BY created_at DESC, order_id DESC works against (created_at DESC, order_id DESC) directly or (created_at, order_id) read backwards. Mismatches force an in-memory sort that undoes the keyset win.
You give up random-access "jump to page N" semantics. Keyset pagination is forward/backward through an ordered stream. Most APIs and infinite-scroll UIs don't actually need random access; if yours does, you're stuck with OFFSET (or need a completely different data model).

Correlated scalar subquery → aggregating JOIN

A scalar subquery in the SELECT list runs once per outer row (SubPlan N in the plan). When the outer set is large, this is O(n²). The rewrite is a LEFT JOIN to a pre-aggregated table or CTE:

-- Before: SubPlan runs once per user.
SELECT u.user_id,
       u.email,
       (SELECT count(*) FROM sim_bp_orders o
        WHERE o.user_id = u.user_id AND o.status = 'pending') AS pending_count
FROM sim_bp_users u
WHERE u.status = 'active';

-- After: single aggregation, left-joined.
SELECT u.user_id, u.email, COALESCE(p.pending_count, 0) AS pending_count
FROM sim_bp_users u
LEFT JOIN (
    SELECT user_id, count(*) AS pending_count
    FROM sim_bp_orders
    WHERE status = 'pending'
    GROUP BY user_id
) p ON p.user_id = u.user_id
WHERE u.status = 'active';

The rewrite computes all per-user counts in a single aggregating scan over sim_bp_orders, then joins them against users. On large outer sets (say, all 200k active users instead of LIMIT 100), the rewrite is usually 20-100× faster because the aggregation happens once rather than 200,000 times.

For "top-N related rows per outer" (not just count), use LATERAL JOIN with LIMIT N.

`NOT IN` → `NOT EXISTS`

The most insidious bug in SQL, bar none. NOT IN returns no rows whenever the inner set contains a single NULL, because x NOT IN (a, b, NULL) evaluates to x <> a AND x <> b AND x <> NULL, and x <> NULL is unknown, making the whole AND evaluate to unknown (not-true, hence excluded).

-- If any user in the inner query has a NULL email, this returns empty.
SELECT * FROM customers
WHERE email NOT IN (SELECT email FROM unsubscribed_users);

-- Correct, NULL-safe equivalent:
SELECT * FROM customers c
WHERE NOT EXISTS (SELECT 1 FROM unsubscribed_users u WHERE u.email = c.email);

NOT EXISTS uses existence semantics, not three-valued logic, so NULLs don't poison the result. The two forms also often produce different plans — NOT EXISTS usually becomes an Anti Semi Join, which PostgreSQL executes as cheaply as a regular join. NOT IN with a nullable inner column can force a hash anti-join that's aware of NULL semantics, and that's slower.

Rule: never write NOT IN against a subquery unless you've confirmed the compared column is NOT NULL at the schema level. In production code, just default to NOT EXISTS.

`DISTINCT` → `GROUP BY`

SELECT DISTINCT tells PostgreSQL to deduplicate the output; GROUP BY on the same columns does the same thing. When the only goal is deduplication (no aggregate functions), the two are equivalent, and the planner usually produces the same plan for each. But GROUP BY is strictly more flexible — it composes with HAVING, plays nicely with window functions, and handles expressions more cleanly.

The rewrite that actually matters is when DISTINCT is used in a query shape that's really asking for something else. "The first order per user" is often written as:

-- Wrong: this gets any order, not the first.
SELECT DISTINCT ON (user_id) user_id, order_id, created_at
FROM sim_bp_orders;

DISTINCT ON (user_id) returns one row per user_id, but which row is unspecified without an ORDER BY. Usually you want:

SELECT DISTINCT ON (user_id) user_id, order_id, created_at
FROM sim_bp_orders
ORDER BY user_id, created_at DESC;

This returns the latest order per user, provided the ORDER BY starts with the DISTINCT ON column. An index on (user_id, created_at DESC) lets this run as an index scan that emits one row per user without a separate sort.

DISTINCT ON is a PostgreSQL extension (not standard SQL) but it's the cleanest expression of "top-1 per group" when the pattern fits. For top-N with N > 1, use LATERAL (below) or a window function with a Run Condition.

Chunked deletes and updates

Large DELETE or UPDATE statements take locks on every row they touch, generate WAL proportional to the row count, and can trigger autovacuum storms. A 10-million-row delete often locks out writers for minutes. The rewrite is to do it in chunks:

-- Problematic: single massive delete.
DELETE FROM sim_bp_logs WHERE created_at < now() - interval '90 days';

-- Chunked: loop until no more rows to delete.
DO $$
DECLARE
    deleted_count int;
BEGIN
    LOOP
        DELETE FROM sim_bp_logs
        WHERE log_id IN (
            SELECT log_id FROM sim_bp_logs
            WHERE created_at < now() - interval '90 days'
            LIMIT 10000
        );
        GET DIAGNOSTICS deleted_count = ROW_COUNT;
        EXIT WHEN deleted_count = 0;
        COMMIT;  -- Releases locks; next iteration starts fresh txn.
    END LOOP;
END $$;

Each chunk commits separately, releasing locks and letting autovacuum catch up between iterations. Use LIMIT + IN (SELECT ... LIMIT ...) because DELETE ... LIMIT isn't valid PostgreSQL syntax (unlike MySQL).

The same pattern applies to bulk UPDATEs. Batch size depends on row width and lock contention tolerance — 1,000 for wide rows with heavy concurrent load, up to 100,000 for narrow rows on an off-hours maintenance window.

`INSERT ... ON CONFLICT`

Pre-existing code often uses a read-then-write pattern for upserts:

-- Anti-pattern: race condition between the SELECT and INSERT.
SELECT 1 FROM sim_bp_users WHERE email = $1;
-- (application: if not found) INSERT INTO sim_bp_users ...;

Two round trips, and two sessions can both read "not found" and both try to insert, producing a unique-constraint violation. The PostgreSQL idiom is INSERT ... ON CONFLICT:

INSERT INTO sim_bp_users (email, username, status)
VALUES ($1, $2, 'active')
ON CONFLICT (email) DO UPDATE
    SET username = EXCLUDED.username,
        status   = 'active'
RETURNING user_id;

One round trip, atomic, race-free. EXCLUDED references the row that would have been inserted (before the conflict). For "do nothing on duplicate," use ON CONFLICT (col) DO NOTHING. The conflict target must be a column or constraint that has a unique index — without one, PostgreSQL has no way to detect "a conflicting row already exists."

`SELECT *` in production queries

Not a rewrite of the query's logic, but a rewrite of its projection. SELECT * from a wide table pulls every column over the wire and through every plan node — Index Only Scans degrade to regular Index Scans (heap fetches required for the extra columns), join memory usage multiplies, sort widths explode.

The specific cost isn't always catastrophic, but the robustness cost is. A column-type change on an upstream table can break downstream consumers that didn't know they depended on the old width. In production code, name every column you actually need.

The exception: dump tools, ad-hoc debugging, and CTEs that genuinely pass all columns through. Context-dependent, but the default should be "name the columns."

`HAVING` vs `WHERE`

HAVING filters after aggregation; WHERE filters before. If a predicate could apply before aggregation, it should — the aggregate then operates on fewer rows. A classic misuse:

-- Inefficient: aggregate over all orders, then filter.
SELECT user_id, count(*) AS order_count
FROM sim_bp_orders
GROUP BY user_id
HAVING user_id IN (SELECT user_id FROM active_users);

-- Better: filter before aggregation.
SELECT user_id, count(*) AS order_count
FROM sim_bp_orders
WHERE user_id IN (SELECT user_id FROM active_users)
GROUP BY user_id;

The WHERE clause restricts the set of rows that go into the GROUP BY, so the aggregate runs over a smaller input. Only predicates that depend on the aggregate result (e.g., HAVING count(*) > 5) belong in HAVING; anything else is almost always more efficient in WHERE.

The planner usually pushes predicates from HAVING to WHERE when it's safe, but not always — especially when there are subqueries or complex expressions involved. Writing the filter in WHERE to begin with removes the uncertainty.

Composite rewrites: correlated subquery + LATERAL + keyset pagination

Real-world queries often combine several anti-patterns. The "show me the latest 20 orders for each of the top 100 users by lifetime spend" query is classic:

-- Naive: one subquery for the user list, window function for the per-user top-N.
WITH top_users AS (
    SELECT user_id
    FROM sim_bp_orders
    GROUP BY user_id
    ORDER BY sum(total_amount_cents) DESC
    LIMIT 100
)
SELECT * FROM (
    SELECT o.*,
           row_number() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS rn
    FROM sim_bp_orders o
    WHERE user_id IN (SELECT user_id FROM top_users)
) t WHERE rn <= 20;

The CTE lists 100 top users; the window function computes row numbers for all their orders (potentially thousands each); then the outer WHERE keeps only the top 20 per user. The window function is doing 10-100× the work that's actually needed.

Rewritten with LATERAL + LIMIT:

WITH top_users AS (
    SELECT user_id, sum(total_amount_cents) AS total_spent
    FROM sim_bp_orders
    GROUP BY user_id
    ORDER BY 2 DESC
    LIMIT 100
)
SELECT t.user_id, t.total_spent, recent.*
FROM top_users t,
LATERAL (
    SELECT order_id, total_amount_cents, created_at
    FROM sim_bp_orders
    WHERE user_id = t.user_id
    ORDER BY created_at DESC
    LIMIT 20
) recent;

For each of the 100 top users, a LATERAL subquery returns their 20 most recent orders — at most 2000 rows total, vs potentially hundreds of thousands in the window-function form. PostgreSQL 15+ can sometimes optimise the window-function form via Run Condition, but LATERAL is both clearer and more reliably cheap.

When not to rewrite

Every rewrite has a small risk of changing semantics in an edge case. Before deploying:

Diff the results. Run the old and new forms against the same data; check the row counts and a representative sample match exactly.
Check the plan with EXPLAIN ANALYZE. The rewrite should show the cost improvement you expect; if it doesn't, there's a case where the planner disagreed.
Run both under load. Synthetic benchmarks rarely capture the real cache and concurrency effects. A rewrite that's 10× faster in isolation might be only 2× faster in production — still worth it, but measure.

Rewriting for performance is the right move after indexing, before buying bigger hardware. The patterns in this article cover most of what you'll find in a typical OLTP codebase; for the actually-broken queries — the ones that are wrong by construction — see the companion article on PostgreSQL Query Anti-Patterns and Common Mistakes.

postgres #performance #database #sql

Full series and canonical version: https://mydba.dev/blog/postgres-query-rewriting-techniques

PostgreSQL WHERE Clause Optimization

Philip McClarence — Fri, 01 May 2026 14:00:04 +0000

The single question that decides whether an index helps your query is: can the planner match the WHERE clause against the index? If the answer is yes, you get an index or bitmap scan and the query returns quickly. If the answer is no — because you wrapped the indexed column in a function, used an implicit cast, or combined conditions with OR in a way the planner can't decompose — the index is silently unused and the table is sequentially scanned.

The catch is that "the planner can match the predicate" isn't a yes-or-no rule; it's a long list of conditions. This article is the sixth in the Complete Guide to PostgreSQL SQL Query Analysis & Optimization series and covers the conditions most often violated in production SQL. Every EXPLAIN block is captured from the series' Neon Postgres 17.8 database.

Sargable predicates — the rule in one sentence

A predicate is sargable (Search ARGument ABLE) when it compares an indexed column, or a leading prefix of an indexed expression, against a constant or parameter — without wrapping the indexed value in a function the planner can't invert. The term isn't formal PostgreSQL terminology, but it's the right mental model: sargable ⇒ indexable; non-sargable ⇒ sequential scan, no matter how many indexes you add.

The canonical non-sargable predicate is a function on the column:

-- Not sargable — the index on email can't help.
SELECT * FROM sim_bp_users WHERE lower(email) = 'user42@example.com';

An index on email doesn't help here because the planner's test is lower(email) = constant, and the index doesn't store lower(email). The fix is either:

Normalise on write. Store emails lowercased; query against the raw column. Most applications should have been doing this anyway.
Expression index on the function. CREATE INDEX ON sim_bp_users (lower(email)) — the index stores the lowercased value, and lower(email) = 'x' becomes sargable against it.
citext extension. A case-insensitive text type with its own operator class. Indexes on citext columns work for equality and pattern operators; which exact cases are index-usable depends on the operator class and the collation semantics. citext is usually the cleanest solution for "case-insensitive equality everywhere" in the schema; for prefix-heavy workloads, an expression index with text_pattern_ops (covered in the index usage article) is often a better fit because its semantics are simpler.

A real capture shows the difference. Non-sargable lower(email) LIKE 'user12%' against 200k rows:

Parallel Seq Scan on sim_bp_users
    Filter: (lower((email)::text) ~~ 'user12%'::text)
    Rows Removed by Filter: 94444
 Execution Time: 122.833 ms

Sargable email LIKE 'user12%' with the existing text_pattern_ops index:

Index Only Scan using idx_sim_bp_users_email_pattern on sim_bp_users
    Index Cond: ((email ~>=~ 'user12'::text) AND (email ~<~ 'user13'::text))
    Heap Fetches: 0
 Execution Time: 24.757 ms

Same data, same 20-row output, 5× faster — and the ratio widens with table size.

Implicit casts that silently disable indexes

PostgreSQL's type system is strict, but it will coerce types when the operator allows it. The implicit coercion happens at the constant side of the comparison usually, which is safe. When it happens at the column side, it's a silent index-bypass:

-- Sargable — PG casts '123' to int; index on int_col still applies.
SELECT * FROM t WHERE int_col = '123';

-- Not sargable — PG casts text_col to int, wrapping the column.
SELECT * FROM t WHERE text_col = 123;

In the second form, the planner sees int_col_cast(text_col) = 123 and the cast prevents the index on text_col from matching. The fix is usually "use the right type in the query," but occasionally a text column genuinely needs to index-match integer literals — in which case, an expression index on the cast solves it: CREATE INDEX ON t ((text_col::int)). Rare, but real.

More insidious: the varchar(N) ↔ text case. status varchar(20) is indexed; the query does WHERE status = 'pending'. PostgreSQL picks the right operator and the index is used. Change the column type to citext or an application-specific domain, and operator resolution can pick a different candidate — sometimes applying a cast on the column side and silently disabling the index. Schema-type changes are a plan-breaking migration; re-run EXPLAIN on the key queries after any column-type change.

The leftmost-prefix rule (quickly, with the consequences)

A composite btree on (a, b, c) helps queries that use:

a = ?
a = ? AND b = ?
a = ? AND b = ? AND c = ?
a = ? AND b < ?
a = ? AND b = ? ORDER BY c

It does not help queries that use only b or only c, or only the range portion of a leading column plus equality on a trailing one. The planner can use a prefix of the index's columns starting from the leading one.

The practical implication: for a composite index, put equality predicates first and range/ORDER BY columns last. An index on (tenant_id, created_at) serves a tenant-scoped time-range filter cleanly; (created_at, tenant_id) forces a seq scan for the same query on a specific tenant.

A common mistake is trying to "cover multiple access patterns with one composite index." If the app filters sometimes by status, sometimes by user_id, and sometimes by both, neither (status, user_id) nor (user_id, status) serves both single-column filters efficiently. You usually want two single-column indexes — the planner will combine them with a BitmapAnd when both are filtered — or one composite index plus one lone single-column index on whichever column is the more common filter in isolation.

OR across indexed columns — the BitmapOr pattern

OR in a WHERE clause used to be a classic "can't use an index" gotcha. Modern PostgreSQL handles the common case well via BitmapOr. Each branch of the OR produces a bitmap from its respective index; the bitmaps are unioned; a single heap scan visits only the matching pages:

SELECT user_id, email
FROM sim_bp_users
WHERE email = 'user42@example.com'
   OR username = 'user42';

Bitmap Heap Scan on sim_bp_users
  Recheck Cond: (((email)::text = 'user42@example.com'::text)
              OR ((username)::text = 'user42'::text))
  ->  BitmapOr
        ->  Bitmap Index Scan on idx_sim_bp_users_email_pattern
              Index Cond: ((email)::text = 'user42@example.com'::text)
        ->  Bitmap Index Scan on idx_sim_bp_users_username_pattern
              Index Cond: ((username)::text = 'user42'::text)
 Execution Time: 4.406 ms

4.4 ms. Both branches of the OR hit an index; the BitmapOr merges the two TID bitmaps (automatically deduplicating tuple IDs that appeared in both branches, since the bitmap is a set structure indexed by TID); the Bitmap Heap Scan visits each matched page once, rechecks the combined condition, and emits matching rows. No rewrite needed.

OR becomes a problem when only some of the branches are indexable, or when the branches match most of the table. In those cases the planner often falls back to a seq scan because the total estimated cost of two bitmap scans + union + recheck is similar to a single scan. If the optimizer picks a seq scan for an OR you thought would hit an index, check each branch individually — the non-sargable one is usually the culprit.

OR → UNION ALL — when the planner won't decompose

For the classic "OR across tables" case — WHERE t.x = 1 OR u.y = 2 in a join — the planner can't always produce a BitmapOr because the two sides are in different relations. The rewrite:

-- Before: OR across joined tables.
SELECT o.order_id
FROM sim_bp_orders o JOIN sim_bp_users u ON u.user_id = o.user_id
WHERE o.status = 'pending' OR u.status = 'suspended';

-- After: UNION (not UNION ALL — we want deduplication).
SELECT o.order_id
FROM sim_bp_orders o JOIN sim_bp_users u ON u.user_id = o.user_id
WHERE o.status = 'pending'
UNION
SELECT o.order_id
FROM sim_bp_orders o JOIN sim_bp_users u ON u.user_id = o.user_id
WHERE u.status = 'suspended';

Each branch of the UNION is a separate query the planner can optimise independently — one can use an index on o.status, the other on u.status, and the dedup at the top removes overlap. This only wins when both branches are individually selective; if one branch matches most of the table, UNION isn't faster.

UNION vs UNION ALL matters for correctness: UNION dedupes (expensive if the output is large and has many overlaps); UNION ALL doesn't (faster, but returns duplicate rows for the overlap). Default to UNION if you're rewriting an OR to preserve equivalent semantics.

`LIKE '%needle%'` — leading wildcards

Standard btree indexes can only help LIKE when the pattern has a fixed prefix. LIKE 'user12%' is range-scannable (with text_pattern_ops or C collation); LIKE '%user12%' isn't — there's no way to translate it into a range on a sorted index.

The fix is a trigram index (pg_trgm extension, GIN):

CREATE EXTENSION pg_trgm;
CREATE INDEX idx_users_email_trgm ON sim_bp_users USING gin (email gin_trgm_ops);

-- Now this becomes a GIN index scan instead of a seq scan:
SELECT * FROM sim_bp_users WHERE email LIKE '%user12%';

Trigram GIN indexes store overlapping 3-character substrings of the text. The query engine decomposes %user12% the same way and looks up candidate rows in the index. The match set is usually narrow enough that the subsequent heap scan is cheap, even though it has to re-verify each candidate against the full pattern.

GIN indexes have write amplification — inserts are roughly 3× slower than for a btree, and updates trigger full re-indexing of the changed row. Use trigram GINs sparingly on high-write tables.

`IS NULL`, `IS NOT NULL`, and three-valued logic

IS NULL is sargable against a btree and against most specialised indexes. column IS NULL can use an index if the index covers nulls (the default for btrees), producing a fast point-scan of the null rows. This is worth knowing because "find the records that haven't been processed yet" is a common pattern on append-mostly tables.

The failure mode is <> with nullable columns. status <> 'completed' excludes rows where status is NULL — NULL is not-equal to everything but also not-not-equal. If you actually want "all rows where status is not completed or is unknown," you have to write it explicitly: status <> 'completed' OR status IS NULL, or status IS DISTINCT FROM 'completed' (which treats NULL as a value).

NOT IN (SELECT ...) on a nullable inner column is the same trap at a higher level: if any row in the subquery has NULL for the compared column, NOT IN returns no rows at all. Use NOT EXISTS (see the subquery/CTE article) unless you've proven the inner column is NOT NULL.

Function calls on the constant side — safe

The non-sargable warning applies to functions on the column side, not the constant side. WHERE created_at > now() - interval '1 day' is sargable because now() - interval '1 day' evaluates to a constant (once per query), and the planner compares the indexed created_at to that constant.

The subtlety is that functions marked VOLATILE (like random()) can't be evaluated once and cached; they're re-evaluated per row, which changes the plan in surprising ways. User-defined functions default to VOLATILE unless you explicitly mark them STABLE or IMMUTABLE. If you're calling a UDF in a WHERE clause that should be constant for the duration of the query, mark it STABLE — otherwise the planner treats it as volatile and loses optimisation opportunities.

Partial index predicate implication

Partial indexes have their own sargability requirement, in addition to the usual one: the query's WHERE clause must imply the partial index's predicate, from the planner's perspective. The planner uses a built-in theorem prover on predicates, which handles equality, inequality, and simple boolean structure. It doesn't handle:

Function calls (WHERE lower(status) = 'pending' won't match a partial index on WHERE status = 'pending' because the function disables the implication).
OR-wrapped forms that don't obviously decompose.
Casts that the theorem prover doesn't recognise as reversible.

When a partial index isn't being used, the most common reason is that the query's predicate isn't obviously implying the partial predicate. Rewrite the query to match the partial predicate as literally as possible.

A diagnostic recipe

When an index exists but a query isn't using it:

Look at the Filter: line in EXPLAIN. If the filter mentions the indexed column with any function around it, that's the non-sargable form. Rewrite.
Check column types match literal types. WHERE int_column = '123' is fine; WHERE text_column = 123 casts the column and loses the index.
Check the Index Cond: line for the expected index. If the index is available but the plan shows Filter: instead of Index Cond:, the planner decided the predicate couldn't use the index — look for functions or casts on the column.
Try SET enable_seqscan = off; just for the session. The resulting plan tells you what the planner would use if forced. If it's still a seq scan or a bizarre fallback, the predicate is genuinely unindexable.
For partial indexes, read the partial predicate carefully. The query's WHERE clause has to imply it literally, not just semantically.

Next steps

When the predicates are right but the query itself is structured awkwardly, the next article — Query Rewriting Techniques — covers the systematic transformations that turn expensive SQL into cheap SQL without changing results: DISTINCT → GROUP BY, keyset pagination, batch operations, and the other rewrites every production SQL writer eventually needs.

postgres #performance #database #sql

Originally published at mydba.dev/blog/postgres-where-clause-optimization.

PostgreSQL Aggregate and Window Function Tuning

Philip McClarence — Thu, 30 Apr 2026 14:00:08 +0000

GROUP BY and window functions look declarative — the query says what it wants, and PostgreSQL figures out how to compute it. In practice the planner has strong opinions about how: whether to hash or sort, whether to parallelise, whether to spill memory to disk, whether a matching index changes the plan entirely. Learn to read what the planner picked and why, and aggregate-heavy queries become one of the easiest categories to tune.

This article is the fifth in the Complete Guide to PostgreSQL SQL Query Analysis & Optimization series. Every EXPLAIN block below is captured from a real run on the series' Neon Postgres 17.8 database (500,000-row sim_bp_orders and friends).

The two aggregate strategies

For GROUP BY, the planner chooses primarily between two implementations — plus some parallel and distinct variants layered on top.

HashAggregate builds a hash table keyed by the group-by columns; each incoming row probes the hash and either creates a new entry or updates an existing one's running aggregate state. Fast when the hash table fits in work_mem. Doesn't care about input order.

GroupAggregate requires input already sorted on the group-by columns. Each group's rows arrive contiguously, so the aggregate can emit a result row and clear its state between groups — constant memory regardless of group count. Picked when the input is already sorted (typically because the group-by matches an index order) or when the planner thinks the hash table won't fit.

The distinguishing signal in EXPLAIN is the node type itself: HashAggregate vs GroupAggregate. When you see Sort → GroupAggregate and no matching index, the planner has decided a sort + streaming aggregate is cheaper than trying to hash. In parallel plans you'll often see a composite shape — Partial HashAggregate inside each worker, topped by Finalize GroupAggregate on the leader — which is a parallel partial-aggregation pattern rather than "just a HashAggregate."

Here's that exact shape, from the classic dashboard query "how many orders in each status?":

SELECT status, count(*), avg(total_amount_cents)
FROM sim_bp_orders
GROUP BY status;

Finalize GroupAggregate  (cost=8334.96..8336.27 rows=5 width=49)
    (actual time=148.938..151.912 rows=5 loops=1)
  Group Key: sim_bp_orders.status
  Buffers: shared hit=3705
  ->  Gather Merge  (actual time=148.924..151.895 rows=15 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        ->  Sort  (actual time=140.390..140.391 rows=5 loops=3)
              Sort Key: sim_bp_orders.status
              Sort Method: quicksort  Memory: 25kB
              ->  Partial HashAggregate
                    (actual time=140.366..140.367 rows=5 loops=3)
                    Group Key: sim_bp_orders.status
                    Batches: 1  Memory Usage: 24kB
                    ->  Parallel Seq Scan on sim_bp_orders
                          (actual time=0.006..32.097 rows=166667 loops=3)
 Execution Time: 151.973 ms

152 ms. This is parallel partial aggregation: each parallel worker (plus the leader, making three process loops) computes a partial HashAggregate over its slice of the table (rows=166667 loops=3 ≈ 500k total), produces its five-row partial result, sorts those by status, and feeds them up to Gather Merge. The leader then finalises with Finalize GroupAggregate — combining the three sets of partial states into five final rows. Partial aggregation is the reason aggregate queries scale so well with parallel workers: only the partial group states (5 rows per worker here, 15 rows total) cross the worker-to-leader boundary, no matter how big the input was.

The Batches: 1 Memory Usage: 24kB on the Partial HashAggregate means the hash table fit in work_mem and didn't spill. Five groups with running sum/count fits easily in 24 kB.

The aggregate spill — and how to diagnose it

Things get interesting when the number of groups grows. A HashAggregate spill on a 117k-group count looks like:

HashAggregate  (actual time=347.354..392.138 rows=117060 loops=1)
  Group Key: u.email
  Planned Partitions: 4  Batches: 5  Memory Usage: 8241kB  Disk Usage: 6920kB

The Disk Usage: 6920kB and Batches > 1 are the spill signals. PostgreSQL 13+ handles this gracefully — the executor detects that not all groups fit in memory, writes partial state to per-partition spill files, and processes them in additional passes — but the extra I/O is not free. On our database it cost roughly 40% of the query's total time.

Two fixes for HashAggregate spills:

Raise work_mem per-session so the hash fits in memory. Set per-role (ALTER ROLE analytics SET work_mem = '64MB') rather than cluster-wide, because work_mem is allocated per sort/hash node per connection and a cluster-wide raise multiplies by concurrency.
Sort + GroupAggregate is cheaper than a spilling HashAggregate when the group-by column is indexed. Force it with SET enable_hashagg = off; as a diagnostic, and if the Sort + GroupAggregate plan is faster, the underlying issue is "too many groups for current work_mem." Usually the right answer is to raise work_mem for the session anyway, since Sort also uses work_mem.

The MyDBA analyzer rule temp_blocks_written fires when a node's Temp Written Blocks exceeds 100. That field is populated from JSON-format EXPLAIN output — MyDBA's visualiser runs the rules over JSON plans, not the text format pasted here — so the rule fires automatically on both HashAggregate spills and Sort spills when captured through the native integration.

Sort spills — the external merge

When a sort doesn't fit in work_mem, PostgreSQL falls back to an external merge sort: write sorted runs to disk, then merge them. You see this as Sort Method: external merge with a Disk: size in the sort node:

SELECT status,
       percentile_cont(0.5) WITHIN GROUP (ORDER BY total_amount_cents) AS median,
       percentile_cont(0.95) WITHIN GROUP (ORDER BY total_amount_cents) AS p95
FROM sim_bp_orders
GROUP BY status;

Percentiles are expensive because the implementation needs an ordered sample per group. PostgreSQL's percentile_cont evaluates as an ordered-set aggregate, which requires sorting the input per group:

GroupAggregate  (actual time=202.589..358.067 rows=5 loops=1)
  Group Key: status
  Buffers: shared hit=3697, temp read=2707 written=2494
  ->  Sort  (actual time=146.514..202.485 rows=500000 loops=1)
        Sort Key: status
        Sort Method: external merge  Disk: 12048kB
        Buffers: shared hit=3689, temp read=1506 written=1512
        ->  Seq Scan on sim_bp_orders  (actual time=0.007..49.886 rows=500000 loops=1)
 Execution Time: 358.230 ms

358 ms. The Sort spilled 12 MB of temp files. The GroupAggregate node above it shows its own temp read=2707 written=2494 — that's the ordered-set aggregate's internal tuplestore materialising per-group sorted input for the percentile computation, not a generic "every aggregate spills" phenomenon. Ordered-set aggregates like percentile_cont, percentile_disc, and mode() all force per-group materialisation; a simple count() or avg() on the same plan wouldn't produce that second temp-I/O figure. The MyDBA rule sort_on_disk fires on any Sort with Sort Space Type = Disk, which this plan has.

The right fix depends on the workload. For a one-off analytical report, raising work_mem to ~40 MB for that session turns the external merge into an in-memory quicksort. For a dashboard that runs this every minute, you want a materialised view:

CREATE MATERIALIZED VIEW order_amount_percentiles AS
SELECT status,
       percentile_cont(0.5) WITHIN GROUP (ORDER BY total_amount_cents) AS median,
       percentile_cont(0.95) WITHIN GROUP (ORDER BY total_amount_cents) AS p95
FROM sim_bp_orders
GROUP BY status;

-- Refresh on whatever schedule fits your freshness requirement:
REFRESH MATERIALIZED VIEW CONCURRENTLY order_amount_percentiles;

REFRESH MATERIALIZED VIEW CONCURRENTLY requires a unique index on the view, reads the source tables outside the refresh window, and replaces the view atomically. The dashboard then queries the view instead of re-running the percentile calculation, and the 358 ms query becomes a 0.5 ms single-row scan.

Window functions

A window function produces an output row for every input row, but with access to a frame of related rows. The syntax:

agg_func(...) OVER (
    PARTITION BY col1, col2     -- split input into independent groups
    ORDER BY col3, col4          -- order within each partition
    ROWS BETWEEN ... AND ...     -- or RANGE BETWEEN, or GROUPS BETWEEN
)

The planner implements window functions via a WindowAgg node that consumes an input ordered appropriately and emits one output row per input. If the input isn't already ordered, the planner inserts a Sort before the WindowAgg — which is often where the cost lives.

Consider a common pattern: "the most recent order per user." Pre-PostgreSQL 15 the usual rewrite was:

SELECT user_id, order_id, created_at
FROM (
    SELECT user_id, order_id, created_at,
           row_number() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS rn
    FROM sim_bp_orders
) t
WHERE rn = 1
LIMIT 100;

The PostgreSQL 15+ optimisation for this is the WindowAgg Run Condition — the planner notices that WHERE rn = 1 can be pushed into the WindowAgg, so it can stop computing row numbers for each partition as soon as rn > 1:

Limit  (actual time=0.093..0.525 rows=100 loops=1)
  Buffers: shared hit=305
  ->  WindowAgg  (actual time=0.092..0.518 rows=100 loops=1)
        Run Condition: (row_number() OVER (?) <= 1)
        ->  Incremental Sort
              Sort Key: user_id, created_at DESC
              Presorted Key: user_id
              Full-sort Groups: 9  Sort Method: quicksort
              ->  Index Scan using idx_sim_bp_orders_user_id on sim_bp_orders
                    (actual time=0.014..0.339 rows=302 loops=1)
 Execution Time: 0.543 ms

0.54 ms. Two optimisations are visible:

Run Condition: (row_number() OVER (?) <= 1) — the WindowAgg stops producing rows for a partition once rn exceeds 1, so only the first row per user is computed. This lets the plan short-circuit once LIMIT 100 is satisfied after only 302 input rows (not the full 500k).
Incremental Sort with Presorted Key: user_id — the input arrives already sorted by user_id (from idx_sim_bp_orders_user_id), and the WindowAgg needs it sorted by (user_id, created_at DESC). An Incremental Sort only sorts within each user_id group rather than globally, which costs drastically less memory and allows pipelined execution.

Even so, a LATERAL join with LIMIT 1 inside is often simpler and at least as fast for "top-N per group" with small N.

Frame specifications

Most window function work defaults to an implicit frame clause that trips people up. The rules:

No ORDER BY clause → the frame defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING — the whole partition. This is what you want for sum() or avg() over an entire partition.
ORDER BY clause present → the frame defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW — the running total up to this row. This is what you want for running sums, but easy to get wrong.
Ranking functions (row_number(), rank(), dense_rank()) — the frame is irrelevant because the function's result only depends on the ordering.

A common mistake: computing a "running average over the last 7 rows" and getting a running average over all preceding rows because the frame clause was omitted. The fix is explicit:

avg(value) OVER (
    ORDER BY timestamp
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
)

ROWS BETWEEN N PRECEDING AND CURRENT ROW is a physical window of N+1 rows. RANGE BETWEEN '7 days' PRECEDING AND CURRENT ROW is a logical window based on the ORDER BY value — useful when timestamps aren't evenly spaced. GROUPS BETWEEN N PRECEDING AND CURRENT ROW (PostgreSQL 11+) treats ties as a single "group" and counts those.

LAG, LEAD, and first/last value

The navigation functions — lag(x, n), lead(x, n), first_value(x), last_value(x) — let you reference rows offset from the current one. Classic use: detect state transitions.

SELECT order_id, status,
       lag(status) OVER (PARTITION BY user_id ORDER BY created_at) AS prev_status,
       created_at
FROM sim_bp_orders
WHERE user_id = 42;

Each row gets the status of the user's previous order. The window can then be wrapped in a subquery or CTE to find "orders where the status changed":

WITH ordered AS (
    SELECT order_id, status, created_at,
           lag(status) OVER (PARTITION BY user_id ORDER BY created_at) AS prev_status
    FROM sim_bp_orders
)
SELECT * FROM ordered WHERE prev_status IS DISTINCT FROM status;

Two performance notes. First, last_value() with a default frame is surprising — because the default frame ends at the current row, last_value() returns the current row's value, not the partition's last. To actually get the partition's last value, specify ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. Second, LAG and LEAD compile to very cheap operations (just a pointer to the previous/next row in the window), while first_value/last_value with an explicit full-partition frame can force materialisation.

Aggregate-related indexes

An index on the GROUP BY columns is a straightforward win when it exists: the planner can use GroupAggregate over an index scan and skip the hash build entirely. The index has to cover the group key in exactly the right order — a composite index on (status, created_at) serves GROUP BY status, but a (created_at, status) doesn't.

For queries that frequently aggregate a narrow window of a big table (WHERE created_at > ... GROUP BY user_id), a partial index or materialised view of the aggregate result is usually the right answer, because re-aggregating millions of rows every time beats out any planner optimisation. Precomputation is the most robust performance tactic for aggregates.

Quick diagnostic checklist

When an aggregate query is slow:

Is the aggregate node a HashAggregate with Batches > 1 or Disk Usage > 0? The hash table spilled. Raise work_mem for the session, or create a supporting index to enable GroupAggregate instead.
Is there a Sort above a GroupAggregate with Sort Method: external merge? The sort spilled. Same fix: more work_mem, or an index that provides pre-sorted input.
Is there a WindowAgg over a Sort that processes all input before the LIMIT? Check if a Run Condition is possible (PG 15+) or if the problem can be rewritten as LATERAL + LIMIT N.
Is the aggregate running every time the dashboard loads? Move it behind a materialised view refreshed on schedule. This is usually the biggest win of all.
Does the MyDBA analyzer flag sort_on_disk, hash_batches_spill, or temp_blocks_written? These are the three rules that specifically target aggregate-related spills; if any fire, follow the suggestion inline.

Next steps

Aggregates interact closely with the shape of your WHERE clauses — a filter that narrows the input set before aggregation is almost always cheaper than aggregating and then filtering. The next article, WHERE Clause Optimisation, covers sargability and composite-index ordering in detail, with an eye toward getting predicates to apply as early in the plan as possible.

postgres #performance #database #sql

Full article with the complete series: https://mydba.dev/blog/postgres-aggregate-window-tuning

PostgreSQL Subquery and CTE Optimization

Philip McClarence — Wed, 29 Apr 2026 14:00:05 +0000

Every SELECT in PostgreSQL is made of smaller SELECTs, even when it doesn't look that way. WHERE col IN (SELECT ...), WHERE EXISTS (SELECT ...), (SELECT count(*) FROM ... WHERE ...) in the column list, WITH x AS (SELECT ...) — these look syntactically different but all get rewritten into plan nodes at plan time. Which plan node the planner chooses determines whether your query runs in three milliseconds or three seconds, and the rules are different for each pattern.

This is part of the Complete Guide to PostgreSQL SQL Query Analysis & Optimization. Assumes you can read EXPLAIN output and are familiar with how the planner chooses join strategies. Running dataset: 500k-row sim_bp_orders, 200k-row sim_bp_users, on Neon Postgres 17.8.

We'll cover: scalar and existence subqueries (SubPlan, EXISTS, IN), when correlated subqueries should be rewritten as joins, how CTEs are executed on modern PostgreSQL, when to use MATERIALIZED vs NOT MATERIALIZED, LATERAL joins, and recursive CTEs.

Scalar correlated subqueries — the SubPlan trap

A scalar subquery in the column list is the easiest way to accidentally write an O(n²) query:

SELECT u.user_id,
       u.email,
       (SELECT count(*)
          FROM sim_bp_orders o
         WHERE o.user_id = u.user_id
           AND o.status = 'pending') AS pending_count
FROM sim_bp_users u
WHERE u.status = 'active'
LIMIT 100;

The query reads naturally: "for each active user, count their pending orders." The plan is what that description implies:

Limit  (cost=0.42..1642.83 rows=100 width=33) (actual time=0.088..3.438 rows=100 loops=1)
  Buffers: shared hit=565 read=3
  ->  Index Scan using sim_bp_users_pkey on sim_bp_users u
        (cost=0.42..3118066 rows=189807 width=33)
        (actual time=0.087..3.433 rows=100 loops=1)
        Filter: ((u.status)::text = 'active'::text)
        SubPlan 1
          ->  Aggregate  (cost=16.24..16.25 rows=1 width=8)
                (actual time=0.033..0.033 rows=1 loops=100)
                ->  Bitmap Heap Scan on sim_bp_orders o
                      (actual time=0.032..0.033 rows=0 loops=100)
                      Recheck Cond: (o.user_id = u.user_id)
                      Filter: ((o.status)::text = 'pending'::text)
                      ->  Bitmap Index Scan on idx_sim_bp_orders_user_id
                            (actual time=0.029..0.029 rows=3 loops=100)
                            Index Cond: (o.user_id = u.user_id)
 Execution Time: 3.444 ms

Two signals. First, the SubPlan 1 node is inside the outer index scan — it runs once per outer row. actual time=0.033..0.033 rows=1 loops=100 tells you the subquery was executed 100 times (once per user returned). With LIMIT 100 it's cheap; without the limit, it would run 200,000 times and that's six seconds of just-subquery time before any other work.

Second, SubPlan N in a plan is a heads-up that the query is executing per-outer-row work, which is almost always worth rewriting — either as an aggregating JOIN or a correlated aggregate pushed into a LATERAL. Both rewrites scale better as the outer set grows.

EXISTS, IN, and JOIN — three ways to express "filter by related rows"

For the "find rows that have at least one related row" pattern, SQL offers three syntactic choices. They don't all produce the same plan.

Rewriting the earlier query as an EXISTS — asking a boolean question, "find users who have at least one pending order":

SELECT u.user_id, u.email
FROM sim_bp_users u
WHERE u.status = 'active'
  AND EXISTS (
      SELECT 1 FROM sim_bp_orders o
      WHERE o.user_id = u.user_id
        AND o.status = 'pending'
  )
LIMIT 100;

Limit  (cost=0.85..137.98 rows=100 width=25) (actual time=0.089..3.238 rows=100 loops=1)
  Buffers: shared hit=726 read=1
  ->  Merge Semi Join  (actual time=0.088..3.234 rows=100 loops=1)
        Merge Cond: (u.user_id = o.user_id)
        ->  Index Scan using sim_bp_users_pkey on sim_bp_users u
              Filter: ((u.status)::text = 'active'::text)
        ->  Index Scan using idx_sim_bp_orders_user_id on sim_bp_orders o
              Filter: ((o.status)::text = 'pending'::text)
              Rows Removed by Filter: 586
 Execution Time: 3.240 ms

The planner picked a Merge Semi Join — stops at the first match per outer row. That's exactly what EXISTS semantics require. Both sides come in user_id-ordered streams (left from the users primary-key btree; right from idx_sim_bp_orders_user_id with status='pending' as a filter), and the merge walks them in lockstep. No per-outer-row SubPlan, no re-execution. The planner doesn't always pick Merge Semi Join — a Nested Loop Semi Join with an index probe is also common, especially with a tight outer LIMIT. Both shapes scale linearly; the SubPlan pattern was quadratic.

IN (SELECT ...) is a third way. Most of the time PostgreSQL treats WHERE col IN (SELECT ...) and WHERE EXISTS (SELECT ... WHERE ... = col) identically, producing the same plan. Two gotchas:

NOT IN with nullable columns is not equivalent to NOT EXISTS. If any value in the inner set is NULL, NOT IN returns unknown (effectively no rows). Always prefer NOT EXISTS unless you've proven the column is NOT NULL.
IN on an array literal (WHERE id IN (1, 2, 3)) is a different beast — syntactic sugar for ANY (ARRAY[1,2,3]), nothing to do with subqueries.

An explicit JOIN works too, but duplicates outer rows for each matching inner row:

SELECT DISTINCT u.user_id, u.email
FROM sim_bp_users u
JOIN sim_bp_orders o ON o.user_id = u.user_id
WHERE u.status = 'active' AND o.status = 'pending';

The DISTINCT is required because a user with five pending orders would appear five times. Usually slower than EXISTS (produces all matching rows then distincts them down), and you have to remember the DISTINCT. Use EXISTS for existence questions, JOIN for data you actually want from the related table.

Rule of thumb:

Count of related rows → aggregating subquery or aggregating JOIN with GROUP BY.
Existence → EXISTS. (Non-existence → NOT EXISTS.)
Data from related rows → regular JOIN.
First/last/top-N of related rows per outer → LATERAL (below).

LATERAL — top-N per group without window functions

A LATERAL join lets a subquery on the right side of a FROM reference columns from the left side: "for each row on the left, evaluate this subquery with those columns bound, and join the result." The SQL-standard way to express "the latest order per customer," "the most recent status message per ticket" — any top-N per outer group.

SELECT u.user_id, u.email,
       latest.order_id,
       latest.total_amount_cents
FROM sim_bp_users u
CROSS JOIN LATERAL (
    SELECT order_id, total_amount_cents
    FROM sim_bp_orders o
    WHERE o.user_id = u.user_id
    ORDER BY o.created_at DESC
    LIMIT 1
) latest
WHERE u.status = 'active'
LIMIT 50;

"For each active user, return their most recent order."

Nested Loop  (actual time=0.016..0.452 rows=50 loops=1)
  Buffers: shared hit=314
  ->  Index Scan using sim_bp_users_pkey on sim_bp_users u
        Filter: ((u.status)::text = 'active'::text)
  ->  Subquery Scan on latest  (actual time=0.008..0.008 rows=1 loops=54)
        ->  Sort  (actual time=0.007..0.007 rows=1 loops=54)
              Sort Key: o.created_at DESC
              Sort Method: quicksort  Memory: 25kB
              ->  Bitmap Heap Scan on sim_bp_orders o
                    (actual time=0.003..0.006 rows=3 loops=54)
                    Recheck Cond: (o.user_id = u.user_id)
 Execution Time: 0.462 ms

0.46 ms. The planner ran the lateral subquery 54 times (one per user, until outer LIMIT 50 was satisfied after some users had zero orders). Each lateral execution was a cheap bitmap index scan + tiny sort bounded by LIMIT 1.

The window-function equivalent — ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) with an outer WHERE rn = 1 — often produces a worse plan on PostgreSQL when only the top 1 or 2 per group are needed, because it computes row numbers for every row before filtering. LATERAL with a LIMIT inside lets the planner stop early.

Two practical notes:

CROSS JOIN LATERAL vs LEFT JOIN LATERAL. CROSS JOIN LATERAL drops outer rows where the subquery returns nothing. LEFT JOIN LATERAL ... ON TRUE preserves them with NULLs. Swapping them changes results silently.
Indexes matter more than for anything else. The subquery runs per outer row, so any table scan inside it multiplies. The lateral on sim_bp_orders.user_id was quick because idx_sim_bp_orders_user_id exists. Without it, the query would be 500,000× slower.

CTEs — materialised by default no longer

Before PostgreSQL 12, every WITH clause was an optimisation fence: the CTE was computed in full and stored in a temporary buffer, and the planner could not push predicates from the outer query into the CTE. People used this intentionally (the "CTE trick" to force materialisation), but it also silently hurt a lot of queries.

PostgreSQL 12 reversed the default. Now a CTE referenced once and without data-modifying statements is inlined — the planner treats it like a subquery, and predicate pushdown works as expected. CTEs referenced multiple times or containing INSERT/UPDATE/DELETE are still materialised.

Two keywords override the default:

WITH foo AS NOT MATERIALIZED (...) — force inlining even if referenced multiple times.
WITH foo AS MATERIALIZED (...) — force materialisation even if referenced only once.

Typical cases:

-- Inlined by default — works like a subquery, predicates push in.
WITH recent_pending AS (
    SELECT order_id, user_id, created_at
    FROM sim_bp_orders
    WHERE status = 'pending'
)
SELECT rp.order_id, u.email
FROM recent_pending rp
JOIN sim_bp_users u ON u.user_id = rp.user_id
WHERE rp.created_at > now() - interval '7 days';

The created_at > now() - interval '7 days' filter is pushed into the CTE, so the combined filter (status = 'pending' AND created_at > ...) can use a single index scan rather than materialising all pending orders first.

-- Expensive aggregation referenced twice — worth materialising once.
WITH user_totals AS MATERIALIZED (
    SELECT user_id, sum(total_amount_cents) AS total
    FROM sim_bp_orders
    GROUP BY user_id
)
SELECT u.email, ut.total
FROM sim_bp_users u
JOIN user_totals ut ON ut.user_id = u.user_id
WHERE ut.total > 1000000

UNION ALL

SELECT u.email, 0
FROM sim_bp_users u
WHERE NOT EXISTS (SELECT 1 FROM user_totals WHERE user_id = u.user_id);

Without MATERIALIZED, the aggregation runs twice (once per reference). With it, it runs once and both references read from the materialised temp table.

Recursive CTEs

Recursive CTEs are for hierarchical data: trees, graphs, transitive closures, category parents, reporting chains.

WITH RECURSIVE employee_tree AS (
    -- Base case: root of the tree
    SELECT employee_id, manager_id, name, 1 AS depth
    FROM employees
    WHERE manager_id IS NULL

    UNION ALL

    -- Recursive step: children of previously-found rows
    SELECT e.employee_id, e.manager_id, e.name, et.depth + 1
    FROM employees e
    JOIN employee_tree et ON et.employee_id = e.manager_id
)
SELECT * FROM employee_tree;

PostgreSQL computes the base case, then repeatedly applies the recursive step to previously-produced rows until no new rows are generated. Two practical concerns:

No termination without a base case. A recursive CTE referencing itself in the base term, or whose recursive step produces the same rows forever, loops forever. Use depth < N as a guard when testing.
Index the join column. The recursive step joins the CTE's accumulated rows against the source table — without an index on employees.manager_id, each iteration is a sequential scan.

For transitive-closure queries (shortest paths, graph traversals), recursive CTEs work but scale poorly beyond a few tens of thousands of rows. For heavier graph workloads, look at dedicated extensions or materialised adjacency tables.

Subqueries in the FROM clause

SELECT ... FROM (SELECT ...) AS sub is semantically just a derived table. The planner inlines it the same way it inlines a CTE (PG 12+ behaviour), pushing predicates in.

One case where FROM subqueries matter: forcing a computation to happen once rather than per outer row. If you have SELECT ..., f(x) AS computed_val FROM t WHERE f(x) > 10, PostgreSQL may call f(x) twice per row (once for filter, once for projection) unless f is marked STABLE. Wrapping the expensive call in a FROM subquery sometimes ensures one-call-per-row evaluation.

Practical rules

SubPlan in the plan output → consider rewriting as a JOIN or LATERAL.
EXISTS / IN / JOIN+DISTINCT → default to EXISTS for boolean questions; it's usually clearest and gets the best plan on PostgreSQL.
NOT IN on a nullable column → almost always a bug. Use NOT EXISTS.
CTE used once → inlined by default in PG 12+. Don't wrap something in a CTE hoping to force materialisation; use MATERIALIZED explicitly.
CTE used multiple times with expensive aggregation → MATERIALIZED wins.
Top-N per group → LATERAL with LIMIT inside. Cleaner plan than window functions for small N.
Recursive traversals → WITH RECURSIVE, but index the join column and put a depth guard on anything you're not sure terminates.

Next in the series: WHERE Clause Optimisation — sargability, composite-index column ordering, and the operators that silently disable indexes.

postgres #performance #database #sql

Canonical version: https://mydba.dev/blog/postgres-subquery-cte-optimization

PostgreSQL Join Optimization: Nested Loop, Hash, and Merge

Philip McClarence — Tue, 28 Apr 2026 14:00:09 +0000

PostgreSQL has three join algorithms. The planner picks between them for every join in every query, driven by several things at once: the estimated sizes of the two inputs, whether they arrive already sorted on the join key, the type of join (inner vs left/semi/anti), which operators are mergejoinable or hashjoinable, whether a hash table will fit in work_mem, and the cost parameters that weigh I/O against CPU. Get the decision right and a three-way join across millions of rows runs in tens of milliseconds. Get it wrong — usually by encouraging a Nested Loop on two large unsorted inputs — and the same query takes minutes.

This article is the third in the Complete Guide to PostgreSQL SQL Query Analysis & Optimization series. We assume the reader can read EXPLAIN output and is familiar with the indexing vocabulary. The running dataset is the same Neon Postgres 17.8 database used throughout the series: 500,000-row sim_bp_orders, 1,000,000-row sim_bp_order_items, 200,000-row sim_bp_users.

We'll cover how each of the three join strategies works, when the planner picks each, what indexes each one wants, and how to read multi-way joins.

Nested Loop — small outer, indexed inner

Nested Loop is the simplest strategy: for each row on the outer side, scan the inner side for matches. Without any index on the inner side, this is a full scan per outer row — O(outer × inner) — and catastrophic for two large tables. With an index on the inner side's join key, each "scan" of the inner is a handful of page reads (a btree descent plus a heap fetch for any columns not in the index), so the total cost is outer-rows × random-I/O-per-probe rather than a polynomial blowup. When the outer side is small and the inner has an index, Nested Loop is nearly unbeatable.

Here's a three-way join that the planner executes as a tower of Nested Loops. The query is "twenty recent pending orders with the user's email and the items in each order":

SELECT u.email, o.order_id, oi.quantity, oi.unit_price_cents
FROM sim_bp_users u
JOIN sim_bp_orders o ON o.user_id = u.user_id
JOIN sim_bp_order_items oi ON oi.order_id = o.order_id
WHERE o.status = 'pending' AND u.status = 'active'
ORDER BY o.created_at DESC
LIMIT 20;

Limit  (cost=1.28..18.06 rows=20 width=41) (actual time=5.098..43.038 rows=20 loops=1)
  Buffers: shared hit=96 read=45
  ->  Nested Loop  (cost=1.28..159741.54 rows=190376 width=41)
        (actual time=5.097..43.027 rows=20 loops=1)
        ->  Nested Loop  (cost=0.85..78058.04 rows=95188 width=33)
              (actual time=2.949..10.878 rows=9 loops=1)
              Inner Unique: true
              ->  Index Scan Backward using idx_sim_bp_orders_created_at on sim_bp_orders o
                    (cost=0.42..30949.29 rows=100300 width=16)
                    (actual time=1.566..2.610 rows=9 loops=1)
                    Filter: ((o.status)::text = 'pending'::text)
                    Rows Removed by Filter: 44
              ->  Memoize  (cost=0.43..0.55 rows=1 width=25)
                    (actual time=0.916..0.916 rows=1 loops=9)
                    Cache Key: o.user_id
                    Cache Mode: logical
                    Hits: 0  Misses: 9  Evictions: 0  Overflows: 0  Memory Usage: 2kB
                    ->  Index Scan using sim_bp_users_pkey on sim_bp_users u
                          (cost=0.42..0.54 rows=1 width=25)
                          (actual time=0.846..0.846 rows=1 loops=9)
                          Index Cond: (u.user_id = o.user_id)
                          Filter: ((u.status)::text = 'active'::text)
        ->  Index Scan using idx_sim_bp_order_items_order_id on sim_bp_order_items oi
              (cost=0.42..0.83 rows=3 width=12)
              (actual time=2.418..3.567 rows=2 loops=9)
              Index Cond: (oi.order_id = o.order_id)
 Execution Time: 43.129 ms

43 ms for a three-way join across 200k × 500k × 1M rows is good. The plan is a tower of two Nested Loops — the inner one joins orders and users, the outer one joins that intermediate result with order items. Read it top-down:

The Index Scan Backward on sim_bp_orders.created_at walks the index in reverse — newest first — looking for pending orders. rows=9 loops=1 means the outer driver produced nine orders before the whole pipeline had enough downstream rows to satisfy LIMIT 20. Forty-four rows were read and filtered as non-pending along the way.
For each of those nine orders, a Memoize → Index Scan on sim_bp_users_pkey looks up the user. Memoize is a PostgreSQL 14+ cache that short-circuits the inner scan when the same key appears repeatedly; here the nine orders happen to be from nine different users, so it's effectively nine primary-key lookups with no cache hits.
For each matching (order, user) pair, the outer Index Scan using idx_sim_bp_order_items_order_id returns an average of two to three line items per order (rows=2 loops=9). The LIMIT 20 applies to the final joined row count, so the executor stops as soon as 20 (order, user, item) tuples have been produced — which is roughly the point where 9 orders × ~2 items each = 20 rows.

This is the Nested Loop success case: the outer driver returns a tiny number of rows thanks to the LIMIT + ordered index, and every inner lookup is an indexed point query. Without the LIMIT, the planner would likely pick a very different strategy — possibly a Hash Join cascade — because it would have to produce tens of thousands of rows instead of twenty.

The Nested Loop failure mode

The same strategy is a disaster when the outer side is large. Consider "count the items across all pending orders," which must process 100,000 pending orders:

SELECT count(*)
FROM sim_bp_orders o
JOIN sim_bp_order_items oi ON oi.order_id = o.order_id
WHERE o.status = 'pending';

If we force the planner to use a Nested Loop (by disabling hash and merge joins), the result is telling:

Aggregate (actual time=1621.494..1621.495 rows=1 loops=1)
  Buffers: shared hit=398994 read=2894
  ->  Nested Loop  (actual time=6.422..1606.338 rows=200535 loops=1)
        ->  Index Only Scan on sim_bp_orders o
              (actual time=4.859..123.354 rows=100252 loops=1)
        ->  Index Only Scan on sim_bp_order_items oi
              (actual time=0.013..0.014 rows=2 loops=100252)
              Index Cond: (oi.order_id = o.order_id)
 Execution Time: 1621.525 ms

1.6 seconds for the same result the planner produces in 1.2 seconds via a Parallel Hash Join (next section). More interestingly, the Buffers line shows 398,994 pages hit — that's from 100,252 inner-index probes, each one re-traversing the btree descent of idx_sim_bp_order_items_order_id. Many of those probes hit the same upper index pages over and over (that's why it's mostly hit, not read), but it's still enormous repeated page traffic that dominates CPU even when the data is fully cached. Under concurrency, other queries would find their own working set evicted from shared_buffers to make room.

The MyDBA analyzer rule nested_loop_large is specifically for this failure mode: it fires when a Nested Loop has Plan Rows > 1000 on the outer side and Plan Rows > 100 on the inner side. At those sizes the Nested Loop is almost always the wrong strategy.

Hash Join — larger sides, unsorted input

Hash Join works in two phases:

Build phase. Read the smaller side in full, building an in-memory hash table keyed by the join column(s). This happens inside the Hash node you see in the plan.
Probe phase. Stream the larger side through the hash table, emitting matched rows as they come.

Hash Join doesn't care whether the inputs are sorted, which makes it the fallback when Merge Join isn't available. It wants the build side to fit in work_mem; if it doesn't, the join spills: PostgreSQL partitions both sides by the join key and processes one pair of partitions at a time. Spilling is visible in the plan as Batches > 1 on the Hash or Hash Join node, and the MyDBA analyzer rule hash_batches_spill fires on it.

Here's the same count query the planner actually chose — a Parallel Hash Join:

Finalize Aggregate  (actual time=1196.234..1199.894 rows=1 loops=1)
  Buffers: shared hit=3827 read=6356
  ->  Gather (Workers Planned: 2, Workers Launched: 2)
        ->  Partial Aggregate  (actual time=1179.014..1179.016 rows=1 loops=3)
              ->  Parallel Hash Join
                    (actual time=170.554..1143.676 rows=333333 loops=3)
                    Hash Cond: (oi.order_id = o.order_id)
                    ->  Parallel Seq Scan on sim_bp_order_items oi
                          (actual time=1.589..703.241 rows=333333 loops=3)
                    ->  Parallel Hash
                          Buckets: 524288  Batches: 1  Memory Usage: 23712kB
                          ->  Parallel Seq Scan on sim_bp_orders o
                                (actual time=0.009..38.403 rows=166667 loops=3)
                                Filter: ((o.status)::text = 'pending'::text)
 Execution Time: 1199.945 ms

1.2 seconds, 10,183 buffer pages touched — about 40× fewer than the forced Nested Loop. The planner built the hash table from sim_bp_orders (the smaller filtered side, 100k pending rows) and probed it with sim_bp_order_items. Batches: 1 means the hash table fit in work_mem entirely, so there was no spill.

Note the Parallel Seq Scan on both sides. That is not a planner mistake — when you're going to read every pending row anyway, a sequential scan is cheaper than an indexed scan because it avoids random I/O and plays nicely with read-ahead. Hash Join is perfectly happy to consume an unsorted stream.

The Parallel Hash Join is a newer variant (PostgreSQL 11+) where workers collaborate to build one shared hash table and then probe it in parallel. Under the hood, Parallel Hash coordinates the build; each worker contributes to it and then proceeds to scan its share of the probe side. This is why you see Workers Planned: 2, Workers Launched: 2 at the top and three loops in each node (one leader + two workers).

When Hash Join is suboptimal

Three cases:

Build side too large. If the smaller table is still multiple-of-work_mem, hash-join spilling degrades performance sharply. The fix is either to raise work_mem (per-session, not cluster-wide), or to force a different strategy via index creation. hash_batches_spill flags this in the analyzer output.

Probe side is tiny. If one input is five rows and the other is fifty million, Nested Loop into an indexed inner is cheaper than building any hash table. PostgreSQL's cost model handles this case correctly most of the time.

Both inputs already sorted. If both sides come out of index scans that produce rows in join-key order, Merge Join is strictly cheaper because it skips the hash build. The planner usually figures this out on its own when it sees the access paths.

Merge Join — both sides sorted

Merge Join walks two pre-sorted inputs in parallel, pairing rows with matching keys in a single pass. It's optimal when both inputs are already sorted on the join key — typically because both are served from index scans on the join column, or because the query itself requires an ORDER BY that aligns with the join key.

The planner picks Merge Join less often than you might expect, because:

If one side has a smaller size and the other has an index, Nested Loop is usually cheaper per row.
If neither side is sorted and both are large, Hash Join wins — sorting both sides just to merge them is rarely cost-effective.
Merge Join's sweet spot is two large pre-sorted streams, which is often a signal that a materialised view or a pre-joined table would be cheaper still.

A canonical Merge Join shape:

SELECT o.order_id, oi.quantity
FROM sim_bp_orders o
JOIN sim_bp_order_items oi ON oi.order_id = o.order_id
ORDER BY o.order_id;

If both tables have indexes on order_id (they do — the primary key on orders and idx_sim_bp_order_items_order_id) and the ORDER BY forces ordered output, the planner may produce something like:

Merge Join
  Merge Cond: (o.order_id = oi.order_id)
  ->  Index Scan using sim_bp_orders_pkey on sim_bp_orders o
  ->  Index Scan using idx_sim_bp_order_items_order_id on sim_bp_order_items oi

Single pass through both indexes, no hash build, no random access. When the prerequisites are met — both sides produced in join-key order — Merge Join is the cheapest option by a wide margin.

In practice you'll see Merge Join most often on joins with explicit ordering, or in the middle of larger plans where the planner noticed that an upstream node was already producing sorted output.

How the planner chooses

PostgreSQL's planner is cost-based. For each join, it enumerates the plausible strategies (Nested Loop, Hash Join, Merge Join, and each direction for each — which side is inner, which is outer) and picks the lowest-cost option. The cost model incorporates:

Estimated row counts from both sides (crucially — if these are wrong, everything downstream is wrong).
Whether each side has a useful index on the join column.
Current work_mem — the planner knows whether a hash table will fit or whether it'll have to plan a spill.
Whether inputs are already sorted (from index scans or prior sort nodes).
The cost parameters: random_page_cost, seq_page_cost, cpu_tuple_cost, etc.

The single biggest cause of wrong-strategy joins is bad row estimates. If the planner thinks a side will produce 15 rows and it actually produces 150,000, it might pick a Nested Loop (optimal for 15) when a Hash Join (optimal for 150,000) would be 100× faster. The MyDBA analyzer rule row_estimate_inaccurate fires when the actual-to-estimated ratio exceeds 10× in either direction, and the fix is almost always ANALYZE on the affected table, or extended statistics if the bad estimate comes from a correlation the planner doesn't know about.

The second biggest cause is stale column statistics on correlated predicates. The planner assumes predicates are independent — if WHERE tenant_id = 7 AND region = 'eu' implies a much narrower row set than P(tenant_id=7) × P(region='eu'), the planner will underestimate and pick the wrong join strategy. Extended statistics (CREATE STATISTICS ... ON tenant_id, region FROM ...) are the specific fix.

Join order: how PostgreSQL decides what to join first

In a three-way join A ⨝ B ⨝ C, there are several possible orders: (A ⨝ B) ⨝ C, A ⨝ (B ⨝ C), and if the join conditions allow it, (A ⨝ C) ⨝ B. For a fourth table you get a lot more permutations. PostgreSQL's planner searches through them.

The heuristic is: do the most selective joins first, so the intermediate result is as small as possible. A join that filters rows_A × rows_B down to 100 rows should happen before a join that would blow the intermediate to millions.

For queries with fewer than 12 tables, PostgreSQL uses dynamic programming to enumerate orders exhaustively. For 12+ tables, the planner switches to the Genetic Query Optimizer (GEQO) which uses heuristic search — sometimes producing non-optimal plans on complex joins. If you have a very wide query (12+ tables, complex conditions), tune geqo_threshold and from_collapse_limit or consider rewriting with explicit CTEs to split the problem.

A few practical levers when the planner picks a wrong join order:

Add or fix indexes. A missing index on a join column often drives the planner to avoid that join until later, resulting in large intermediates. Indexing fixes it.
ANALYZE recently. Stale row counts → bad estimates → bad orders. Autovacuum handles this for active tables; it's often out of date after a bulk load.
Extended statistics. For correlated join keys, CREATE STATISTICS on the correlation.
Rewriting to constrain the planner. STRAIGHT_JOIN doesn't exist in PostgreSQL, but you can force the order by using explicit JOIN syntax and setting join_collapse_limit = 1. Use sparingly — the cost model is usually right.

When join strategy doesn't matter — and what does

Sometimes the join strategy is correct and the query is still slow. The real costs are upstream:

A slow sub-query or CTE feeding the join. The join isn't the problem; its input is. Diagnose by looking at the actual timing of each side.
An expensive filter that prevents index use. If one side of the join is doing a sequential scan because of a non-sargable WHERE clause, the join strategy can't save you. See WHERE Clause Optimisation.
Over-selective projections. SELECT * on a 400-column table passed through a join is expensive in row width; projecting only the columns you need tightens the whole pipeline.

When reading a multi-way join plan, resist the urge to focus on the outermost join. Instead, scan the leaves of the plan tree for the biggest actual rows × loops node — that's where the time is actually going.

Quick reference

Outer size	Inner size	Inner indexed?	Inputs sorted?	Strategy
Small (≤1K)	Any	Yes	—	Nested Loop
Medium	Large	Yes	—	Nested Loop or Hash
Large	Large	—	Both	Merge Join
Large	Large	—	No	Hash Join (may spill)
Large	Large	Build side > work_mem	No	Hash Join with spill — raise work_mem or add an index

A plan shape that should always prompt investigation:

Nested Loop with outer rows > 1,000 and no Memoize cache → fires nested_loop_large.
Hash or Hash Join with Batches > 1 → fires hash_batches_spill; either raise work_mem or index to eliminate the join.
Any join where row_estimate_inaccurate fires on either side — fix statistics first, then re-examine the join.

Next steps

Joins are the category most affected by the quality of your WHERE clauses. The next article in the series covers WHERE Clause Optimisation — sargability, composite-index column ordering, and the operators that silently disable indexes. If your joins look right but the inputs to them are slow, that's almost always where the fix lives.

For the subquery/CTE patterns that sometimes appear in place of explicit joins (EXISTS, correlated subqueries, LATERAL), see Subquery & CTE Optimisation.

postgres #performance #database #sql

Originally published at mydba.dev/blog/postgres-join-optimization.

PostgreSQL Index Usage and Optimization

Philip McClarence — Mon, 27 Apr 2026 14:00:03 +0000

PostgreSQL Index Usage and Optimization

Indexing is the single biggest lever in SQL performance, and it is also the category where most of the bad advice lives. "Add an index" solves a narrow class of problems. "Add the right index, in the right shape, for the right query, and drop the ones you don't need" is the actual job — and it's more design work than most teams expect.

This is article 2 in a series on PostgreSQL query analysis. The pillar is The Complete Guide to PostgreSQL SQL Query Analysis & Optimization; article 1 covers reading EXPLAIN output. The running dataset is 500k-row sim_bp_orders / 200k-row sim_bp_users / 50k-row sim_bp_products on Neon Postgres 17.8; every EXPLAIN block is from a real run.

We'll cover: when the planner actually uses an index, the four design choices that matter most (column selection, partial, covering, expression), the less-common index types and when they beat btrees, how to find unused indexes, and four cases where not adding an index is the correct call.

When the planner picks an index

An index is a data structure; "using an index" is a planner decision. PostgreSQL estimates the cost of each candidate plan — sequential scan, index scan, index-only scan, bitmap scan — and picks the cheapest. Three things drive that choice:

Selectivity. The estimated fraction of rows the query will return. If the filter returns 0.1% of rows, an index scan is almost always cheaper. If the filter returns 30%, it depends on the rest of the query shape. If the filter returns 70%, the planner will almost always choose a sequential scan because visiting most of the heap sequentially costs less than reading index pages plus random heap I/O.

Correlation. If the rows matching the filter are physically clustered on disk, the planner's random-access penalty shrinks and an index scan becomes more attractive. If they're scattered, random I/O dominates and seq scan wins. The pg_stats.correlation column (range -1 to 1) tells you how clustered each column's values are. Time-series tables (created_at) often have near-1 correlation because they're append-mostly; status columns usually hover near 0.

Cost parameters. random_page_cost (default 4.0) vs seq_page_cost (default 1.0). On SSD-backed storage those defaults are too conservative; lowering random_page_cost to 1.5 or 2.0 makes the planner reach for indexes more readily. Setting it below seq_page_cost is almost always wrong — it implies random I/O is faster than sequential, which isn't true on any real storage. If you're tempted to go there, you probably want to raise effective_cache_size instead.

If a plan has a Seq Scan, no index-type nodes, and more than two nodes total, you probably have a missing or ignored index. It's a signal, not a verdict — some queries genuinely don't want an index — but it's worth checking.

The boring case — primary key lookup

The cheapest index in any database is the primary-key btree:

SELECT * FROM sim_bp_users WHERE user_id = 12345;

Index Scan using sim_bp_users_pkey on sim_bp_users
  (cost=0.42..8.44 rows=1 width=51) (actual time=8.683..8.686 rows=1 loops=1)
  Index Cond: (sim_bp_users.user_id = 12345)
  Buffers: shared read=4
 Execution Time: 9.700 ms

Four shared-buffer reads for a 200,000-row table. The 9.7 ms execution time is dominated by cold-cache reads against Neon's networked storage; on a warm-cache benchmark this drops to sub-millisecond. This is the shape every OLTP single-row lookup should have.

The four design choices that matter

1. Column selection — matching the query shape

A composite index on (user_id, created_at) helps:

WHERE user_id = ? (uses the leading column alone).
WHERE user_id = ? AND created_at > ? (uses both).
WHERE user_id = ? ORDER BY created_at DESC LIMIT n (uses leading equality + sorted trailing column).

It does not help WHERE created_at > ? in isolation. This is the leftmost-prefix rule: a btree composite index can answer queries that use a contiguous prefix of its columns, starting with the leading one. Skip-scan isn't efficient on PostgreSQL btrees for reasonable-cardinality leading columns.

Rule of thumb: leading columns should be equality predicates, trailing columns range predicates or sort keys. (tenant_id, created_at), not (created_at, tenant_id).

2. Partial indexes — when 80% of the table is irrelevant

CREATE INDEX idx_bp_orders_pending_recent
    ON sim_bp_orders (created_at)
    WHERE status = 'pending';

The index only contains rows where status = 'pending', so it's roughly one-fifth the size of a full index on created_at. The planner will use it for any query whose WHERE clause implies status = 'pending' — it proves this by theorem-proving over the predicates. So WHERE status = 'pending' AND created_at > now() - interval '1 day' works, but WHERE status IN ('pending', 'shipped') AND ... doesn't (the IN predicate doesn't imply the partial predicate).

Two gotchas: they're fragile to query rewording (a function, a cast, a reworded predicate can break the implication proof), and they pay write cost whenever a row moves into or out of the partial predicate.

3. Covering indexes — eliminating heap fetches

INCLUDE tucks non-key columns into the leaf pages:

CREATE INDEX idx_bp_orders_pending_by_amount
    ON sim_bp_orders (total_amount_cents DESC)
    INCLUDE (order_id, user_id, created_at)
    WHERE status = 'pending';

A query that SELECTs any combination of order_id, user_id, total_amount_cents, created_at from this index can be served entirely from index pages — provided the visibility map marks the relevant heap pages as all-visible. On a write-heavy table where autovacuum can't keep up, you'll see non-zero Heap Fetches: in EXPLAIN, which defeats most of the benefit.

INCLUDE columns cannot be used for index conditions. Rule: put columns used for filtering/joining/ordering in the key; put columns you're only retrieving in INCLUDE.

4. Expression indexes — indexing computed values

This is where most "why isn't my index being used?" problems live. A btree on email can't serve WHERE lower(email) = ? or WHERE lower(email) LIKE 'prefix%'. Case-insensitive prefix search on a 200k-row table without an expression index:

Gather  (cost=1000.00..5841.09 rows=1000 width=25) (actual time=0.553..122.758 rows=1 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  ->  Parallel Seq Scan on sim_bp_users
        Filter: (lower((email)::text) ~~ 'user12%'::text)
        Rows Removed by Filter: 94444
 Execution Time: 122.833 ms

Parallel seq scan, 94k rows filtered per worker, 122 ms. The fix:

CREATE INDEX idx_bp_users_email_lower
    ON sim_bp_users (lower(email) text_pattern_ops);

For equality on lowercased email, a plain CREATE INDEX ... (lower(email)) is enough. For prefix LIKE, text_pattern_ops is needed because PostgreSQL can only rewrite LIKE 'prefix%' into an index range scan when the index orders text by byte value rather than by locale collation.

With the existing idx_sim_bp_users_email_pattern index on email text_pattern_ops:

Index Only Scan using idx_sim_bp_users_email_pattern on sim_bp_users
  (cost=0.42..29.87 rows=20 width=8) (actual time=0.057..24.729 rows=20 loops=1)
  Index Cond: ((email ~>=~ 'user12'::text) AND (email ~<~ 'user13'::text))
  Filter: ((email)::text ~~ 'user12%'::text)
  Heap Fetches: 0
 Execution Time: 24.757 ms

The Index Cond uses ~>=~ and ~<~ — real PostgreSQL operators from text_pattern_ops that do byte-order comparisons. 24.7 ms vs 122.8 ms — five times faster, and the gap widens on larger tables.

Index types beyond btree

GIN — when equality becomes containment

For values with internal structure (arrays, JSONB, full-text search vectors, trigrams):

CREATE INDEX idx_events_data_gin
    ON events USING gin (event_data jsonb_path_ops);

-- Now this is sargable:
SELECT * FROM events WHERE event_data @> '{"type": "purchase"}';

jsonb_path_ops indexes only the @> operator but produces a significantly smaller and faster index than the default jsonb_ops. Use it unless you need the other JSONB operators.

GIN with pg_trgm turns substring LIKE queries (LIKE '%needle%') into index-backed scans.

BRIN — when the data is physically ordered

CREATE INDEX idx_bp_orders_created_at_brin
    ON sim_bp_orders USING brin (created_at);

For our 500,000-row orders table, a BRIN index is ~24 kB; a btree on the same column is ~5 MB. BRIN loses effectiveness immediately if the data isn't correlated — on a shuffled table, the min/max of every page range overlaps the whole value domain and the planner can't skip anything. BRIN is effectively useless on uncorrelated columns and brilliant on time-series data.

GiST / SP-GiST / hash

Geometric types, ranges, and fuzzy matching use GiST or SP-GiST. Hash indexes only support equality and are usually beaten by btrees even for point lookups — use them only when you've measured a specific case where they win.

When NOT to add an index

Write-heavy, read-light tables. Every index is write cost.
Low selectivity. A btree on a boolean is_active where 90% of rows are active will never be used. A partial index is better.
Queries that need most of the table. Reports over large windows are best served by parallel seq scan.
Redundant indexes. (a, b, c) subsumes (a, b) and (a). Drop the prefixes, keep the longest.

Finding unused indexes

SELECT
    s.indexrelname AS index_name,
    s.relname AS table_name,
    pg_size_pretty(pg_relation_size(s.indexrelid)) AS size,
    s.idx_scan
FROM pg_stat_user_indexes s
WHERE s.schemaname = 'public'
  AND s.idx_scan = 0
  AND NOT EXISTS (
      SELECT 1 FROM pg_constraint c
      WHERE c.conindid = s.indexrelid AND c.contype IN ('p', 'u', 'x')
  )
ORDER BY pg_relation_size(s.indexrelid) DESC;

Real result from the running database:

index_name	size	idx_scan
idx_sim_bp_users_username_pattern	6184 kB	0
idx_sim_bp_users_email_pattern	7960 kB	1

One 6 MB index with zero scans is a straightforward drop. The NOT EXISTS clause skips PK/unique/exclusion constraint indexes — those enforce integrity and are used internally even if no user query hits them.

Two caveats: pg_stat_reset() zeros the counter (check the stats timestamp before acting), and a replica's stats only count scans on that replica (don't drop an index from the primary based on replica stats alone).

Adding the right index — a complete example

SELECT order_id, user_id, total_amount_cents, created_at
FROM sim_bp_orders
WHERE status = 'pending'
ORDER BY total_amount_cents DESC
LIMIT 50;

51 ms sequential scan over 500k rows with a top-n heapsort. Three plausible candidates:

(status) — cheapest, most general, but the planner still needs a sort step.
(status, total_amount_cents DESC) — solves filter and sort. The sort is free because the index is already ordered on the trailing column within each status group.
(total_amount_cents DESC) WHERE status = 'pending' — only pending rows indexed. Smaller, faster to maintain, but only helps pending queries.

Option 3 plus INCLUDE (order_id, user_id, created_at) gives Index Only Scan and is the right call for this specific query. If the dashboard later adds status IN ('pending', 'processing'), you'd want option 2 instead. Design indexes for the query you have, and re-read the plans every six months.

postgres #performance #database #sql

Originally published at mydba.dev/blog/postgres-index-usage-optimization.

Reading PostgreSQL EXPLAIN and EXPLAIN ANALYZE Output

Philip McClarence — Fri, 24 Apr 2026 14:00:07 +0000

Every PostgreSQL performance conversation eventually lands on a question that sounds trivial: what does this EXPLAIN mean? The output is almost readable. There are node names in English, numbers that look familiar, and enough structure that you can guess at the intent. But if you're guessing, you're going to miss the signal that actually matters — and the difference between a plan that returns in 0.3 ms and one that returns in 400 ms is often one line of EXPLAIN output that looks like boilerplate.

This article is a systematic walk through how to read an EXPLAIN plan on PostgreSQL 17, using real output captured from a live database. By the end you should be able to look at a plan, identify what each node is doing and why, spot the three places where things usually go wrong, and articulate in one sentence why the query is slow — or whether it's actually fine and something else is wrong.

This is part of the Complete Guide to PostgreSQL SQL Query Analysis & Optimization series.

EXPLAIN vs EXPLAIN ANALYZE vs EXPLAIN (ANALYZE, BUFFERS)

The three variants you'll use in practice:

EXPLAIN — asks the planner what it would do, without running the query. Fast (milliseconds), safe for expensive queries, but every number is an estimate. Useful for "how expensive does the planner think this is?" and "did my new index change the plan shape?"

EXPLAIN ANALYZE — actually runs the query and reports what happened. You get both the planner's estimates and the real measured results, side by side. Use this in development and staging; use it on production only after thinking about the cost. Three warnings: (1) EXPLAIN ANALYZE on an INSERT/UPDATE/DELETE will execute the DML — wrap in a BEGIN; ... ROLLBACK; if you don't want the side effects. (2) The query runs end-to-end, so a slow query is slow again, and any locks it takes are held for real. (3) ANALYZE pulls rows into the buffer cache and may evict other working-set pages; running it on a busy production system can perturb the performance of the exact thing you're measuring. On hot-path queries, prefer capturing a representative plan via auto_explain or an EXPLAIN visualiser in a monitoring tool rather than running EXPLAIN ANALYZE ad-hoc under load.

EXPLAIN (ANALYZE, BUFFERS, VERBOSE, SETTINGS) — the version you should default to. BUFFERS adds per-node cache-hit/read/dirtied counts and must still be specified explicitly; EXPLAIN ANALYZE on its own does not include buffer statistics. VERBOSE adds the output column list at each node (useful for spotting why indexes aren't being chosen). SETTINGS reports any non-default planner knobs that might be influencing the plan.

You can also ask for structured output with FORMAT JSON, FORMAT YAML, or FORMAT XML. JSON preserves every field and is what you want for programmatic analysis; the text format is easier to read inline.

The plan tree

Every EXPLAIN output is a tree. The root is the outermost node, which is whatever produces the query's final rows; children feed their output up to their parent. PostgreSQL indents children under their parent with arrows:

Parent Node
  ->  Child A
  ->  Child B
        ->  Grandchild

The top-down narrative is: "to produce Parent Node's output, PostgreSQL runs Child A and Child B, feeding both into the parent. Child B itself is produced by running Grandchild." Execution order is bottom-up (leaves run first), but the way to read the plan is top-down — start with "what is this query ultimately asking for?" and then follow the tree down to understand how PostgreSQL intends to answer.

Here's a real example — "show twenty recent pending orders with the user's email." The plan is against a 500,000-row sim_bp_orders table and 200,000-row sim_bp_users table on PostgreSQL 17.8:

Limit  (cost=0.85..16.38 rows=20 width=37) (actual time=0.075..0.277 rows=20 loops=1)
  Buffers: shared hit=211
  ->  Nested Loop  (cost=0.85..77853.84 rows=100300 width=37) (actual time=0.074..0.275 rows=20 loops=1)
        Inner Unique: true
        Buffers: shared hit=211
        ->  Index Scan Backward using idx_sim_bp_orders_created_at on sim_bp_orders o
              (cost=0.42..30949.29 rows=100300 width=20)
              (actual time=0.012..0.151 rows=20 loops=1)
              Filter: ((status)::text = 'pending'::text)
              Rows Removed by Filter: 106
              Buffers: shared hit=131
        ->  Memoize  (cost=0.43..0.55 rows=1 width=25) (actual time=0.006..0.006 rows=1 loops=20)
              Cache Key: o.user_id
              Cache Mode: logical
              Hits: 0  Misses: 20  Evictions: 0  Overflows: 0  Memory Usage: 3kB
              Buffers: shared hit=80
              ->  Index Scan using sim_bp_users_pkey on sim_bp_users u
                    (cost=0.42..0.54 rows=1 width=25)
                    (actual time=0.003..0.003 rows=1 loops=20)
                    Index Cond: (u.user_id = o.user_id)
                    Buffers: shared hit=80
 Planning Time: 1.183 ms
 Execution Time: 0.309 ms

Read top-down. The root is Limit, which caps the result at twenty rows. Below it is a Nested Loop that joins two sources: an Index Scan Backward over sim_bp_orders and a Memoize wrapping an Index Scan on sim_bp_users. The outer loop walks the orders index backwards (newest first) filtering for status = 'pending', and for each matching order, looks up the user via the primary-key index — but the Memoize caches results by user_id in case the same user appears multiple times (they don't in this particular run, so all 20 are cache misses).

This is a very good plan. 0.309 ms, 211 shared-buffer hits, no reads from disk. The LIMIT 20 short-circuits the nested loop early — only 106 rows are read and filtered before twenty matches are found. The same query with a much larger LIMIT would have very different numbers.

Now let's break down what each number means.

Per-node fields: cost, rows, width, time, loops

On every node, PostgreSQL prints something like:

Node Name  (cost=S..T rows=R width=W) (actual time=s..t rows=r loops=l)

The first parenthesis (cost=... rows=... width=...) is the planner's estimate. The second (actual time=... rows=... loops=...) is what actually happened when the query ran. EXPLAIN without ANALYZE only prints the first.

cost=startup..total. Dimensionless units, scaled relative to seq_page_cost (1.0 by default). The other cost GUCs — random_page_cost, cpu_tuple_cost, cpu_index_tuple_cost, cpu_operator_cost — are all expressed in the same arbitrary unit, which lets the planner compare heterogeneous operations against each other. startup is the estimated cost to produce the first row from this node; total is the estimated cost to produce all rows. The difference matters: a Sort node has a high startup cost (it has to consume all input before it can produce the first row) but a low marginal cost per row after that. An Index Scan has a very low startup cost. When you see a node above a LIMIT, what matters is the startup cost of the child, because the limit stops asking for rows as soon as it has enough.

rows=N. The planner's estimate of how many rows this node will emit. Per loop — see below.

width=W. Estimated average row width in bytes. Mostly informational; you use it to sanity-check whether a Sort or Hash might spill to disk (row width × estimated rows ≈ memory requirement).

actual time=startup..total. Wall-clock milliseconds, measured per loop. startup is the time to produce the first row from this node; total is the time to produce the last row.

actual rows=r loops=l. rows is the number of rows produced per loop, averaged over all l loops. To get the total rows this node emitted, multiply: rows × loops.

Loops matter. In the nested loop example above, the Memoize node reports rows=1 loops=20 — meaning the node was executed 20 times (once per outer row), and each execution produced 1 row. Total output: 20 rows. But the actual time=0.006..0.006 is per loop, so the total time spent in Memoize was about 0.006 ms × 20 = 0.12 ms. Forgetting to multiply by loops is the single most common mistake in reading EXPLAIN output — a node that looks fast per loop can still dominate the query time if it runs 50,000 times.

The relationship between rows estimate and actual rows is arguably the most important signal in a plan. If the planner estimated 15 and the actual was 8,000, the plan was built on bad assumptions: every decision it made downstream (join strategy, memory allocation, whether to parallelise) was wrong. A ratio past 10× in either direction is worth treating as a warning; past 100× it's usually critical. The fix is almost always ANALYZE on the affected table, or extended statistics if the bad estimate comes from correlated columns that the planner assumes are independent.

Node types you'll see most often

Scan nodes — where rows enter the plan.

Seq Scan — read every row of a table. Reports Filter: when there's a WHERE clause applied, and Rows Removed by Filter: telling you how many rows were read and discarded. Cheap on small tables, catastrophic on large ones with selective filters.
Index Scan — use an index to find rows, then fetch each matching row from the heap for any columns the index doesn't contain. Reports Index Cond: for conditions satisfied by the index, and optionally Filter: for conditions that have to be rechecked after the heap fetch.
Index Only Scan — use an index to find rows and return all requested columns directly from the index, skipping the heap entirely. Requires either that the index includes every referenced column (see INCLUDE) or that all columns are part of the index keys. Reports Heap Fetches: — this number should be close to zero; a non-zero count means the visibility map didn't cover some pages and PostgreSQL had to check the heap anyway, defeating the point.
Bitmap Index Scan + Bitmap Heap Scan — two-step pattern for combining multiple index conditions or for queries that match many rows. First, the index scan builds a bitmap of heap pages that might have matches. Then the heap scan visits those pages once each, avoiding re-reading pages that contain multiple matches. Reports Exact Heap Blocks and Lossy Heap Blocks — a high lossy-block count means work_mem was too small to track individual tuples, so PostgreSQL fell back to page-level tracking and has to re-filter the matches.

Join nodes — combining two inputs.

Nested Loop — for each row on the outer side, scan the inner side. Optimal when the outer side is small and the inner side has an index on the join column. Pathological when both sides are large.
Hash Join — build a hash table from the smaller side (the Hash child), then probe it with each row from the other side. Optimal for equi-joins on unordered data when the smaller side fits in work_mem. Reports Hash Batches: — if this is greater than 1, the hash table didn't fit in memory and had to spill.
Merge Join — two pre-sorted inputs, walked in parallel. Optimal when both sides are already sorted (or can be sorted cheaply via an index). Reports Merge Cond:.

Sort and aggregation nodes.

Sort — ordering rows. Reports Sort Key: (the columns being sorted), Sort Method: (algorithm), Sort Space Type: (Memory or Disk), and Sort Space Used: (in KB).
- top-N heapsort — used under a LIMIT N. Keeps only N rows in a heap regardless of input size. Efficient in memory and time.
- quicksort — everything fits in work_mem.
- external merge — didn't fit; spilled to disk.
Aggregate / HashAggregate / GroupAggregate — SUM/AVG/COUNT/GROUP BY. HashAggregate builds a hash table keyed by the group-by columns; GroupAggregate requires presorted input. HashAggregate can spill to disk with Planned Partitions: N Batches: M.
Limit — cap the number of rows. Often the shortcut that makes a plan fast.
WindowAgg — window functions like ROW_NUMBER() and SUM() OVER.

Parallelism.

Gather / Gather Merge — the leader process collecting results from parallel workers. Workers Planned: and Workers Launched: tell you how many workers the planner asked for vs actually got. When Launched < Planned, the system is short on parallel worker slots.
Parallel Seq Scan / Parallel Index Scan / Parallel Hash Join — parallel-aware variants of the base node types.

Utility nodes.

Materialize — cache an intermediate result so the parent can rescan it without redoing the work. Common above the inner side of a Nested Loop.
Memoize (new in PostgreSQL 14) — LRU cache above an inner loop. Reports Cache Key:, Hits:, Misses:, Evictions:, and Memory Usage:. A high hit ratio is good; a high miss ratio just means the cache didn't help this particular query but didn't hurt either.
CTE Scan — reading from a materialised CTE. In PostgreSQL 12+ most CTEs are inlined and this node disappears; you see it when a CTE is referenced multiple times or marked MATERIALIZED.
SubPlan — a correlated subquery, executed once per outer row. Almost always worth rewriting as a JOIN.

The Buffers line

With BUFFERS enabled, every node reports how many 8 KB pages it touched:

Buffers: shared hit=3689

The four counters to know:

shared hit=N — pages found in shared_buffers (PostgreSQL's cache). No I/O system calls.
shared read=N — pages the backend had to read into shared_buffers via a read() system call. Whether the OS page cache satisfied the read without touching disk is invisible to EXPLAIN — these show up as reads regardless.
shared dirtied=N — pages the query modified in cache. Common with DML; in a read-only SELECT, usually comes from hint-bit updates or cleanup.
shared written=N — pages written back out during this node's execution. Usually this is the backend itself being forced to evict dirty pages to make room for new ones, not the background writer — so a high written count means your query is doing someone else's work because the dirty-page pool was already full.

There's also local hit/read/dirtied/written for per-session temporary tables, and temp read=N written=N for work files produced by sorts and hash joins that spilled.

A query doing shared read=2016, temp written=2051 in a single node is telling you two things: the table isn't fitting in cache, and the query itself is generating its own on-disk temp files because some operation (hash, sort, bitmap) exceeded work_mem. Both are fixable; both hurt.

A harder plan: the HashAggregate spill

Here's a plan with more going on — "the twenty users with the most pending-or-shipped orders." Against the same 500,000-row orders table and 200,000-row users table:

Limit  (cost=42281.85..42282.10 rows=20 width=29) (actual time=408.141..408.145 rows=20 loops=1)
  Buffers: shared hit=3737 read=2016, temp read=1320 written=2051
  ->  Sort  (cost=42281.85..42722.81 rows=176383 width=29) (actual time=406.664..406.667 rows=20 loops=1)
        Sort Key: (count(*)) DESC
        Sort Method: top-N heapsort  Memory: 26kB
        ->  HashAggregate  (cost=33757.54..37588.36 rows=176383 width=29)
              (actual time=347.354..392.138 rows=117060 loops=1)
              Group Key: u.email
              Planned Partitions: 4  Batches: 5  Memory Usage: 8241kB  Disk Usage: 6920kB
              ->  Hash Join  (cost=7932.00..21080.02 rows=176383 width=21)
                    (actual time=140.974..285.275 rows=175263 loops=1)
                    Hash Cond: (o.user_id = u.user_id)
                    ->  Seq Scan on sim_bp_orders o  (cost=0.00..9939.00 rows=176383 width=4)
                          (actual time=0.018..51.809 rows=175263 loops=1)
                          Filter: ((status)::text = ANY ('{pending,shipped}'::text[]))
                          Rows Removed by Filter: 324737
                    ->  Hash  (cost=4064.00..4064.00 rows=200000 width=25)
                          (actual time=140.864..140.865 rows=200000 loops=1)
                          Buckets: 131072  Batches: 2  Memory Usage: 6822kB
                          ->  Seq Scan on sim_bp_users u  (cost=0.00..4064.00 rows=200000 width=25)
                                (actual time=1.735..87.218 rows=200000 loops=1)
 Planning Time: 1.107 ms
 Execution Time: 408.215 ms

408 ms. Let's read it top-down and find where the time actually goes.

Root: Limit + Sort. The Sort is top-N heapsort, Memory: 26 kB — fine. Under the LIMIT 20, a top-N sort is almost free regardless of input size.

HashAggregate — the first red flag. The Group Key is u.email; the aggregate is a count(*) across the 175k joined rows. Two numbers jump out: Planned Partitions: 4 Batches: 5 and Memory Usage: 8241 kB Disk Usage: 6920 kB. PostgreSQL 13+ can spill a HashAggregate to disk when the hash table exceeds work_mem: the executor detects that not all groups will fit in memory, writes unfinished groups out to per-partition spill files, and processes them in a second pass. The exact number of spill-and-resume cycles isn't something you should read literally from the Batches count, but the presence of Disk Usage at all is the signal — this query is paying for temp file I/O on every run. The temp written=2051 buffer count at the top is driven by exactly this, and this is the dominant cost of the query.

Hash Join + Hash child — the second red flag. Buckets: 131072 Batches: 2 Memory Usage: 6822 kB. The hash table built from sim_bp_users needed about 13 MB (the build side is 200k rows at ~64 bytes each) and didn't fit in 4 MB of work_mem. When a hash join spills, PostgreSQL partitions both sides by the join key and processes one matched pair of partitions at a time — each probe row is tested only against its matching partition, not against every batch. The cost is the extra I/O of writing the build and probe sides to per-partition temp files and reading them back.

Seq Scan on sim_bp_orders. 175k rows returned, 324k removed by filter (total = 500k, the whole table). The filter is status IN ('pending', 'shipped'). No index on status, so the whole table is scanned.

Seq Scan on sim_bp_users. 200k rows returned, no filter — we need all users. Reads 2016 pages from disk (shared read=2016); the users table is mostly cold in cache.

The bottleneck order, from biggest to smallest: HashAggregate spill, Hash Join build-side batches, Seq Scans. Three different fixes are plausible, and which one is appropriate depends on how often this query runs, how much work_mem the rest of the workload can tolerate, and whether the data is append-mostly:

Raise work_mem per-session to ~20 MB so both the HashAggregate and the Hash Join stay in memory. Caveat: work_mem is allocated per sort/hash node per connection, so raising it globally multiplies by the number of concurrent queries doing sorts. Set it per-role (ALTER ROLE dashboard SET work_mem = '32MB') or per-session in the dashboard's connection pool, not cluster-wide.
Index sim_bp_orders.status so the scan becomes a Bitmap or Index Scan instead of reading all 500k rows. At ~35% selectivity a plain btree might not beat a seq scan by much, but a partial index or a multi-column (status, user_id) would.
Materialise the aggregate into a small summary table refreshed on a schedule or via triggers, if the query is a dashboard that runs every 10 seconds and the underlying data is append-mostly.

A fair DBA answer is "measure each fix in isolation and pick based on the workload" — not any specific prescribed order. If the query runs once a day in a reporting job, the work_mem bump is cheapest; if it runs constantly and powers a UI, the materialised result wins.

The five most common mistakes in reading plans

Comparing rows without multiplying by loops. A node reporting rows=1 loops=50000 produced 50,000 rows. A node reporting rows=50000 loops=1 produced the same 50,000 rows in a very different shape. Always look at loops.
Looking at top-line cost/time and calling it a day. The top-line number tells you the query is slow; it doesn't tell you which node is slow. Scan the tree for the node with the highest actual time × loops — that's where the time is spent, and usually where the fix is.
Trusting the planner's estimates when actual rows disagrees. If rows=15 on the estimate and actual rows=8000, every downstream decision was built on the wrong premise. Don't try to understand why the plan is shaped the way it is until you've fixed the estimate (usually with ANALYZE or extended statistics).
Missing the Rows Removed by Filter line. A Seq Scan returning a reasonable number of rows looks fine — until you notice the filter line says ten million rows were read and discarded to produce those few. The scan was fine; the cost is in the discard.
Ignoring the Buffers line. Two plans can have identical shapes and wildly different performance if one hits cache and the other doesn't. shared hit=5 means "hot"; shared read=50000 means "the storage layer did all the work, and next time it might be even worse." The Buffers line is the only way to see this without looking at the timing.

Next steps

If the first plan in this article (the nested loop) looked straightforward and the second (the HashAggregate spill) made sense, you've mostly got it. The rest of the series digs into specific bottleneck categories — missing indexes, join-strategy mistakes, aggregate spills, non-sargable WHERE clauses — and what to do about each. The next piece is PostgreSQL Index Usage and Optimization.

postgres #performance #database #sql

Originally published at https://mydba.dev/blog/postgres-explain-analyze-reading

The Complete Guide to PostgreSQL SQL Query Analysis & Optimization

Philip McClarence — Thu, 23 Apr 2026 14:00:06 +0000

Most PostgreSQL performance work is wasted because it starts from the wrong end. Someone notices a slow query, skim-reads EXPLAIN, pattern-matches to "missing index," adds one, and moves on. Sometimes that works. Often it doesn't — and when it doesn't, the next attempt is usually an even blunter instrument: "just add more RAM," "just use a read replica," "just cache it."

This guide is a systematic alternative. The argument is that a large fraction of single-query latency problems in OLTP workloads fall into one of a small number of bottleneck categories, each with a characteristic EXPLAIN signature and a well-understood fix. (Lock contention, vacuum bloat, replication lag, and the generic-plan vs custom-plan behaviour of prepared statements are real and common, but they are cluster-level or protocol-level problems rather than single-plan problems; this guide is strictly about the latter.) If you can name the category in sixty seconds of reading the plan, the fix usually follows in minutes.

We'll work through the full workflow end-to-end on a real query against a real PostgreSQL 17 database, then map the eight bottleneck categories to the eight deep-dive articles that make up this series. Every EXPLAIN snippet below is captured from an actual run against a 500,000-row sim_bp_orders table on a Neon Postgres 17.8 database — not a synthetic example.

The workflow

Read the EXPLAIN plan — specifically the three signals that matter most: estimated-vs-actual row counts, access path at each scan node, and where time is actually spent.
Categorise the bottleneck — translate the plan signals into one of eight categories.
Apply the matching fix — index, rewrite, tune memory, or restructure the query.
Verify with a second EXPLAIN — before/after is how you know you actually fixed something.

That's it. The rest of this article walks through each step on a concrete example.

A typical slow query

Our running example is a dashboard query: "show me the fifty highest-value pending orders."

SELECT order_id,
       user_id,
       total_amount_cents,
       created_at
FROM sim_bp_orders
WHERE status = 'pending'
ORDER BY total_amount_cents DESC
LIMIT 50;

The table is 500,000 rows, with roughly 20% in status = 'pending'. There's a primary key on order_id, indexes on user_id and created_at, but no index on status or total_amount_cents. We've disabled parallel execution (SET max_parallel_workers_per_gather = 0) for this example so the plan reads cleanly. Here's the plan:

Limit  (cost=13270.89..13271.02 rows=50 width=20) (actual time=50.873..50.883 rows=50 loops=1)
  Buffers: shared hit=3689
  ->  Sort  (cost=13270.89..13521.64 rows=100300 width=20) (actual time=50.871..50.877 rows=50 loops=1)
        Sort Key: sim_bp_orders.total_amount_cents DESC
        Sort Method: top-N heapsort  Memory: 30kB
        Buffers: shared hit=3689
        ->  Seq Scan on sim_bp_orders  (cost=0.00..9939.00 rows=100300 width=20) (actual time=0.011..37.781 rows=100252 loops=1)
              Filter: ((status)::text = 'pending'::text)
              Rows Removed by Filter: 399748
              Buffers: shared hit=3689
 Planning Time: 0.073 ms
 Execution Time: 50.908 ms

Fifty-one milliseconds is not a disaster on its own. It's the kind of number that gets shrugged at until a hundred of these queries run concurrently on a busy application server, at which point CPU saturates and every request starts stacking.

Step 1 — Read the plan

Three signals tell you nearly everything about a plan node.

Signal 1 — how the table is accessed. At the leaf of this plan is Seq Scan on sim_bp_orders. A sequential scan means the planner's cost model decided reading every row was cheaper than any available index — sometimes because no useful index exists, sometimes because existing indexes don't match the query shape, occasionally because statistics are misleading the cost estimate. On small tables, or when the query needs a large fraction of the table anyway, a seq scan is often genuinely the cheapest plan. But on a 500k-row table with a selective filter and an ORDER BY ... LIMIT 50, it's the wrong shape.

Signal 2 — rows removed by filter. Rows Removed by Filter: 399,748, with Actual Rows: 100,252 matching. The scan touched every row in the table. The filter selectivity is ~20% — not pathological by itself — but 400,000 rows of pure waste every time the dashboard refreshes. An index on the filter column would let PostgreSQL skip them entirely.

Signal 3 — planner estimate vs reality. rows=100,300 estimated vs rows=100,252 actual. Essentially perfect. If the ratio had been ten-to-one or worse in either direction, the plan would be built on bad assumptions and ANALYZE would be the first move. Here, statistics are healthy.

There's a fourth node worth naming: the Sort above the scan is a top-N heapsort. Unlike a full sort, a top-N heapsort streams all input rows through a heap of size N (50 here) — it reads all 100,252 pending rows but only ever holds 50 in memory. That's why the Memory: 30kB is so small. Even so, it's 100,252 rows of unnecessary work: an index on (total_amount_cents DESC) WHERE status = 'pending' would let the planner walk the index from the largest value downward and stop after fifty entries.

Step 2 — Categorise the bottleneck

Once you've read the plan, map what you see to one of eight categories. Each category has a characteristic signature; each maps to a deep-dive article in this series.

Plan signal	Bottleneck category	Fix article
You can't even read the plan confidently	Plan literacy	Reading EXPLAIN / EXPLAIN ANALYZE Output
`Seq Scan` with large row counts, many `Rows Removed by Filter`	Missing or wrong index	Index Usage & Optimisation
Nested Loop joining large tables; Hash Join spilling to disk	Join strategy	Join Optimisation
`CTE Scan` feeding a filter; `SubPlan` running per outer row	Subquery / CTE structure	Subquery & CTE Optimisation
`HashAggregate` or `Sort` spilling, expensive window functions	Aggregate or window tuning	Aggregate & Window Function Tuning
Index exists but isn't being used; function on indexed column	WHERE clause shape	WHERE Clause Optimisation
Plan is "fine" but the query itself is the problem	Query rewriting	Query Rewriting Techniques
`SELECT *`, implicit casts, deep pagination, N+1 from ORM	Anti-pattern	Anti-Patterns & Common Mistakes

Our example plan maps cleanly. A Seq Scan with 400,000 rows removed by filter, sitting under an ORDER BY ... LIMIT that can't exploit any existing index, is the textbook signature for the Missing or wrong index category. The Sort above it is solvable in the same stroke — a single partial index can eliminate both the scan and the sort.

Step 3 — Apply the fix

The fix is a partial index, with the non-filter columns tucked into INCLUDE so the planner can serve the query from the index alone without touching the heap:

CREATE INDEX CONCURRENTLY idx_sim_bp_orders_pending_by_amount
    ON sim_bp_orders (total_amount_cents DESC)
    INCLUDE (order_id, user_id, created_at)
    WHERE status = 'pending';

Four non-obvious choices:

Partial index. Only pending orders are indexed, because that's the only status the query cares about. A full index on (status, total_amount_cents) would work too; it would contain roughly 5× more entries. Partial indexes only help queries whose WHERE clause implies the index's predicate — so if this dashboard later adds WHERE status IN ('pending', 'processing'), the planner will skip this index.
Sort direction in the index. Specifying total_amount_cents DESC means the planner can scan the btree in the direction that produces rows in the needed order without an explicit Sort node.
Tiebreaker. In a real dashboard you'd almost always want a tiebreaker column — ORDER BY total_amount_cents DESC isn't deterministic for ties, and two rows with equal totals would shuffle between pages; adding , order_id DESC to both the index and the query fixes that.
Covering (INCLUDE). The SELECT list is satisfied entirely from index tuples, which lets the planner serve the query as an Index Only Scan without heap fetches. Index Only Scan also requires the visibility map to mark the relevant heap pages all-visible, so on a write-heavy table where autovacuum can't keep up, you may still see heap fetches even with a covering index.

CREATE INDEX CONCURRENTLY avoids taking an ACCESS EXCLUSIVE lock on the table, so normal reads and writes continue while the index builds. It still takes weaker locks (SHARE UPDATE EXCLUSIVE) twice, waits for transactions that hold old snapshots on the target table to finish before advancing between phases, and runs two passes over the table — so it's slower than CREATE INDEX in wall-clock time, and a single long-running transaction that has touched this table can stall the build indefinitely. On a 500k-row table the build takes seconds; on a 500M-row table it can take hours. The application stays up the whole time. A partial index on status = 'pending' still pays write cost when rows are inserted into or updated out of that state — so if pending is a high-churn status, weigh the read win against the write overhead.

Step 4 — Verify

Same query, same data, index in place:

Limit  (cost=0.42..2.18 rows=50 width=20) (actual time=0.021..0.031 rows=50 loops=1)
  Buffers: shared hit=5
  ->  Index Only Scan using idx_sim_bp_orders_pending_by_amount on sim_bp_orders
        (cost=0.42..3544.68 rows=100300 width=20)
        (actual time=0.021..0.026 rows=50 loops=1)
        Heap Fetches: 0
        Buffers: shared hit=5
 Planning Time: 0.186 ms
 Execution Time: 0.045 ms

0.045 ms, down from 50.9 ms — roughly 1100× faster. Buffers dropped from 3,689 hit to 5 hit. The Sort node is gone entirely: the index is already sorted in the right order. The Filter line is gone: the partial index guarantees every row it contains already satisfies status = 'pending'. Heap Fetches: 0 means the visibility map covered every leaf page we touched, so PostgreSQL served all 50 tuples from the index without reading a single heap page.

Two caveats on the headline number. First, EXPLAIN ANALYZE's Execution Time measures server-side SQL execution only — it excludes network round-trip, client-side tuple deserialisation, and connection pool overhead. Real application latency for this query is probably closer to 2–10 ms depending on your region. Second, the measurement is on a hot cache with an immediately-post-vacuum visibility map; a colder cache would show Buffers: shared read=N instead of all hit. The meaningful improvement is the ~700× drop in buffer reads — that's what translates into lower CPU under concurrency.

The eight categories

The workflow above treats "spot the category" as a two-sentence step. In practice, each category has its own rules, exceptions, and non-obvious variants. The rest of this series is eight standalone articles, each diving into one category.

1. Reading EXPLAIN / EXPLAIN ANALYZE output

Before you can optimise a plan, you have to be able to read one. EXPLAIN reports the planner's estimated plan; EXPLAIN ANALYZE executes the query and reports what actually happened. The deep dive covers every common node type, the meaning of loops, Buffers, Memory, Workers Planned vs Launched, and the five most common ways to misread a plan. → Reading EXPLAIN / EXPLAIN ANALYZE Output.

2. Index usage and optimisation

A large share of single-query OLTP latency problems come down to indexing — either missing, or present but not matching the query shape. But "add an index" understates what's actually required: choosing columns in the right order, deciding between full and partial indexes, using INCLUDE for covering indexes, expression indexes for computed predicates, GIN/GiST/BRIN for the data types where btrees are wrong, and knowing when not to add one. → Index Usage & Optimisation.

3. Join optimisation

The planner picks between Nested Loop, Hash Join, and Merge Join based on cost estimates. Each has a regime where it's best, and the worst joins are the ones using the wrong strategy — usually a Nested Loop on two large tables. → Join Optimisation.

4. Subquery and CTE optimisation

PostgreSQL 12 changed CTE semantics — what used to always be materialised is now inlined by default, except when you ask for materialisation explicitly. That change made many old "CTE as optimisation fence" tricks silently stop working. → Subquery & CTE Optimisation.

5. Aggregate and window function tuning

GROUP BY and window functions look declarative, but the planner has strong opinions about how to execute them: HashAggregate versus GroupAggregate, partial and parallel aggregation, window frame optimisation. Sorts and hashes that spill to disk are almost always the visible symptom, and work_mem is almost always the knob. → Aggregate & Window Function Tuning.

6. WHERE clause optimisation

An index is only useful if the WHERE clause is sargable. Wrapping an indexed column in a function (lower(email) = '...'), doing implicit casts (varchar_column = 123), or comparing on the wrong side of an operator all silently disable indexes that look like they should apply. → WHERE Clause Optimisation.

7. Query rewriting techniques

Sometimes the plan is "fine" but the query itself is asking the wrong question. Correlated subqueries can usually become lateral joins; NOT IN with NULLs should be NOT EXISTS; offset pagination past a few hundred pages should be keyset pagination; DISTINCT over a large set is often GROUP BY in disguise. → Query Rewriting Techniques.

8. Anti-patterns and common mistakes

The final category is queries that are wrong by construction: SELECT * in hot paths, implicit type casts that silently disable indexes, missing LIMIT on exploratory joins, N+1 patterns coming out of ORMs, inserting one row at a time instead of batching. → Anti-Patterns & Common Mistakes.

Where to start

If you can already read EXPLAIN confidently, the highest-value articles are probably Index Usage and Query Rewriting, because those are where the largest wins hide. If reading the plans in this article felt like work, start with Reading EXPLAIN and come back here.

Slow queries are not mysterious. They fall into a small number of categories, each with a characteristic plan signature and a well-understood fix. Learn to recognise the signatures and most of the rest follows.

postgres #performance #database #sql

Canonical version with the full series linked: https://mydba.dev/blog/postgres-query-analysis-complete-guide

PostgreSQL Parallel Query: Configuration & Performance Tuning

Philip McClarence — Wed, 22 Apr 2026 10:00:02 +0000

PostgreSQL Parallel Query: Configuration & Performance Tuning

Your analytical query scans a 50 GB table, aggregates 200 million rows, and takes 25 seconds. Your server has 16 CPU cores. PostgreSQL uses... 2 of them. The other 14 sit idle. The max_parallel_workers_per_gather default of 2 is leaving 7x potential speedup on the table. Let's fix that -- and understand when you should not.

How Parallel Query Works

PostgreSQL divides large operations across multiple CPU cores. Worker processes each scan a portion of the data, feed results through a Gather node to the leader process, which combines them. Sequential scans, hash joins, aggregates, and B-tree index scans all support parallel execution.

The key defaults:

max_parallel_workers_per_gather = 2 -- max workers per parallel operation
max_parallel_workers = 8 -- total parallel workers across all sessions
max_worker_processes = 8 -- total background workers (shared with other subsystems)
min_parallel_table_scan_size = 8MB -- minimum table size for parallel scan
parallel_setup_cost = 1000 -- planner's estimate for starting a worker
parallel_tuple_cost = 0.1 -- per-tuple transfer cost estimate

These are tuned for general-purpose workloads. For analytical queries on large tables, they're far too conservative.

Detecting the Problem

Check your current settings:

SELECT name, setting, unit, short_desc
FROM pg_settings
WHERE name LIKE '%parallel%'
   OR name = 'max_worker_processes'
ORDER BY name;

Check whether queries actually use parallel workers:

EXPLAIN (ANALYZE, VERBOSE, BUFFERS)
SELECT customer_region, count(*), sum(order_total)
FROM orders
WHERE created_at >= '2026-01-01'
GROUP BY customer_region;

Look for:

Gather or Gather Merge -- parallel execution is happening
Workers Planned: 2 and Workers Launched: 2 -- how many workers
Workers Launched < Workers Planned -- system ran out of workers

Check if workers are being exhausted:

-- Currently active parallel workers
SELECT count(*) AS active_parallel_workers
FROM pg_stat_activity
WHERE backend_type = 'parallel worker';

-- The limit
SELECT setting AS max_parallel_workers
FROM pg_settings
WHERE name = 'max_parallel_workers';

If active workers frequently approach the limit, queries are competing for workers and some run with fewer than planned.

Tuning for Analytical Workloads

If your database runs analytical queries on large tables:

-- More workers per query
SET max_parallel_workers_per_gather = 4;

-- More total workers
ALTER SYSTEM SET max_parallel_workers = 16;

-- Enough background worker slots
ALTER SYSTEM SET max_worker_processes = 20;

-- Lower table size threshold
ALTER SYSTEM SET min_parallel_table_scan_size = '1MB';

-- Apply (max_worker_processes requires restart)
SELECT pg_reload_conf();

Rule of thumb: max_parallel_workers_per_gather = half your CPU cores, max_parallel_workers = total cores. On a 16-core server: 8 and 16 respectively.

Lower Cost Thresholds

If medium-sized tables aren't getting parallelized despite adequate configuration:

ALTER SYSTEM SET parallel_setup_cost = 100;    -- default: 1000
ALTER SYSTEM SET parallel_tuple_cost = 0.01;   -- default: 0.1
SELECT pg_reload_conf();

Lower parallel_setup_cost makes the planner consider parallelism for smaller operations. Lower parallel_tuple_cost makes parallel plans look cheaper.

Per-Session Overrides

For mixed workloads, set parallelism based on the connection type:

-- Reporting query: maximum parallelism
SET LOCAL max_parallel_workers_per_gather = 8;
SET LOCAL parallel_setup_cost = 0;
SET LOCAL parallel_tuple_cost = 0;

SELECT region, count(*), avg(order_total)
FROM orders
GROUP BY region;

-- OLTP session: disable parallelism
SET LOCAL max_parallel_workers_per_gather = 0;

Per-Table Settings

For critical large tables:

-- Guarantee up to 8 workers for scans on this table
ALTER TABLE orders SET (parallel_workers = 8);

This overrides the planner's automatic worker count calculation.

What Parallelizes (and What Doesn't)

Operation	Parallel?
Sequential Scan	Yes
B-tree Index Scan	Yes
Bitmap Heap Scan	Yes
Hash Join	Yes
Merge Join	Yes
Nested Loop	Yes (outer side)
Aggregate (count, sum, avg)	Yes
CREATE INDEX (B-tree)	Yes
Append (UNION ALL, partitions)	Yes
UPDATE, DELETE	No
CTEs (WITH queries)	No
Cursors	No
FOR UPDATE/SHARE	No

The Memory Multiplication Trap

work_mem applies per worker. A query with work_mem = 256MB and 4 parallel workers can consume 1.28 GB for sorting and hashing. Budget accordingly:

max_connections * max_parallel_workers_per_gather * work_mem < available RAM

This catches people who increase parallelism without accounting for memory.

Verify the Impact

Compare sequential vs parallel execution:

-- Disable parallelism
SET LOCAL max_parallel_workers_per_gather = 0;
EXPLAIN (ANALYZE, BUFFERS, TIMING)
SELECT customer_region, count(*), sum(order_total)
FROM orders
GROUP BY customer_region;

-- Enable parallelism
SET LOCAL max_parallel_workers_per_gather = 4;
EXPLAIN (ANALYZE, BUFFERS, TIMING)
SELECT customer_region, count(*), sum(order_total)
FROM orders
GROUP BY customer_region;

The parallel plan should show roughly sequential_time / (1 + num_workers) execution time, with 60-80% of theoretical speedup typical due to Gather overhead.

Prevention Strategy

OLAP databases: aggressive parallelism. max_parallel_workers_per_gather = CPU_cores / 2, max_parallel_workers = CPU_cores, lower cost thresholds.

OLTP databases: keep defaults or disable. Many short concurrent queries don't benefit -- worker overhead exceeds speedup on small queries.

Mixed workloads: per-connection settings. Reporting connections get high parallelism. App connections get zero.

Monitor Workers Launched vs Workers Planned. Consistent shortfall means you need more max_parallel_workers. If CPU hits 100% during parallel queries and other sessions slow down, reduce max_parallel_workers_per_gather.

Originally published at mydba.dev/blog/postgres-parallel-query

PostgreSQL Point-in-Time Recovery with pgBackRest

Philip McClarence — Tue, 21 Apr 2026 10:00:03 +0000

PostgreSQL Point-in-Time Recovery with pgBackRest

pg_dump gives you a snapshot at the moment you ran it. If your last dump was 6 hours ago and someone accidentally deletes a production table, those 6 hours are gone. Even with hourly dumps, you lose everything between the last dump and the incident. For a database processing thousands of transactions per minute, that gap is devastating. Point-in-time recovery (PITR) eliminates that gap -- restoring your database to any specific second by replaying the write-ahead log on top of a base backup.

How PITR Works

Two mechanisms combine:

Base backups -- periodic snapshots of all database files
WAL archiving -- continuous streaming of every WAL segment to a backup repository

The WAL records every change made to the database. By replaying WAL segments from a base backup forward to a target timestamp, you reconstruct the exact state at that moment. If the last archived WAL is 30 seconds old, your maximum data loss is 30 seconds -- not 6 hours.

pgBackRest is the standard tool for this. It handles base backups (full, incremental, differential), WAL archiving, retention, verification, and recovery -- with parallel compression, encryption, and remote repository support.

Detecting Whether You're Protected

Check if WAL archiving is even enabled:

SELECT name, setting
FROM pg_settings
WHERE name IN (
    'archive_mode',
    'archive_command',
    'archive_timeout',
    'wal_level'
);

You need archive_mode = on and wal_level = replica (or logical). If archive_mode is off, PITR is impossible.

Check for archiving failures:

SELECT
    archived_count,
    failed_count,
    last_archived_wal,
    last_archived_time,
    last_failed_wal,
    last_failed_time,
    now() - last_archived_time AS archive_lag
FROM pg_stat_archiver;

A non-zero failed_count or archive_lag greater than a few minutes means the pipeline is broken. WAL segments are accumulating on the primary and will eventually fill the disk.

Verify backup freshness:

pgbackrest info --stanza=main
pgbackrest verify --stanza=main

Setting Up pgBackRest

Install and Configure

# Debian/Ubuntu
sudo apt-get install pgbackrest

# RHEL/Rocky
sudo dnf install pgbackrest

Create /etc/pgbackrest/pgbackrest.conf:

[main]
pg1-path=/var/lib/postgresql/18/main

[global]
repo1-path=/var/lib/pgbackrest
repo1-retention-full=2
repo1-retention-diff=7
repo1-cipher-type=aes-256-cbc
repo1-cipher-pass=your-secure-encryption-passphrase

process-max=4
compress-type=zst
compress-level=6

log-level-console=info
log-level-file=detail

Configure PostgreSQL

Add to postgresql.conf:

wal_level = replica
archive_mode = on
archive_command = 'pgbackrest --stanza=main archive-push %p'
archive_timeout = 60

The archive_timeout = 60 forces a WAL switch every 60 seconds even if the segment isn't full. This caps maximum data loss at 60 seconds.

Restart PostgreSQL, then initialize the stanza:

pgbackrest --stanza=main stanza-create
pgbackrest --stanza=main check

Schedule Backups

# Full backup (weekly)
pgbackrest --stanza=main --type=full backup

# Differential (daily -- changes since last full)
pgbackrest --stanza=main --type=diff backup

# Incremental (every 6 hours -- changes since last any backup)
pgbackrest --stanza=main --type=incr backup

Cron schedule:

0 2 * * 0  pgbackrest --stanza=main --type=full backup
0 2 * * 1-6  pgbackrest --stanza=main --type=diff backup
0 */6 * * *  pgbackrest --stanza=main --type=incr backup

Performing Recovery

When disaster strikes, restore to a specific timestamp:

sudo systemctl stop postgresql

pgbackrest --stanza=main --type=time \
    --target="2026-02-28 14:30:00+00" \
    --target-action=promote \
    restore

sudo systemctl start postgresql

Set --target to just before the incident. --target-action=promote opens the database for read-write after recovery.

You can also restore to a named restore point:

-- Create before a risky operation
SELECT pg_create_restore_point('before_schema_migration');

pgbackrest --stanza=main --type=name \
    --target="before_schema_migration" \
    --target-action=promote \
    restore

Test Recovery Regularly

This is the most critical step. Schedule monthly recovery tests to a standby server:

pgbackrest --stanza=main --type=time \
    --target="2026-02-28 12:00:00+00" \
    --target-action=promote \
    --pg1-path=/var/lib/postgresql/18/test_recovery \
    restore

Verify the restored database contains expected data at the target timestamp. If recovery fails, fix the configuration before you need it in an emergency. Record the actual recovery time -- that's your real RTO.

Prevention

Build PITR into infrastructure from day one. Every production PostgreSQL database should have WAL archiving before its first production write.

Monitor three metrics continuously:

Archive lag -- alert if > 5 minutes
Failed archive count -- any non-zero value requires investigation
Backup age -- alert if exceeding your backup interval + buffer

An untested backup is not a backup. Test quarterly at minimum. Document the procedure, the expected recovery time, and who executes it. Run end-to-end: restore, replay WAL, verify data, record duration.

Store backups off-host. A backup on the same disk is destroyed by the same failure. Use S3, Azure Blob, GCS, or a separate server. Enable encryption.

Originally published at mydba.dev/blog/postgres-point-in-time-recovery

Forem: Philip McClarence

PostgreSQL Plan Signatures: Quick Reference

PostgreSQL Plan Signatures: Quick Reference

1. Plan-node signatures

2. SQL anti-patterns and replacements

3. MyDBA analyzer rules

How to use this reference

postgres #performance #database #sql

PostgreSQL Query Anti-Patterns and Common Mistakes

1. N+1 queries — the ORM default

2. SELECT * on wide tables in hot paths

3. Implicit type casts that disable indexes

4. Functions on indexed columns

5. Missing LIMIT on exploratory joins

6. One-row-at-a-time INSERTs

7. Keeping transactions open

8. Storing dates, numbers, or booleans as strings

9. count(*) as a cheap operation

10. Ignoring pg_stat_statements

Catching anti-patterns automatically

postgres #performance #database #sql

PostgreSQL Query Rewriting Techniques

PostgreSQL Query Rewriting Techniques

Offset pagination → keyset pagination

Correlated scalar subquery → aggregating JOIN

NOT IN → NOT EXISTS

DISTINCT → GROUP BY

Chunked deletes and updates

INSERT ... ON CONFLICT

SELECT * in production queries

HAVING vs WHERE

Composite rewrites: correlated subquery + LATERAL + keyset pagination

When not to rewrite

postgres #performance #database #sql

PostgreSQL WHERE Clause Optimization

Sargable predicates — the rule in one sentence

Implicit casts that silently disable indexes

The leftmost-prefix rule (quickly, with the consequences)

OR across indexed columns — the BitmapOr pattern

OR → UNION ALL — when the planner won't decompose

LIKE '%needle%' — leading wildcards

IS NULL, IS NOT NULL, and three-valued logic

Function calls on the constant side — safe

Partial index predicate implication

A diagnostic recipe

Next steps

postgres #performance #database #sql

PostgreSQL Aggregate and Window Function Tuning

The two aggregate strategies

The aggregate spill — and how to diagnose it

Sort spills — the external merge

Window functions

Frame specifications

LAG, LEAD, and first/last value

Aggregate-related indexes

Quick diagnostic checklist

Next steps

postgres #performance #database #sql

PostgreSQL Subquery and CTE Optimization

Scalar correlated subqueries — the SubPlan trap

EXISTS, IN, and JOIN — three ways to express "filter by related rows"

LATERAL — top-N per group without window functions

CTEs — materialised by default no longer

Recursive CTEs

Subqueries in the FROM clause

Practical rules

postgres #performance #database #sql

PostgreSQL Join Optimization: Nested Loop, Hash, and Merge

Nested Loop — small outer, indexed inner

The Nested Loop failure mode

Hash Join — larger sides, unsorted input

When Hash Join is suboptimal

Merge Join — both sides sorted

How the planner chooses

Join order: how PostgreSQL decides what to join first

When join strategy doesn't matter — and what does

Quick reference

Next steps

postgres #performance #database #sql

PostgreSQL Index Usage and Optimization

2. `SELECT *` on wide tables in hot paths

5. Missing `LIMIT` on exploratory joins

9. `count(*)` as a cheap operation

10. Ignoring `pg_stat_statements`

`NOT IN` → `NOT EXISTS`

`DISTINCT` → `GROUP BY`

`INSERT ... ON CONFLICT`

`SELECT *` in production queries

`HAVING` vs `WHERE`

`LIKE '%needle%'` — leading wildcards

`IS NULL`, `IS NOT NULL`, and three-valued logic