<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Philip McClarence</title>
    <description>The latest articles on Forem by Philip McClarence (@philip_mcclarence_2ef9475).</description>
    <link>https://forem.com/philip_mcclarence_2ef9475</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2690053%2F913499a1-620d-4487-a868-d677f1aca106.png</url>
      <title>Forem: Philip McClarence</title>
      <link>https://forem.com/philip_mcclarence_2ef9475</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/philip_mcclarence_2ef9475"/>
    <language>en</language>
    <item>
      <title>PostgreSQL Logical Replication: Setup, Monitoring &amp; Troubleshooting</title>
      <dc:creator>Philip McClarence</dc:creator>
      <pubDate>Wed, 08 Apr 2026 10:00:02 +0000</pubDate>
      <link>https://forem.com/philip_mcclarence_2ef9475/postgresql-logical-replication-setup-monitoring-troubleshooting-3og5</link>
      <guid>https://forem.com/philip_mcclarence_2ef9475/postgresql-logical-replication-setup-monitoring-troubleshooting-3og5</guid>
      <description>&lt;h1&gt;
  
  
  PostgreSQL Logical Replication: Setup, Monitoring &amp;amp; Troubleshooting
&lt;/h1&gt;

&lt;p&gt;PostgreSQL logical replication decodes WAL changes into a logical format and streams them to subscribers. Unlike physical (streaming) replication that creates byte-for-byte copies of the entire cluster, logical replication lets you replicate a subset of tables, replicate between different PostgreSQL major versions, and feed changes into systems running a different schema.&lt;/p&gt;

&lt;p&gt;Sounds great on the surface. The complexity hits after the initial setup.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Physical Replication Isn't Always Enough
&lt;/h2&gt;

&lt;p&gt;Streaming replication sends raw WAL bytes to replicas. It's great for HA failover, but falls apart when you need selectivity. You can't replicate just three tables out of a hundred. You can't replicate from PostgreSQL 14 to PostgreSQL 17 during a version upgrade. You can't stream changes into a data warehouse with a different schema.&lt;/p&gt;

&lt;p&gt;Logical replication solves all of these by decoding WAL into row-level change events (INSERT, UPDATE, DELETE) and applying them on the subscriber. You define a publication with a set of tables on the source, create a subscription on the target, and PostgreSQL handles the initial data copy and ongoing change streaming.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting It Up
&lt;/h2&gt;

&lt;h3&gt;
  
  
  On the publisher
&lt;/h3&gt;

&lt;p&gt;First, set &lt;code&gt;wal_level&lt;/code&gt; (this requires a restart):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;wal_level&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'logical'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- Restart PostgreSQL after this change&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then create a publication:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Publish specific tables&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;PUBLICATION&lt;/span&gt; &lt;span class="n"&gt;orders_pub&lt;/span&gt; &lt;span class="k"&gt;FOR&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_items&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Or publish all tables in a schema (PG15+)&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;PUBLICATION&lt;/span&gt; &lt;span class="n"&gt;analytics_pub&lt;/span&gt; &lt;span class="k"&gt;FOR&lt;/span&gt; &lt;span class="n"&gt;TABLES&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="k"&gt;SCHEMA&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  On the subscriber
&lt;/h3&gt;

&lt;p&gt;Create the subscription. This automatically creates a replication slot on the publisher and copies existing data before streaming changes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;SUBSCRIPTION&lt;/span&gt; &lt;span class="n"&gt;orders_sub&lt;/span&gt;
    &lt;span class="k"&gt;CONNECTION&lt;/span&gt; &lt;span class="s1"&gt;'host=publisher.example.com port=5432 dbname=myapp user=replicator password=secret'&lt;/span&gt;
    &lt;span class="n"&gt;PUBLICATION&lt;/span&gt; &lt;span class="n"&gt;orders_pub&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Monitor the initial sync:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;srsubid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;srrelid&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;regclass&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;CASE&lt;/span&gt; &lt;span class="n"&gt;srsubstate&lt;/span&gt;
        &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="s1"&gt;'i'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'initializing'&lt;/span&gt;
        &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="s1"&gt;'d'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'data_copying'&lt;/span&gt;
        &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="s1"&gt;'s'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'synchronized'&lt;/span&gt;
        &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="s1"&gt;'r'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'ready'&lt;/span&gt;
    &lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;sync_state&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_subscription_rel&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Three Things That Bite You in Production
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. DDL is not replicated
&lt;/h3&gt;

&lt;p&gt;This is the big one. An &lt;code&gt;ALTER TABLE ADD COLUMN&lt;/code&gt; on the publisher is completely invisible to the subscriber. The subscriber doesn't know the column exists, so when it receives rows with the new column, replication breaks.&lt;/p&gt;

&lt;p&gt;The fix requires careful ordering:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Step 1: Apply on subscriber FIRST&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="k"&gt;ADD&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;priority_level&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="s1"&gt;'normal'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Step 2: Apply on publisher&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="k"&gt;ADD&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;priority_level&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="s1"&gt;'normal'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Step 3: Refresh the subscription&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="n"&gt;SUBSCRIPTION&lt;/span&gt; &lt;span class="n"&gt;orders_sub&lt;/span&gt; &lt;span class="n"&gt;REFRESH&lt;/span&gt; &lt;span class="n"&gt;PUBLICATION&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The rule is: additive changes (ADD COLUMN, ADD TABLE) go on the subscriber first. Destructive changes (DROP COLUMN, DROP TABLE) go on the publisher first. Get it backwards and replication stops.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Replication slots retain WAL indefinitely
&lt;/h3&gt;

&lt;p&gt;A replication slot tells PostgreSQL "do not delete any WAL after this point until I say so." If the subscriber goes offline -- a failed server, a dropped subscription that wasn't cleaned up, a CDC tool that crashed -- the slot keeps retaining WAL segments.&lt;/p&gt;

&lt;p&gt;Hours or days later, &lt;code&gt;pg_wal/&lt;/code&gt; has grown from a few GB to hundreds of GB. The disk fills. PostgreSQL can't write new WAL. All transactions stall. The primary is effectively down.&lt;/p&gt;

&lt;p&gt;Monitor slot lag constantly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;slot_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;active&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;pg_size_pretty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;pg_wal_lsn_diff&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg_current_wal_lsn&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;confirmed_flush_lsn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;slot_lag_bytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;wal_status&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_replication_slots&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;slot_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'logical'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And set a safety limit (PG13+):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;max_slot_wal_keep_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'10GB'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;pg_reload_conf&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a slot exceeds this limit, PostgreSQL invalidates it rather than filling the disk. The subscriber will need to be re-synced, but the publisher stays online.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Conflict resolution is minimal
&lt;/h3&gt;

&lt;p&gt;If an INSERT on the subscriber violates a unique constraint because the row already exists, replication stops dead. You have to manually intervene:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Disable the subscription&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="n"&gt;SUBSCRIPTION&lt;/span&gt; &lt;span class="n"&gt;orders_sub&lt;/span&gt; &lt;span class="n"&gt;DISABLE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Fix the conflict&lt;/span&gt;
&lt;span class="k"&gt;DELETE&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;12345&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Re-enable&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="n"&gt;SUBSCRIPTION&lt;/span&gt; &lt;span class="n"&gt;orders_sub&lt;/span&gt; &lt;span class="n"&gt;ENABLE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On PG15+, you can skip conflicts automatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="n"&gt;SUBSCRIPTION&lt;/span&gt; &lt;span class="n"&gt;orders_sub&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;disable_on_error&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;false&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But that means silently dropping conflicting data, which may not be acceptable.&lt;/p&gt;

&lt;h2&gt;
  
  
  REPLICA IDENTITY: The Hidden Requirement
&lt;/h2&gt;

&lt;p&gt;For UPDATE and DELETE to work, the subscriber needs to identify which row to modify. By default it uses the primary key. Tables without a primary key need explicit configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Option 1: Send all columns (works but increases WAL volume)&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="n"&gt;REPLICA&lt;/span&gt; &lt;span class="k"&gt;IDENTITY&lt;/span&gt; &lt;span class="k"&gt;FULL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Option 2: Use a unique index&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;UNIQUE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_events_event_id&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event_id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="n"&gt;REPLICA&lt;/span&gt; &lt;span class="k"&gt;IDENTITY&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_events_event_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without a replica identity, UPDATE and DELETE operations on the publisher fail with an error.&lt;/p&gt;

&lt;h2&gt;
  
  
  Aurora PostgreSQL Setup
&lt;/h2&gt;

&lt;p&gt;For AWS Aurora, logical replication requires a parameter group change and reboot:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- In the RDS parameter group:&lt;/span&gt;
&lt;span class="c1"&gt;-- rds.logical_replication = 1&lt;/span&gt;
&lt;span class="c1"&gt;-- This automatically sets wal_level = logical&lt;/span&gt;

&lt;span class="c1"&gt;-- After reboot, grant the replication role:&lt;/span&gt;
&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="n"&gt;rds_replication&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="n"&gt;replicator_user&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Then create publications and subscriptions as normal&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Monitoring Replication Health
&lt;/h2&gt;

&lt;p&gt;Check subscription status and lag on the subscriber:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;subname&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;subscription_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;received_lsn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;latest_end_lsn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;latest_end_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;pg_size_pretty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;pg_wal_lsn_diff&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;received_lsn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;latest_end_lsn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;apply_lag_bytes&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_subscription&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;subname&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check for errors that have stalled replication:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;subname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;worker_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;relname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;last_error_message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;last_error_time&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_subscription_stats&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;last_error_message&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Prevention Checklist
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Always set &lt;code&gt;max_slot_wal_keep_size&lt;/code&gt;&lt;/strong&gt; -- 5-20 GB depending on workload. This prevents a single orphaned slot from filling your disk.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Document your schema change procedure&lt;/strong&gt; -- subscriber-first for additions, publisher-first for removals. Include &lt;code&gt;ALTER SUBSCRIPTION ... REFRESH PUBLICATION&lt;/code&gt; in the checklist.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use UUID primary keys&lt;/strong&gt; in bidirectional setups to avoid conflicts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep subscriber tables read-only&lt;/strong&gt; in unidirectional setups (separate role without write privileges).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitor both apply lag and slot lag&lt;/strong&gt; continuously. They indicate different problems.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What Logical Replication Does NOT Replicate
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Sequences (sync periodically with &lt;code&gt;pg_dump --data-only -t '*_seq'&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;DDL changes (manual coordination required)&lt;/li&gt;
&lt;li&gt;Large objects&lt;/li&gt;
&lt;li&gt;Materialized view refreshes&lt;/li&gt;
&lt;li&gt;TRUNCATE (on PostgreSQL versions before 11)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Plan for these gaps in your architecture rather than discovering them in production.&lt;/p&gt;

</description>
      <category>database</category>
      <category>monitoring</category>
      <category>postgres</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>MyDBA.dev vs Percona PMM: Deep PostgreSQL Intelligence vs Multi-Database Monitoring</title>
      <dc:creator>Philip McClarence</dc:creator>
      <pubDate>Tue, 07 Apr 2026 10:00:01 +0000</pubDate>
      <link>https://forem.com/philip_mcclarence_2ef9475/mydbadev-vs-percona-pmm-deep-postgresql-intelligence-vs-multi-database-monitoring-2g7j</link>
      <guid>https://forem.com/philip_mcclarence_2ef9475/mydbadev-vs-percona-pmm-deep-postgresql-intelligence-vs-multi-database-monitoring-2g7j</guid>
      <description>&lt;h1&gt;
  
  
  MyDBA.dev vs Percona PMM: Deep PostgreSQL Intelligence vs Multi-Database Monitoring
&lt;/h1&gt;

&lt;p&gt;Percona Monitoring and Management (PMM) is one of the most established open-source database monitoring tools available. It is free, actively maintained, and covers MySQL, MongoDB, and PostgreSQL in a single platform. If you run a polyglot database environment, PMM is a strong choice. This article compares PMM with MyDBA.dev to help you decide which tool fits your PostgreSQL monitoring needs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;Percona PMM&lt;/th&gt;
&lt;th&gt;MyDBA.dev&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pricing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Free (open source, self-hosted)&lt;/td&gt;
&lt;td&gt;Free tier (hosted) / Pro from GBP 19/mo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Database support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;MySQL, MongoDB, PostgreSQL&lt;/td&gt;
&lt;td&gt;PostgreSQL only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deployment&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Self-hosted server + agents on every host&lt;/td&gt;
&lt;td&gt;SaaS with lightweight collector&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Query analytics&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;QAN with fingerprinting, EXPLAIN&lt;/td&gt;
&lt;td&gt;Fingerprinting, EXPLAIN + EXPLAIN ANALYZE, plan regression detection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Health checks&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Percona Advisors (security, config, perf)&lt;/td&gt;
&lt;td&gt;75+ checks with scored domains and fix scripts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Index advisor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes -- missing index detection with CREATE INDEX&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;XID wraparound&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Basic metric exposure&lt;/td&gt;
&lt;td&gt;Dedicated monitoring with blocker detection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Extension monitoring&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;TimescaleDB, pgvector, PostGIS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Lock visualization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Standard lock metrics&lt;/td&gt;
&lt;td&gt;Interactive dependency graphs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Schema comparison&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Cross-instance schema diff&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Where PMM Excels
&lt;/h2&gt;

&lt;p&gt;PMM deserves genuine credit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It is fully free and open source.&lt;/strong&gt; Not freemium with paywalled features -- everything is available. As of 2026, all Percona Advisors (security, configuration, and performance checks) are included without any subscription.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-database support is real.&lt;/strong&gt; If you run MySQL, MongoDB, and PostgreSQL together, PMM gives you one monitoring platform. The MySQL coverage in particular is excellent -- PMM was built on Percona's deep MySQL expertise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Grafana dashboards are flexible.&lt;/strong&gt; PMM ships with dozens of pre-built dashboards and you can build your own. If your team already knows Grafana, you are immediately productive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Advisors catch real issues.&lt;/strong&gt; The framework checks for default passwords, SSL misconfigurations, memory settings, checkpoint tuning, replication lag, and more. These run automatically and surface findings in the UI.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Backup integration.&lt;/strong&gt; PMM integrates with Percona's backup tools for MySQL and MongoDB -- useful if backup monitoring is part of your workflow.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where MyDBA.dev Goes Further
&lt;/h2&gt;

&lt;p&gt;The core difference is specialization. PMM is a multi-database monitoring platform adapted to support PostgreSQL. MyDBA.dev is built exclusively for PostgreSQL.&lt;/p&gt;

&lt;h3&gt;
  
  
  Health Checks with Fix Scripts
&lt;/h3&gt;

&lt;p&gt;PMM's Advisors cover a solid set of checks. MyDBA.dev runs 75+ checks across 10 scored domains -- Vacuum, Indexes, Storage, Performance, Schema, Security, WAL &amp;amp; Backup, Connections, Replication, and Extensions. Every finding includes a fix script: not just "this index is missing" but the exact &lt;code&gt;CREATE INDEX&lt;/code&gt; statement. Domain scores (0-100) let you track improvement over time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://mydba.dev/blog/mydba-vs-pmm" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fyubpcnvytyatwmbfjsxb.supabase.co%2Fstorage%2Fv1%2Fobject%2Fpublic%2Fblog-images%2Fmydba-vs-pmm%2Fhealth-check-overview.png" alt="Health check overview showing scored domains" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Index Advisor
&lt;/h3&gt;

&lt;p&gt;PMM has no index recommendation engine. MyDBA.dev analyzes query patterns, sequential scan frequency, and table access statistics to recommend missing indexes. Each recommendation includes the &lt;code&gt;CREATE INDEX&lt;/code&gt; statement, estimated impact, and the queries that would benefit.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://mydba.dev/blog/mydba-vs-pmm" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fyubpcnvytyatwmbfjsxb.supabase.co%2Fstorage%2Fv1%2Fobject%2Fpublic%2Fblog-images%2Fmydba-vs-pmm%2Findex-advisor.png" alt="Index advisor with CREATE INDEX recommendations" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  XID Wraparound Protection
&lt;/h3&gt;

&lt;p&gt;XID wraparound is a PostgreSQL-specific risk that can force emergency autovacuum or database shutdown. PMM exposes basic XID age metrics. MyDBA.dev provides dedicated monitoring with age tracking, blocker detection (long-running transactions, prepared transactions, replication slots), trend analysis, and recovery scripts.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://mydba.dev/blog/mydba-vs-pmm" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fyubpcnvytyatwmbfjsxb.supabase.co%2Fstorage%2Fv1%2Fobject%2Fpublic%2Fblog-images%2Fmydba-vs-pmm%2Fxid-wraparound.png" alt="XID wraparound monitoring with blocker detection" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Extension Monitoring
&lt;/h3&gt;

&lt;p&gt;TimescaleDB, pgvector, PostGIS -- these extensions have their own performance characteristics and failure modes. MyDBA.dev monitors TimescaleDB (chunk health, compression ratios, continuous aggregate freshness), pgvector (index selection, distance functions, filtered search), and PostGIS (spatial index efficiency, SRID consistency, geometry quality). PMM has no extension awareness.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://mydba.dev/blog/mydba-vs-pmm" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fyubpcnvytyatwmbfjsxb.supabase.co%2Fstorage%2Fv1%2Fobject%2Fpublic%2Fblog-images%2Fmydba-vs-pmm%2Fpostgis-monitoring.png" alt="PostGIS monitoring with spatial analysis" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  No Infrastructure to Manage
&lt;/h3&gt;

&lt;p&gt;PMM requires deploying and maintaining PMM Server (Docker/VM) plus agents on every monitored host. You manage storage, upgrades, availability, and resources. MyDBA.dev is SaaS -- install a lightweight collector binary and the monitoring infrastructure is handled for you. No Grafana to upgrade, no Prometheus storage to manage.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lock Chain Visualization and Schema Diff
&lt;/h3&gt;

&lt;p&gt;MyDBA.dev renders interactive lock chain dependency graphs showing the full blocking tree -- which session holds the lock, which are waiting, what queries are involved. PMM provides standard lock metrics without the visual dependency mapping. MyDBA.dev also includes cross-instance schema comparison for staging vs production or pre/post migration diffs. PMM does not offer schema comparison.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pricing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;PMM:&lt;/strong&gt; Free (open source). You provide infrastructure for PMM Server and agents. The real cost is your team's time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MyDBA.dev:&lt;/strong&gt; Free tier (one connection, 7-day retention). Pro from GBP 19/month (30-day retention, all features). No infrastructure to manage.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Choose Which
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Choose PMM when:&lt;/strong&gt; You run MySQL + MongoDB + PostgreSQL together, you are already in the Percona ecosystem, your team has ops capacity for self-hosting, you need backup integration, or budget is the primary constraint.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Choose MyDBA.dev when:&lt;/strong&gt; Your environment is PostgreSQL-only or PostgreSQL-primary, you want health checks with fix scripts and index recommendations, you use PostgreSQL extensions, you want to avoid managing monitoring infrastructure, or you value remediation guidance alongside detection.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Honest Take
&lt;/h2&gt;

&lt;p&gt;PMM is a mature, genuinely free monitoring platform with excellent multi-database support. If you run MySQL alongside PostgreSQL, it is hard to beat.&lt;/p&gt;

&lt;p&gt;The trade-off is PostgreSQL depth. PMM was built on MySQL heritage and adapted to PostgreSQL. MyDBA.dev was built for PostgreSQL from the ground up. The difference shows in extension monitoring, index recommendations, XID wraparound tooling, lock chain visualization, and the 75+ health checks with fix scripts. If PostgreSQL is your primary database, that specialization matters.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://mydba.dev/blog/mydba-vs-pmm" rel="noopener noreferrer"&gt;mydba.dev/blog/mydba-vs-pmm&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>database</category>
      <category>monitoring</category>
      <category>postgres</category>
      <category>tooling</category>
    </item>
    <item>
      <title>MyDBA.dev vs pgwatch: Metrics Collection vs Actionable Intelligence</title>
      <dc:creator>Philip McClarence</dc:creator>
      <pubDate>Mon, 06 Apr 2026 10:00:05 +0000</pubDate>
      <link>https://forem.com/philip_mcclarence_2ef9475/mydbadev-vs-pgwatch-metrics-collection-vs-actionable-intelligence-5d8m</link>
      <guid>https://forem.com/philip_mcclarence_2ef9475/mydbadev-vs-pgwatch-metrics-collection-vs-actionable-intelligence-5d8m</guid>
      <description>&lt;p&gt;If you are evaluating PostgreSQL monitoring tools, pgwatch is almost certainly on your list. It should be. It is one of the best open-source options available -- mature, lightweight, flexible, and purpose-built for PostgreSQL.&lt;/p&gt;

&lt;p&gt;We built myDBA.dev to solve a different problem, and the distinction matters more than you might expect.&lt;/p&gt;

&lt;h2&gt;
  
  
  What pgwatch Does Well
&lt;/h2&gt;

&lt;p&gt;pgwatch deserves genuine credit. Here is where it excels:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Any SQL query becomes a metric.&lt;/strong&gt; This is pgwatch's killer feature. Write a SQL query, add it to the configuration, and it shows up in Grafana. Business-specific KPIs, custom health indicators, application-level metrics stored in your database -- if you can SELECT it, pgwatch can track it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It is free and open source.&lt;/strong&gt; BSD-3-Clause license, no vendor lock-in, full control. For organizations with strict open-source mandates or zero monitoring budget, this is a decisive advantage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It scales.&lt;/strong&gt; The Go-based collector is lightweight enough to monitor hundreds of PostgreSQL instances without becoming a bottleneck. Multiple storage backends -- PostgreSQL, TimescaleDB, Prometheus -- give you flexibility in how you store and query historical data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Auto-discovery works.&lt;/strong&gt; Point it at Patroni, PgBouncer, Pgpool2, or AWS RDS and it automatically discovers and monitors all members. Topology changes are handled without manual intervention.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Grafana integration is mature.&lt;/strong&gt; 30+ predefined dashboards covering database overview, table statistics, index usage, replication, locks, WAL, and more.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Gap Between Data and Action
&lt;/h2&gt;

&lt;p&gt;Here is the pattern we kept seeing: teams deploy pgwatch, build beautiful Grafana dashboards, and collect comprehensive metrics. Then something goes wrong.&lt;/p&gt;

&lt;p&gt;The dashboard shows &lt;code&gt;n_dead_tup&lt;/code&gt; at 2.4 million on the orders table. The question is not whether you can see the number -- pgwatch shows it clearly. The question is what you do next.&lt;/p&gt;

&lt;p&gt;Is autovacuum disabled on that table? Is a long-running transaction blocking vacuum? Are the per-table vacuum settings misconfigured? What is the exact command to fix it?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- The fix pgwatch cannot generate for you:&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="k"&gt;RESET&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;autovacuum_enabled&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="c1"&gt;-- Re-enables autovacuum with default settings&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gap between seeing a metric and knowing what to do about it is where myDBA.dev focuses.&lt;/p&gt;

&lt;h2&gt;
  
  
  What myDBA.dev Adds
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Health Checks with Fix Scripts
&lt;/h3&gt;

&lt;p&gt;pgwatch has no health check system. It collects and displays metrics. Interpretation is left to the operator.&lt;/p&gt;

&lt;p&gt;myDBA.dev runs 75+ automated health checks across 12 domains -- Configuration, Performance, Vacuum, Replication, Indexes, Security, Storage, Connections, TimescaleDB, XID Wraparound, pgvector, and Wait Events. Each finding produces three things: what is wrong, why it matters, and a copy-pasteable SQL fix calculated from your actual server configuration.&lt;/p&gt;

&lt;h3&gt;
  
  
  Index Advisor
&lt;/h3&gt;

&lt;p&gt;pgwatch collects index usage statistics. myDBA.dev goes further: it analyzes your query workload, identifies sequential scans on large tables, detects duplicate and overlapping indexes, and generates &lt;code&gt;CREATE INDEX&lt;/code&gt; or &lt;code&gt;DROP INDEX CONCURRENTLY&lt;/code&gt; statements with estimated impact.&lt;/p&gt;

&lt;h3&gt;
  
  
  Automatic EXPLAIN Plans
&lt;/h3&gt;

&lt;p&gt;pgwatch does not collect query execution plans. When a query regresses from 50ms to 12 seconds, you see the timing change in your dashboard but have no plan to explain why.&lt;/p&gt;

&lt;p&gt;myDBA.dev automatically collects EXPLAIN plans for your top queries each collection cycle. When performance degrades, you compare old and new plans to see whether the planner switched scan types, whether row estimates drifted, or whether a new join strategy is suboptimal.&lt;/p&gt;

&lt;h3&gt;
  
  
  Extension Monitoring
&lt;/h3&gt;

&lt;p&gt;pgwatch can store data in TimescaleDB but does not monitor it. It does not track chunk health, compression ratios, continuous aggregate staleness, or job failures. It has no pgvector monitoring for index build progress or recall accuracy. No PostGIS monitoring for spatial index quality or SRID mismatches.&lt;/p&gt;

&lt;p&gt;myDBA.dev has dedicated monitoring and health checks for TimescaleDB, pgvector, and PostGIS.&lt;/p&gt;

&lt;h3&gt;
  
  
  XID Wraparound Protection
&lt;/h3&gt;

&lt;p&gt;pgwatch shows XID age as a metric. myDBA.dev detects wraparound risk, identifies what is blocking autovacuum from making progress (long-running transactions, prepared transactions, replication slots), and generates recovery scripts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;pgwatch&lt;/th&gt;
&lt;th&gt;myDBA.dev&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Open source&lt;/td&gt;
&lt;td&gt;Yes (BSD-3-Clause)&lt;/td&gt;
&lt;td&gt;No (SaaS)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Custom SQL metrics&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Health checks with fix scripts&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;75+ checks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Index advisor&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EXPLAIN plan collection&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Extension monitoring&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;TimescaleDB, pgvector, PostGIS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Self-hosting required&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Price&lt;/td&gt;
&lt;td&gt;Free (self-hosted)&lt;/td&gt;
&lt;td&gt;Free tier, Pro from GBP 19/mo&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  When to Choose pgwatch
&lt;/h2&gt;

&lt;p&gt;Choose pgwatch if your team has deep PostgreSQL expertise and prefers raw data over guided recommendations. If you already run Grafana and Prometheus, pgwatch integrates natively. If you need custom business-domain metrics, pgwatch's SQL-based approach is unmatched. If budget is zero and you have the ops capacity to self-host, pgwatch is the clear winner on cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Choose myDBA.dev
&lt;/h2&gt;

&lt;p&gt;Choose myDBA.dev if you want the interpretation done for you. If your team does not have a senior DBA who can turn raw metrics into fix scripts, the health check system fills that gap. If you need extension monitoring, EXPLAIN plan collection, or index recommendations, those are capabilities pgwatch does not offer. If you do not want to manage monitoring infrastructure, SaaS means one less thing to operate.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;pgwatch answers "what is happening?" myDBA.dev answers "what is happening, what does it mean, and what should I do about it?"&lt;/p&gt;

&lt;p&gt;Both are good tools solving different problems. The right choice depends on whether your team needs data or direction.&lt;/p&gt;

&lt;p&gt;Full comparison with screenshots: &lt;a href="https://mydba.dev/blog/mydba-vs-pgwatch" rel="noopener noreferrer"&gt;MyDBA.dev vs pgwatch&lt;/a&gt;&lt;/p&gt;

</description>
      <category>database</category>
      <category>monitoring</category>
      <category>opensource</category>
      <category>postgres</category>
    </item>
    <item>
      <title>MyDBA.dev vs Datadog Database Monitoring: PostgreSQL-Native vs Full-Stack Observability</title>
      <dc:creator>Philip McClarence</dc:creator>
      <pubDate>Sun, 05 Apr 2026 10:00:04 +0000</pubDate>
      <link>https://forem.com/philip_mcclarence_2ef9475/mydbadev-vs-datadog-database-monitoring-postgresql-native-vs-full-stack-observability-3742</link>
      <guid>https://forem.com/philip_mcclarence_2ef9475/mydbadev-vs-datadog-database-monitoring-postgresql-native-vs-full-stack-observability-3742</guid>
      <description>&lt;h1&gt;
  
  
  Datadog Database Monitoring vs MyDBA.dev: What PostgreSQL Teams Actually Need
&lt;/h1&gt;

&lt;p&gt;If you run PostgreSQL in production, you have probably evaluated Datadog Database Monitoring at some point. It is the default choice for many teams because it integrates with everything else in the Datadog ecosystem. But "integrates with everything" and "monitors PostgreSQL deeply" are not the same thing. Let me walk through where each tool excels and where each falls short, from the perspective of someone who spends their days thinking about PostgreSQL performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Trade-Off: Breadth vs Depth
&lt;/h2&gt;

&lt;p&gt;Datadog is a full-stack observability platform that happens to include database monitoring. MyDBA.dev is a PostgreSQL monitoring tool that does nothing else. This distinction shapes every difference between them.&lt;/p&gt;

&lt;p&gt;Datadog's breadth means you get APM traces correlated with database queries, infrastructure metrics alongside query latency, and a single pane of glass across MySQL, PostgreSQL, MongoDB, and Redis. If your debugging workflow starts at the application layer and drills down to the database, Datadog's trace-to-query correlation is genuinely impressive.&lt;/p&gt;

&lt;p&gt;MyDBA.dev's depth means you get 75+ PostgreSQL-specific health checks with SQL remediation scripts, a cluster-aware index advisor, XID wraparound monitoring, and dedicated dashboards for extensions like TimescaleDB, pgvector, and PostGIS. If your debugging workflow starts at the database layer, these capabilities matter more than APM integration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Five Things Datadog Does Not Cover
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Extension Monitoring
&lt;/h3&gt;

&lt;p&gt;This is the biggest gap. TimescaleDB, pgvector, and PostGIS are increasingly common in PostgreSQL deployments, and Datadog treats all three as invisible. No chunk size tracking for TimescaleDB, no index recall metrics for pgvector, no spatial index efficiency checks for PostGIS.&lt;/p&gt;

&lt;p&gt;MyDBA.dev monitors all three with dedicated dashboards. For pgvector alone there are 13 health checks covering index type selection, recall estimation, quantization settings, and storage efficiency. For TimescaleDB, you get hypertable chunk monitoring, compression ratio tracking, continuous aggregate freshness, and background job health.&lt;/p&gt;

&lt;p&gt;If you use any of these extensions, this is likely the deciding factor.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. XID Wraparound Protection
&lt;/h3&gt;

&lt;p&gt;Transaction ID wraparound is PostgreSQL's most dangerous failure mode. When a database approaches the 2-billion XID limit, PostgreSQL forces an emergency vacuum that blocks all writes -- a production outage that can take hours to resolve.&lt;/p&gt;

&lt;p&gt;MyDBA.dev tracks XID age trends over time, alerts on rising age, and identifies which tables are contributing to the problem. Datadog does not monitor XID age at all. For PostgreSQL-specific failure modes, a generalist tool leaves blind spots.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Health Scoring with Fix Scripts
&lt;/h3&gt;

&lt;p&gt;MyDBA.dev runs automated health checks across 10 domains -- Connections, Indexes, Performance, Replication, Schema, Security, Storage, Vacuum, WAL &amp;amp; Backup, and Extensions -- and assigns a score from A to F. Each finding includes a severity level, an explanation, and a ready-to-run SQL fix script.&lt;/p&gt;

&lt;p&gt;Datadog surfaces some recommendations, but without a structured scoring system or actionable remediation scripts. The difference: "you have a problem" versus "you have a problem, here is why, and here is the SQL to fix it."&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Cluster-Aware Index Advisor
&lt;/h3&gt;

&lt;p&gt;MyDBA.dev analyzes query workloads and recommends specific indexes with estimated cost savings, write-performance impact, and CREATE INDEX statements. It accounts for replication topology and existing index overlap.&lt;/p&gt;

&lt;p&gt;Datadog provides basic missing index detection -- tables with high sequential scan counts that probably need an index. It does not perform workload-based analysis or provide cost/benefit estimates.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Plan Regression Detection
&lt;/h3&gt;

&lt;p&gt;MyDBA.dev automatically collects EXPLAIN plans for your top queries and detects when a query's execution plan changes for the worse. If the planner switches from an index scan to a sequential scan after an ANALYZE or a statistics change, you get alerted before users notice the slowdown.&lt;/p&gt;

&lt;p&gt;Datadog lets you manually run EXPLAIN ANALYZE through the UI, which is useful but reactive rather than proactive.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Datadog Is the Right Choice
&lt;/h2&gt;

&lt;p&gt;Full credit where it is due -- Datadog wins in several important scenarios:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multi-service architectures&lt;/strong&gt; where you need to trace latency from the frontend through microservices to the database&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Polyglot database environments&lt;/strong&gt; with MySQL, MongoDB, Redis, and PostgreSQL all in the same stack&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Teams already invested in Datadog&lt;/strong&gt; where adding database monitoring is incremental, not a new tool&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom dashboarding needs&lt;/strong&gt; where combining database metrics with application and infrastructure metrics in flexible layouts matters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your primary debugging workflow is "slow API response -&amp;gt; which service -&amp;gt; which query -&amp;gt; why is it slow," Datadog's end-to-end tracing is hard to replicate with any database-specific tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Pricing Reality
&lt;/h2&gt;

&lt;p&gt;Datadog Database Monitoring costs $70/host/month. It requires the Infrastructure tier ($15/host/month) as a prerequisite. So the effective cost is $85/host/month minimum.&lt;/p&gt;

&lt;p&gt;MyDBA.dev pricing is per-organization, not per-host:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Servers&lt;/th&gt;
&lt;th&gt;Datadog Annual Cost&lt;/th&gt;
&lt;th&gt;MyDBA.dev Pro Annual Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;$1,020&lt;/td&gt;
&lt;td&gt;£228&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;$5,100&lt;/td&gt;
&lt;td&gt;£228&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;$10,200&lt;/td&gt;
&lt;td&gt;£228&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;25&lt;/td&gt;
&lt;td&gt;$25,500&lt;/td&gt;
&lt;td&gt;£228&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At scale, the difference is not marginal -- it is an order of magnitude. MyDBA.dev also has a free tier with 7-day retention and full health-check access. Datadog does not offer a free database monitoring tier.&lt;/p&gt;

&lt;h2&gt;
  
  
  Making the Decision
&lt;/h2&gt;

&lt;p&gt;Choose Datadog if PostgreSQL is one of many things you need to monitor and full-stack correlation is critical. Choose MyDBA.dev if PostgreSQL is the thing you need to monitor and depth of PostgreSQL-specific intelligence matters more than breadth.&lt;/p&gt;

&lt;p&gt;They are not mutually exclusive. Some teams run Datadog for application-level observability and MyDBA.dev for PostgreSQL-specific depth. The lightweight Go collector adds minimal overhead and does not conflict with the Datadog Agent.&lt;/p&gt;

&lt;p&gt;The deciding question: is your bottleneck "I need to connect database performance to application behavior" (Datadog) or "I need to understand what is happening inside PostgreSQL" (MyDBA.dev)?&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://mydba.dev/blog/mydba-vs-datadog" rel="noopener noreferrer"&gt;mydba.dev/blog/mydba-vs-datadog&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>database</category>
      <category>monitoring</category>
      <category>performance</category>
      <category>postgres</category>
    </item>
    <item>
      <title>MyDBA.dev vs pganalyze: Which PostgreSQL Monitor Should You Choose?</title>
      <dc:creator>Philip McClarence</dc:creator>
      <pubDate>Sat, 04 Apr 2026 10:00:04 +0000</pubDate>
      <link>https://forem.com/philip_mcclarence_2ef9475/mydbadev-vs-pganalyze-which-postgresql-monitor-should-you-choose-3e81</link>
      <guid>https://forem.com/philip_mcclarence_2ef9475/mydbadev-vs-pganalyze-which-postgresql-monitor-should-you-choose-3e81</guid>
      <description>&lt;h1&gt;
  
  
  pganalyze vs MyDBA.dev -- A Practical PostgreSQL Monitoring Comparison
&lt;/h1&gt;

&lt;p&gt;I've been running PostgreSQL in production for years, and if there's one thing I've learned about monitoring tools, it's this: the best time to evaluate them is before you need them. Not during a 3am incident when you're staring at a chart that says "something is wrong" but gives you no idea how to fix it.&lt;/p&gt;

&lt;p&gt;Both pganalyze and MyDBA.dev are PostgreSQL-focused monitoring tools -- not generic infrastructure platforms that treat Postgres as an afterthought. But they have meaningfully different philosophies about what monitoring should do. Here's a practical comparison.&lt;/p&gt;

&lt;h2&gt;
  
  
  pganalyze: The Established Player
&lt;/h2&gt;

&lt;p&gt;pganalyze has been around since 2013 and has built genuine depth in several areas.&lt;/p&gt;

&lt;p&gt;Their &lt;strong&gt;index advisor&lt;/strong&gt; uses hypothetical index simulation ("What If?" analysis) to recommend new indexes and predict their performance impact before you create them. You can see estimated query cost reduction for candidate indexes, which is valuable when you're weighing the write overhead of a new index against query speed improvements.&lt;/p&gt;

&lt;p&gt;Their &lt;strong&gt;VACUUM advisor&lt;/strong&gt; provides per-table autovacuum tuning recommendations, including freeze age analysis. If you're struggling with autovacuum configuration, pganalyze will tell you exactly which tables need what settings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Log insights&lt;/strong&gt; parse your PostgreSQL logs to surface connection errors, lock timeouts, checkpoint warnings, and other events that pg_stat_statements alone misses. This is well-executed and fills a real gap.&lt;/p&gt;

&lt;p&gt;The tool is mature, stable, and well-documented. It does what it does reliably.&lt;/p&gt;

&lt;h2&gt;
  
  
  MyDBA.dev: The Opinionated Newcomer
&lt;/h2&gt;

&lt;p&gt;MyDBA.dev (launched 2025) takes a different stance: monitoring should tell you what's wrong AND hand you the fix. Not just a red chart -- a diagnosis, an explanation, and a SQL script you can copy-paste.&lt;/p&gt;

&lt;h3&gt;
  
  
  75+ Health Checks with Fix Scripts
&lt;/h3&gt;

&lt;p&gt;This is the core differentiator. MyDBA.dev runs 75+ automated health checks across 12 domains (indexes, vacuum, security, WAL, replication, storage, connections, configuration, queries, extensions, locks, XID). Each failing check includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A severity score&lt;/li&gt;
&lt;li&gt;A plain-English explanation&lt;/li&gt;
&lt;li&gt;A ready-to-run SQL fix script&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://mydba.dev/blog/mydba-vs-pganalyze" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fyubpcnvytyatwmbfjsxb.supabase.co%2Fstorage%2Fv1%2Fobject%2Fpublic%2Fblog-images%2Fmydba-vs-pganalyze%2Fhealth-check-overview.png" alt="MyDBA.dev health check dashboard" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;pganalyze surfaces around 20-30 check-style findings, but doesn't generate fix scripts. At 3am, that difference matters.&lt;/p&gt;

&lt;h3&gt;
  
  
  Extension Monitoring (The Biggest Gap)
&lt;/h3&gt;

&lt;p&gt;If you run TimescaleDB, pgvector, or PostGIS, this is the deciding factor. MyDBA.dev provides dedicated monitoring for all three:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;TimescaleDB&lt;/strong&gt;: chunk health, compression ratios, continuous aggregate staleness, job monitoring, 15+ extension-specific health checks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;pgvector&lt;/strong&gt;: index type analysis (IVFFlat vs HNSW), recall estimation, storage analysis, 13 health checks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PostGIS&lt;/strong&gt;: spatial index coverage, geometry quality, SRID consistency, 19 health checks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;pganalyze has no extension monitoring whatsoever. For teams whose workload is dominated by extension behavior (time-series ingestion, vector similarity search, spatial queries), this is a significant blind spot.&lt;/p&gt;

&lt;h3&gt;
  
  
  XID Wraparound Protection
&lt;/h3&gt;

&lt;p&gt;Transaction ID wraparound can force your database into read-only emergency mode. MyDBA.dev provides a dedicated dashboard showing current XID age, tables approaching danger, and -- critically -- &lt;strong&gt;blocker detection&lt;/strong&gt; with recovery scripts. It identifies the specific long-running transaction, abandoned replication slot, or prepared transaction preventing XID advancement, and gives you the command to fix it.&lt;/p&gt;

&lt;p&gt;pganalyze shows basic xmin horizon data. When you're racing against an emergency wraparound vacuum, the difference between "your XID age is high" and "this replication slot is the blocker, here's the DROP command" is the difference between a quick fix and an hour of investigation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cluster-Aware Index Advisor
&lt;/h3&gt;

&lt;p&gt;MyDBA.dev aggregates index usage across the entire replication topology. An index that looks unused on primary might be serving all your read-replica analytics queries. Dropping it based on primary-only stats would break things. pganalyze evaluates each server independently.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://mydba.dev/blog/mydba-vs-pganalyze" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fyubpcnvytyatwmbfjsxb.supabase.co%2Fstorage%2Fv1%2Fobject%2Fpublic%2Fblog-images%2Fmydba-vs-pganalyze%2Findex-advisor.png" alt="Cluster-aware index advisor" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Pricing
&lt;/h2&gt;

&lt;p&gt;This is where it gets interesting.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Plan&lt;/th&gt;
&lt;th&gt;pganalyze&lt;/th&gt;
&lt;th&gt;MyDBA.dev&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;1 server + 1 replica (all features)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Entry&lt;/td&gt;
&lt;td&gt;$149/mo (1 server)&lt;/td&gt;
&lt;td&gt;$19/mo (1 server)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4 servers&lt;/td&gt;
&lt;td&gt;$349/mo&lt;/td&gt;
&lt;td&gt;$76/mo&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;pganalyze has no free tier. MyDBA.dev's free tier includes every feature -- the paid plan adds more servers and longer data retention.&lt;/p&gt;

&lt;h2&gt;
  
  
  Which Should You Choose?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Choose pganalyze&lt;/strong&gt; if you need mature VACUUM advisory with per-table recommendations, hypothetical index analysis, or if you're already invested in it and it covers your needs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Choose MyDBA.dev&lt;/strong&gt; if you need health checks with remediation scripts, extension monitoring (TimescaleDB/pgvector/PostGIS), XID protection with blocker detection, cluster-aware index analysis, or a free tier to evaluate with production data.&lt;/p&gt;

&lt;p&gt;Both are solid, PostgreSQL-focused tools. The right answer depends on your workload. If you run extensions or need remediation guidance, MyDBA.dev. If you need deep VACUUM and hypothetical index analysis, pganalyze. If budget matters, the pricing difference alone may be decisive.&lt;/p&gt;

&lt;p&gt;Full comparison with more detail: &lt;a href="https://mydba.dev/blog/mydba-vs-pganalyze" rel="noopener noreferrer"&gt;mydba.dev/blog/mydba-vs-pganalyze&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://mydba.dev/blog/mydba-vs-pganalyze" rel="noopener noreferrer"&gt;mydba.dev/blog/mydba-vs-pganalyze&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>database</category>
      <category>monitoring</category>
      <category>performance</category>
      <category>postgres</category>
    </item>
    <item>
      <title>XID Wraparound Recovery: The Runbook Your Database Needs Before It's Too Late</title>
      <dc:creator>Philip McClarence</dc:creator>
      <pubDate>Fri, 03 Apr 2026 10:00:04 +0000</pubDate>
      <link>https://forem.com/philip_mcclarence_2ef9475/xid-wraparound-recovery-the-runbook-your-database-needs-before-its-too-late-2lc0</link>
      <guid>https://forem.com/philip_mcclarence_2ef9475/xid-wraparound-recovery-the-runbook-your-database-needs-before-its-too-late-2lc0</guid>
      <description>&lt;p&gt;If you have been running PostgreSQL in production for long enough, you have probably seen a blog post or conference talk about XID wraparound. You nodded along, thought "I should set up monitoring for that," and then went back to shipping features. This article is the runbook you will wish you had written before the 3 AM page.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Transaction ID wraparound is PostgreSQL's nuclear option. Every transaction gets a 32-bit XID. When the counter approaches 2 billion, PostgreSQL forces itself into single-user mode to prevent data corruption. No reads, no writes, no connections -- until you manually run a vacuum that can take hours on large tables.&lt;/p&gt;

&lt;p&gt;The insidious part: it sneaks up. XID age grows slowly, day by day, invisible unless you are actively watching. A database consuming 10 million XIDs per day has 200 days before hitting the 2-billion limit. That feels like plenty of time until you realize that autovacuum has been silently blocked for the last 3 weeks.&lt;/p&gt;

&lt;p&gt;Three things block vacuum from advancing the XID horizon:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Long-running transactions&lt;/strong&gt; -- a single &lt;code&gt;idle in transaction&lt;/code&gt; session prevents vacuum from freezing any rows newer than that transaction's snapshot.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Abandoned prepared transactions&lt;/strong&gt; -- &lt;code&gt;PREPARE TRANSACTION&lt;/code&gt; creates a durable transaction that survives server restarts. If nobody commits or rolls it back, it holds the XID horizon indefinitely.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lagging replication slots&lt;/strong&gt; -- logical replication slots retain WAL and prevent the server from advancing past the slot's confirmed LSN. A slot that falls behind holds the XID horizon for the entire cluster.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of these blockers has a different detection method and a different fix. The danger is that you need to check all three, in sequence, and understand how they interact.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Detect It
&lt;/h2&gt;

&lt;p&gt;Start with the database-level XID age, then drill into per-table ages and potential blockers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Database-level XID age&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;datname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;datfrozenxid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;xid_age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;datfrozenxid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2147483647&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;pct_wraparound&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_database&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;datname&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'template0'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'template1'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;datfrozenxid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Top 10 tables holding back the XID horizon&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;schemaname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;relname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;relfrozenxid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;xid_age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;pg_size_pretty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;oid&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;table_size&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_class&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;pg_stat_user_tables&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;relname&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;relkind&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'r'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;relfrozenxid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now check all three blocker categories:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Long-running transactions holding back vacuum&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;backend_xmin&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;xmin_age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;xact_start&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;duration&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_activity&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;backend_xmin&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;backend_xmin&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Orphaned prepared transactions&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;gid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prepared&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;database&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transaction&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;xid_age&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_prepared_xacts&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transaction&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Replication slots preventing XID advancement&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;slot_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;slot_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;active&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xmin&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;slot_xmin_age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;catalog_xmin&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;slot_catalog_xmin_age&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_replication_slots&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;xmin&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;catalog_xmin&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;greatest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xmin&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;catalog_xmin&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The problem with manual detection: you need to run all four queries, correlate the results, and determine which specific blocker is the root cause. If a prepared transaction has XID age 1.4 billion and the worst table has XID age 1.4 billion, the prepared transaction is your blocker -- but that correlation is not obvious when you are running queries one at a time under pressure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Monitoring for XID Wraparound
&lt;/h2&gt;

&lt;p&gt;The key to effective XID monitoring is not just tracking the age number -- it is automatically correlating blockers with the current horizon. A good monitoring setup should:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Track per-database and per-table XID ages as time series, not snapshots&lt;/li&gt;
&lt;li&gt;Identify which specific blocker (transaction, prepared transaction, or replication slot) is holding the horizon&lt;/li&gt;
&lt;li&gt;Generate actionable recovery steps in the correct order&lt;/li&gt;
&lt;li&gt;Alert at meaningful thresholds: 500 million (25%) for early warning, 1 billion (50%) for urgent action&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without automated correlation, you are left running queries one at a time during an emergency and hoping you check the right thing first.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Fix It
&lt;/h2&gt;

&lt;p&gt;Follow this sequence. Order matters -- clearing blockers before vacuuming prevents wasted work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Remove prepared transaction blockers&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- List and roll back orphaned prepared transactions&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;gid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prepared&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;owner&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_prepared_xacts&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;ROLLBACK&lt;/span&gt; &lt;span class="n"&gt;PREPARED&lt;/span&gt; &lt;span class="s1"&gt;'txn_2024_q4'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 2: Terminate long-running transaction blockers&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Kill sessions holding the XID horizon&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;pg_terminate_backend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;12345&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 3: Drop inactive replication slots&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Check if the slot is genuinely needed before dropping&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;slot_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;active&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xmin&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_replication_slots&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;pg_drop_replication_slot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'stale_subscriber_slot'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 4: Run targeted VACUUM FREEZE on the worst tables&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Freeze the tables with the highest XID age first&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;vacuum_cost_delay&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;-- Remove throttling for emergency&lt;/span&gt;
&lt;span class="k"&gt;VACUUM&lt;/span&gt; &lt;span class="k"&gt;FREEZE&lt;/span&gt; &lt;span class="n"&gt;large_events_table&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;VACUUM&lt;/span&gt; &lt;span class="k"&gt;FREEZE&lt;/span&gt; &lt;span class="n"&gt;user_sessions&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;VACUUM&lt;/span&gt; &lt;span class="k"&gt;FREEZE&lt;/span&gt; &lt;span class="n"&gt;audit_log&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;RESET&lt;/span&gt; &lt;span class="n"&gt;vacuum_cost_delay&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Verify XID age decreased&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;datname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;datfrozenxid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;xid_age&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_database&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;datname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;current_database&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Vacuuming without clearing blockers first is the most common mistake. The vacuum will complete, consuming hours of I/O, but the database XID age will not decrease because the horizon is still pinned by the blocker.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Prevent It
&lt;/h2&gt;

&lt;p&gt;Set alerts on XID age at meaningful thresholds. Alert at 500 million (25% of capacity) for early warning, and at 1 billion (50%) for urgent action. The default &lt;code&gt;autovacuum_freeze_max_age&lt;/code&gt; of 200 million triggers aggressive vacuum early, but only if autovacuum is not blocked.&lt;/p&gt;

&lt;p&gt;Prevent blocker accumulation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Auto-terminate forgotten sessions&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;idle_in_transaction_session_timeout&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'10min'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- Monitor prepared transactions (should normally be empty)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_prepared_xacts&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- Check replication slot health daily&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;slot_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;active&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xmin&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_replication_slots&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key to preventing wraparound is not running &lt;code&gt;VACUUM FREEZE&lt;/code&gt; harder -- it is ensuring nothing blocks autovacuum's freeze cycle from completing on schedule.&lt;/p&gt;

</description>
      <category>database</category>
      <category>devops</category>
      <category>monitoring</category>
      <category>postgres</category>
    </item>
    <item>
      <title>WAL and Vacuum Monitoring: The Two Metrics That Predict Every Outage</title>
      <dc:creator>Philip McClarence</dc:creator>
      <pubDate>Thu, 02 Apr 2026 10:00:05 +0000</pubDate>
      <link>https://forem.com/philip_mcclarence_2ef9475/wal-and-vacuum-monitoring-the-two-metrics-that-predict-every-outage-27o0</link>
      <guid>https://forem.com/philip_mcclarence_2ef9475/wal-and-vacuum-monitoring-the-two-metrics-that-predict-every-outage-27o0</guid>
      <description>&lt;p&gt;I used to think PostgreSQL outages were unpredictable. Then I started tracking two metrics consistently and realized that every single major outage I had seen was telegraphed days or weeks in advance by one of them. This article covers those two metrics, the SQL to track them, and what to do when they start trending in the wrong direction.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;The two most common PostgreSQL failure modes share a pattern: they are slow-moving, fully preventable, and invisible until the moment they are not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failure mode 1: Disk exhaustion.&lt;/strong&gt; WAL segments accumulate because archiving falls behind or replication slots retain old WAL. Temporary files pile up from large sorts. The &lt;code&gt;pg_wal&lt;/code&gt; directory grows from 1 GB to 50 GB over a weekend. Monday morning, the disk is full, the database cannot write WAL, and all transactions hang.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failure mode 2: XID wraparound.&lt;/strong&gt; Dead tuples accumulate because autovacuum is blocked, throttled, or misconfigured. Transaction ID age creeps from 100 million to 500 million to 1.5 billion. Eventually PostgreSQL refuses new transactions to prevent data corruption. The database becomes effectively read-only.&lt;/p&gt;

&lt;p&gt;Both failures are predicted by the same two signals: WAL generation rate and vacuum health. If WAL generation is outpacing archive or cleanup, disk will eventually fill. If dead tuples are accumulating faster than vacuum removes them, XID wraparound is approaching. Both problems develop over days or weeks. Both produce no visible symptoms until the final failure -- no slow queries, no error messages, no degraded throughput. Just a sudden, total outage.&lt;/p&gt;

&lt;p&gt;The challenge is not that these signals are hard to collect. It is that they require continuous monitoring with trend analysis. A snapshot query showing "WAL rate is 500 MB/hour" is meaningless without knowing whether that is normal for this workload or 5x higher than yesterday.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Detect It
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;WAL generation rate&lt;/strong&gt; -- compare LSN positions over time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Current WAL position (run twice with interval to calculate rate)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;pg_current_wal_lsn&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;current_lsn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;pg_wal_lsn_diff&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg_current_wal_lsn&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="s1"&gt;'0/0'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_wal_mb&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Checkpoint frequency and timing&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;checkpoints_timed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;checkpoints_req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;buffers_checkpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;buffers_backend&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;checkpoint_write_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;checkpoint_sync_time&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_bgwriter&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When &lt;code&gt;checkpoints_req&lt;/code&gt; (forced checkpoints) is growing faster than &lt;code&gt;checkpoints_timed&lt;/code&gt; (scheduled checkpoints), WAL is being generated faster than the configured &lt;code&gt;checkpoint_timeout&lt;/code&gt; expects. This means larger I/O spikes during checkpoints and more WAL retained on disk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dead tuple accumulation&lt;/strong&gt; -- the leading indicator for vacuum health:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Tables with the most dead tuples&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;schemaname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;relname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_dead_tup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_live_tup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;last_autovacuum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;last_autoanalyze&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;CASE&lt;/span&gt; &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;n_live_tup&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
           &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="n"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;n_dead_tup&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;n_live_tup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
       &lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;dead_pct&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_user_tables&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;n_dead_tup&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;n_dead_tup&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Active autovacuum workers (saturation check)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;active_workers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;current_setting&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'autovacuum_max_workers'&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;max_workers&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_activity&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'autovacuum:%'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If active workers consistently equals max workers, tables are queuing for vacuum. Dead tuples accumulate during the wait, and each vacuum run takes longer because there is more work to do -- a feedback loop that gets worse over time.&lt;/p&gt;

&lt;p&gt;The fundamental limitation of these queries is that they show the current state, not the trend. You need to run them repeatedly, store the results, and compute deltas. That is monitoring infrastructure work that most teams skip until after the first outage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Trends Matter More Than Snapshots
&lt;/h2&gt;

&lt;p&gt;The real value in WAL and vacuum monitoring is not the current number -- it is the direction and rate of change. Here is what to watch for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;WAL generation rate doubling&lt;/strong&gt; over a week signals a workload change or a configuration drift that will eventually exhaust disk&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dead tuple ratio climbing&lt;/strong&gt; on a specific table means vacuum is falling behind on that table, even if overall vacuum health looks fine&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vacuum duration increasing&lt;/strong&gt; over time means bloat is accumulating between runs and you need to trigger vacuum more frequently&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;All autovacuum workers busy&lt;/strong&gt; for more than 10 consecutive minutes means tables are queuing and the backlog is growing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A proper monitoring setup collects these metrics at regular intervals and presents them as time series, so you can distinguish a one-time spike (a batch import) from a sustained trend (an application change generating more writes).&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Fix It
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;WAL accumulation&lt;/strong&gt; -- tune checkpoint and WAL configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Increase max_wal_size to reduce forced checkpoints&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;max_wal_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'4GB'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- Spread checkpoint I/O over the full interval&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;checkpoint_completion_target&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- Compress WAL to reduce disk usage and archive bandwidth&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;wal_compression&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'lz4'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;pg_reload_conf&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If WAL accumulation is caused by a lagging replication slot, either fix the subscriber or drop the slot:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Check slot lag&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;slot_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;active&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;pg_size_pretty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg_wal_lsn_diff&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg_current_wal_lsn&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;restart_lsn&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;lag&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_replication_slots&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Vacuum backlog&lt;/strong&gt; -- per-table autovacuum tuning:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- More aggressive vacuum on high-churn tables&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;autovacuum_vacuum_scale_factor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;02&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;autovacuum_vacuum_cost_delay&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;autovacuum_vacuum_cost_limit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Add more workers if all are consistently busy&lt;/span&gt;
&lt;span class="c1"&gt;-- In postgresql.conf (requires restart):&lt;/span&gt;
&lt;span class="c1"&gt;-- autovacuum_max_workers = 5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For immediate dead tuple cleanup, run a manual vacuum on the worst tables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;VACUUM&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;VERBOSE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;VACUUM&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;VERBOSE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;sessions&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;VERBOSE&lt;/code&gt; flag outputs per-page statistics including pages scanned, tuples removed, and pages truncated -- confirming that vacuum is actually making progress.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Prevent It
&lt;/h2&gt;

&lt;p&gt;Monitor both metrics continuously with alerts on trend, not just threshold. A WAL generation rate that doubles in a week is an alert-worthy event even if the absolute value is still within disk capacity. A dead tuple count that is climbing on a specific table is a signal even if XID age is still low.&lt;/p&gt;

&lt;p&gt;Set concrete alert thresholds:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;WAL&lt;/strong&gt;: alert when generation rate exceeds 2x the 7-day average, or when &lt;code&gt;pg_wal&lt;/code&gt; directory exceeds 50% of available disk&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vacuum&lt;/strong&gt;: alert when any table's dead tuple ratio exceeds 20%, or when all autovacuum workers are busy for more than 10 consecutive minutes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal is not to react to outages -- it is to see them forming days in advance and intervene when the fix is a configuration change rather than an emergency recovery.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Wait Event Heatmaps: See What PostgreSQL Is Actually Waiting On</title>
      <dc:creator>Philip McClarence</dc:creator>
      <pubDate>Wed, 01 Apr 2026 10:00:05 +0000</pubDate>
      <link>https://forem.com/philip_mcclarence_2ef9475/wait-event-heatmaps-see-what-postgresql-is-actually-waiting-on-2339</link>
      <guid>https://forem.com/philip_mcclarence_2ef9475/wait-event-heatmaps-see-what-postgresql-is-actually-waiting-on-2339</guid>
      <description>&lt;h1&gt;
  
  
  Wait Event Heatmaps: See What PostgreSQL Is Actually Waiting On
&lt;/h1&gt;

&lt;p&gt;If you have ever stared at a Grafana dashboard full of green CPU and memory panels while your users complain about slow queries, you have experienced the fundamental blind spot of resource utilization monitoring. PostgreSQL has a built-in profiler that most teams completely ignore — wait events.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Your database feels slow but the usual metrics look fine. CPU is at 30%. Memory is stable. Disk I/O graphs show nothing dramatic. The problem is invisible because you are looking at resource utilization metrics, which tell you how much of each resource is consumed but not what PostgreSQL is spending its time doing.&lt;/p&gt;

&lt;p&gt;PostgreSQL tracks exactly what every backend is waiting on at any given moment. These are wait events, and they fall into categories: &lt;code&gt;IO&lt;/code&gt; (reading or writing data files), &lt;code&gt;Lock&lt;/code&gt; (heavyweight row or table locks), &lt;code&gt;LWLock&lt;/code&gt; (lightweight internal locks like buffer mapping or WAL insertion), &lt;code&gt;BufferPin&lt;/code&gt; (waiting for a shared buffer), &lt;code&gt;Client&lt;/code&gt; (waiting for the application to send data or read results), and &lt;code&gt;Activity&lt;/code&gt; (background process idle waits). When a backend is actively computing — executing an operator, sorting rows, evaluating expressions — there is no wait event at all.&lt;/p&gt;

&lt;p&gt;Wait events are the closest thing PostgreSQL offers to a profiler. They tell you whether your database is slow because of I/O, lock contention, internal bottlenecks, or client-side delays. A database that shows 50% of sessions in &lt;code&gt;IO:DataFileRead&lt;/code&gt; has a fundamentally different problem than one showing 50% in &lt;code&gt;Lock:transactionid&lt;/code&gt; — the first needs better caching or faster storage, the second needs shorter transactions. Without wait event data, you are guessing at the root cause.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Detect It
&lt;/h2&gt;

&lt;p&gt;Query &lt;code&gt;pg_stat_activity&lt;/code&gt; to see what each active backend is currently waiting on:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Current wait events for all active sessions&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;wait_event_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;wait_event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;left&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;query_snippet&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;query_start&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;query_duration&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_activity&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'active'&lt;/span&gt;
    &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;pid&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;pg_backend_pid&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;wait_event_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wait_event&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Aggregate current wait events by type&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;wait_event_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;wait_event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;session_count&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_activity&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'active'&lt;/span&gt;
    &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;wait_event&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The problem with these queries is that wait events are transient. A session's wait event changes every millisecond — it reads a data page (&lt;code&gt;IO:DataFileRead&lt;/code&gt;), processes it (no wait event), reads another page, acquires a buffer pin (&lt;code&gt;BufferPin:BufferPin&lt;/code&gt;), and so on. A single-point query captures whatever is happening at that exact instant, which may or may not represent the actual bottleneck.&lt;/p&gt;

&lt;p&gt;To build a meaningful picture, you need to sample repeatedly and aggregate. Some teams write cron jobs that query &lt;code&gt;pg_stat_activity&lt;/code&gt; every second and log the results. This works, but building a sampling infrastructure, storing the time-series data, and visualizing the results is a significant investment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building a Wait Event Picture
&lt;/h2&gt;

&lt;p&gt;The most effective way to use wait event data is as a heatmap over time. Each row represents a wait event category, and color intensity shows how many sessions were in that state at each point. This format reveals patterns that point-in-time queries cannot:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;I/O spikes aligned with checkpoint intervals&lt;/strong&gt; — If IO:DataFileRead spikes every 5 minutes, your checkpoint is too aggressive or shared_buffers is too small.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lock contention correlated with batch jobs&lt;/strong&gt; — Lock:transactionid spikes at the same time every day point to a batch process with long transactions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LWLock pressure during peak writes&lt;/strong&gt; — LWLock:WALInsert during high write throughput indicates WAL is a bottleneck.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Client waits indicating application issues&lt;/strong&gt; — Client:ClientRead means PostgreSQL is waiting for your application, not the other way around.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Drilling into a specific wait event type to see which queries contribute the most wait time turns a vague "the database is slow" into a specific "this query on this table is causing 40% of all I/O waits."&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Fix It
&lt;/h2&gt;

&lt;p&gt;Each wait event category maps to a different class of fix:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;IO:DataFileRead / IO:DataFileExtend&lt;/strong&gt; — The database is reading data from disk instead of shared buffers, or extending table files during inserts.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Check buffer cache hit ratio&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;blks_hit&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="k"&gt;nullif&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;blks_hit&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;blks_read&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;cache_hit_ratio&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_database&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;datname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;current_database&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the hit ratio is below 99%, increase &lt;code&gt;shared_buffers&lt;/code&gt;. If a specific query causes most of the reads, add an index to reduce the number of pages scanned. If the hit ratio is already high but I/O waits persist, the storage subsystem is the bottleneck — consider faster disks or moving to NVMe.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lock:transactionid / Lock:tuple&lt;/strong&gt; — Sessions are waiting for other transactions to commit or for row-level locks.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Find long-running transactions holding locks&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;xact_start&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;transaction_duration&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;left&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;query_snippet&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_activity&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'idle in transaction'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;xact_start&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Reduce transaction duration. Set &lt;code&gt;idle_in_transaction_session_timeout&lt;/code&gt; to automatically kill forgotten sessions. Break large batch operations into smaller transactions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LWLock:WALWrite / LWLock:WALInsert&lt;/strong&gt; — WAL writing is a bottleneck, typically during heavy write workloads.&lt;/p&gt;

&lt;p&gt;Move the WAL directory to faster storage. Increase &lt;code&gt;wal_buffers&lt;/code&gt;. Tune &lt;code&gt;checkpoint_completion_target&lt;/code&gt; toward 0.9 to spread checkpoint I/O more evenly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Client:ClientRead / Client:ClientWrite&lt;/strong&gt; — PostgreSQL is waiting for the application to send the next command or read the result. This is not a database problem — investigate network latency, connection pooler configuration, or application-side processing delays.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Prevent It
&lt;/h2&gt;

&lt;p&gt;Establish wait event baselines for your workload. A healthy OLTP database typically shows minimal wait events during normal operation — most time is spent actively processing, not waiting. When a new wait event pattern appears or an existing one intensifies, it signals a change in workload or a developing bottleneck.&lt;/p&gt;

&lt;p&gt;Sample wait events at regular intervals and retain the history. The value of wait event monitoring increases over time as you build context for what "normal" looks like. A 10% increase in &lt;code&gt;IO:DataFileRead&lt;/code&gt; after a deployment points directly at a query change. A gradual increase in &lt;code&gt;LWLock:BufferMapping&lt;/code&gt; over weeks suggests growing memory pressure as data volume increases.&lt;/p&gt;

&lt;p&gt;Pair wait event monitoring with query-level analysis to close the loop — when a wait event spikes, drill through to the specific queries causing it, fix them, and verify the wait event subsides.&lt;/p&gt;

</description>
      <category>database</category>
      <category>monitoring</category>
      <category>performance</category>
      <category>postgres</category>
    </item>
    <item>
      <title>Isolating the Query Type That's Killing Your Database</title>
      <dc:creator>Philip McClarence</dc:creator>
      <pubDate>Tue, 31 Mar 2026 10:00:06 +0000</pubDate>
      <link>https://forem.com/philip_mcclarence_2ef9475/isolating-the-query-type-thats-killing-your-database-7jg</link>
      <guid>https://forem.com/philip_mcclarence_2ef9475/isolating-the-query-type-thats-killing-your-database-7jg</guid>
      <description>&lt;h1&gt;
  
  
  Isolating the Query Type That's Killing Your Database
&lt;/h1&gt;

&lt;p&gt;When your PostgreSQL database is on fire at 2am, the last thing you want to be doing is scrolling through a flat list of 400 query templates trying to visually pattern-match which ones are SELECTs and which ones are UPDATEs. Yet that is exactly what most teams do — because pg_stat_statements gives you everything in one undifferentiated list.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;During an incident, the first question is: "Is this a read problem or a write problem?" The answer determines everything about your investigation path. If the database is slow because of a runaway SELECT scanning millions of rows without an index, you need to find that query and add an index. If it is slow because of a bulk INSERT generating massive WAL, you need to find the write and throttle it. Different causes, different fixes, different urgency.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;pg_stat_statements&lt;/code&gt; shows everything in a flat list — hundreds of query templates sorted by total time or mean time. During an incident you are scrolling through SELECTs, INSERTs, UPDATEs, DELETEs, and utility statements mixed together. The query consuming the most total time might be a harmless high-frequency SELECT that runs millions of times per day. The actual problem might be an UPDATE that runs 50 times per hour but holds row locks for 2 seconds each time.&lt;/p&gt;

&lt;p&gt;Without filtering by statement type, you are pattern-matching visually: scanning query text for keywords, mentally categorizing each entry, trying to spot the anomaly. This is slow and error-prone, especially under incident pressure.&lt;/p&gt;

&lt;p&gt;Workload composition also matters outside of incidents. A database that shifts from 80% reads / 20% writes to 50/50 after a new feature launch is fundamentally different — the same hardware and configuration may no longer be adequate. But you cannot see this shift without statement type classification.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Detect It
&lt;/h2&gt;

&lt;p&gt;You can manually classify queries by parsing the statement type from &lt;code&gt;pg_stat_statements&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Group query statistics by statement type&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="k"&gt;CASE&lt;/span&gt;
        &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="k"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;left&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ltrim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'SELECT'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'SELECT'&lt;/span&gt;
        &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="k"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;left&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ltrim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'INSERT'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'INSERT'&lt;/span&gt;
        &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="k"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;left&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ltrim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'UPDATE'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'UPDATE'&lt;/span&gt;
        &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="k"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;left&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ltrim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'DELETE'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'DELETE'&lt;/span&gt;
        &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="s1"&gt;'UTILITY'&lt;/span&gt;
    &lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;statement_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;template_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;calls&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_calls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_exec_time&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;numeric&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_time_ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mean_exec_time&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;numeric&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_mean_time_ms&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_statements&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;dbid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;oid&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_database&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;datname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;current_database&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;total_time_ms&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives you a workload breakdown, but it is a one-time snapshot with no trending. And the regex-based classification is fragile — it breaks on queries with leading comments, CTEs, or whitespace variations.&lt;/p&gt;

&lt;p&gt;To identify the most expensive queries per type during an incident:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Top 5 most expensive SELECT queries by total time&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;queryid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;substring&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;query_preview&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;calls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mean_exec_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total_exec_time&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_statements&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;~*&lt;/span&gt; &lt;span class="s1"&gt;'^&lt;/span&gt;&lt;span class="se"&gt;\s&lt;/span&gt;&lt;span class="s1"&gt;*SELECT'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;dbid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;oid&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_database&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;datname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;current_database&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;total_exec_time&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Top 5 most expensive INSERT/UPDATE/DELETE queries&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;queryid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;substring&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;query_preview&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;calls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mean_exec_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total_exec_time&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_statements&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;~*&lt;/span&gt; &lt;span class="s1"&gt;'^&lt;/span&gt;&lt;span class="se"&gt;\s&lt;/span&gt;&lt;span class="s1"&gt;*(INSERT|UPDATE|DELETE)'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;dbid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;oid&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_database&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;datname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;current_database&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;total_exec_time&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This requires multiple queries, manual regex, and mental assembly. During an incident, every minute spent writing SQL is a minute not spent fixing the problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building Statement Type Classification Into Your Workflow
&lt;/h2&gt;

&lt;p&gt;The right approach is to classify every query template by statement type at collection time — not at query time during an incident. A proper classifier handles edge cases that simple regex misses: CTEs starting with &lt;code&gt;WITH&lt;/code&gt;, queries with leading comments, and prepared statement variations.&lt;/p&gt;

&lt;p&gt;With pre-classified queries, your incident triage workflow becomes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Filter to the suspected statement type (SELECT for read latency, INSERT/UPDATE for write pressure)&lt;/li&gt;
&lt;li&gt;Sort by total time to find the most expensive queries of that type&lt;/li&gt;
&lt;li&gt;Sort by mean time to find the slowest individual executions&lt;/li&gt;
&lt;li&gt;Sort by calls to identify high-frequency queries that might benefit from caching or batching&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Combined with sortable columns, statement type filtering becomes a rapid triage tool. The workflow is: click the suspected type, sort by total time, and the top entry is your primary suspect. This takes seconds rather than the minutes required for manual SQL classification.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Fix It
&lt;/h2&gt;

&lt;p&gt;Once you have isolated the statement type, each category has a distinct investigation path:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SELECT problems — reads consuming too much time:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Check for sequential scans on large tables&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;schemaname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;relname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;seq_scan&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;seq_tup_read&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;idx_scan&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pg_size_pretty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;relid&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;table_size&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_user_tables&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;seq_scan&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;relid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;10485760&lt;/span&gt;  &lt;span class="c1"&gt;-- &amp;gt; 10MB&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;seq_tup_read&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Common fixes: add missing indexes, rewrite N+1 query patterns into JOINs, add &lt;code&gt;LIMIT&lt;/code&gt; clauses to unbounded queries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;INSERT problems — writes overwhelming WAL:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Check WAL generation rate&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;pg_size_pretty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg_wal_lsn_diff&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg_current_wal_lsn&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="s1"&gt;'0/0'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_wal&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Check for indexes slowing down inserts&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;indexrelname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;idx_scan&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pg_size_pretty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;indexrelid&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_user_indexes&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;idx_scan&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;schemaname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'public'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;indexrelid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Common fixes: use &lt;code&gt;COPY&lt;/code&gt; instead of row-by-row INSERT, drop unused indexes on write-heavy tables, batch inserts into larger transactions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;UPDATE/DELETE problems — lock contention:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Check for lock waits&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;blocked&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;blocked&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;blocking&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pid&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;blocking_pid&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_activity&lt;/span&gt; &lt;span class="n"&gt;blocked&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;pg_locks&lt;/span&gt; &lt;span class="n"&gt;bl&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;bl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;blocked&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pid&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;pg_locks&lt;/span&gt; &lt;span class="n"&gt;kl&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;kl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transactionid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transactionid&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;kl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pid&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;bl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pid&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;pg_stat_activity&lt;/span&gt; &lt;span class="n"&gt;blocking&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;blocking&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pid&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="n"&gt;bl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;granted&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Common fixes: add missing foreign key indexes (cascading deletes scan child tables), reduce transaction scope, avoid &lt;code&gt;SELECT ... FOR UPDATE&lt;/code&gt; where &lt;code&gt;SKIP LOCKED&lt;/code&gt; would work.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Prevent It
&lt;/h2&gt;

&lt;p&gt;Track statement type distribution as a workload health indicator. Establish your baseline — a healthy ratio for your application — and alert when the mix shifts significantly. A sudden increase in DELETE percentage might indicate a runaway cleanup job, while a spike in UTILITY statements might signal excessive &lt;code&gt;ANALYZE&lt;/code&gt; or &lt;code&gt;VACUUM&lt;/code&gt; activity competing for resources.&lt;/p&gt;

&lt;p&gt;Review workload composition after every major feature launch. A feature that adds a new UPDATE-heavy workflow to a previously read-dominated database changes the performance profile fundamentally. Index strategies, &lt;code&gt;work_mem&lt;/code&gt; settings, and WAL configuration that worked for a read workload may need adjustment for a mixed workload.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Schema Diff: Catch Drift Between Staging and Production</title>
      <dc:creator>Philip McClarence</dc:creator>
      <pubDate>Mon, 30 Mar 2026 10:00:06 +0000</pubDate>
      <link>https://forem.com/philip_mcclarence_2ef9475/schema-diff-catch-drift-between-staging-and-production-4jh4</link>
      <guid>https://forem.com/philip_mcclarence_2ef9475/schema-diff-catch-drift-between-staging-and-production-4jh4</guid>
      <description>&lt;h1&gt;
  
  
  Schema Diff: Catch Drift Between Staging and Production
&lt;/h1&gt;

&lt;p&gt;Your migration tool says everything is applied. Your CI pipeline is green. But production has a column that staging does not, and nobody knows who added it or when. Sound familiar? Schema drift is the gap between what your migration history says should exist and what actually exists in the database.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;A deployment fails in production because a migration expects a column that exists in staging but not production. Someone added it manually during debugging weeks ago and forgot to create a proper migration. Or the deployment succeeds, but a query starts failing because the index it depends on was only created in the staging environment. Or a constraint was dropped in production to fix a data issue and never re-added.&lt;/p&gt;

&lt;p&gt;Schema drift between environments is one of those problems everyone accumulates and nobody actively monitors. Each drift is individually small — a missing column, a different default value, an index present in one environment but not the other. But they compound. By the time you discover them, it is during a deployment or an incident, when the cost of surprise is highest.&lt;/p&gt;

&lt;p&gt;The drift is not always between staging and production. Replicas can drift from their primary if DDL is applied directly to one node. Development databases diverge as engineers apply ad-hoc changes. Even within a single environment, a failed migration that partially applied can leave the schema in an inconsistent state — half the tables have the new column, half do not.&lt;/p&gt;

&lt;p&gt;Migration tools like Flyway, Liquibase, and Alembic track what migrations were applied, but they do not verify the actual schema matches the expected state. A migration marked as "applied" in the history table says nothing about whether someone later modified the schema by hand.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Detect It
&lt;/h2&gt;

&lt;p&gt;Comparing schemas manually means querying &lt;code&gt;information_schema&lt;/code&gt; on both databases and diffing the results. Start with tables and columns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Compare columns between two schemas&lt;/span&gt;
&lt;span class="c1"&gt;-- Run on EACH database and diff the output&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="k"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;column_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;data_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;column_default&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;is_nullable&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;character_maximum_length&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;information_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;table_schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'public'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ordinal_position&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then indexes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- List all indexes with their definitions&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;tablename&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;indexname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;indexdef&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_indexes&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;schemaname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'public'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;tablename&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indexname&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then constraints, functions, triggers, and sequences. Each object type requires its own query, its own export, and its own diff. For a schema with 50 tables, this is hundreds of rows to compare across six or more categories. You can script it, but maintaining the script and interpreting the diff requires ongoing effort.&lt;/p&gt;

&lt;p&gt;The deeper problem is that nobody does this proactively. Schema comparison is a reactive activity — you do it after something breaks, not before. The gap between "last time someone checked" and "now" is where drift accumulates undetected.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Practical Approach to Schema Comparison
&lt;/h2&gt;

&lt;p&gt;Whether you build custom tooling or use a monitoring platform, an effective schema diff should categorize differences by type: missing tables, missing columns, different column types or defaults, missing indexes, missing constraints, and missing functions. Color-coding by severity helps with triage — a missing column that exists in production but not staging is a deployment risk, while a different column default may be intentional.&lt;/p&gt;

&lt;p&gt;Even better, generate the exact ALTER statements needed to reconcile each difference. A missing column should produce &lt;code&gt;ALTER TABLE ADD COLUMN&lt;/code&gt;. A different default should produce &lt;code&gt;ALTER COLUMN SET DEFAULT&lt;/code&gt;. A missing index should produce &lt;code&gt;CREATE INDEX CONCURRENTLY&lt;/code&gt;. This eliminates the manual step of writing migration SQL.&lt;/p&gt;

&lt;p&gt;Safe patterns for generated statements include &lt;code&gt;IF NOT EXISTS&lt;/code&gt; where supported, &lt;code&gt;CONCURRENTLY&lt;/code&gt; for index operations, and explicit data types to avoid ambiguity.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Fix It
&lt;/h2&gt;

&lt;p&gt;Walk through the most common drift scenarios and their fixes:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Missing column&lt;/strong&gt; — A column exists in staging but not production:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Add the missing column with a safe default&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
    &lt;span class="k"&gt;ADD&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="nb"&gt;integer&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use &lt;code&gt;IF NOT EXISTS&lt;/code&gt; (PostgreSQL 9.6+) so the statement is idempotent — safe to run even if someone already added the column manually.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Different default value&lt;/strong&gt; — The column exists in both environments but defaults differ:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Align the default value&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
    &lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Missing index&lt;/strong&gt; — An index exists in production but not staging (or vice versa):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Create the missing index without blocking writes&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;CONCURRENTLY&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;idx_orders_status_code&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Always use &lt;code&gt;CONCURRENTLY&lt;/code&gt; in production. A regular &lt;code&gt;CREATE INDEX&lt;/code&gt; takes a &lt;code&gt;ShareLock&lt;/code&gt; on the table, blocking all inserts and updates until the index is built.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Missing constraint&lt;/strong&gt; — A foreign key or check constraint was dropped or never created:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Re-add a missing foreign key constraint&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;order_items&lt;/span&gt;
    &lt;span class="k"&gt;ADD&lt;/span&gt; &lt;span class="k"&gt;CONSTRAINT&lt;/span&gt; &lt;span class="n"&gt;fk_order_items_order_id&lt;/span&gt;
    &lt;span class="k"&gt;FOREIGN&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;REFERENCES&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Adding a foreign key validates all existing rows, which can be slow on large tables. Use &lt;code&gt;NOT VALID&lt;/code&gt; to add the constraint without validating existing data, then validate separately:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;order_items&lt;/span&gt;
    &lt;span class="k"&gt;ADD&lt;/span&gt; &lt;span class="k"&gt;CONSTRAINT&lt;/span&gt; &lt;span class="n"&gt;fk_order_items_order_id&lt;/span&gt;
    &lt;span class="k"&gt;FOREIGN&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;REFERENCES&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;VALID&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Validate existing rows separately (does not block writes)&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;order_items&lt;/span&gt;
    &lt;span class="n"&gt;VALIDATE&lt;/span&gt; &lt;span class="k"&gt;CONSTRAINT&lt;/span&gt; &lt;span class="n"&gt;fk_order_items_order_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How to Prevent It
&lt;/h2&gt;

&lt;p&gt;Run schema diff as part of your deployment pipeline. Before applying migrations to production, compare the production schema against what your migration tool expects. This catches manual changes that were applied outside the migration system — the single most common source of drift.&lt;/p&gt;

&lt;p&gt;Compare your primary against each replica periodically. Physical replication keeps schemas in sync, but logical replication does not — DDL is not replicated by default in logical replication setups. If you use logical replication, schema drift between publisher and subscriber is almost guaranteed without explicit management.&lt;/p&gt;

&lt;p&gt;Establish a monthly schema comparison between staging and production even when no deployment is planned. Drift accumulates in the gaps between deployments.&lt;/p&gt;

&lt;p&gt;Treat schema as code, but verify the code matches reality. Migration history says what should have been applied. Schema diff tells you what actually exists. The gap between the two is where incidents start.&lt;/p&gt;

</description>
      <category>cicd</category>
      <category>database</category>
      <category>devops</category>
      <category>sql</category>
    </item>
    <item>
      <title>Interactive Replication Topology: See Your Cluster at a Glance</title>
      <dc:creator>Philip McClarence</dc:creator>
      <pubDate>Sun, 29 Mar 2026 10:00:07 +0000</pubDate>
      <link>https://forem.com/philip_mcclarence_2ef9475/interactive-replication-topology-see-your-cluster-at-a-glance-3lb8</link>
      <guid>https://forem.com/philip_mcclarence_2ef9475/interactive-replication-topology-see-your-cluster-at-a-glance-3lb8</guid>
      <description>&lt;h1&gt;
  
  
  Interactive Replication Topology: See Your Cluster at a Glance
&lt;/h1&gt;

&lt;p&gt;Replication lag is one of those metrics that everyone monitors as a single number — and that single number hides almost everything you need to know to actually fix the problem. Is the lag caused by network, storage, or CPU? The answer changes your remediation completely, and the default PostgreSQL views do not make it obvious.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;You manage a PostgreSQL primary with three read replicas. One of the replicas has a cascading standby attached to it. Replication lag is spiking and your application team is reporting stale reads.&lt;/p&gt;

&lt;p&gt;To diagnose the problem, you need to answer several questions at once: Which replica is lagging? Is it write lag (the primary has sent WAL but the replica has not received it), flush lag (the replica received it but has not flushed to disk), or replay lag (flushed but not yet applied)? Is the cascading standby affected by the same lag, or is it independently lagging from its upstream replica? Are all replication slots active, or is an inactive slot causing WAL retention to balloon?&lt;/p&gt;

&lt;p&gt;Answering these questions means connecting to each node separately. You run &lt;code&gt;pg_stat_replication&lt;/code&gt; on the primary to see what it knows about each replica. You run &lt;code&gt;pg_stat_wal_receiver&lt;/code&gt; on each replica to see the receiver's perspective. You check &lt;code&gt;pg_replication_slots&lt;/code&gt; on the primary to verify slot health. Each query returns raw LSN positions and interval values that you mentally compare across nodes. With three replicas and a cascading standby, that is at least five separate connections and eight queries before you have a picture of the cluster state.&lt;/p&gt;

&lt;p&gt;During an incident, this is too slow.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Detect It
&lt;/h2&gt;

&lt;p&gt;On the primary, &lt;code&gt;pg_stat_replication&lt;/code&gt; shows the current state of each connected standby:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Run on the PRIMARY: check replication lag to each standby&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;client_addr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;application_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sent_lsn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;write_lsn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;flush_lsn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;replay_lsn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;write_lag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;flush_lag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;replay_lag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sync_state&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_replication&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;client_addr&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On each replica, &lt;code&gt;pg_stat_wal_receiver&lt;/code&gt; shows the receiver's perspective:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Run on each REPLICA: check WAL receiver status&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;received_lsn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;last_msg_send_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;last_msg_receipt_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;latest_end_lsn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;slot_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;conninfo&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_wal_receiver&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check replication slot health — inactive slots prevent WAL recycling and cause disk usage to grow indefinitely:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Run on the PRIMARY: check replication slot health&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;slot_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;slot_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;active&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;pg_size_pretty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;pg_wal_lsn_diff&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg_current_wal_lsn&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;restart_lsn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;retained_wal_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;wal_status&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_replication_slots&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;pg_wal_lsn_diff&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg_current_wal_lsn&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;restart_lsn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The raw data is all there, but it is distributed across nodes, expressed in LSN positions that require mental arithmetic, and presented as flat rows with no visual representation of the topology.&lt;/p&gt;

&lt;h2&gt;
  
  
  Making Sense of the Data
&lt;/h2&gt;

&lt;p&gt;When monitoring replication, the key is breaking total lag into its three components and understanding what each one means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Write lag&lt;/strong&gt; (time between primary sending WAL and replica receiving it) points to network issues between primary and replica.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flush lag&lt;/strong&gt; (time between receiving and flushing to disk) suggests slow storage on the replica.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Replay lag&lt;/strong&gt; (time between flushing and applying) means the replica is CPU-bound on WAL replay — common when the primary runs many parallel writes but the replica applies them single-threaded.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Trend lines matter more than point-in-time values. Is lag stable, growing, or recovering? A stable 500ms of replay lag is normal for a busy system. A replay lag that grows 100ms every minute is a replica falling behind that will eventually become unusable for reads.&lt;/p&gt;

&lt;p&gt;For cascading standbys, check whether lag is inherited from the upstream replica or independent. If the upstream replica has 3 seconds of replay lag and the cascading standby has 3.5 seconds, the standby is fine — it is just inheriting the upstream lag plus 500ms of its own.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Fix It
&lt;/h2&gt;

&lt;p&gt;The fix depends on which lag component is elevated:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;High write lag (network)&lt;/strong&gt; — Check bandwidth and latency between primary and replica. Verify &lt;code&gt;wal_sender_timeout&lt;/code&gt; and &lt;code&gt;wal_receiver_timeout&lt;/code&gt; are not too aggressive for your network conditions. For cross-region replication, ensure adequate bandwidth for your WAL generation rate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;High flush lag (storage)&lt;/strong&gt; — The replica's storage cannot keep up with incoming WAL. Move the replica's WAL directory to faster storage (NVMe). Check for I/O contention from queries running on the replica.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;High replay lag (CPU)&lt;/strong&gt; — The replica is applying WAL single-threaded and falling behind. In PostgreSQL 15+, enable parallel apply with &lt;code&gt;max_parallel_apply_workers_per_subscription&lt;/code&gt; for logical replication. For physical replication, reduce the write rate on the primary during peak times or add more replicas to distribute read load.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Inactive replication slots&lt;/strong&gt; retaining WAL indefinitely:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Drop an inactive slot that is retaining excessive WAL&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;pg_drop_replication_slot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'stale_replica_slot'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Or set a maximum WAL retention size (PG 13+)&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;max_slot_wal_keep_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'10GB'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;pg_reload_conf&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Cascading standby disconnected&lt;/strong&gt; — Check &lt;code&gt;primary_conninfo&lt;/code&gt; on the cascading standby. Verify the upstream replica allows replication connections (&lt;code&gt;max_wal_senders&lt;/code&gt; not exhausted, &lt;code&gt;pg_hba.conf&lt;/code&gt; permits the connection).&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Prevent It
&lt;/h2&gt;

&lt;p&gt;Monitor all three lag types separately — write, flush, and replay. Total replication lag hides the root cause. A replica showing 5 seconds of total lag could be a network problem, a storage problem, or a CPU problem, and each requires a different fix.&lt;/p&gt;

&lt;p&gt;Set &lt;code&gt;max_slot_wal_keep_size&lt;/code&gt; to prevent inactive slots from consuming all available disk. Without this setting, a single inactive slot will retain WAL indefinitely until the primary runs out of disk space and stops accepting writes.&lt;/p&gt;

&lt;p&gt;Alert on replication slot inactivity. A slot that becomes inactive usually means a replica has disconnected — either crashed, lost network connectivity, or was decommissioned without dropping its slot.&lt;/p&gt;

&lt;p&gt;Catching a lagging replica at 2 seconds of delay is routine maintenance; catching it at 2 hours of delay is an incident.&lt;/p&gt;

</description>
      <category>database</category>
      <category>monitoring</category>
      <category>performance</category>
      <category>postgres</category>
    </item>
    <item>
      <title>From $1 Placeholders to Executable Queries: Debugging the Parameters That Matter</title>
      <dc:creator>Philip McClarence</dc:creator>
      <pubDate>Sat, 28 Mar 2026 10:00:04 +0000</pubDate>
      <link>https://forem.com/philip_mcclarence_2ef9475/from-1-placeholders-to-executable-queries-debugging-the-parameters-that-matter-o45</link>
      <guid>https://forem.com/philip_mcclarence_2ef9475/from-1-placeholders-to-executable-queries-debugging-the-parameters-that-matter-o45</guid>
      <description>&lt;h1&gt;
  
  
  From $1 Placeholders to Executable Queries: Debugging the Parameters That Matter
&lt;/h1&gt;

&lt;p&gt;Every PostgreSQL DBA has been in this situation: a query shows up as slow in pg_stat_statements, but the template uses &lt;code&gt;$1&lt;/code&gt; placeholders and you have no idea which parameter values actually caused the slow execution. The query text is &lt;code&gt;WHERE customer_id = $1&lt;/code&gt; — but which customer? The one with 15 orders or the one with 500,000? Without the actual values, you cannot reproduce the problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;pg_stat_statements&lt;/code&gt; is the foundation of PostgreSQL query performance monitoring. It aggregates execution statistics per query template, grouping all executions of &lt;code&gt;SELECT * FROM orders WHERE customer_id = $1&lt;/code&gt; regardless of the parameter value. This is exactly what you want for identifying expensive query patterns — but it hides a critical detail.&lt;/p&gt;

&lt;p&gt;A query might average 5ms across 100,000 calls but take 12 seconds for one specific customer. The customer with ID 847291 has 500,000 orders while the median customer has 15. The index scan that works perfectly for most customers degrades to near-sequential-scan performance for this outlier. The average hides the catastrophe.&lt;/p&gt;

&lt;p&gt;You cannot reproduce this problem from &lt;code&gt;pg_stat_statements&lt;/code&gt; alone. You see &lt;code&gt;WHERE customer_id = $1&lt;/code&gt; and the mean time, but not the parameter value that caused the outlier. Without the actual parameter values, you cannot run EXPLAIN ANALYZE to see what the planner does for that specific input.&lt;/p&gt;

&lt;p&gt;The standard workaround is to search PostgreSQL logs. If &lt;code&gt;log_min_duration_statement&lt;/code&gt; is configured, slow queries are logged with their parameters in a DETAIL line. But parsing log files to extract parameters, correlate them with query templates, and reconstruct executable queries is manual, tedious, and error-prone — exactly the kind of work that gets skipped during a production incident.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Detect It
&lt;/h2&gt;

&lt;p&gt;PostgreSQL can log slow queries with their parameter values, but you need the right configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Check if slow query logging is enabled&lt;/span&gt;
&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;log_min_duration_statement&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- -1 means disabled, 0 logs everything, positive value is milliseconds&lt;/span&gt;

&lt;span class="c1"&gt;-- Enable logging of queries slower than 500ms&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;log_min_duration_statement&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;pg_reload_conf&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When enabled, slow queries appear in the PostgreSQL log with a DETAIL line containing the parameter values:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;LOG&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="n"&gt;duration&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;12847&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;234&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;  &lt;span class="k"&gt;execute&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;
      &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;
&lt;span class="n"&gt;DETAIL&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="k"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'847291'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'2020-01-01'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To reconstruct an executable query from this, you must manually substitute each parameter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Manually reconstructed from log parsing&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'847291'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'2020-01-01'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works for one query. For systematic analysis, you would need to parse the log files programmatically, match DETAIL lines to their query statements, and build executable queries for each slow execution. Most teams lack this tooling, so parameter values are effectively lost.&lt;/p&gt;

&lt;p&gt;Another approach is &lt;code&gt;auto_explain&lt;/code&gt;, which logs the EXPLAIN plan for slow queries. This gives you the plan but still not the executable query you can paste into psql to reproduce:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Enable auto_explain for queries over 1 second&lt;/span&gt;
&lt;span class="k"&gt;LOAD&lt;/span&gt; &lt;span class="s1"&gt;'auto_explain'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;auto_explain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log_min_duration&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;auto_explain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log_analyze&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Automating Parameter Extraction
&lt;/h2&gt;

&lt;p&gt;The ideal solution is a monitoring tool that parses PostgreSQL logs automatically, extracts DETAIL parameters, and substitutes them into query templates to produce executable queries. This turns a tedious manual process into an automated pipeline.&lt;/p&gt;

&lt;p&gt;With automated extraction, each query template accumulates a collection of real-world parameter samples. Templates with zero executable samples use &lt;code&gt;$N&lt;/code&gt; placeholders that cannot be run directly. Templates with samples have actual parameter values from the logs. The higher the sample count, the more diverse the parameter combinations available for testing.&lt;/p&gt;

&lt;p&gt;This is particularly valuable for queries where performance varies dramatically based on input. You can test with the actual values that caused slow executions — not guessed values or synthetic test data, but the real parameters from production.&lt;/p&gt;

&lt;p&gt;The workflow becomes: find a slow query template, click into its samples, copy the executable query with one click, and paste it into psql with &lt;code&gt;EXPLAIN (ANALYZE, BUFFERS)&lt;/code&gt; to see exactly what the planner does for that specific parameter combination.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Fix It
&lt;/h2&gt;

&lt;p&gt;Once you have an executable query with the problematic parameters, diagnose the plan:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Run EXPLAIN ANALYZE with the exact parameters from the sample&lt;/span&gt;
&lt;span class="k"&gt;EXPLAIN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ANALYZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BUFFERS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;FORMAT&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'847291'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'2020-01-01'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Compare this plan to the same query with a typical parameter value:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;EXPLAIN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ANALYZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BUFFERS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;FORMAT&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'12345'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'2020-01-01'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Common findings:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Skewed data distribution:&lt;/strong&gt; Customer 847291 has 500,000 rows, the planner estimates 15 (the average). Fix with extended statistics or a partial index:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Extended statistics for correlated columns&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;STATISTICS&lt;/span&gt; &lt;span class="n"&gt;orders_customer_date_stats&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;ANALYZE&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Partial index for high-volume customers&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;CONCURRENTLY&lt;/span&gt; &lt;span class="n"&gt;idx_orders_high_volume&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
                          &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Date range too wide:&lt;/strong&gt; &lt;code&gt;$2 = '2020-01-01'&lt;/code&gt; pulls 5 years of data. The application may need pagination or the query needs a tighter default range.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Generic plan vs. custom plan:&lt;/strong&gt; On PostgreSQL 12+, after 5 executions the planner may switch to a generic plan that is suboptimal for outlier parameter values. Check with &lt;code&gt;plan_cache_mode = force_custom_plan&lt;/code&gt; as a diagnostic.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Prevent It
&lt;/h2&gt;

&lt;p&gt;Ensure &lt;code&gt;log_min_duration_statement&lt;/code&gt; is set to a reasonable threshold — 500ms to 1000ms is a good starting point. Too low floods the logs, too high misses important slow queries. This single setting is the prerequisite for parameter extraction; without it, no tool can recover parameter values from normalized templates.&lt;/p&gt;

&lt;p&gt;Set &lt;code&gt;log_parameter_max_length&lt;/code&gt; to at least 1024 (PostgreSQL 13+) to ensure long parameter values are not truncated in the log DETAIL line. Truncated parameters produce incomplete executable queries.&lt;/p&gt;

&lt;p&gt;Review captured samples periodically to identify parameter patterns that consistently cause slow execution. If the same customer ID or date range appears across multiple slow samples, the problem is data skew rather than a missing index.&lt;/p&gt;

</description>
      <category>database</category>
      <category>monitoring</category>
      <category>performance</category>
      <category>postgres</category>
    </item>
  </channel>
</rss>
