Forem: JoongHyuk Shin

1.3.4 RETURNING Mapping and View-Target DML Transformation

JoongHyuk Shin — Mon, 25 May 2026 11:02:53 +0000

The previous section (1.3.3) looked at the work the rewriter does to fill empty slots with values and expressions inside an INSERT/UPDATE/DELETE the user issued. Injecting defaults for omitted INSERT columns, resolving the DEFAULT keyword, and merging partial UPDATE assignments all belong to that work.

The other half of DML rewriting has a different flavor. It is the work of moving what existing references point to. It shows up in two places. One is when the RETURNING clause has to be resolved. The other is when a DML aimed at a view has to be rewritten as a DML against a base table. In both cases the expressions the user wrote stay the same; only what those expressions point to (the varno/varattno of Var nodes) gets swapped.

RULE's RETURNING is reshaped to the user's request

All three commands (INSERT/UPDATE/DELETE) can carry a RETURNING clause. It is the clause that hands the affected rows back as a result. RETURNING usually flows into the planning stage exactly as the user wrote it, but when DML is issued against a table that has an INSTEAD/ALSO rule attached, the rewriter adds one more mapping step.

Take a concrete scenario. Suppose an account_summary view and a rule on it are defined like this.

CREATE VIEW account_summary AS
    SELECT id, owner, balance FROM accounts;

CREATE RULE log_summary_insert AS ON INSERT TO account_summary
    DO INSTEAD INSERT INTO accounts (owner, balance)
                          VALUES (NEW.owner, NEW.balance)
                          RETURNING id, owner, balance;

This rule is an INSTEAD rule: when the user issues an INSERT to account_summary, it replaces that with an INSERT to accounts. The action itself carries a RETURNING clause whose list is the three columns id, owner, balance.

Now suppose the user issues this.

INSERT INTO account_summary (owner, balance) VALUES ('alice', 100)
   RETURNING id, balance;

The user's requested RETURNING list is two columns, id, balance. The shape differs from the rule action's RETURNING list of id, owner, balance. What the rewriter does here is clear. It reshapes the rule action's RETURNING list to the shape the user asked for, id, balance. The query that ends up being executed looks like this.

INSERT INTO accounts (owner, balance) VALUES ('alice', 100)
   RETURNING id, balance;

The user's shape is what ends up in the result. The RETURNING id, owner, balance written into the rule action itself acts only as a signal that "this rule knows how to resolve RETURNING over these columns"; the final shape follows the user's request.

There are three more branches.

First, the user did not write a RETURNING clause. In that case the rule action's RETURNING is simply discarded. The user said they don't want a result, so the RETURNING expressions are never evaluated.

Second, RETURNING lists appear on more than one rule action. Suppose, for example, account_summary has both of these rules attached.

CREATE RULE log_summary_insert AS ON INSERT TO account_summary
    DO INSTEAD INSERT INTO accounts (owner, balance)
                          VALUES (NEW.owner, NEW.balance)
                          RETURNING id, owner, balance;

CREATE RULE audit_summary_insert AS ON INSERT TO account_summary
    DO ALSO   INSERT INTO accounts_audit (owner, balance)
                          VALUES (NEW.owner, NEW.balance)
                          RETURNING audit_id, owner, balance;

DO ALSO does not intercept the user's command; it runs as an additional action alongside. Now a single INSERT event has two actions carrying RETURNING: one for accounts, the other for accounts_audit. When the user asks for RETURNING id, balance, there is no way to decide which action the result should come from. Because of this ambiguity, the rewriter reports an error at this point.

Third, an INSTEAD rule has no RETURNING at all but the user asked for one. This is also an error. The message is "cannot perform INSERT/UPDATE/DELETE RETURNING on relation X".

If we boil these three branches down to one line: the RULE system prioritizes the user's requested result shape, but the path is sound only when exactly one rule action can resolve it.

DML on a view is rewritten as DML on a base relation

When an INSERT/UPDATE/DELETE targets a view and that view is auto-updatable (a view that satisfies the simple-view expansion conditions from 1.3.1), the rewriter rewrites the view-target DML as a DML against the base relation. INSERT, UPDATE, DELETE, and MERGE all go through the same function rewriteTargetView.

Take an example first. Suppose a view and a user query like this.

CREATE VIEW active_accounts AS
    SELECT * FROM accounts WHERE deleted_at IS NULL;

UPDATE active_accounts SET balance = balance + 100
    WHERE owner = 'alice'
    RETURNING id, balance;

The user issued a query shaped like an UPDATE against the active_accounts view. The relation targeted by the UPDATE, i.e., the relation whose rows will be modified, is called the result relation in PG terminology. Here the slot for the result relation holds active_accounts. A view doesn't carry data of its own, so it cannot be updated as-is; the update has to flow through to the base table behind the view, which is accounts.

In one line, the transformation replaces every view reference inside the query with a reference to the base relation. Broken down into steps, three things happen.

First, the view RTE that sat in the result relation slot is swapped for the base relation RTE. In the example above, the slot initially held an RTE pointing to active_accounts; that gets replaced with the RTE of the base table accounts that the view definition points to.

Second, every Var node inside the query (Var in targetList, Var in RETURNING, Var in ON CONFLICT) has its attno rewritten from the view's column index to the base table's column index. One point worth pinning down here. When the user issues a DML targeting a view, the column names they write in the RETURNING clause are the view's column names. That's because when the analyzer resolves the column references in RETURNING, it looks at the catalog of the result relation, which is the view. After that, the rewriter rewrites those Var nodes' attno to point at slot numbers in the base table. SQL text is not what changes; one integer field on a tree node is swapped for a different number.

In our example, the view is shaped as SELECT * FROM accounts, so the view's id maps straight to the base's id. Rewriting attno is the whole story. If the view had instead been written as SELECT id AS account_id, owner, balance FROM accounts, with an alias renaming a column, the user would have written RETURNING account_id, and that Var's attno would be rewritten to point at the base's id slot. The label on the column the user gets back stays exactly as they wrote it, i.e., the view column name (account_id), and the value inside that column is read from the base's id.

Third, if the view definition had a WHERE clause, that condition is added to the user's query. The view definition in our example is ... WHERE deleted_at IS NULL, and that condition defines which rows are "visible" through the view. The UPDATE the user issued must target rows that fall within the view's visibility range, so the view's WHERE is ANDed onto the user's WHERE. UPDATE and DELETE pick up the merge this way; INSERT creates new rows and has no WHERE clause of its own for the view's WHERE to attach to, so it skips this step.

Conceptually, the rewriter takes the query above and produces something like this.

UPDATE accounts SET balance = balance + 100
    WHERE owner = 'alice' AND deleted_at IS NULL
    RETURNING id, balance;

The view definition's WHERE (deleted_at IS NULL) got ANDed onto the user's WHERE (owner = 'alice'), and the Var nodes in RETURNING have been remapped from view column indexes to base column indexes in accounts. The SQL above isn't real SQL text; it's an analogy that pictures the transformed result in SQL form. Inside the tree, the Var nodes in RETURNING simply point at base column slots; the result is not re-emitted as SQL text. The user thinks they updated a view, but the base table is what's actually updated, and the view no longer appears in the post-transformation query.

A security_barrier view pins the view's WHERE in front of securityQuals

If a view was created with the security_barrier option, one thing changes. Instead of being ANDed onto the user's WHERE in the usual way, the view definition's WHERE is pinned at the front of the base relation RTE's securityQuals. This is the same securityQuals list from 1.3.2, the container that held RLS quals. Going into the same container brings the same protection effect. The planner cannot reorder a leaky function the user planted in their WHERE clause to be evaluated before the view's filter. That stops a leaky function from bypassing the view's filter to see base rows.

WITH CHECK OPTION stops a new row from breaking the view's visibility

If the view carries WITH CHECK OPTION, one more thing is added. It is the promise that any row coming in (or being changed) through INSERT/UPDATE must satisfy the view's WHERE again. Adding WITH CHECK OPTION to the end of the view definition looks like this.

CREATE VIEW active_accounts AS
    SELECT * FROM accounts WHERE deleted_at IS NULL
    WITH CHECK OPTION;

If we issue an UPDATE against a view defined this way, the row produced by the UPDATE has to still satisfy deleted_at IS NULL. That stops the user from setting deleted_at to some timestamp via that UPDATE and shoving the row they were updating outside the view's visibility range. The rewriter turns this promise into a check entry and pins it at the front of the Query's withCheckOptions list.

The same container from 1.3.2 shows up again. The withCheckOptions list that held RLS WITH CHECK conditions now also takes the view's CHECK OPTION in the same place. The only difference is the identifier that distinguishes the kind of check; the enforcement mechanism is the same. Even the enforcement semantics (an error is raised on violation, not a silent drop) are identical.

Consider the case where RLS and view CHECK OPTION are both active on the same table/view. Say the base table accounts has an RLS WITH CHECK policy saying "an INSERTed row's owner must always be current_user", and the active_accounts view above carries a WITH CHECK OPTION saying "the row must satisfy deleted_at IS NULL". When the user issues an INSERT through active_accounts, the rewriter pins both conditions onto the withCheckOptions list together. The executor checks the new row against both conditions in order; a violation of either raises an error. Because they sit in the same list and are enforced the same way, the enforcement mechanism is single.

View transformation unwinds recursively

When the view-target DML work finishes, the resulting query re-enters RewriteQuery recursively. The base table might still have another view attached, and if RLS is attached to the base table, it gets applied in step 2 (fireRIRrules) that runs after this transformation. That's why multi-layer constructions like view-on-view, view-on-RLS, and RLS-on-view each unwind at their own layer.

What this means in practice

First, code that takes the RETURNING result from a view-target DML by positional index can break when the view definition changes. When you issue a DML against a view with RETURNING *, the result columns are labeled with the view's column names, but the data is actually read from the base table. Take a view defined like this.

CREATE VIEW account_summary AS
    SELECT id, owner, balance FROM accounts;

Application code issues an INSERT against this view, takes the RETURNING *, and assumes the first slot is id.

row = cur.execute(
    "INSERT INTO account_summary (owner, balance) VALUES (%s, %s) "
    "RETURNING *",
    ("alice", 100),
).fetchone()
new_id = row[0]   # assumed to be id

Later, during operation, suppose the team decides to add created_at to the view and redefines the column order.

CREATE OR REPLACE VIEW account_summary AS
    SELECT created_at, id, owner, balance FROM accounts;

If the same code keeps running, row[0] is now created_at, and new_id ends up with a timestamp. The code still compiles, no SQL error fires, so the bug is found late. Reading by column name (row["id"]) lets the value find its own slot even if the view's column order shifts. On code paths that issue DML through views, standardizing on reading RETURNING results by column name is the safer choice.

Second, an UPDATE against a view with WITH CHECK OPTION cannot shove the very row it is updating outside the view. One common trap shows up on the soft-delete pattern. Soft-delete means not actually DELETEing a row but recording a deletion timestamp in a column like deleted_at, treating the row as "removed". A view that exposes only "rows not yet deleted" is then layered on top. active_accounts is exactly that shape: in the base accounts the row stays put, and only rows with deleted_at IS NULL are exposed through the view. When application code tries to soft-delete a row through that same view, as in UPDATE active_accounts SET deleted_at = now() WHERE id = ?, the row resulting from that UPDATE no longer satisfies deleted_at IS NULL. The row falls outside the view's visibility range (the area of "visible rows" the view's WHERE defines). WITH CHECK OPTION blocks this with an error. The application meant to soft-delete, but the view's visibility promise stands in direct conflict with that intent, so the operation is rejected.

There are two ways to resolve the conflict. One is to issue the soft-delete UPDATE directly against the base table instead of routing it through the view. The view keeps the read-side visibility job, and the soft-delete write happens on the base. The other is to intentionally drop the view's CHECK OPTION and instead enforce, at the application layer, the promise that "INSERTed/UPDATEd rows must satisfy deleted_at IS NULL". The application takes on the promise the engine used to enforce.

1.3.3 INSERT/UPDATE/DELETE Targetlist Normalization (Default Expansion)

JoongHyuk Shin — Fri, 22 May 2026 09:52:25 +0000

The two rewriter tasks we have seen so far, view expansion and RLS policy injection, were unpacked under the assumption that the query is a SELECT. INSERT/UPDATE/DELETE go through one more preparatory stage on top of those. Even when the user only writes INSERT INTO accounts (owner) VALUES ('alice'), what PostgreSQL ends up looking at is a query where the id, balance, and created_at slots already contain expressions. That preparation does not exist for SELECT. The subject of this section is how that preparation works, and why it has to happen before user-defined rules get applied.

DML preparation has to happen before rule application

The rewriter is divided into two phases. The first phase applies user-defined rules; the second applies view expansion and RLS to each resulting Query (fireRIRrules, seen in 1.3.1 and 1.3.2). SELECT queries have essentially nothing to do in the first phase. Only INSERT/UPDATE/DELETE receive additional processing inside the first phase.

The first step of that additional processing is targetlist normalization. For INSERT, the rewriter fills in default expressions for columns the user did not name (when those columns have defaults), and replaces any DEFAULT keyword the user wrote with the actual expression. UPDATE goes through the same normalization function, but with less to do because there are no missing slots to fill. DELETE skips the stage entirely.

Why does this normalization have to come before rule application? The answer comes from looking at how the RULE system (1.3.1) refers to columns. The accounts table used throughout this section is defined as follows.

CREATE TABLE accounts (
    id          integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    owner       text NOT NULL,
    balance     numeric DEFAULT 0,
    created_at  timestamptz DEFAULT now(),
    deleted_at  timestamptz
);

The last column, deleted_at, has no DEFAULT clause and is not an identity column. In other words, it is a column with no default. This shape is common in the soft-delete pattern, and the way this column is handled later becomes one interesting case in this section.

Suppose an audit rule is attached to this table.

CREATE RULE log_account_insert AS ON INSERT TO accounts
    DO ALSO INSERT INTO audit_log (account_id, owner_name, logged_at)
                          VALUES (NEW.id,    NEW.owner,  now());

The rule says "whenever someone inserts a row into accounts, also insert a row into audit_log." Inside the rule body, NEW.id and NEW.owner are virtual expressions that mean "the id column, the owner column of the row the user is about to insert." When the rule author writes these references, they make one assumption: no matter what INSERT the user throws, that row will have an id slot and an owner slot, with values in them.

Now suppose the normalization step did not exist, and the rule received the user's original query as is. The user runs INSERT INTO accounts (owner) VALUES ('alice'). The original query's targetlist contains a single entry, owner = 'alice'. The id is missing. When the rule fires, it looks into the trigger query's targetlist to resolve NEW.id. There is no matching slot. At this point PostgreSQL's rule mapping (the precise mechanism is covered later in this section) replaces the unmatched slot with a NULL constant. So the rule itself does not error out. But what ends up in audit_log is (NULL, 'alice', ...). Meanwhile the user's base INSERT goes through the planner separately, picks up an actual integer (say 17) from the sequence into the id slot, and lands (17, 'alice', ...) in accounts. The audit table is supposed to track "what row got inserted," and yet it never receives the new row's id. That is not what the rule author had in mind.

The opposite case: normalization has finished, then the rule fires. The rewriter has already filled the three missing columns (id, balance, created_at) with expressions, so the trigger query the rule sees has all four columns sitting in their proper slots. NEW.id points to the sequence expression now occupying that slot, and NEW.owner points to 'alice'. The rule mapping has nothing to fall back to NULL for. The account_id slot in the audit_log INSERT receives the same sequence expression, so at execution time audit_log gets a value tied to the same new row id that accounts received. The tracking the rule author originally assumed finally works.

The same INSERT may have some columns explicitly given and others omitted. The rule has no way to tell. The rewriter has to run before the rule and fill in the missing columns in attno order, so that the user's query the rule sees is in the standard shape. The position of this preparation is determined by what the next stage expects as input.

This is qualitatively different from what the rewriter does for SELECT. View expansion and RLS are transformations that add conditions that were not there in the user's query; DML preparation is a transformation that fills in slots that were left empty in the user's query. Both touch the tree, but with different motivations.

INSERT auto-injects default expressions into missing columns

Suppose the user fires this query against the accounts table just defined.

INSERT INTO accounts (owner) VALUES ('alice');

The targetList of the Query tree produced by the analyzer contains only the single entry owner = 'alice'. The id, balance, created_at, and deleted_at are missing. The rewriter fills the empty slots as follows.

id is defined as GENERATED ALWAYS AS IDENTITY, a SQL-standard identity column. PostgreSQL creates a dedicated sequence (a catalog object called accounts_id_seq) for such columns and, on every INSERT, draws the next integer from that sequence and fills the column with it. What the rewriter places into this slot is the expression node that retrieves the next value from the sequence (NextValueExpr).
balance has DEFAULT 0 written in the column definition. That definition is stored as an expression in the pg_attrdef catalog, and the rewriter pulls the expression out of there and places it into the slot.
created_at follows the same route, and the now() function call expression goes into its slot.
deleted_at has no default. The rewriter does not place any expression into this slot.

After three columns get filled, the targetList ends up shaped like (NextValueExpr node, 'alice', 0, now(), empty) across five positions. The user's query had only one column (owner), but the query PostgreSQL ends up looking at has four columns filled with expressions and one (deleted_at) left as an empty slot. The decision of what to fill in each column (or whether to leave it empty) belongs to a single responsibility function (build_column_default); when an expression is produced, it gets wrapped in a new TargetEntry and inserted into the targetList.

The decision follows three priority steps. First, if the column is an identity column, the node that retrieves the next value from a sequence. Second, if the column definition contains DEFAULT <expression>, that expression. Third, if neither applies but the column's data type itself has a default, that type default. In the example above, id took the first path, while balance and created_at took the second. deleted_at falls through all three.

Columns without a default: how the empty slot resolves to NULL

When all three priority steps fail, build_column_default returns NULL, and the rewriter does not create a TargetEntry for that column in the INSERT targetList. The attno slot itself exists, since the table has five columns by definition, but one of those slots (here, deleted_at) is simply left empty rather than carrying a TargetEntry. This might look like it breaks the promise stated earlier ("the query the rule sees is in the standard shape with every column sitting in its slot"). If the slot is empty, surely the rule pointing to NEW.deleted_at should get the wrong value or trigger an error.

That is not how it works. PostgreSQL has been designed so that every stage downstream of the empty slot interprets it as NULL in a consistent way. A concrete scenario makes this clear. Suppose another rule is also attached to accounts.

CREATE RULE forward_deleted_at AS ON INSERT TO accounts
    DO ALSO INSERT INTO sync_queue (account_id, deleted_at)
                         VALUES (NEW.id, NEW.deleted_at);

This rule references NEW.deleted_at. But the normalized targetList of the user's INSERT INTO accounts (owner) VALUES ('alice') does not contain a TargetEntry for deleted_at (no default, so no expression was placed there). When the rule mapping step looks into the trigger query's targetList for the deleted_at slot, it does not find one.

At this point PostgreSQL's rule mapping substitutes a NULL constant in place of the missing TargetEntry. So the deleted_at slot of the sync_queue INSERT receives NULL. The rule's promise is upheld in a normal way. Pointing to NEW.deleted_at simply produces NULL; there is no rule breakage, no neighboring column getting accidentally dragged into the slot.

The base table side works the same way. When the planner produces the plan for the INSERT and sees no TargetEntry for deleted_at in the normalized targetList, it fills that slot with a NULL constant to complete the row. The row that lands in accounts ends up as (1, 'alice', 0, now(), NULL).

At both of those sites, the sync_queue INSERT produced by rule mapping and the base accounts row produced by the planner, the NULL is filled in the same way. There is no inconsistency where one site produces NULL and the other produces some other value. That is why the rewriter does not bother building an explicit NULL expression for the targetList. Inserting a NULL TargetEntry and leaving the slot empty have the same end result, so PostgreSQL chose to leave the slot empty and keep the tree lighter.

The `DEFAULT` keyword arrives as a SetToDefault placeholder and gets resolved

The previous section covered automatic filling when the user omits a column. The user can also write the DEFAULT keyword explicitly to say "fill this slot with the default expression."

INSERT INTO accounts VALUES (DEFAULT, 'bob', DEFAULT, DEFAULT);

When this SQL passes through the analyzer, each DEFAULT written by the user becomes a SetToDefault placeholder node in the Query tree. SetToDefault is just a temporary dummy placed where an expression node belongs; it is not an executable expression. The node definition comment in PostgreSQL says this directly.

/*
 * Placeholder node for a DEFAULT marker in an INSERT or UPDATE command.
 *
 * This is not an executable expression: it must be replaced by the actual
 * column default expression during rewriting.
 */

The expression that replaces a SetToDefault is built by the same responsibility function shown earlier. The same three-step priority applies, so each DEFAULT slot receives the corresponding identity / pg_attrdef / type-default expression. The core mechanism is the same up to this point. What differs is where the placeholder lives in the tree, and that depends on the form of the INSERT.

Single-row INSERT is substituted directly in the targetList

For the single-row INSERT case above, since there is only one row, the four values (DEFAULT, 'bob', DEFAULT, DEFAULT) go straight into the Query's targetList. The three SetToDefault nodes sit at positions 1, 3, and 4 of the targetList. The rewriter walks the targetList column by column, and whenever it encounters a SetToDefault, it substitutes the expression produced by the responsibility function in place. In terms of the Query tree structure from 1.3.1, one row in the targetList represents the expression for one column, and that expression flips from SetToDefault to the actual default expression.

Multi-row INSERT runs through a VALUES RTE in between

For multi-row INSERT, which inserts several rows in one statement, the Query tree shape is slightly different.

INSERT INTO accounts VALUES (DEFAULT, 'bob',   DEFAULT, DEFAULT),
                            (DEFAULT, 'carol', 500,     DEFAULT),
                            (DEFAULT, 'dave',  DEFAULT, '2026-01-01');

With three rows, the targetList cannot hold three rows at once (targetList is shaped as one expression per column). Instead, PostgreSQL bundles the rows into a separate node and attaches it to the Query tree's from-list. That node is called a VALUES RTE. RTE is short for RangeTblEntry (introduced in 1.2.3), a single data-source unit referenced from the from-list (a table, a sub-query, a function, and here, the VALUES bundle). A VALUES RTE holds a list of rows internally, and each row in turn holds a list of values. The seven SetToDefault placeholders from the INSERT above are scattered across positions 1, 3, and 4 of the three rows inside the VALUES RTE.

Why does the tree shape differ for what looks like the same INSERT? A single-row INSERT is the case where producing one row is the entire result of the Query, while a multi-row INSERT is closer to "produce a set of rows and pour them into the table." PostgreSQL handles the latter the same way it handles a regular sub-query. A VALUES RTE sits in the same kind of slot a SELECT's result set would, an "inline data source." That is how INSERT INTO accounts SELECT ... and INSERT INTO accounts VALUES (...), (...) end up sharing the same rewriter and planner code paths.

The multi-row path walks each value of each row inside the VALUES RTE, and whenever it sees a SetToDefault, it substitutes the expression produced by the same responsibility function. The expression produced is identical to the single-row case; only the location it lands in differs.

There is one small optimization. A VALUES RTE may carry 100 or 1,000 rows. If none of them contain a SetToDefault, the rewriter skips the expensive list rebuild entirely. Rather than reconstructing all those rows, it passes the original list through to the next stage. The reason PostgreSQL has this fast path is the practical observation that multi-row INSERT is typically used for bulk loads, where the DEFAULT keyword almost never appears.

UPDATE only tidies what the user wrote; DELETE leaves the targetlist alone

UPDATE goes through the same normalization function as INSERT, but it does much less work than INSERT does. There is no need to look for missing columns to fill. The SET clause of an UPDATE only puts the columns the user explicitly listed into the targetList, and the columns the user did not list must keep the existing row's values. So there is nothing to fill in for empty slots. The two things the rewriter does for UPDATE are: first, resolve any DEFAULT keyword the user wrote into the actual expression (the same SetToDefault substitution mechanism as in INSERT); second, combine separate-line updates of parts of the same column into a single expression. The second item needs a brief explanation.

A column in PostgreSQL can be more than a simple scalar (int, text, etc.); it can be a composite type or an array. You can write a SET clause that updates only part of such a column. For example, if address is a composite-type column shaped like (street, city, zip), you can update two different fields of the same column in one UPDATE.

UPDATE accounts
   SET address.street = '101 Pine St',
       address.city   = 'Seattle'
   WHERE id = 1;

Or you can update different subscripts of an array column tags:

UPDATE accounts
   SET tags[2] = 'premium',
       tags[4] = 'verified'
   WHERE id = 1;

The two SET items look like the same column name (address, tags) appearing twice in the SQL text, but they really refer to different parts of the column. PostgreSQL combines the two lines into a single expression that operates on the column as a whole. address.street = '101 Pine St' followed by address.city = 'Seattle' becomes "take the original address, change street first, then take the result and change city" as one expression. As a result, even though the column name appears twice in the SET clause, at execution time the column is updated once.

When the entire column is assigned twice (UPDATE t SET address = X, address = Y, replacing the whole column twice), that is no longer a partial update but a genuine conflict, and the rewriter throws a syntax error. The combining only happens when the two SET items point to different parts of the same column.

DELETE is simpler still. In the DML branch, DELETE has a one-line case body that reads "Nothing to do here." DELETE produces no column values. It only decides which rows to remove, so there is nothing to do to the targetList itself. None of the things we have seen so far (INSERT's default filling, the DEFAULT keyword substitution for INSERT/UPDATE, UPDATE's partial-update combining) apply to DELETE. The amount of work the rewriter does is set by how much responsibility the user's query has for producing row data.

That covers the DML's targetlist normalization. We have walked through the filling-in-empty-slots half of DML rewriting. The other half, redirecting references to point at different places, is covered in 1.3.4 through RETURNING mapping and DML-on-view transformation.

What this means in practice

First, when an INSERT feels slow, default expressions are the first thing to suspect. Even if the user does not name a column, the default expression for it still runs on every INSERT. If an expression in pg_attrdef calls a function (now(), uuid_generate_v4()) or a sub-query, that function or sub-query runs once per missing column on every INSERT. Sequence consumption, statistics counters, side-effect function calls, all of these happen exactly as if the user had provided the value explicitly. When INSERT performance looks off, the fastest first check is to inspect what each default expression evaluates to via \d+ <table>. Avoiding heavy sub-queries in default expressions follows the same reasoning.

Second, RULE definitions can safely reference NEW.<column> even for columns without a default. With a rule like the forward_deleted_at example from the body text, there is no need to separately check whether the user named deleted_at. If they did, the value flows through as written; if they did not, NULL flows through. Missing data is uniformly carried as NULL, so the rule author does not have to branch on every possible combination of "which columns did the user actually write." This safety guarantee shows up often in INSERT triggers and audit rules.

1.3.2 How RLS Rewrites Queries (Mechanism)

JoongHyuk Shin — Sun, 17 May 2026 10:53:10 +0000

Once you put row-level security (RLS) on a table, queries that read from that table don't run the way the user wrote them. A single line SELECT * FROM accounts comes out of the rewriter looking more like SELECT * FROM accounts WHERE owner = current_user. The user never wrote a WHERE clause, but it's there. What RLS as a feature does is covered in 7.5. Here we look at what that "silently changes" actually is, inside Postgres.

The qual the rewriter adds

The core of RLS is simpler than it sounds. It takes the expression written in the policy and grafts it onto the target table as a qual. That's the whole mechanism. What PG calls a qual is a shorthand for a boolean expression that filters rows, the kind of thing that lives in a WHERE clause (short for "qualification", and the same concept as what the SQL standard usually calls a predicate). If a query is the whole tree carrying SELECT/FROM/WHERE and so on, a qual is one expression node inside that tree. RLS doesn't run a separate enforcement engine. It plants the policy's expression into the query tree and deforms the tree itself.

This work happens at the same site as view expansion (covered in 1.3.1). The RIR pass, driven by fireRIRrules, finishes everything else first (CTE handling, view expansion, sublink recursion) and then, as the last step, applies RLS policies. It walks the query's range table one more time and, for each RTE that points at a regular relation, attaches the policy's conditions.

The place those conditions go is the securityQuals field on the RTE. The comment in the PG source describes the field this way:

/*
 * securityQuals is a list of security barrier quals (boolean expressions),
 * to be tested in the listed order before returning a row from the
 * relation.  It is always NIL in parser output.  Entries are added by the
 * rewriter to implement security-barrier views and/or row-level security.
 */

Two things stand out. First, securityQuals is filled by the rewriter, not the parser. Conditions appear on the RTE that weren't in the user's query at all. Second, RLS and security-barrier views share the same container. A security-barrier view (defined with CREATE VIEW ... WITH (security_barrier)) is one whose filter conditions the planner is forbidden from reordering against the user's WHERE clauses. Its purpose is essentially the same as RLS, so PG implements both on top of the same machinery. Even though 1.3.1 showed view expansion turning a view reference into a sub-query, the conditions of a security-barrier view and the conditions of an RLS policy end up in the very same securityQuals list. (The leaky view attack scenarios that security-barrier views guard against, and the operational pitfalls, are covered in 7.5.3.)

RLS fills the same container the same way. A "policy" in RLS is a catalog object created by CREATE POLICY. A single policy bundles five things together: a name, which commands it applies to (SELECT/INSERT/UPDATE/DELETE or ALL), which roles it applies to, the USING expression that filters existing rows, and the WITH CHECK expression that validates new rows. It's stored as one row in the pg_policy catalog, and it rides along when the table's relcache is loaded. The rewriter pulls the USING and WITH CHECK expressions out of that catalog row and grafts them onto the query tree.

Suppose the accounts table has this policy:

CREATE POLICY account_owner ON accounts
    FOR SELECT USING (owner = current_user);

The user runs SELECT * FROM accounts WHERE balance > 0. After the analyzer, the Query tree's RTE for accounts has an empty securityQuals. When the rewriter's RLS pass visits that RTE, it pulls the policy's USING expression owner = current_user, copies it, fixes up the Var references so they point at the right relation, and drops it into securityQuals. What the planner ends up seeing, conceptually, is something like:

SELECT * FROM accounts WHERE (owner = current_user) AND (balance > 0)

owner = current_user is a condition the user never wrote. The rewriter pulled it out of the policy and stuck it in.

One thing to flag, though. That SQL above is a conceptual view. PG isn't actually rewriting the user's SQL text into that form. The text the user submitted stays as it was. What gets transformed is the Query tree it was parsed into. The policy condition lands on the accounts RTE's securityQuals field inside that tree. The SQL above is just a human-readable rendering of the tree-level change. What pg_stat_statements records and what shows up in the logs is the original SQL the user typed. The RLS transformation happens at the tree level, not at the text level.

Why the rewriter, not somewhere else

Doing RLS as a query transformation is itself a design choice. Row-level access control can be built in other ways. One alternative is to add a separate filter stage in the execution engine that evaluates the policy condition as each row comes off the relation. PG didn't go that route. It does the work in the rewriter, planting the condition into the query tree.

The reason lies one step downstream, in the planner. Once the rewriter has folded the policy condition in as a normal qual, the planner doesn't see it as anything special. It looks like just another condition in the query. So if there's an index on the column, the planner can pick an index scan. If the condition can be pushed below a join, the planner can push it. It estimates selectivity from statistics like any other qual. The RLS condition benefits from the planner's full optimization machinery.

A separate runtime filter wouldn't get any of this. The filter tends to bolt onto the tail of an already-built plan, which means the table gets fully scanned before the policy condition is applied. There's no path to index the policy condition. Putting RLS in the rewriter is the choice that keeps access control inside the part of the query the planner can optimize.

There's also a reason fireRIRrules saves RLS for last. A policy's condition can contain a sub-query (for example, USING (owner IN (SELECT ...))), and adding such a condition means the infinite-recursion check needs to run again on the new condition. If RLS were folded into the view-expansion loop above, the tree walk for view expansion would end up visiting the sub-queries inside RLS conditions twice. Pulling RLS out into a separate pass after view expansion avoids that.

The qual that filters and the qual that checks

A policy can carry two kinds of condition, and the two land in different parts of the query tree and get enforced in different ways. This distinction is the most important point in the whole RLS mechanism.

The USING condition decides which of the existing rows in the table are visible. The owner = current_user from before is a USING. It goes into securityQuals and behaves like a WHERE clause. Rows that don't match are simply absent from the result. They disappear silently.

The WITH CHECK condition decides which new rows are allowed in. Rows added by INSERT or UPDATE have to satisfy it. This condition doesn't go into securityQuals. It goes into a separate list on the Query node called withCheckOptions, and it gets enforced differently. When a row violates the condition, the row isn't quietly dropped. PG raises an error.

Why is one silent and the other loud? The scenarios make it obvious. Suppose the user runs SELECT * FROM accounts and there happen to be 100 accounts they don't own. It's natural for the USING condition to remove those from the result. The user asked for "rows I can see," and the unseen rows being missing is the expected behavior. It's not an error.

Now suppose the user runs INSERT INTO accounts VALUES (...) trying to insert an account owned by someone else. If that row were silently dropped, the user would believe the INSERT succeeded while the data isn't actually there. The command reported success, but there's no result behind it. That's the worst kind of bug to leave for the application. So PG raises an error on WITH CHECK violations. The comment in rowsecurity.c spells out this intent: the check option exists specifically to make sure a policy violation is signaled as an error, because otherwise rows you were trying to add could silently disappear.

So USING is a visibility filter and WITH CHECK is a write gate. One is silent, the other is loud. The same RLS policy gets unpacked into different containers and different enforcement on the read path versus the write path.

There's one convenience. If a policy doesn't specify an explicit WITH CHECK, PG reuses the USING condition as the WITH CHECK. The reasoning is that "only rows you can read can be written" is a reasonable default in most cases. Specifying a separate WITH CHECK lets you write policies where the read condition and the write condition differ.

How policies get combined

A single table can carry multiple policies. By what rule does the rewriter combine them into one condition?

Policies come in two flavors: permissive and restrictive. A permissive policy says "allow if this condition is met," and multiple permissive policies are combined with OR. Any one of them passing is enough to make the row visible. A restrictive policy says "must satisfy this condition," and they're combined with AND. Violating any single restrictive policy hides the row. The rewriter filters the policies by command type and role, then ORs the permissive ones into a single chunk and ANDs the restrictive ones onto it.

Imagine the accounts table has three policies. One is a permissive policy that lets users see accounts they own, with the condition owner = current_user. Another is a permissive policy that also lets users see accounts flagged as public, with is_public = true. The third is a restrictive policy that blocks deleted accounts no matter what, with deleted_at IS NULL. When the user runs a SELECT, the rewriter combines the three into the following condition on accounts's securityQuals:

(owner = current_user OR is_public = true) AND deleted_at IS NULL

The two permissive conditions OR together to form an allow-channel ("either I own it, or it's public"), and the single restrictive condition ANDs on top to enforce "and also, not deleted." Adding more permissive policies widens the allow-channel, so more rows become visible. Adding more restrictive policies adds more gates that have to pass, so fewer rows become visible. The general shape is (permissive1 OR permissive2 OR ...) AND restrictive1 AND restrictive2 AND ....

There's one rule worth highlighting. If there are no permissive policies at all, no row is visible. That holds even if there are plenty of restrictive policies on the table. The reason is that permissive policies are the grounds for visibility, and restrictive policies are constraints that trim what's already allowed. With no grounds for visibility, there's nothing to trim. The rewriter handles this by putting a single false constant into securityQuals. The effective condition becomes WHERE false, and no row passes.

That's RLS's default-deny behavior. If you run ALTER TABLE ... ENABLE ROW LEVEL SECURITY on a table without creating a single policy, no one (except the table owner or a user with BYPASSRLS) sees a single row. It might feel counterintuitive that not writing a WHERE clause results in zero rows, but RLS doesn't sit on the same layer as the user's WHERE. It's an access-control gate sitting above that layer. The gate's default being deny is the same fail-closed principle behind firewalls and filesystem ACLs. Zero policies means zero rules granting passage, so zero rows pass through, which is internally consistent. Turning RLS on is in itself a declaration that "anything not explicitly allowed is forbidden," and that's the safe default that prevents one forgotten policy from leaking data.

Security barrier: quals have an order

That securityQuals is not just a list but an ordered list is the last piece of the RLS mechanism. The field comment said "tested in the listed order," and that order matters.

Why does it matter? After the rewriter plants the policy conditions, the planner picks each element of securityQuals out (in process_security_barrier_quals) and moves it into the rel's qual list. As it does so, it assigns each element a distinct security_level. Earlier elements get a lower level, later elements a higher level. Conditions from the user's own WHERE clause get an even higher level than any of those.

The level enforces evaluation order. Lower-level conditions have to be evaluated before higher-level ones. The concrete thing it prevents is a function the user wrote in WHERE getting evaluated before the RLS condition.

Picture this query:

SELECT * FROM accounts WHERE leaky_function(account_number);

Suppose leaky_function is a function that leaks its argument value somewhere. The leak can take several forms. A PL/pgSQL function might write the argument to the server log with RAISE NOTICE '%', $1. Or you can write a function that deliberately raises an error for certain argument values, like 1 / (CASE WHEN account_number = '...' THEN 0 ELSE 1 END) triggering a division-by-zero. When PG raises a runtime error like that, the error context includes the call site and the argument expression, so the argument value gets carried out into the user-visible error message or the server log. Any path where information about the argument escapes the function through something other than its return value makes the function leaky. (The opposite is leakproof, a function guaranteed to have no such paths.) PG groups all such functions under the term leaky functions.

The RLS policy is supposed to only show rows where owner = current_user. But if the planner, for optimization reasons, evaluates leaky_function before the RLS condition, the function receives the account_number of rows the user has no right to see as its argument. Even if the policy then removes those rows from the result, the function already saw the values and leaked them, into the server log or the error message. RLS has been pierced.

The security_level prevents this. The RLS condition is at a lower level, and the user's WHERE conditions are at a higher level. The planner therefore cannot push a user condition containing a non-leakproof function below the RLS condition. The RLS condition filters first, and only the rows that pass make it to the user's function. This is the substance of the name "security barrier": qual order keeps optimization from crossing the security boundary.

There's one more thing about the order. The rewriter attaches the RLS conditions to the front of rte->securityQuals. If a table has both an RLS policy and conditions coming in from a security-barrier view, the RLS condition on the table itself ends up at the lower level, the position that gets evaluated first. The decision is that policies bound directly to a table take precedence over conditions coming in through a view.

What this means in practice

First, the performance of queries on an RLS-protected table comes down to whether the policy condition can use an index. The RLS condition isn't a separate filter, it's a regular qual merged into the query, so the planner may or may not be able to use an index for it. A simple USING (owner = current_user) will use an index scan if there's one on owner. A USING (owner IN (SELECT ... FROM ...)) with a sub-query, on the other hand, attaches that sub-query to every query against the table. If queries on an RLS table feel slow, the first step is to use EXPLAIN to see how the policy condition is being executed. Policy conditions are part of index design too.

Second, if an INSERT under RLS completes without error, the row actually went in. WITH CHECK violations are loud, not silent. Rows blocked by the policy don't vanish behind a success response, so you can trust that on the read side. What's a separate concern is whether the application code is catching that error and handling it properly. Make sure WITH CHECK violation errors are routed through the same handling path as ordinary constraint violations.

Third, turning RLS on without creating a policy locks the table down entirely. ENABLE ROW LEVEL SECURITY is itself a default-deny declaration. If a migration enables RLS in one statement and creates policies in another, and the second statement is missing, any code deployed in between gets empty results. To make matters worse, table owners bypass RLS, so the problem won't surface in a development environment where you connect as the owner; it only shows up in production. The lesson is to put the RLS-enable statement and the policy creation in a single transaction, and to test at least once with a role that isn't the owner.

1.3.1 Rule system and view expansion

JoongHyuk Shin — Thu, 14 May 2026 10:49:29 +0000

This section is about how PostgreSQL makes a view reference disappear. When a user writes SELECT * FROM my_view, the planner no longer sees my_view in that query. As soon as PostgreSQL receives the query, it moves the SELECT definition that the view name points to into that spot. Where and how that substitution happens, and how the fact that the mechanism rides on PostgreSQL's old RULE system surfaces in application code, is the subject of this section.

View and materialized view: what is the difference

Views themselves are worth one paragraph of setup. PostgreSQL has two kinds of views: the plain view and the materialized view.

A plain view holds no data. A view definition is a SELECT line (or a larger query), and every time you SELECT from the view, you get back the result of running that definition SELECT freshly. The view itself has no storage. Only the view definition is registered in the catalog, and each time a query comes in, that definition gets expanded to read data from the base tables. The subject of this section is the expansion of plain views.

A materialized view stores its data. It runs the view definition's SELECT once, keeps the result on disk, and later SELECTs read that stored result directly. When the base table data changes, the materialized view's result is not refreshed automatically; the user has to explicitly run REFRESH MATERIALIZED VIEW to recompute it. So the rewriter leaves materialized views alone, without expanding them.

Aspect	Plain view	Materialized view
Data storage	None	Yes (on disk)
Behavior on SELECT	Runs definition SELECT each time	Reads stored result
When updated	Reflects base table changes immediately	Updated only on `REFRESH`
Rewriter handling	Expanded into definition SELECT	Left as is

With this picture in mind, it becomes clear that this section deals with how PostgreSQL moves the definition SELECT into place for plain views.

PostgreSQL's RULE system

PostgreSQL has a mechanism called the RULE system. It is the system responsible for rewriting the query itself, right after the query arrives at the backend and before it is actually executed. As noted in the 1.3 chapter introduction, the rewriter is the stage that changes the form of a query while preserving its meaning, and the RULE system is the tool that decides which rules to apply for that form change. (Transformations that change the meaning itself, such as subquery unnesting or predicate push-down, are the planner's responsibility in PostgreSQL and are covered in chapter 1.4.)

A rule is a bundle of four elements.

Element	Meaning	Example
Target relation	Which table/view it applies to	the `users` table
Event type	Which command triggers it	one of INSERT, UPDATE, DELETE, SELECT
Condition	Under what condition it fires (optional)	`WHERE NEW.role = 'admin'`
Action	What it does when fired	replace the original command with another query, or run an additional query

This bundle is stored as one row in the pg_rewrite catalog. A user can create a rule directly with the CREATE RULE command, and PostgreSQL also registers one rule automatically when a view is created.

Comparing it to the trigger, which does a similar job, makes the position of the RULE clear. A trigger acts at the executor stage, hooking into each individual row. This row was inserted, so add a row to the audit table, that kind of thing. A rule rewrites the query as a whole at the rewriter stage. Query A came in, so rewrite it as query B and run that, that kind of thing. If a trigger is a "per-row side effect," a rule is a "per-query transformation."

Here is one example of a rule a user might define directly. Suppose you want to record into an audit table every time an INSERT happens on the users table.

CREATE RULE log_user_insert AS
    ON INSERT TO users
    DO ALSO INSERT INTO user_audit_log (user_id, action, ts)
            VALUES (NEW.id, 'INSERT', now());

DO ALSO means "run the original INSERT as is, and additionally run this audit INSERT too." With this rule in place, when you fire INSERT INTO users (id, name) VALUES (1, 'alice'), the result of passing through the rewriter is two Query trees. One is the original INSERT into users, the other is the INSERT into user_audit_log. Both queries go through the planner and executor and get executed. This is what it looks like for one SQL text to fan out into multiple statements as it passes through the rewriter. (DO INSTEAD replaces the original query; DO ALSO, or the default, adds to the original query.)

A user-defined rule is one use case of the RULE system. But the rule you encounter most often in ordinary applications is not one a user made directly. It is the SELECT rule that PostgreSQL registers automatically when you create a view.

A view is a special case of the RULE system

Creating a view in PostgreSQL means registering two things at once.

One is the registration of an empty table. When you run CREATE VIEW my_view AS SELECT ..., PostgreSQL adds a row to pg_class. It gets an OID like an ordinary table, and column definitions go into pg_attribute too. The only difference is that the relkind column is marked 'v' (view). Looking at just this row, a view is an empty table. No file is even created to store data.

The other is the registration of a SELECT substitution promise. PostgreSQL adds a row to the pg_rewrite catalog, and that row is the promise: "when this view is queried with SELECT, instead of returning an empty result, run the SELECT written in the view definition and return that result." The formal name of this promise is the ON SELECT DO INSTEAD SELECT rule. Unpacked word by word:

ON SELECT: which event it fires on. When a query that reads this relation with SELECT comes in.
DO INSTEAD: how the original action is handled. Instead of the original action (querying this empty view as is).
SELECT ...: what gets run. The SELECT written in the view definition.

Putting the three parts together gives the promise: "when this view is queried with SELECT, do not return 0 rows from the empty table; run the view definition's SELECT and return that result." PostgreSQL does not keep a view as a separate data structure; it expresses the concept of a view with one empty table and this one-line promise. Mapping it to the general rule definition from the previous section: the view's automatic rule has the view itself as target, SELECT as event, no condition, and the execution of the definition SELECT as action.

This automatic SELECT rule for views has two constraints. There is exactly one ON SELECT rule per view, and that rule fires unconditionally. A single line of PostgreSQL code captures this fact directly.

"RIR" stands for "Retrieve-Instead-Retrieve", that is an ON SELECT DO INSTEAD SELECT rule (which has to be unconditional and where only one rule can exist on each relation).

The abbreviation RIR shows up all over the PostgreSQL code, and its identity is in this one line. "retrieve" is the SELECT keyword from the POSTQUEL era, and its trace remains in the abbreviation. Two historical names you will often encounter in PostgreSQL material are worth a brief note.

POSTGRES (all caps): the name of the database project started at UC Berkeley in 1986. Led by Michael Stonebraker as the follow-on research project, it meant the next generation after Ingres.
POSTQUEL: the name of the query language that the POSTGRES project used. It was designed as the successor to QUEL (Ingres's query language). It is not SQL.

When SQL was adopted in 1995, the project renamed itself to PostgreSQL and POSTQUEL disappeared, but traces of that era remain in abbreviations like RIR and in some function names.

The rewriter works in two steps

The rewriter has a single entry point function, but inside it two steps run in sequence.

Step 1: applying the INSERT/UPDATE/DELETE rules that the user defined with CREATE RULE. The audit rule we saw earlier fires at this step. One input query can fan out into zero or several queries.
Step 2: applying RIR rules. That is, firing the view's automatic ON SELECT rule to expand a view reference into its definition SELECT. It is applied to each of the queries produced by Step 1.

What the separation of the two steps means is clear. View expansion always happens last. Whether a user-defined rule fans one query into several or changes it into something else, if a view reference remains in the result, it all gets resolved in Step 2. By the time it reaches the planner, no view reference remains anywhere.

Step 2 does one thing. It walks the query's range table, and when it meets a view reference, it expands the view's definition SELECT into that spot.

The expansion mechanism resolves cleanly thanks to the unification of the RTE data structure we saw in 1.2.3. In the Query that the analyzer produced, the view reference spot holds an RTE of kind RTE_RELATION (that is, an RTE pointing at an ordinary table or view). After passing through rewriter Step 2, that spot becomes kind RTE_SUBQUERY (an RTE that holds a sub-Query inside it), and the view definition's SELECT sits inside it. The item at slot N of the range table just changes kind and stays in the same spot.

Suppose a view v_active_users is defined as SELECT id, name FROM users WHERE deleted_at IS NULL. Here is what the one line SELECT * FROM v_active_users looks like before and after the rewriter.

before rewriter (analyzer output)        after rewriter (planner input)

Query                                    Query
├─ commandType = CMD_SELECT              ├─ commandType = CMD_SELECT
├─ rtable                                ├─ rtable
│   └─ RangeTblEntry                     │   └─ RangeTblEntry
│       (rtekind = RTE_RELATION,         │       (rtekind = RTE_SUBQUERY,
│        relid = OID(v_active_users))    │        subquery = Query{
│                                        │             rtable = [users RTE],
│                                        │             targetList = [id, name],
│                                        │             qual = (deleted_at IS NULL)
│                                        │        })
├─ jointree → RangeTblRef(1)             ├─ jointree → RangeTblRef(1)
└─ targetList: SELECT *                  └─ targetList: SELECT *

Slot 1 of the range table stays in place. Only the kind changed from RTE_RELATION to RTE_SUBQUERY, and the view definition sits inside it whole. Thanks to this unification, the planner, the stage after the rewriter, does not know the concept of a view at all. The Query the planner receives is just an ordinary SELECT with a sub-query expanded into a slot. This is why, when you look at EXPLAIN output, the view name disappears and base table names show up directly.

A view definition can contain another view inside it. In that case the expanded sub-Query is recursively passed to the same function to resolve the inner view too. The problem is when a view references itself (either directly or through another view). Left as is, that becomes an infinite recursion of expanding and expanding again. To prevent this, PostgreSQL carries a list of the OIDs of views currently being processed, and when it meets a view already in that list, it throws the error infinite recursion detected in rules for relation "..." and aborts query execution. It chose to cut things off with a clear error rather than spin forever.

Step 2 exceptions: materialized views and EXCLUDED

The Step 2 view expansion logic has two explicit exceptions.

First, materialized views are not expanded. As we saw earlier, a materialized view reads its stored result directly, so there is no need to expand its definition on every query. Step 2 passes an RTE with relkind 'm' straight through.

Second, the virtual name EXCLUDED in INSERT ... ON CONFLICT is not expanded. To understand EXCLUDED, we need to briefly touch PostgreSQL's UPSERT syntax.

PostgreSQL has the INSERT ... ON CONFLICT (...) DO UPDATE SET ... syntax. Commonly called UPSERT, it means "attempt the INSERT, but if it hits a unique constraint or similar and a conflict occurs, run an UPDATE in its place instead." For example, in a table that records a user's login count, you can handle "if it is a new user, add it; if it already exists, increment the count" in one query.

INSERT INTO users (id, name, login_count)
VALUES (1, 'alice', 1)
ON CONFLICT (id) DO UPDATE SET
    name = EXCLUDED.name,                   -- the incoming value 'alice'
    login_count = users.login_count + 1;    -- the existing row's value + 1

If a row with id = 1 already exists, the INSERT conflicts, and the UPDATE in the ON CONFLICT clause runs against that row. Here the UPDATE must be able to reference both rows: the row already in the DB (referenced as users.column) and the row that tried to come in via INSERT but was blocked (referenced as EXCLUDED.column). The name EXCLUDED carries the meaning "the values excluded because of the conflict." In the example above, EXCLUDED.name is 'alice', and users.login_count points at the existing count stored in the DB.

EXCLUDED is neither a real table nor a view. It is a virtual row name that exists only inside the UPSERT syntax. Internally PostgreSQL does represent EXCLUDED as an RTE_RELATION RTE, but the view expansion logic must not wrongly fire and try to expand this RTE (there is no definition to expand). So when the rewriter meets this RTE, it skips it with a separate branch.

The RULE system has many pitfalls and is not recommended

The fact that view expansion rides on the RULE system is a trace of PostgreSQL's history. The RULE system existed from the POSTGRES era, and views were implemented as one branch of that system. The idea was that if a user directly defined INSERT/UPDATE/DELETE rules on a view with CREATE RULE, that view could be used as an update target.

The RULE system itself is powerful but full of pitfalls. The same rule getting evaluated multiple times, cascading happening differently than intended, permission checks behaving subtly: problems come up often. In PostgreSQL 9.3, the automatically updatable view feature was introduced to sidestep those pitfalls. When a user fires INSERT/UPDATE/DELETE against a simple view (one that, say, just picks some columns from a single base table), PostgreSQL analyzes the view definition and automatically converts it into INSERT/UPDATE/DELETE against the base table, with no rule definition needed. For complex views that cannot meet the automatically-updatable conditions, solving it with an INSTEAD OF trigger is the standard approach today. PostgreSQL's official documentation recommends triggers over rules for INSERT/UPDATE/DELETE.

In many cases, tasks that could be performed by rules on
INSERT/UPDATE/DELETE are better done with triggers. Triggers are
notationally a bit more complicated, but their semantics are much
simpler to understand. Rules tend to have surprising results when
the original query contains volatile functions: volatile functions
may get executed more times than expected in the process of carrying
out the rules.

The ON SELECT rule generated automatically when a view is created is something users rarely deal with directly. PostgreSQL installs it, and the rewriter expands it. The case where a user faces the RULE system is usually when they want to make a complex view updatable that automatically updatable views cannot handle, and even then an INSTEAD OF trigger is safer and less work.

In other RDBMS engines, view expansion is usually handled as a separate stage. PostgreSQL's decision was to fold view expansion into one branch of general rule application. The result of that unification is that data structures and code paths gather in one place, but the cost is that the RULE system, an old abstraction, is exposed to users as is.

What this means in practice

First, when a view name disappears from EXPLAIN output, the plan is not wrong. Since the view reference gets expanded into its definition SELECT at the rewriter stage, the query the planner receives has no concept of a view at all. It is normal for base table names to show up in EXPLAIN output instead of the view name you wrote. The more complex the view definition, the more you need to look at the base tables in the EXPLAIN output and map them back to the view definition.

Second, when making a view updatable, consider an automatically updatable view or an INSTEAD OF trigger before a RULE. For a simple view that meets the automatically-updatable conditions, INSERT/UPDATE/DELETE work with no separate definition. If it cannot meet the conditions, an INSTEAD OF trigger is safer than a user-defined RULE. Defining a RULE directly is the last resort.

Third, when you change a view definition with ALTER, every query that calls that view gets expanded with the new definition at its next analysis. A view definition is stored as a SELECT rule in the catalog, and view expansion has the rewriter re-read that catalog entry at execution time and expand it. When the catalog changes, an invalidation message propagates to the plan cache, and on the next execution the result is planned with a fresh expansion using the new definition (see 1.2.3 plan cache). From the application side, this means you can use a view as a code abstraction. Just changing the view definition consistently updates the behavior of every query that uses that view from the next call onward.

1.3 Rewriter: How a Query is Rewritten

JoongHyuk Shin — Thu, 14 May 2026 10:49:28 +0000

By the time the analysis stage from 1.2 finishes, the SQL has become a Query tree. Catalog coordinates are baked in, locks (as we saw in 1.2.2) are already held from that point, and the type of every expression is determined. The starting point of this chapter is one fact: that Query tree does not go straight to the planner. There is one more stage in between, and that stage is the rewriter.

The rewriter's input and output are the same. It takes a Query tree and produces a Query tree. The number of trees that come out can be zero, one, or several. That means a single user SQL can fan out into multiple statements as it passes through the rewriter. The key point is that the input format and the output format are identical. The planner only receives Query trees that the rewriter passed through, and it does not distinguish whether a tree is exactly what the user wrote or the result of a transformation.

Why does this stage exist on its own? Analysis is the work of connecting the meaning of SQL to the catalog, and planning is the work of deciding how to execute that meaning. The transformation that sits between them, the one that does not change what the query means but does change the form of the query itself, is the rewriter's responsibility. Expanding a view reference into its underlying SELECT, injecting the WHERE condition of a row-level security (RLS) policy into the query, filling in defaults for columns omitted from an INSERT: all of this happens here.

1.3 splits into three sections.

1.3.1 Rule system and view expansion: the mechanism by which a view reference gets expanded in place into a sub-query. This expansion rides on the RULE system that PostgreSQL has carried since its early days, and we cover how that fact surfaces in application code.
1.3.2 How RLS rewrites queries: when a row-level security policy is defined in the catalog, how the rewriter weaves that policy into the query as a WHERE condition.
1.3.3 INSERT/UPDATE/DELETE rewriting: filling in default values, handling RETURNING, and the flow that turns an automatically updatable view into INSERT/UPDATE/DELETE against the view's base table.

By the end of this chapter, you should have a clear picture of where and how "the SQL I wrote" and "the SQL the planner saw" diverge. The three places they diverge are view expansion, RLS policy injection, and INSERT/UPDATE/DELETE rewriting. When EXPLAIN output or permission behavior does not match your intuition, recalling these three places is the start of an accurate trace.

1.2.3 Query tree node types: Query, RangeTblEntry, TargetEntry

JoongHyuk Shin — Sun, 10 May 2026 08:05:36 +0000

After the Analyze stage from 1.2.2, the raw parse tree has become a Query tree. This section is about the shape of that result. How it differs from a raw parse tree, what abstraction sits at its core, and what assumptions the next stages (rewriter and planner) get to start from when they receive a Query tree.

Tokens descend into catalog coordinates

A raw parse tree and a Query tree are both trees, but they sit on different planes. A raw parse tree mirrors the syntactic structure of SQL text, so its nodes look a lot like tokens. Where a table name appeared, you find a node carrying that name as a token. Where a column appeared, you find another name token. The types of expressions are not yet determined.

A Query tree is what those tokens look like after they have been resolved against the catalog. The same positions hold different representations now. Where a table name was, there is now an RTE carrying that table's OID and lock mode. Where a column name was, there is now an index pair: "the M-th attribute of the N-th item in rtable". Where an operator token sat, there is now an OID pointing to the function that implements that operator. The same shape of tree carries the same kinds of information at the same positions, but the information has dropped one level down from surface SQL text into catalog coordinates.

For SELECT id, name FROM users WHERE age > 18, the contrast looks like this:

raw parse tree (sketch)          Query tree (sketch)

SelectStmt                       Query
├─ targetList                    ├─ commandType = CMD_SELECT
│   ├─ ColumnRef "id"            ├─ rtable
│   └─ ColumnRef "name"          │   └─ RangeTblEntry
├─ fromClause                    │       (relid = 16384, RTE_RELATION)
│   └─ RangeVar "users"          ├─ jointree (FROM + WHERE)
└─ whereClause                   ├─ targetList
    └─ A_Expr                    │   ├─ TargetEntry(resno=1, expr=Var(rt=1, att=1))
        ('>', ColumnRef "age",   │   └─ TargetEntry(resno=2, expr=Var(rt=1, att=2))
              A_Const 18)        └─ qual: OpExpr(opno=...,
                                            args=[Var(rt=1, att=3), Const(int4=18)])

The descent into catalog coordinates is right there in the picture. Table names become OIDs, column names become pairs of (rtable index, attribute number), operator tokens become typed expression nodes. The rewriter and planner work on those coordinates and have nothing left to resolve.

PG's decision to fold every kind of FROM item into a single enum looks unremarkable at first, but its effect compounds. If each kind of input were a different node type, the rewriter and planner would scatter kind-specific handling across every branch they touch. With one node type and an enum tag, transformations that cut across many places (think "scan every RTE and check locks", "inject an RLS policy on every RTE") fit into one place consistently.

Range table: every FROM input in one place

The Query carries a single List that holds every item appearing in the FROM clause. PG calls this list the range table, and an item in it a RangeTblEntry. A plain table, a subquery (FROM (SELECT ...)), a result of an explicit JOIN, a set-returning function call (FROM generate_series(...)), an inline row set built with VALUES (...), a CTE (WITH ... AS (...)). Whatever form a FROM input takes, it gets folded into a single entry in this list.

The kind tag lives in one field, rtekind, on the RTE. There is no separate node type per kind of input; the same RangeTblEntry object carries different rtekind values to distinguish them. The object type is one, and the value of rtekind decides which auxiliary fields are meaningful. For a plain table the OID and lock mode fields matter; for a subquery the sub-Query field matters; and so on.

You can see the payoff of this consolidation in the shape of the rewriter and planner code. Both stages, having received a Query, work to a single model: "take the N items in rtable, glue them together via jointree, and evaluate targetList against the result." A subquery, a JOIN, a function call all flow through the same model. Adding a new kind of input is a matter of adding one line to RTEKind and filling in a few branches.

The range table plays another role. Column references look at it. In a Query tree, a column is not a name; it is two integers, "attribute M of rtable item N". A column reference's meaning is bound to the order of items in the range table. That is why a rewriter or planner that adds, removes, or reorders rtable items must update the indices in existing column references at the same time.

Target list: SELECT, INSERT, UPDATE, RETURNING all in one shape

The other key List on a Query is the target list. The output columns of a SELECT, the values to assign in INSERT/UPDATE, and the columns returned by RETURNING all live in lists of the same shape with the same kind of node. One target list per Query for the SELECT/INSERT/UPDATE body, plus a second one for RETURNING when present.

The shape is one, but its meaning shifts slightly with context. Each entry has a field called resno (short for result number, indicating where the entry lands in the result), and the interpretation of this number depends on whether you are looking at SELECT, INSERT, or UPDATE. Take this table to compare the two cases.

CREATE TABLE users (
    id    serial PRIMARY KEY,  -- attno 1
    name  text,                 -- attno 2
    age   int                   -- attno 3
);

For SELECT, resno is the ordinal position of the output column.

SELECT name, age FROM users;

This SELECT's target list has two entries. The first, holding the name expression, has resno 1; the second, holding age, has resno 2. They land in the first and second positions of the result set, respectively. Even though name has attno 2 and age has attno 3 in the table, the result positions in SELECT are independent of those attnos: they go 1, 2, in the order they appear in the SELECT list.

For INSERT, resno is the attribute number of the destination column.

INSERT INTO users (name, age) VALUES ('alice', 20);

This INSERT's target list also has two entries, but the first's resno is not 1, the second's not 2. They are 2 and 3, the attnos of name and age in the table definition. The number that lands in resno is not the order in which the columns were written, but the position of the column in the table itself. UPDATE's SET clause follows the same rule: attno in resno.

The fact that resno carries different meanings in different contexts feels odd at first, but folding two meanings into one List buys clear leverage. Expression evaluation code, target list traversal, RETURNING handling all assume one node type. The branching shrinks. The context-dependent meaning is settled when the List is built, and downstream stages just read what is already there.

The target list also carries items invisible to the user. When an expression appears in ORDER BY or GROUP BY but not in the SELECT list, that expression is added to the target list with a "junk" mark. Sorting and grouping need a place to evaluate the value, so the value rides along in the same List. Junk columns get evaluated but stripped from the final output. The visible result is clean from the application side, while internally a few extra columns flow alongside to make sorting work.

Where the Query goes

Right after a Query tree is built, the rewriter takes it. RULE expansion, view expansion, and row-level security (RLS) policy injection all happen on the Query tree at this stage. The output is still a Query tree (or a list of zero or more, when a rule expands one statement into several). Details are covered in chapter 1.3.

The rewriter hands the Query off to the planner. The planner's job is to convert a Query tree into a PlannedStmt: the strategy for how this SQL will be executed. Which indexes to use, what JOIN order to take, whether to go parallel. Chapter 1.4 covers this in depth.

What is interesting is that the next stage, the executor, does not look at the Query tree. The Query definition in parsenodes.h says it in one line: "the Query structure is not used by the executor." The executor's input is a PlannedStmt, and the Query tree, once turned into a plan, does not flow further down. The Query tree is the common currency of parser/analyzer/rewriter/planner, and that is where its lifetime ends.

This separation of stages is reflected directly in how prepared statements work. A prepared statement is a mechanism to avoid going through the parser every time the same SQL runs repeatedly, by holding the analyzed result or plan in memory. There are two layers of holding. One layer holds the Query tree (already resolved against the catalog) in memory; the layer above holds the PlannedStmt that has gone through the planner. What is commonly called the plan cache refers to these two layers together. A held Query tree lets the next execution start from the planner; a held PlannedStmt is handed straight to the executor. If the catalog changes in between, an invalidation message arrives, the held results are discarded, and analysis or planning runs again. Because the per-stage outputs of parser, analyzer, rewriter, planner, and executor are cleanly separated, deciding which layer to invalidate and where to restart is straightforward.

What this means in practice

First, EXPLAIN works in the same shape for SELECT, INSERT, UPDATE, DELETE, and MERGE. Five kinds of statement live in the same Query data structure, and the planner produces a single PlannedStmt regardless of kind. From the application side, you don't have to wonder whether you need a different tool to inspect an INSERT plan than a SELECT plan. From the tooling side, monitoring tools can deal with plans through one interface without branching on statement kind.

Second, INSERT/UPDATE remains safe even when only some columns are written. The resno in the target list carries the attno of the destination column rather than the order in which the user wrote it. The columns the user wrote land in their attno positions; missing columns are filled in with defaults later by the rewriter, which works on the same target list (covered in 1.3.3). This is the structural reason ORM-generated INSERT/UPDATE statements work correctly even when columns are out of order or only partially specified.

Third, the same SQL can produce different Query trees depending on when it was analyzed. Because catalog coordinates get baked into the Query tree, the same SQL text seen at two different times against two different catalog states produces two different trees. A prepared statement that produces different plans across two connections, a plan cache that re-analyzes after a catalog invalidation, the same code pointing at different tables under different search_path settings: all of these emerge on top of this fact. From the application side, this is a single-line summary of where the intuition "same text, same behavior" breaks.

1.2.2 Semantic analysis: name resolution, type checking, catalog lookup

JoongHyuk Shin — Thu, 07 May 2026 10:57:23 +0000

A raw parse tree captures the structure of SQL text, but not what that structure refers to. Whether the token users is a table in some schema, whether the two sides of = have compatible types, whether count is a function or a column. Filling in those answers and turning a raw parse tree into a meaningful query tree is the job of Semantic analysis. PG code commonly shortens it to "Analyze". The shape of the resulting node structure is covered in 1.2.3.

Parser and Analyzer

A note on terminology first. In PG code, "parser" is used in two scopes. Narrowly, it refers to the lexer and grammar that turn SQL text into a raw parse tree. Broadly, it includes the next stage, Analyze, as well. Both stages live under src/backend/parser/.

This chapter splits them. The raw parser takes SQL text and produces a raw parse tree (1.2.1). The Analyzer takes a raw parse tree and produces a Query tree by consulting the catalog (this section). The two stages differ in their output shape (raw parse tree → Query tree) and in whether they touch the catalog.

What the raw parse tree leaves out

The raw parser checks only syntax. The fact that SELECT a FROM t WHERE a = 1 is grammatically well-formed is the limit of what a raw parse tree asserts. Whether t actually exists, whether a is a column of that table, whether 1 can be coerced to the type of a: the raw parse tree knows none of this.

Why split into two stages? A few reasons.

Separation of concerns. Mixing syntax checks with semantic checks puts too much responsibility into one stage. Splitting keeps each stage's code simple.
Raw parse must work in an aborted transaction. When a transaction is in an aborted state, the catalog cannot be touched (covered at the end of 1.2.1). But commands like COMMIT/ROLLBACK still need to be recognized to escape that state. The split lets raw parse identify the command type without catalog access.
Reuse of raw parse output. A single raw parse tree can be Analyzed multiple times in different contexts (different search_path, different parameter types). Prepared statements rely on this reuse pattern.

The result is a clean two-stage split. Raw parse is independent of the catalog and only checks syntax. Analyze then aggressively consults the catalog and fills in the blanks. Analyze does four things. It looks up every name in the raw parse tree against the catalog and replaces them with OIDs (object identifiers, the 32-bit integer IDs assigned to every object in the catalog). It determines the type of every expression. It inserts cast nodes wherever types do not match. It packages the result into a new tree called Query.

Name resolution: what does each name in SQL refer to?

Name resolution is the task of figuring out which catalog object each table, column, or function name in the SQL is pointing at.

Start with table names. As Analyze reads down the FROM clause, it encounters items like FROM users, looks up the users table in the catalog, and obtains its OID. It then registers it in an internal list as "the Nth table appearing in this SQL's FROM". PG calls this list the range table. (Why "range"? Covered in 1.2.3.)

A range table holds more than just plain tables. The FROM clause can contain subqueries (FROM (SELECT ...)), JOIN results (FROM t1 JOIN t2 ON ...), and function calls (FROM generate_series(1, 10)). All of these are equivalent in being "an entry that produces rows", so each gets one slot in the range table. The differences between kinds are covered in 1.2.3.

When the schema is not specified (FROM users versus FROM public.users), PG walks the connection's search_path in order and picks the first schema that contains a matching table. This decision is locked in at Analyze time. So the same SQL text can refer to different tables across connections with different search_path settings.

Column resolution comes next. Column names appearing in WHERE, SELECT, ORDER BY, and similar positions are matched against the range table that was just built, deciding "which column of which table this name refers to". The familiar column reference "a" is ambiguous error from SELECT a FROM t1 JOIN t2 ON ... (when a exists in both t1 and t2) is raised at this stage.

Column references can have between one and four dot-separated parts. a (column only), t.a (table.column), s.t.a (schema.table.column), db.s.t.a (database.schema.table.column). Depending on how many dots are present, different parts are interpreted differently, and the priority rules for that interpretation are baked into this stage.

Function calls are slightly more involved. PG allows function overloading, so multiple versions of the same function name can be registered with different argument types. For example, length() has separate versions accepting text and bytea, and the right one is chosen based on the argument types at the call site. Analyze gathers all candidates with the same name and picks the best match by argument-type matching. If the match is ambiguous, an ambiguous error is raised. If there is no match, function does not exist.

Type checking and cast insertion

In an expression tree, leaf nodes are one of three things. Constants, column references, or parameters. The constant 1 has type integer, the column name has type text, the parameter $1 has whatever type the caller declared. Analyze walks the expression tree from leaf to root, determining the type of every node along the way. The result type of a + b is determined only after the types of a and b are settled, and the type of (a + b) * 2 is determined from that result.

When the two sides of an operation differ in type, an automatic cast may be inserted. In WHERE a = '5' where a is integer and '5' is text, PG inserts a cast node converting '5' to integer so the comparison is well-typed. The authoritative answer for which type-to-type conversions are possible lives in the pg_cast catalog. It records the source type, target type, conversion method, and which context the cast is automatically applied in. A few representative rows look like this.

source → target	Method	Context
`int4` → `int8`	Function call (`int8(int4)`)	implicit
`int8` → `int4`	Function call (`int4(int8)`)	assignment
`int4` → `oid`	Binary passthrough	implicit
`json` → `jsonb`	Text I/O detour	assignment

Two of the columns capture the heart of how PG does casts. The method is how the source value is turned into the target type, and the context is where that cast is automatically applied.

The method falls into one of three kinds.

No conversion needed because the memory representation is the same. The int4 → oid row is this case. Both are 32-bit integers, so the binary representation passes through. The conversion cost is essentially zero.
A cast function is called. The int4 → int8 row is this case, with a registered conversion function int8(int4) that gets called. The cost is one function call.
Output as text and read it back. Every type in PG has an output function that serializes its value to a string and an input function that parses a string back into a value of that type. To bridge two types that have no direct cast function, PG runs the source type's output function to produce a string, then feeds that string through the target type's input function. The json → jsonb row works this way: a json value is first emitted as a textual form like {"a":1}, then that text is parsed by the jsonb input function into the binary JSON representation. This route is inefficient and not used when a direct cast function can be defined, but it serves as a fallback for user-defined types or for bridging two types whose representations differ significantly.

If no method is registered, PG raises cannot cast type X to type Y.

The same cast does not apply automatically in every position. PG splits cast contexts into three groups.

implicit. PG inserts the cast on its own, without a request from the user. Applied in positions like comparisons and arithmetic. The int4 → int8 row is implicit, so WHERE bigint_col = int_col automatically promotes int4 to int8 for the comparison even though the two sides differ in type.
assignment. Applies when a value is being put into a column. Operates in INSERT, the SET of UPDATE, COPY, and similar positions. Some conversions that are not registered as implicit are registered as assignment. The int8 → int4 row is one such example. Narrowing a wide integer to a narrower one risks precision loss, so applying it automatically in a comparison would be dangerous, but in a position where a value is being placed into a column of a particular type, accepting that loss is closer to the user's intent.
explicit. The user directly requests the cast with CAST(col AS integer) or col::integer. PG treats this as an unambiguous statement of intent, so even conversions not registered as implicit or assignment are allowed here.

The three contexts form a relationship in which the set of cast pairs PG applies automatically grows wider. A cast registered as implicit is also automatic in assignment positions, and of course in explicit positions too. Conversely, a cast registered only at the explicit level does not happen automatically in implicit or assignment positions; the user has to spell it out. In set notation, implicit ⊂ assignment ⊂ explicit in terms of which pairs PG converts on your behalf.

This split keeps PG away from a common pitfall in other RDBMSs where implicit conversions happen too eagerly and produce surprising results. PG deliberately keeps implicit casts narrow. Before 8.3 there was an implicit cast between text and integer, but it caused too many unintended comparisons and was removed in later versions.

Catalog lookups and locks

Every decision above involves catalog lookups. An OID lookup per FROM table, an attribute lookup per column, a pg_type lookup per type in the expression, a pg_proc lookup per function call. Analyzing one SQL statement can trigger dozens of catalog lookups.

To keep the cost from blowing up, PG caches lookup results in memory (the syscache, a system catalog cache). A second lookup with the same key in the same connection hits memory. When another connection mutates the catalog with ALTER TABLE or similar, an invalidation message is delivered to the syscache and just the affected entry is dropped.

The bigger user-visible effect is not the lookup itself but its side effect. When Analyze finds a FROM table in the catalog, it also takes a lock on that table. A plain FROM in a SELECT acquires AccessShareLock (the read lock). Tables locked by FOR UPDATE/SHARE get RowShareLock. The target of an INSERT/UPDATE/DELETE gets RowExclusiveLock. By the time Analyze finishes, every table touched by the statement is locked. The exact semantics of each lock mode and the conflict matrix between them are covered in 3.5.1 (Heavyweight locks).

What is interesting is that locks are taken at the Analyze stage. Other designs are possible. Some other RDBMSs delay lock acquisition until plan or execute time. PG chose to take the lock during the catalog visit instead. The effect is clear: once parse finishes, the schema is guaranteed not to change for the rest of plan and execute, on a per-statement basis. The race in which the catalog mutates under ALTER TABLE while planning is in progress, invalidating the plan, is shut out before it can happen.

After Analyze finishes

Once Analyze is done, PG does two more things.

The first is computing a query identifier. When monitoring tools track slow-running queries, calls like WHERE id = 1 and WHERE id = 2 should be grouped together as instances of the same pattern (WHERE id = ?). Otherwise, a statistic like "this query pattern averages X ms" loses its meaning.

The catch is in what key the statistics table groups by. The key here is the data-structure key, the same kind of key as in a hash table where matching keys land in the same slot and accumulate. Using the SQL text directly as a key is no good. WHERE id = 1 and WHERE id = 2 are different texts and would land in separate slots, scattering what should be one row of statistics across many. What we want is one row per pattern, not one row per call.

So at the end of Analyze, PG normalizes the Query tree by stripping out the variable parts (constants and the like, with the same effect as replacing them with ?) and then hashes the remainder into a 64-bit identifier. Same pattern produces the same identifier; same identifier accumulates into the same statistics slot. This process is called query jumble, and the resulting identifier is the key used by monitoring tools to group queries of the same kind.

The second thing is a hook where extensions plug in. PG places a post-analyze hook at the very end of Analyze, so any extension registered there has its code invoked right after each query is analyzed. Statistics-gathering extensions like pg_stat_statements register on this hook to collect each query's identifier and execution time.

What this means in practice

First, the same SQL text can refer to different tables depending on the connection's search_path. This is a direct consequence of table name resolution being locked in at Analyze time and following search_path. In container or deployment environments where the default search_path happens to differ, tests pass while production reports "table not found". This is one of the first things to check when debugging such an error.

Second, throwing a SELECT inside a transaction immediately locks every table the statement touches, and those locks are held until the transaction ends. Lock acquisition being baked into the Analyze stage shows up directly in how applications behave. Patterns like BEGIN; SELECT ...; (some other work), where a transaction is opened early, hold AccessShareLock on the SELECT's tables for the entire transaction's duration, which can block ALTER TABLE in other sessions. This is one of the reasons long-running transactions are operationally risky (other reasons are covered in chapter 3 on MVCC). Conversely, from the application's point of view, the moment a SELECT is issued, the schema becomes protected against change for the remainder of that statement.

Third, the unit of "query" that monitoring tools group on is not the SQL text but the normalized pattern of the analyzed result. When pg_stat_statements collects slow queries, the key it uses to lump same-shaped queries onto one row is the query identifier computed at the end of analysis. Statements that differ only in constant values but share the same structure accumulate into the same row. So calling WHERE id = 1 a thousand times and calling WHERE id = 1, WHERE id = 2, ..., WHERE id = 1000 once each look the same to the monitoring tool: a single row.

1.2.1 From SQL Text to Raw Parse Tree

JoongHyuk Shin — Wed, 06 May 2026 02:06:12 +0000

A line like SELECT name FROM users WHERE id = 1 arrives at the backend. As we saw in 1.1.1, the backend is a child process forked by the postmaster when the client connects, dedicated to that one client. What this process first holds in its hands is just a byte array. It does not yet know where the keywords end, where the identifiers begin, or where the integer constant lives. Turning that byte array into a tree structure is the front half of the second of the five stages, namely raw parsing. This section covers only that front half. The back half (parse analysis), which consults the catalog to attach meaning, belongs to 1.2.2. (The catalog, in case you need a refresher, is the set of internal tables PostgreSQL maintains to describe itself: which tables have which columns, which functions take which argument types, and so on, all stored as rows in these tables.)

The output of raw parsing is a single RawStmt node per SQL string (or a List, if multiple statements are joined with ;). This RawStmt wraps a raw node like SelectStmt for SELECT, InsertStmt for INSERT, and so on. The name "raw" means it has not seen the catalog. Whether users is actually an existing table, which column id refers to, none of that is known yet. All that has been captured is the grammatical structure: a SELECT keyword followed by a column list, a FROM followed by an identifier, a WHERE followed by a comparison expression.

Two tools dividing the work: flex and Bison

PostgreSQL's raw parser is a collaboration of two tools. One side is the lexer (lexical analyzer), the other is the grammar (syntactic analyzer). The lexer slices the byte array into tokens. The grammar takes that token sequence and groups it into a tree.

PostgreSQL does not write these two by hand. It uses standard generator tools brought in from outside. The lexer is flex, the grammar is Bison. Both are code generators. The developer writes rules, and at build time the tool reads those rules and emits actual working lexer/parser C code.

For flex, a rule means "if this regex pattern comes in, emit this token." For example, "if a letter is followed by alphanumerics, emit an IDENT (identifier) token", "if only digits appear, emit an ICONST (integer constant) token." PostgreSQL keeps these rules in scan.l (about 1,500 lines on the current PostgreSQL 18). At build time flex reads this file and generates the C function that does the tokenizing.

For Bison, a rule means "if this token sequence appears, build this tree node." For example, "if SELECT is followed by a column list, then FROM and a table identifier, build a SelectStmt node." PostgreSQL keeps these grammar rules in gram.y (about 20,000 lines on the current PostgreSQL 18). At build time Bison reads this file and generates the C parser function that groups tokens into a tree.

The driver that ties these two together is the raw_parser() function.

List *
raw_parser(const char *str, RawParseMode mode)
{
    core_yyscan_t yyscanner;
    base_yy_extra_type yyextra;
    int           yyresult;

    scanner_init(str, ...);     /* flex init */
    parser_init(&yyextra);      /* Bison state init */
    yyresult = base_yyparse(yyscanner);   /* one cycle */
    scanner_finish(yyscanner);

    if (yyresult)
        return NIL;             /* syntax error */

    return yyextra.parsetree;   /* List of RawStmt */
}

base_yyparse() is the actual parser function Bison generated. Inside it, whenever a token is needed, it calls the lexer and groups tokens into a tree according to grammar rules. The lexer does not run on its own. It is pulled along by the grammar.

What the lexer (scan.l) sees

What the lexer does is simple. It scans the byte array from the start, finds the longest pattern that matches one of the regex rules, and hands the corresponding token type and value to the grammar. What does "longest matching pattern" mean? When several rules can match starting at the same position, the lexer picks the one with the longer match. For example, if the input contains >=, the lexer does not slice off > alone. It groups >= into a single token (such as GREATER_EQUALS). > alone would also match a comparison-operator rule, but a longer-matching rule for >= exists, so that one wins. In lex terminology this is called longest match. Almost every lexer behaves this way. So when SELECT name FROM users WHERE id = 1 comes in, the lexer emits tokens in this order.

SELECT     → keyword SELECT
name       → identifier (IDENT, value = "name")
FROM       → keyword FROM
users      → identifier (IDENT, value = "users")
WHERE      → keyword WHERE
id         → identifier (IDENT, value = "id")
=          → token '='
1          → integer constant (ICONST, value = 1)

How does the lexer tell identifiers from keywords? Every identifier candidate (a letter followed by alphanumerics) first matches the same regex rule. After that, the matched string is checked against the keyword table by binary search. If it is in the table, it becomes a keyword token. If not, IDENT.

The size and categorization of that keyword table matter. On the current PostgreSQL 18 there are 494 keywords, split into four categories.

Category	Count	Meaning
UNRESERVED	330	usable as identifier (e.g. `abort`, `aggregate`)
COL_NAME	63	usable as a column name but not as a function/type name
TYPE_FUNC_NAME	23	usable as a type or function name
RESERVED	78	never usable as identifier (e.g. `select`, `from`, `where`)

This split is why a query like SELECT abort FROM ... works in PostgreSQL: abort is UNRESERVED, so it can also be used as an identifier. By contrast, SELECT select FROM ... is a syntax error. RESERVED never gives way. Compatibility differences with other RDBMSes often come from this split.

There is an interesting comment at the top of scan.l: "rules in this file must be kept in sync with src/fe_utils/psqlscan.l and src/interfaces/ecpg/preproc/pgc.l!" In other words, the same SQL lexer rules live in three places. The psql client has its own lexer, ecpg (the embedded SQL preprocessor) has its own. The reason is that each tool has slightly different lexical needs, but the practical consequence is that changing the lexer rules in PostgreSQL means synchronizing three files.

The tree the grammar (gram.y) builds

As the lexer emits tokens, the grammar takes them in order, matches them against grammar rules, and assembles a tree. It builds bottom-up. Small subtrees are grouped first, those subtrees are then collected into larger nodes, and finally a single node corresponding to the whole statement is completed. This act of "taking a sequence of tokens or subtrees and reducing them into a single parent node" is called reduce in parser terminology. For example, if the token sequence IDENT '=' ICONST matches the "binary comparison expression" rule, those three are reduced into a single subtree. That subtree is then reduced as part of a WHERE clause rule. That WHERE clause is reduced as part of a SelectStmt rule. In the end, one SelectStmt node is built and gets wrapped in a RawStmt.

Here is a picture of the raw parse tree that SELECT name FROM users WHERE id = 1 produces. The tree has the root at the top, leaves toward the bottom.

                        RawStmt
                           │
                       SelectStmt
              ┌────────────┼─────────────┐
              │            │             │
          targetList   fromClause    whereClause
              │            │             │
            IDENT       RangeVar      A_Expr (=)
            "name"      "users"      ┌──────┴──────┐
                                     │             │
                                   IDENT         ICONST
                                   "id"             1

The RawStmt at the top is the output of the raw parser. Keywords like SELECT, FROM, WHERE are only markers that tell the grammar rule which subtree to reduce into, so they do not survive as separate tree nodes. Only meaningful values (identifiers, constants) from the input tokens remain as leaves.

PostgreSQL's top rule is stmtmulti. It expresses that multiple SQL statements may be joined with ;.

stmtmulti:  stmtmulti ';' toplevel_stmt
                { ... lappend($1, makeRawStmt($3, @3)); ... }
          | toplevel_stmt
                { ... list_make1(makeRawStmt($1, @1)); ... }
;

Each toplevel_stmt is one SQL statement, and they all get wrapped by makeRawStmt() into a RawStmt. That is why the raw parser's output is always a List of RawStmt.

Going down into toplevel_stmt and then stmt, you find that stmt is a giant OR rule.

stmt:
        AlterEventTrigStmt
      | AlterCollationStmt
      | AlterDatabaseStmt
      | ...
      | SelectStmt
      | InsertStmt
      | ...
;

More than 120 statement kinds are listed here on the current PostgreSQL 18, one per line. Each then expands into its own sub-rules. SelectStmt alone has dozens of sub-rules, and every SELECT element such as FROM clause, WHERE clause, GROUP BY, set operation unfolds inside it like a tree. The roughly 20,000 lines of gram.y are the result of this unfolding.

A new tree node is built inside the reduce action. At the moment SELECT * FROM users reduces into a SelectStmt node, makeNode(SelectStmt) allocates the node and fills in target list, FROM clause, WHERE, and so on. At this moment we do not know which OID of which table users refers to. The identifier string "users" simply lives inside a RangeVar node. Catalog lookup is the next stage's job.

gram.y also happens to be the file that most directly shows PostgreSQL's history of SQL compatibility changes. When a new PostgreSQL version adds a new SQL feature (such as MERGE or JSON_TABLE), the grammar rules grow accordingly, so dozens to hundreds of lines get added to gram.y each time. Following git blame is enough to see which SQL feature entered which PostgreSQL version. Its roughly 20,000-line size is the trace of three decades of SQL standard accumulation in a single file.

The lookahead filter: base_yylex

There is one more detail. The parser Bison generates is LALR(1). Spelling out the acronym, that is Look-Ahead Left-to-right Rightmost-derivation, with 1-token lookahead. The name sounds intimidating, but the gist is simple. The parser scans input from left to right exactly once, and at each step it peeks at exactly one upcoming token ("1-token lookahead") to decide which grammar rule to apply. Being able to decide with one token of lookahead keeps the parser fast and memory-efficient. Almost every mainstream compiler/DB parser works this way.

The catch is that SQL grammar has a few cases that do not fit cleanly into LALR(1). Multi-word tokens like NULLS FIRST and WITH ORDINALITY need to be received by the grammar as single tokens. If NULLS and FIRST were given to the grammar as separate tokens, the grammar could not decide what comes after NULLS based on a single lookahead.

PostgreSQL solves this by inserting one filter layer between the lexer and the grammar. That filter is base_yylex(). The actual lexer flex generates is core_yylex(), but Bison never calls it directly. It always receives tokens through base_yylex(). base_yylex looks at one token, and if this is a case that needs the next token pulled in and merged, it gives the grammar a single combined token. The result is that the grammar can be written cleanly as LALR(1), and the complexity of multi-word tokens is isolated inside base_yylex.

Never touches the catalog

The most important constraint of the raw parser stage is that it never accesses the system catalog. This is not just a convention but a correctness requirement.

The header comment of pg_parse_query() explains why.

Analysis and rewriting cannot be done in an aborted transaction, since they require access to database tables. So, we rely on the raw parser to determine whether we've seen a COMMIT or ABORT command; when we are in abort state, other commands are not processed any further than the raw parse stage.

To follow this comment, you first need to know what "the transaction is in abort state" means. Here is the most intuitive scenario.

BEGIN;
INSERT INTO users(id, name) VALUES (1, 'a');
INSERT INTO users(id, name) VALUES (1, 'b');   -- unique violation!
SELECT * FROM users;

The moment the third line violates the unique constraint and errors out, PostgreSQL marks this transaction as "already broken." That is the abort state. The next line, SELECT * FROM users;, is grammatically fine and the table exists in the catalog, but PostgreSQL refuses to execute it. Instead, every subsequent SQL gets the same one-line error.

ERROR:  current transaction is aborted, commands ignored until end of transaction block

This is the situation in psql where, after one query inside a transaction fails, every following query is rejected with the same message. To unlock it, the client must explicitly send ROLLBACK (or COMMIT, which in abort state has the same effect as rollback). In other words, the only SQL that PostgreSQL effectively accepts in abort state is ROLLBACK.

Now the relationship between the raw parser and the catalog matters. ROLLBACK still arrives as SQL text, so it has to be parsed somehow. But in abort state, even catalog reads are blocked. What if the raw parser depended on the catalog? Parsing ROLLBACK itself would then error out, and the client would have no way to unwind the transaction. Killing and reopening the connection would be the only escape.

The design choice that avoids this from the start is the raw parser's catalog ban. The raw parser must work in any transaction state, so it never touches the catalog. All identifier-level meaning analysis is deferred to the next stage (parse analysis). 1.2.2 covers that story.

What this means in practice

First, "parsing is fast" really means raw parsing is fast. PostgreSQL's raw parser is pure string work without catalog lookup, so it is very fast. But the bulk of what a prepared statement (1.1.2) caches is not raw parsing. It is the next stage (parse analysis). Parse analysis is where catalog lookups, function-overload resolution, and type checking happen. When people talk about the per-query parse cost of the simple query protocol being a burden, they almost always mean parse analysis cost, not raw parse cost. To diagnose accurately, separate which stage's cost you are looking at.

Second, keyword categories are a hidden trap in PostgreSQL compatibility and migration. A word that other RDBMSes happily allow as a column or function name may be RESERVED in PostgreSQL. If your migration tool does not automatically wrap such names in quotes ("name"), you get a syntax error. Conversely, identifiers PostgreSQL allows freely (such as abort) may be blocked in another DB. When reviewing a migration, comparing the keyword tables on both sides is the safe move.

Third, the raw parser only catches syntax errors. Errors like "table does not exist", "function signature does not match", "column is ambiguous" are all thrown by the next stage, parse analysis. So when a query has both a syntax error and a semantic error, the syntax error is reported first. The semantic error only surfaces once syntax is clean. If your error message suddenly changes during debugging, that may be a signal that the syntax issue cleared and you are now seeing the next stage's error.

1.2 Parser and Analyzer: How SQL Gets Its Meaning

JoongHyuk Shin — Wed, 06 May 2026 02:05:24 +0000

A line like SELECT name FROM users WHERE id = 1 is just text when the client sends it. The first thing the backend does after receiving it is figure out "what do these characters mean." This chapter covers the second of the five stages we saw in 1.1.1: the parser and analyzer.

By the time this stage finishes, the SQL text has been transformed twice. First into a raw parse tree that captures the grammatical structure, then into a Query tree with meaning attached after consulting the catalog. The catalog, in case you need a refresher, is the set of internal tables that PostgreSQL keeps to describe itself. Which tables exist, what columns they have, what argument types each function accepts, what data types are defined: all of it lives as rows in these tables. PostgreSQL treats user data and metadata uniformly, both as ordinary tables. The raw stage looks only at form (does this follow SELECT syntax, where are the IDENTs). The Query stage looks at substance (this identifier users is which table in which schema and what is its OID, id is which column and what is its type, can 1 be coerced to that column's type).

1.2 splits into three sections.

1.2.1 From SQL text to raw parse tree (lexer, grammar): how the flex-based lexer and Bison-based grammar turn an SQL string into a tree. This stage is pure syntactic work, no catalog access.
1.2.2 Semantic analysis: name resolution, type checking, catalog lookup: take the raw parse tree, dig into the catalog to find what each identifier really refers to, check types, resolve function overloads. This is the body of how PostgreSQL gives meaning to SQL text.
1.2.3 Query tree node types (Query, RangeTblEntry, TargetEntry): the core nodes of the Query tree, which is the output of semantic analysis. This node structure is the standard input format that rewriter, planner, and executor all consume.

By the end of this chapter, you should have a clear picture of how SQL text meets the catalog and acquires meaning, and what data structures carry that meaning into the next stage.

1.1.3 Optimizable vs Utility

JoongHyuk Shin — Tue, 05 May 2026 07:36:20 +0000

Inside the five-stage pipeline from 1.1.1, there is another fork right after the parser. PostgreSQL classifies every SQL command into one of two camps. One side holds the optimizable queries, the other holds the utility commands. The classification is decided by a single field on the Query node, commandType, and from that point on the two camps travel completely different paths. One goes through the rewriter, the planner, and the executor. The other bypasses all three.

This fork was a single line in the 1.1.1 picture, but it shapes the entire internal structure of PostgreSQL, so it earns its own section.

Five optimizables, and everything else

PostgreSQL defines its command types as a single enum.

typedef enum CmdType
{
    CMD_UNKNOWN,
    CMD_SELECT,
    CMD_UPDATE,
    CMD_INSERT,
    CMD_DELETE,
    CMD_MERGE,
    CMD_UTILITY,
    CMD_NOTHING,
} CmdType;

Of these, CMD_SELECT, CMD_INSERT, CMD_UPDATE, CMD_DELETE, and CMD_MERGE are the optimizable ones. As the name suggests, these are queries the planner can do meaningful work on. It rearranges join order using a cost model, picks indexes, and chooses scan methods.

CMD_UTILITY is the catch-all for everything else: CREATE TABLE, ALTER TABLE, DROP, VACUUM, BEGIN/COMMIT/ROLLBACK, COPY, NOTIFY, LISTEN, CLUSTER, REINDEX, GRANT, SET, SHOW, TRUNCATE, LOCK, FETCH, CHECKPOINT, PREPARE TRANSACTION, CREATE INDEX, CREATE FUNCTION, and many more. What they share is a single property: the planner has no room to produce a better plan via cost comparison.

A command like CREATE TABLE foo (id int) cannot have two different paths. It just inserts a few rows into the system catalog and asks the storage manager to allocate a new relfilenode. BEGIN similarly nudges the transaction state by one step; there is no choice in "how to BEGIN." VACUUM walks a target table page by page and cleans up dead tuples through a fixed procedure. The point is that cost comparison is meaningless here.

The two camps are wired through different code paths. The first split happens in the analyzer, right after the parser hands over a raw parse tree.

The fork lives in transformStmt's switch

When the raw parse tree arrives, transformStmt() runs a large switch on the node tag (src/backend/parser/analyze.c).

switch (nodeTag(parseTree))
{
    /* Optimizable statements */
    case T_InsertStmt:
        result = transformInsertStmt(...);
        break;
    case T_SelectStmt:
        result = transformSelectStmt(...);
        break;
    /* ... UPDATE, DELETE, MERGE ... */

    /* Special cases (utility wrappers around an optimizable inside) */
    case T_DeclareCursorStmt:
    case T_ExplainStmt:
    case T_CreateTableAsStmt:
    case T_CallStmt:
        /* transform the inner query separately */
        ...

    default:
        /* every other utility */
        result = makeNode(Query);
        result->commandType = CMD_UTILITY;
        result->utilityStmt = parseTree;
        break;
}

PostgreSQL does meaningful semantic analysis on the five optimizable statement types and a handful of special cases. Everything else falls through to the default branch, gets stamped commandType = CMD_UTILITY, and the raw parse tree is stored verbatim in the utilityStmt field. Nothing was actually analyzed; a Query shell was wrapped around the raw tree with a "this is utility" sticker.

The next stage, the rewriter, also reads that sticker (src/backend/tcop/postgres.c).

if (query->commandType == CMD_UTILITY)
    querytree_list = list_make1(query);   /* don't rewrite utilities */
else
    querytree_list = QueryRewrite(query);

If the query is utility, the rewriter does not touch it. The rule system, view expansion, and RLS policy application all live on the optimizable side and have no meaning for utility commands.

The planner is the same.

if (query->commandType == CMD_UTILITY)
{
    /* Utility commands require no planning. */
    stmt = makeNode(PlannedStmt);
    stmt->commandType = CMD_UTILITY;
    stmt->utilityStmt = query->utilityStmt;
    ...
}
else
{
    stmt = pg_plan_query(query, ...);   /* invoke the planner */
}

Utility commands never call into the planner. An empty PlannedStmt wrapper is built, and the raw parse tree is dropped into it. A PlannedStmt with no plan tree.

The executor stage is split too. Optimizable statements feed their plan tree into the executor proper, which produces rows. Utility statements get handed off to ProcessUtility(), which dispatches to a per-statement handler. The dispatch logic and the individual handlers belong to later chapters (DDL in 1.6, transaction commands in chapter 4).

When you lay out the four stages side by side, the asymmetry is sharp.

Stage	Optimizable (5 types)	Utility (everything else)
Parse analysis	Dedicated transform function	Wrapped in a Query shell
Rewriter	Rule system applied	Skipped (passes through)
Planner	Plan tree generated	Skipped (empty PlannedStmt holding the raw tree)
Executor	`ExecutorRun()` (walks the plan tree)	`ProcessUtility()` (per-statement handler)

The whole sophistication of the planner exists for those five types; utility bypasses it entirely. This is not an efficiency choice. It is a structural asymmetry, because utility commands have no alternative paths to compare.

Two species sharing one system

Once you see how this asymmetry is wired, the two camps look almost like two species inside the same engine.

Hooks live on different paths. A hook in PostgreSQL is a function pointer exposed at a key point in the execution path so that external code (typically an extension) can plug in. An extension installs its own function address into the hook, and PostgreSQL calls it whenever the hook is non-null at the appropriate moment. Which camp the hook applies to depends on where it sits. planner_hook is invoked just before the planner runs, so it only affects optimizable queries. Utility never enters the planner, so planner_hook never fires for utility. On the other side, ProcessUtility_hook is invoked just before ProcessUtility() runs, so it only applies to utility commands. That is how an audit logging extension like pgaudit intercepts DDL and DCL. You need both hooks together to cover every SQL execution path. If you write extensions, you have to know up front which camp your hook is intercepting.

Statistics live in a single channel, but the metrics mean different things. PostgreSQL accumulates usage patterns about which SQL ran how often and for how long, and DBAs use that data to find slow queries and pick tuning targets. By default (pg_stat_statements.track_utility = on), pg_stat_statements tracks both optimizable and utility commands, gathering execution counts and accumulated time into one place. Plan-level metrics (plan time, plan cache hits, and so on) only carry meaning for optimizable commands where a plan actually exists. For utility rows those columns are empty or zero. The two camps appear in the same table, but which metric means what for which camp has to be kept in mind.

Prepared statements are an optimizable-only tool. The prepared statements from 1.1.2 cache plans for optimizable queries. SQL PREPARE grammar itself does not accept utility commands. PostgreSQL only allows SELECT, INSERT, UPDATE, DELETE, MERGE, and VALUES as PREPARE targets; PREPARE p AS ALTER TABLE ... is rejected as a syntax error. Utility commands run through ProcessUtility directly on every call, with no plan cache layer to step in.

EXPLAIN's reach is asymmetric. EXPLAIN SELECT ... draws a plan tree. EXPLAIN ALTER TABLE ... does not work; there is no plan tree to draw. The exception is the special-case group from the switch above (T_DeclareCursorStmt, T_ExplainStmt, T_CreateTableAsStmt, T_CallStmt). They are classified as utility on the outside but contain an optimizable query that goes through the regular pipeline. That hybrid path is why EXPLAIN ANALYZE INSERT INTO ... SELECT ... works. The outer wrapper is utility; the SELECT and INSERT inside are optimizable.

I once tried to debug "why is this long ALTER TABLE so slow" by analyzing the plan. The plan had no answer. The bulk of the cost lived outside the plan tree: lock waits, catalog updates, full-table rewrites, and WAL volume. That was when I learned why utility needs its own dedicated path. Some costs are invisible to the plan-level cost model, and to see them you need a different stage of tools.

What this means in practice

First, planner-related tuning and monitoring tools have no meaning for utility. EXPLAIN, plan-level metrics in pg_stat_statements, auto_explain, plan_cache_mode, all of these are optimizable-side tools. Looking at the plan to debug a slow DDL or DCL is pointless because there is no plan. Utility commands grab and release catalog locks while running directly on every call. Slow utility is almost always lock contention or I/O cost, not a plan choice issue, so the diagnostic path is to set log_min_duration_statement to capture timings, watch lock waits in pg_stat_activity, and look at wait_events to see where the command is stuck.

Second, audit and security requirements split into two camps and attach to different mechanisms. Tracking "who changed which schema and when" lives on the utility side. pgaudit captures ALTER/CREATE/DROP/GRANT into an audit log because it sits on ProcessUtility_hook. Row-level access control like "this user can only see a subset of rows in this table" lives on the optimizable side. RLS (Row-Level Security) runs in the rewriter, automatically attaching policy conditions to optimizable queries. The two requirements get bundled under the same security umbrella, but they hook into completely different stages of the pipeline. RLS cannot stop a schema change, and ProcessUtility_hook cannot filter rows. When a compliance requirement comes in, the first task is to classify it as schema-level tracking versus row-level access control. Only then do the candidate tools fall into place, and you almost always need both mechanisms together to leave no gaps.

1.1.2 Simple vs Extended

JoongHyuk Shin — Tue, 05 May 2026 04:35:38 +0000

The fork visible in 1.1.1 (simple query protocol on one side, extended on the other) is the subject of this section, one level deeper. 1.1.1 set the skeleton: simple is one message, extended is four. The job here is to show how that split translates into four distinct outcomes: plan reuse, parameter safety, pipelining, and error handling.

Message sequence: the shape of one cycle is different

Putting the message sequences side by side makes the difference visible at a glance.

Simple:

Client                          Server
  │                               │
  │── 'Q' (SQL text) ────────────▶│
  │                               │ parse → analyze/rewrite → plan
  │                               │ → create portal → execute → drop portal
  │◀── RowDescription, DataRow*, CommandComplete, ReadyForQuery

Extended:

Client                          Server
  │                               │
  │── 'P' (SQL template) ────────▶│ parse + analyze, store prepared statement
  │◀── ParseComplete                
  │                               │
  │── 'B' (parameter values) ────▶│ choose plan, create portal
  │◀── BindComplete                
  │                               │
  │── 'E' (execute) ─────────────▶│ run portal, send rows
  │◀── DataRow*, CommandComplete   
  │                               │
  │── 'S' (Sync) ────────────────▶│ close transaction
  │◀── ReadyForQuery

Simple finishes one cycle in a single message. Extended slices the cycle into four messages, and that slicing is what produces the four capabilities below.

Capability one: execution plans get reused

The central concept that lets extended split the stages is the prepared statement.

A prepared statement is a SQL template that has already been parsed and analyzed. The places where values would go are left blank with placeholders like $1, $2, and at execution time only the actual values get plugged into those slots. Take INSERT INTO users (id, name) VALUES ($1, $2). Once you turn that into a prepared statement, you can run it later by sending only the values: (1, 'Alice'), (2, 'Bob'). The full SQL text isn't reparsed each time. Give it a name and it becomes a named prepared statement you can call back during the session. Send it without a name and it's an unnamed prepared statement, automatically discarded the moment the next 'P' arrives.

The four messages of the extended protocol are exactly that flow, sliced.

Message	What it does
`'P'` Parse	Take the SQL template, finish parse and analysis, store as a prepared statement
`'B'` Bind	Bind actual parameter values to the prepared statement and prepare for execution (create a portal)
`'E'` Execute	Run the prepared portal and send result rows
`'S'` Sync	End of the cycle, send ReadyForQuery

What this means is that the same prepared statement can be re-executed many times with different parameters by repeating just 'B' + 'E'. Take inserting 1,000 users.

# Driver pseudocode: 1000 INSERTs via a prepared statement
stmt = conn.prepare("INSERT INTO users (id, name) VALUES ($1, $2)")
for i in range(1000):
    stmt.execute(i, f"user{i}")

conn.prepare(...) corresponds to a single 'P' message. Parsing and analysis of the SQL text happen there. Each of the 1000 stmt.execute(...) calls corresponds to a 'B' + 'E' pair. Parse and analyze run only once on the first call; the remaining 999 do bind and execute only. With simple query, the same INSERT text would be sent 1000 times and reparsed 1000 times.

Internally, a prepared statement is held in a structure called CachedPlanSource, which keeps the raw parse tree and the analysis result. When the same prepared statement gets another 'B' + 'E', the backend starts from the saved CachedPlanSource, only redecides the execution plan, and runs. Parsing and analysis are skipped.

Generic plan vs custom plan

One step further. Plan reuse is real, but to be precise there are two kinds of plan.

Custom plan: recomputed every time using the bound parameter values. Helpful when the optimal path differs by value. Take WHERE status = $1. Suppose status='pending' matches 1% of rows and status='completed' matches 99%. A distribution where the value-by-value ratios are this lopsided is what's usually called a skewed distribution. Index scan is fast for 'pending'; sequential scan is fast for 'completed'. Custom plan looks at the value on every call and picks the path that fits it. (Plan construction is the entire subject of chapter 1.4; the kinds and behavior of scan nodes are covered in 1.5.2.)
Generic plan: planned once without knowing the parameters and cached. Every EXECUTE from then on reuses the cached plan, so from each call's point of view the cost of "planning this one" is zero. The trade-off is that the same path is forced for every parameter value.

PostgreSQL decides between the two on every EXECUTE. The decision function is choose_custom_plan(), and the default policy is:

For the first 5 EXECUTEs, always use a custom plan. Collect actual cost measurements.
From the 6th onward, compare the average custom plan cost against the generic plan cost. The custom average includes the cost of planning every time, while the generic side has that cost as zero (for the reason above), so the comparison is intentionally asymmetric.
If generic is cheaper, switch to generic. Otherwise, stay on custom.

The decision can be forced via the plan_cache_mode GUC. auto (default) runs the policy above; force_custom_plan always uses custom; force_generic_plan always uses generic.

Working on another RDBMS engine, the first time I saw the "5 customs, then start comparing" rule I spent a while looking for the reason behind that 5. The conclusion: it's an arbitrary constant. The PG source comment literally says "until we have done at least 5 (arbitrary)". Other engines tend to be stricter with plan cache policy (e.g. lock the first plan in as the generic one) and let you override via a knob, while PG chose to decide dynamically on every call. The result is that a PG prepared statement isn't simply a "plan cache"; it's "automatic switching driven by statistics." This is one reason that even ORM code that uses prepared statements automatically can show much less plan caching than people expect: if a statement is called fewer than 5 times, it gets recomputed as a custom plan every time.

Capability two: SQL injection is structurally blocked

In simple query, putting a parameter into a query means embedding the value inside the SQL text, something like f"SELECT * FROM users WHERE id = {user_input}". If user_input is untrusted, you've just opened the door to SQL injection.

Extended separates the SQL template from the parameter values into different messages. 'P' carries only the template, like SELECT * FROM users WHERE id = $1. 'B' carries the values that fill those slots, in binary or text form. Those values never go through the SQL parser. They're plugged into the already-parsed plan tree as data.

When JDBC PreparedStatement, libpq PQexecParams, or psycopg2 supports ? or $1 placeholders, that's the path being used internally. The real mechanism for SQL injection prevention lives here. It isn't "remember to escape on the client"; it's a structure where the parser has no chance to interpret a user-supplied value as a SQL token.

Capability three: messages can be batched (pipelining)

Simple sends ReadyForQuery back the moment each 'Q' is processed. The client can't send the next query until that response arrives. One query equals one round-trip.

Extended only sends ReadyForQuery when an 'S' (Sync) arrives. That means a sequence like 'P', 'B', 'E', 'B', 'E', 'B', 'E', 'S' can go out as a single batch. 100 INSERTs in one round-trip. In environments with significant network latency (cross-region cloud calls, for instance), the throughput difference is large.

Built on top of this mechanism, PG 14 introduced an official pipeline mode in libpq (PQpipelineSync, PQenterPipelineMode, etc.). The wire-level capability existed before, but the libpq client API for it wasn't clean.

Capability four: a partial error doesn't break the whole batch

Simple, on error, immediately sends ErrorResponse plus ReadyForQuery. The cycle closes right away and the backend is ready for the next 'Q'. As noted above, simple is a 1-round-trip structure, so when the backend returns to normal mode there's nothing queued in the buffer behind the failed message. Closing out and waiting for the next 'Q' is enough.

Where extended runs into real trouble is the batch case. As we saw in capability three, a typical client pushes 'B', 'E', 'B', 'E', ..., 'S' into the wire all at once. Suppose you send 100 INSERTs by pipelining: one 'P', followed by 100 pairs of 'B' + 'E' and a single 'S' all line up in the backend's buffer. While the backend is processing the 1st 'B', the 51st and 70th messages are already sitting in that buffer waiting their turn.

Now suppose the 50th 'B' fails with something like a unique violation. If the backend behaved like simple (immediately sending ErrorResponse + ReadyForQuery and returning to normal mode), it would pull the 51st 'B' out of the buffer and start processing it next. But that 51st 'B' was sent by the client under the assumption that the first 50 had succeeded. The transaction is already aborted, so processing the 51st errors out too. Same for 52, 53, ..., 100. The client ends up tracking the original error plus 50 more downstream errors.

PG avoids this chaos with a different strategy. The moment an error occurs, the backend enters a special state called ignore_till_sync. While in that state, every message that arrives is dropped without being processed until the client explicitly sends an 'S' (Sync). No additional error responses go out. Once 'S' arrives, the backend finally sends ReadyForQuery and starts accepting messages normally again.

The result is that the client receives exactly two responses: one ErrorResponse (the 50th failure) and one ReadyForQuery (in reply to 'S'). A clean boundary forms: "the batch failed somewhere, and everything past that point was discarded." ignore_till_sync is, in essence, the byproduct that makes pipelining safe.

All four in one table

Compressing the four capabilities into a single comparison.

Area	Simple	Extended
Message count	1 (`'Q'`)	4+ (`'P'`, `'B'`, `'E'`, `'S'`)
Plan reuse	None (parse + plan every time)	Yes (`CachedPlanSource` + auto generic/custom)
Parameters	Inline in SQL text	Separated as data at bind time
SQL injection	Client is responsible for escaping	Prevented at the protocol level
Round-trips	1 per query	Batched (1 per Sync)
Error handling	Immediate ReadyForQuery	ignore_till_sync (wait until Sync)

What this means in practice

First, don't assume that "the ORM uses prepared statements" means you're getting full plan caching. PG plans a custom plan for the first 5 EXECUTEs of every prepared statement. If a statement is called only once or twice inside a short transaction, the plan caching benefit is essentially zero. The real benefit shows up in workloads that call the same prepared statement dozens to hundreds of times with different parameters. The ratio of calls to plans in pg_stat_statements, plus a forced plan_cache_mode setting, are the two diagnostic tools.

Second, the answer to "why isn't my prepared statement going generic?" is the wall of 5. Forcing plan_cache_mode = force_generic_plan brings planning cost to zero but locks every parameter value to the same path. With skewed data this can actually be slower. The opposite, force_custom_plan, pays planning cost every time. The default auto, which decides dynamically from the 6th call, is usually safest, but there are environments where explicitly choosing generic is worth the GUC tweak. For example, environments where prepared statements have very short lifetimes due to PgBouncer transaction pooling.

Third, in environments with significant network latency, pipelining is the real lever. Cross-region RDS calls in the cloud, or even same-region setups where there's millisecond-level latency between application and database, will turn 100 simple INSERTs into 100 round-trips. libpq pipeline mode, or JDBC addBatch() + executeBatch(), can collapse that to a single round-trip. Just keep in mind that the error-handling complexity goes up (you need to understand what ignore_till_sync means), so it pays to design batch-level transaction boundaries and a retry policy at the same time as the pipelining itself.

1.1.1 Life of a Query

JoongHyuk Shin — Mon, 04 May 2026 10:13:22 +0000

This section is the map for the rest of the book. The five stages introduced in the 1.1 chapter overview (message dispatch, parser + analyzer, rewriter, planner, executor) are traced here through the actual code: which functions implement each stage, and in what order they get called. The mechanics of each of the five stages are unpacked in later chapters. Here, only the skeleton matters: how a backend starts up, how it receives messages, and where the first fork in the road appears.

One term to settle first. In this book, a statement refers to a single unit of SQL command that the client sends to the backend. One SELECT * FROM users, one INSERT INTO users ..., each is one statement. A single request may carry several statements joined by ;, and in that case each is handled as a separate statement. Every section from here on uses "statement" in this sense.

One backend process owns one query

Every time a client connects, PostgreSQL forks a backend process for it (the parent is postmaster). That process stays alive until the client disconnects, and it handles every query that client sends, by itself. Unlike the thread-pool model common in other RDBMSs, PG uses one OS process per connection. The reasons behind that decision are taken up in 6.1.1.

The actual entry point of that backend is a function called PostgresMain. The name is grand; what it does is unexpectedly simple. Two things, then off it goes.

First, it installs signal handlers. Signals are asynchronous notifications the OS delivers to a process (for example, SIGTERM is a request to shut down, SIGUSR1 is for PG-internal communication). A backend has to react to signals from postmaster and from other backends, so each signal is wired to a handler ahead of time. Signals and IPC in general are covered in 6.3.

Second, it initializes the transaction system. Every SQL statement in PG, even without an explicit BEGIN, runs inside some transaction. The transaction system is the core PG machinery that tracks BEGIN/COMMIT boundaries, MVCC visibility, XID assignment, and so on. Transactions and MVCC are the subject of all of chapter 3. For now, it's enough to know that this machinery is set up before the backend ever sees a SQL statement.

Once those two preparations are done, the real work of the backend begins. An infinite loop.

for (;;)
{
    ...
    ReadyForQuery(whereToSendOutput);
    ...
    firstchar = ReadCommand(&input_message);
    ...
    switch (firstchar)
    {
        case PqMsg_Query:        // 'Q', simple query
            exec_simple_query(query_string);
            break;
        case PqMsg_Parse:        // 'P', extended: parse
            exec_parse_message(...);
            break;
        case PqMsg_Bind:         // 'B', extended: bind
            exec_bind_message(&input_message);
            break;
        case PqMsg_Execute:      // 'E', extended: execute
            exec_execute_message(portal_name, max_rows);
            break;
        case PqMsg_Sync:         // 'S', end of an extended cycle
            finish_xact_command();
            send_ready_for_query = true;
            break;
        ...
    }
}

This loop is the entire life of a backend.

"Announce that I'm ready, read one message, dispatch on its type." Repeat forever. When the client closes the connection, an 'X' (Terminate) message arrives, the loop exits, and the process dies.

The first fork in the road is visible right here. There's the 'Q' path and the 'P' / 'B' / 'E' path. That split is the difference between the simple query protocol and the extended query protocol.

Simple vs extended

Simple is the case where a single message contains the SQL text in full. Type SELECT 1; into psql and hit enter, and that's what flies across the wire. The backend receives that one message and runs the full five-stage cycle (dispatch, parser + analyzer, rewriter, planner, executor) before returning the result.

Extended does the same job but splits it into four messages ('P', 'B', 'E', 'S'). Splitting the stages opens up plan reuse, parameter safety, and pipelining. The semantic differences between the two protocols and how they play out in practice are unpacked in section 1.1.2.

Optimizable vs utility

Everything described so far assumes optimizable statements: SELECT/INSERT/UPDATE/DELETE. These have paths to optimize. The planner decides between sequential and index scan, hash join and nested loop, one join order or another.

But statements like CREATE TABLE, VACUUM, SET, and BEGIN (the utility statements) are different. There's nothing for a cost model to optimize. They're DDL or system commands, with no path to choose. In that case the planner produces only an empty shell of a plan and hands the actual work to a utility-statement handler. The executor never gets called on this path.

The detailed branching is the subject of 1.1.3. The takeaway here is just one thing: not every query in PG goes through the planner.

The big picture

We can now compress the journey of a SQL line into a single diagram.

Client
   │
   │  'Q' (or 'P' + 'B' + 'E')
   ▼
PostgresMain main loop
   │
   ▼
exec_simple_query
   │
   ├─ pg_parse_query           → raw parse tree     (1.2.1, 1.2.3)
   │
   ├─ pg_analyze_and_rewrite   → list of Query nodes (1.2.2, 1.3)
   │
   ├─ pg_plan_queries          → execution plan      (1.4 chapter)
   │     └─ utility produces an empty shell          (1.1.3)
   │
   ├─ PortalStart + PortalRun  → tuple pulling       (1.5)
   │
   └─ PortalDrop + finish_xact_command
   │
   ▼
ReadyForQuery → back to the top of the loop

Each box in this diagram corresponds to a chapter in the book. 1.2 is parser and analyzer, 1.3 is rewriter, 1.4 is planner, 1.5 is the executor. All of part 1 is essentially one zoomed-in view of this diagram.

Working on another RDBMS engine, I once found this aspect of PG surprising. PG accepts a multi-statement query like SELECT 1; SELECT 2; as a single simple-query message. What's even more surprising is the transaction handling. Without an explicit BEGIN/COMMIT, all those statements get bundled into a single implicit transaction block, and if even one of them fails, the whole batch rolls back.

At first I assumed this was just standard behavior. Comparing the client protocols of other major databases made it clear this is a PG-specific decision. MySQL has CLIENT_MULTI_STATEMENTS off by default, so multi-statement queries are simply rejected (you have to flip the flag explicitly because of SQL injection risk). Even with the flag on, statements are processed sequentially, and because autocommit is the default, each one commits as its own transaction. Oracle accepts only one statement per OCI call, so to bundle multiple statements you have to wrap them in an anonymous PL/SQL block (BEGIN ... END;). SQL Server accepts multiple statements in a T-SQL batch, but atomic handling still requires an explicit BEGIN TRANSACTION. None of the three does what PG does: bundle automatically as soon as the message arrives.

What this means in practice

This five-stage skeleton turns out to be the foundation for two diagnostic tools you'll use in operations.

First, you can see exactly where EXPLAIN's output comes from. EXPLAIN runs only as far as stage 4 (plan); it skips stage 5 (execute). EXPLAIN ANALYZE actually runs through stage 5 and measures it. That's why EXPLAIN ANALYZE produces real load and shouldn't be casually run in production: an EXPLAIN ANALYZE UPDATE ... actually updates rows. The familiar BEGIN; EXPLAIN ANALYZE UPDATE ...; ROLLBACK idiom exists for exactly this reason.

Second, the fact that one backend means one process means one query at a time explains why connection pooling matters so much. A backend's main loop is essentially single-threaded. While one client runs a long query, that backend can't do anything else. Connection counts therefore drive memory and scheduling costs linearly, and a pooler like PgBouncer becomes effectively mandatory. The answer to "why are PostgreSQL connections so expensive?" lives inside this one-line main loop.

Forem: JoongHyuk Shin

1.3.4 RETURNING Mapping and View-Target DML Transformation

RULE's RETURNING is reshaped to the user's request

DML on a view is rewritten as DML on a base relation

A security_barrier view pins the view's WHERE in front of securityQuals

WITH CHECK OPTION stops a new row from breaking the view's visibility

View transformation unwinds recursively

What this means in practice

1.3.3 INSERT/UPDATE/DELETE Targetlist Normalization (Default Expansion)

DML preparation has to happen before rule application

INSERT auto-injects default expressions into missing columns

Columns without a default: how the empty slot resolves to NULL

The DEFAULT keyword arrives as a SetToDefault placeholder and gets resolved

Single-row INSERT is substituted directly in the targetList

Multi-row INSERT runs through a VALUES RTE in between

UPDATE only tidies what the user wrote; DELETE leaves the targetlist alone

What this means in practice

1.3.2 How RLS Rewrites Queries (Mechanism)

The qual the rewriter adds

Why the rewriter, not somewhere else

The qual that filters and the qual that checks

How policies get combined

Security barrier: quals have an order

What this means in practice

1.3.1 Rule system and view expansion

View and materialized view: what is the difference

PostgreSQL's RULE system

A view is a special case of the RULE system

The rewriter works in two steps

Step 2 exceptions: materialized views and EXCLUDED

The RULE system has many pitfalls and is not recommended

What this means in practice

1.3 Rewriter: How a Query is Rewritten

1.2.3 Query tree node types: Query, RangeTblEntry, TargetEntry

Tokens descend into catalog coordinates

Range table: every FROM input in one place

Target list: SELECT, INSERT, UPDATE, RETURNING all in one shape

Where the Query goes

What this means in practice

1.2.2 Semantic analysis: name resolution, type checking, catalog lookup

Parser and Analyzer

What the raw parse tree leaves out

Name resolution: what does each name in SQL refer to?

Type checking and cast insertion

Catalog lookups and locks

After Analyze finishes

What this means in practice

1.2.1 From SQL Text to Raw Parse Tree

Two tools dividing the work: flex and Bison

What the lexer (scan.l) sees

The tree the grammar (gram.y) builds

The lookahead filter: base_yylex

Never touches the catalog

What this means in practice

1.2 Parser and Analyzer: How SQL Gets Its Meaning

1.1.3 Optimizable vs Utility

Five optimizables, and everything else

The fork lives in transformStmt's switch

Two species sharing one system

What this means in practice

1.1.2 Simple vs Extended

Message sequence: the shape of one cycle is different

Capability one: execution plans get reused

Generic plan vs custom plan

Capability two: SQL injection is structurally blocked

Capability three: messages can be batched (pipelining)

Capability four: a partial error doesn't break the whole batch

All four in one table

What this means in practice

1.1.1 Life of a Query

One backend process owns one query

Simple vs extended

Optimizable vs utility

The big picture

What this means in practice

The `DEFAULT` keyword arrives as a SetToDefault placeholder and gets resolved