<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: JoongHyuk Shin</title>
    <description>The latest articles on Forem by JoongHyuk Shin (@joonghyukshin).</description>
    <link>https://forem.com/joonghyukshin</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3911251%2F63b302ae-0e4c-4fa7-916b-72b2a119b393.png</url>
      <title>Forem: JoongHyuk Shin</title>
      <link>https://forem.com/joonghyukshin</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/joonghyukshin"/>
    <language>en</language>
    <item>
      <title>1.2.1 From SQL Text to Raw Parse Tree</title>
      <dc:creator>JoongHyuk Shin</dc:creator>
      <pubDate>Wed, 06 May 2026 02:06:12 +0000</pubDate>
      <link>https://forem.com/joonghyukshin/121-from-sql-text-to-raw-parse-tree-342c</link>
      <guid>https://forem.com/joonghyukshin/121-from-sql-text-to-raw-parse-tree-342c</guid>
      <description>&lt;p&gt;A line like &lt;code&gt;SELECT name FROM users WHERE id = 1&lt;/code&gt; arrives at the backend. As we saw in 1.1.1, the backend is a child process forked by the postmaster when the client connects, dedicated to that one client. What this process first holds in its hands is just a byte array. It does not yet know where the keywords end, where the identifiers begin, or where the integer constant lives. Turning that byte array into a tree structure is the front half of the second of the five stages, namely raw parsing. This section covers only that front half. The back half (parse analysis), which consults the catalog to attach meaning, belongs to 1.2.2. (The catalog, in case you need a refresher, is the set of internal tables PostgreSQL maintains to describe itself: which tables have which columns, which functions take which argument types, and so on, all stored as rows in these tables.)&lt;/p&gt;

&lt;p&gt;The output of raw parsing is a single &lt;code&gt;RawStmt&lt;/code&gt; node per SQL string (or a List, if multiple statements are joined with &lt;code&gt;;&lt;/code&gt;). This RawStmt wraps a raw node like &lt;code&gt;SelectStmt&lt;/code&gt; for SELECT, &lt;code&gt;InsertStmt&lt;/code&gt; for INSERT, and so on. The name "raw" means it has not seen the catalog. Whether &lt;code&gt;users&lt;/code&gt; is actually an existing table, which column &lt;code&gt;id&lt;/code&gt; refers to, none of that is known yet. All that has been captured is the grammatical structure: a SELECT keyword followed by a column list, a FROM followed by an identifier, a WHERE followed by a comparison expression.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two tools dividing the work: flex and Bison
&lt;/h2&gt;

&lt;p&gt;PostgreSQL's raw parser is a collaboration of two tools. One side is the &lt;strong&gt;lexer&lt;/strong&gt; (lexical analyzer), the other is the &lt;strong&gt;grammar&lt;/strong&gt; (syntactic analyzer). The lexer slices the byte array into tokens. The grammar takes that token sequence and groups it into a tree.&lt;/p&gt;

&lt;p&gt;PostgreSQL does not write these two by hand. It uses standard generator tools brought in from outside. The lexer is &lt;strong&gt;flex&lt;/strong&gt;, the grammar is &lt;strong&gt;Bison&lt;/strong&gt;. Both are code generators. The developer writes rules, and at build time the tool reads those rules and emits actual working lexer/parser C code.&lt;/p&gt;

&lt;p&gt;For flex, a rule means "if this regex pattern comes in, emit this token." For example, "if a letter is followed by alphanumerics, emit an IDENT (identifier) token", "if only digits appear, emit an ICONST (integer constant) token." PostgreSQL keeps these rules in &lt;code&gt;scan.l&lt;/code&gt; (about 1,400 lines). At build time flex reads this file and generates the C function that does the tokenizing.&lt;/p&gt;

&lt;p&gt;For Bison, a rule means "if this token sequence appears, build this tree node." For example, "if SELECT is followed by a column list, then FROM and a table identifier, build a SelectStmt node." PostgreSQL keeps these grammar rules in &lt;code&gt;gram.y&lt;/code&gt; (about 21,000 lines). At build time Bison reads this file and generates the C parser function that groups tokens into a tree.&lt;/p&gt;

&lt;p&gt;The driver that ties these two together is the &lt;code&gt;raw_parser()&lt;/code&gt; function.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="n"&gt;List&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="nf"&gt;raw_parser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;RawParseMode&lt;/span&gt; &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;core_yyscan_t&lt;/span&gt; &lt;span class="n"&gt;yyscanner&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;base_yy_extra_type&lt;/span&gt; &lt;span class="n"&gt;yyextra&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt;           &lt;span class="n"&gt;yyresult&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="n"&gt;scanner_init&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...);&lt;/span&gt;     &lt;span class="cm"&gt;/* flex init */&lt;/span&gt;
    &lt;span class="n"&gt;parser_init&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;yyextra&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;      &lt;span class="cm"&gt;/* Bison state init */&lt;/span&gt;
    &lt;span class="n"&gt;yyresult&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base_yyparse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;yyscanner&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;   &lt;span class="cm"&gt;/* one cycle */&lt;/span&gt;
    &lt;span class="n"&gt;scanner_finish&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;yyscanner&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;yyresult&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;NIL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;             &lt;span class="cm"&gt;/* syntax error */&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;yyextra&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parsetree&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;   &lt;span class="cm"&gt;/* List of RawStmt */&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;base_yyparse()&lt;/code&gt; is the actual parser function Bison generated. Inside it, whenever a token is needed, it calls the lexer and groups tokens into a tree according to grammar rules. The lexer does not run on its own. It is pulled along by the grammar.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the lexer (scan.l) sees
&lt;/h2&gt;

&lt;p&gt;What the lexer does is simple. It scans the byte array from the start, finds the longest pattern that matches one of the regex rules, and hands the corresponding token type and value to the grammar. What does "longest matching pattern" mean? When several rules can match starting at the same position, the lexer picks the one with the longer match. For example, if the input contains &lt;code&gt;&amp;gt;=&lt;/code&gt;, the lexer does not slice off &lt;code&gt;&amp;gt;&lt;/code&gt; alone. It groups &lt;code&gt;&amp;gt;=&lt;/code&gt; into a single token (such as &lt;code&gt;GREATER_EQUALS&lt;/code&gt;). &lt;code&gt;&amp;gt;&lt;/code&gt; alone would also match a comparison-operator rule, but a longer-matching rule for &lt;code&gt;&amp;gt;=&lt;/code&gt; exists, so that one wins. In lex terminology this is called longest match. Almost every lexer behaves this way. So when &lt;code&gt;SELECT name FROM users WHERE id = 1&lt;/code&gt; comes in, the lexer emits tokens in this order.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;     &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;keyword&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt;
&lt;span class="n"&gt;name&lt;/span&gt;       &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;identifier&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;IDENT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;       &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;keyword&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt;
&lt;span class="n"&gt;users&lt;/span&gt;      &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;identifier&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;IDENT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;"users"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt;      &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;keyword&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt;
&lt;span class="n"&gt;id&lt;/span&gt;         &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;identifier&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;IDENT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;=&lt;/span&gt;          &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="s1"&gt;'='&lt;/span&gt;
&lt;span class="mi"&gt;1&lt;/span&gt;          &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="nb"&gt;integer&lt;/span&gt; &lt;span class="n"&gt;constant&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ICONST&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How does the lexer tell identifiers from keywords? Every identifier candidate (a letter followed by alphanumerics) first matches the same regex rule. After that, the matched string is checked against the keyword table by binary search. If it is in the table, it becomes a keyword token. If not, IDENT.&lt;/p&gt;

&lt;p&gt;The size and categorization of that keyword table matter. PostgreSQL has 511 keywords, split into four categories.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;UNRESERVED&lt;/td&gt;
&lt;td&gt;346&lt;/td&gt;
&lt;td&gt;usable as identifier (e.g. &lt;code&gt;abort&lt;/code&gt;, &lt;code&gt;aggregate&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;COL_NAME&lt;/td&gt;
&lt;td&gt;64&lt;/td&gt;
&lt;td&gt;usable as a column name but not as a function/type name&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TYPE_FUNC_NAME&lt;/td&gt;
&lt;td&gt;23&lt;/td&gt;
&lt;td&gt;usable as a type or function name&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RESERVED&lt;/td&gt;
&lt;td&gt;78&lt;/td&gt;
&lt;td&gt;never usable as identifier (e.g. &lt;code&gt;select&lt;/code&gt;, &lt;code&gt;from&lt;/code&gt;, &lt;code&gt;where&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This split is why a query like &lt;code&gt;SELECT abort FROM ...&lt;/code&gt; works in PostgreSQL: &lt;code&gt;abort&lt;/code&gt; is UNRESERVED, so it can also be used as an identifier. By contrast, &lt;code&gt;SELECT select FROM ...&lt;/code&gt; is a syntax error. RESERVED never gives way. Compatibility differences with other RDBMSes often come from this split.&lt;/p&gt;

&lt;p&gt;There is an interesting comment at the top of scan.l: "rules in this file must be kept in sync with &lt;code&gt;src/fe_utils/psqlscan.l&lt;/code&gt; and &lt;code&gt;src/interfaces/ecpg/preproc/pgc.l&lt;/code&gt;!" In other words, the same SQL lexer rules live in three places. The psql client has its own lexer, ecpg (the embedded SQL preprocessor) has its own. The reason is that each tool has slightly different lexical needs, but the practical consequence is that changing the lexer rules in PostgreSQL means synchronizing three files.&lt;/p&gt;

&lt;h2&gt;
  
  
  The tree the grammar (gram.y) builds
&lt;/h2&gt;

&lt;p&gt;As the lexer emits tokens, the grammar takes them in order, matches them against grammar rules, and assembles a tree. It builds bottom-up. Small subtrees are grouped first, those subtrees are then collected into larger nodes, and finally a single node corresponding to the whole statement is completed. This act of "taking a sequence of tokens or subtrees and reducing them into a single parent node" is called &lt;strong&gt;reduce&lt;/strong&gt; in parser terminology. For example, if the token sequence &lt;code&gt;IDENT '=' ICONST&lt;/code&gt; matches the "binary comparison expression" rule, those three are reduced into a single subtree. That subtree is then reduced as part of a WHERE clause rule. That WHERE clause is reduced as part of a SelectStmt rule. In the end, one SelectStmt node is built and gets wrapped in a RawStmt.&lt;/p&gt;

&lt;p&gt;Here is a picture of the raw parse tree that &lt;code&gt;SELECT name FROM users WHERE id = 1&lt;/code&gt; produces. The tree has the root at the top, leaves toward the bottom.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                        RawStmt
                           │
                       SelectStmt
              ┌────────────┼─────────────┐
              │            │             │
          targetList   fromClause    whereClause
              │            │             │
            IDENT       RangeVar      A_Expr (=)
            "name"      "users"      ┌──────┴──────┐
                                     │             │
                                   IDENT         ICONST
                                   "id"             1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The RawStmt at the top is the output of the raw parser. Keywords like SELECT, FROM, WHERE are only markers that tell the grammar rule which subtree to reduce into, so they do not survive as separate tree nodes. Only meaningful values (identifiers, constants) from the input tokens remain as leaves.&lt;/p&gt;

&lt;p&gt;PostgreSQL's top rule is &lt;code&gt;stmtmulti&lt;/code&gt;. It expresses that multiple SQL statements may be joined with &lt;code&gt;;&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;stmtmulti:  stmtmulti ';' toplevel_stmt
                { ... lappend($1, makeRawStmt($3, @3)); ... }
          | toplevel_stmt
                { ... list_make1(makeRawStmt($1, @1)); ... }
;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each &lt;code&gt;toplevel_stmt&lt;/code&gt; is one SQL statement, and they all get wrapped by &lt;code&gt;makeRawStmt()&lt;/code&gt; into a RawStmt. That is why the raw parser's output is always a List of &lt;code&gt;RawStmt&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Going down into &lt;code&gt;toplevel_stmt&lt;/code&gt; and then &lt;code&gt;stmt&lt;/code&gt;, you find that &lt;code&gt;stmt&lt;/code&gt; is a giant OR rule.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;stmt:
        AlterEventTrigStmt
      | AlterCollationStmt
      | AlterDatabaseStmt
      | ...
      | SelectStmt
      | InsertStmt
      | ...
;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;More than 200 statement kinds are listed here, one per line. Each then expands into its own sub-rules. &lt;code&gt;SelectStmt&lt;/code&gt; alone has dozens of sub-rules, and every SELECT element such as FROM clause, WHERE clause, GROUP BY, set operation unfolds inside it like a tree. The 21,000 lines of gram.y are the result of this unfolding.&lt;/p&gt;

&lt;p&gt;A new tree node is built inside the reduce action. At the moment &lt;code&gt;SELECT * FROM users&lt;/code&gt; reduces into a SelectStmt node, &lt;code&gt;makeNode(SelectStmt)&lt;/code&gt; allocates the node and fills in target list, FROM clause, WHERE, and so on. At this moment we do not know which OID of which table &lt;code&gt;users&lt;/code&gt; refers to. The identifier string &lt;code&gt;"users"&lt;/code&gt; simply lives inside a &lt;code&gt;RangeVar&lt;/code&gt; node. Catalog lookup is the next stage's job.&lt;/p&gt;

&lt;p&gt;gram.y also happens to be the file that most directly shows PostgreSQL's history of SQL compatibility changes. When a new PostgreSQL version adds a new SQL feature (such as MERGE or JSON_TABLE), the grammar rules grow accordingly, so dozens to hundreds of lines get added to gram.y each time. Following git blame is enough to see which SQL feature entered which PostgreSQL version. The 21,000-line size is the trace of three decades of SQL standard accumulation in a single file.&lt;/p&gt;

&lt;h2&gt;
  
  
  The lookahead filter: base_yylex
&lt;/h2&gt;

&lt;p&gt;There is one more detail. The parser Bison generates is &lt;strong&gt;LALR(1)&lt;/strong&gt;. Spelling out the acronym, that is Look-Ahead Left-to-right Rightmost-derivation, with 1-token lookahead. The name sounds intimidating, but the gist is simple. The parser scans input from left to right exactly once, and at each step it peeks at exactly one upcoming token ("1-token lookahead") to decide which grammar rule to apply. Being able to decide with one token of lookahead keeps the parser fast and memory-efficient. Almost every mainstream compiler/DB parser works this way.&lt;/p&gt;

&lt;p&gt;The catch is that SQL grammar has a few cases that do not fit cleanly into LALR(1). Multi-word tokens like &lt;code&gt;NULLS FIRST&lt;/code&gt; and &lt;code&gt;WITH ORDINALITY&lt;/code&gt; need to be received by the grammar as single tokens. If NULLS and FIRST were given to the grammar as separate tokens, the grammar could not decide what comes after NULLS based on a single lookahead.&lt;/p&gt;

&lt;p&gt;PostgreSQL solves this by inserting one filter layer between the lexer and the grammar. That filter is &lt;code&gt;base_yylex()&lt;/code&gt;. The actual lexer flex generates is &lt;code&gt;core_yylex()&lt;/code&gt;, but Bison never calls it directly. It always receives tokens through &lt;code&gt;base_yylex()&lt;/code&gt;. base_yylex looks at one token, and if this is a case that needs the next token pulled in and merged, it gives the grammar a single combined token. The result is that the grammar can be written cleanly as LALR(1), and the complexity of multi-word tokens is isolated inside base_yylex.&lt;/p&gt;

&lt;h2&gt;
  
  
  Never touches the catalog
&lt;/h2&gt;

&lt;p&gt;The most important constraint of the raw parser stage is that &lt;strong&gt;it never accesses the system catalog&lt;/strong&gt;. This is not just a convention but a correctness requirement.&lt;/p&gt;

&lt;p&gt;The header comment of &lt;code&gt;pg_parse_query()&lt;/code&gt; explains why.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Analysis and rewriting cannot be done in an aborted transaction, since they require access to database tables. So, we rely on the raw parser to determine whether we've seen a COMMIT or ABORT command; when we are in abort state, other commands are not processed any further than the raw parse stage.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;To follow this comment, you first need to know what "the transaction is in abort state" means. Here is the most intuitive scenario.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;BEGIN&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'a'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'b'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;   &lt;span class="c1"&gt;-- unique violation!&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The moment the third line violates the unique constraint and errors out, PostgreSQL marks this transaction as "already broken." That is the abort state. The next line, &lt;code&gt;SELECT * FROM users;&lt;/code&gt;, is grammatically fine and the table exists in the catalog, but PostgreSQL refuses to execute it. Instead, every subsequent SQL gets the same one-line error.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ERROR:  current transaction is aborted, commands ignored until end of transaction block
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the situation in &lt;code&gt;psql&lt;/code&gt; where, after one query inside a transaction fails, every following query is rejected with the same message. To unlock it, the client must explicitly send &lt;code&gt;ROLLBACK&lt;/code&gt; (or &lt;code&gt;COMMIT&lt;/code&gt;, which in abort state has the same effect as rollback). In other words, the only SQL that PostgreSQL effectively accepts in abort state is ROLLBACK.&lt;/p&gt;

&lt;p&gt;Now the relationship between the raw parser and the catalog matters. ROLLBACK still arrives as SQL text, so it has to be parsed somehow. But in abort state, even catalog reads are blocked. What if the raw parser depended on the catalog? Parsing ROLLBACK itself would then error out, and the client would have no way to unwind the transaction. Killing and reopening the connection would be the only escape.&lt;/p&gt;

&lt;p&gt;The design choice that avoids this from the start is the raw parser's catalog ban. The raw parser must work in any transaction state, so it never touches the catalog. All identifier-level meaning analysis is deferred to the next stage (parse analysis). 1.2.2 covers that story.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means in practice
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;First, "parsing is fast" really means raw parsing is fast.&lt;/strong&gt; PostgreSQL's raw parser is pure string work without catalog lookup, so it is very fast. But the bulk of what a prepared statement (1.1.2) caches is not raw parsing. It is &lt;strong&gt;the next stage (parse analysis)&lt;/strong&gt;. Parse analysis is where catalog lookups, function-overload resolution, and type checking happen. When people talk about the per-query parse cost of the simple query protocol being a burden, they almost always mean parse analysis cost, not raw parse cost. To diagnose accurately, separate which stage's cost you are looking at.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second, keyword categories are a hidden trap in PostgreSQL compatibility and migration.&lt;/strong&gt; A word that other RDBMSes happily allow as a column or function name may be RESERVED in PostgreSQL. If your migration tool does not automatically wrap such names in quotes (&lt;code&gt;"name"&lt;/code&gt;), you get a syntax error. Conversely, identifiers PostgreSQL allows freely (such as &lt;code&gt;abort&lt;/code&gt;) may be blocked in another DB. When reviewing a migration, comparing the keyword tables on both sides is the safe move.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Third, the raw parser only catches syntax errors.&lt;/strong&gt; Errors like "table does not exist", "function signature does not match", "column is ambiguous" are all thrown by the next stage, parse analysis. So when a query has both a syntax error and a semantic error, the syntax error is reported first. The semantic error only surfaces once syntax is clean. If your error message suddenly changes during debugging, that may be a signal that the syntax issue cleared and you are now seeing the next stage's error.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>internals</category>
      <category>parser</category>
    </item>
    <item>
      <title>1.2 Parser and Analyzer: How SQL Gets Its Meaning</title>
      <dc:creator>JoongHyuk Shin</dc:creator>
      <pubDate>Wed, 06 May 2026 02:05:24 +0000</pubDate>
      <link>https://forem.com/joonghyukshin/12-parser-and-analyzer-how-sql-gets-its-meaning-44e3</link>
      <guid>https://forem.com/joonghyukshin/12-parser-and-analyzer-how-sql-gets-its-meaning-44e3</guid>
      <description>&lt;p&gt;A line like &lt;code&gt;SELECT name FROM users WHERE id = 1&lt;/code&gt; is just text when the client sends it. The first thing the backend does after receiving it is figure out "what do these characters mean." This chapter covers the second of the five stages we saw in 1.1.1: the parser and analyzer.&lt;/p&gt;

&lt;p&gt;By the time this stage finishes, the SQL text has been transformed twice. First into a &lt;strong&gt;raw parse tree&lt;/strong&gt; that captures the grammatical structure, then into a &lt;strong&gt;Query tree&lt;/strong&gt; with meaning attached after consulting the catalog. The catalog, in case you need a refresher, is the set of internal tables that PostgreSQL keeps to describe itself. Which tables exist, what columns they have, what argument types each function accepts, what data types are defined: all of it lives as rows in these tables. PostgreSQL treats user data and metadata uniformly, both as ordinary tables. The raw stage looks only at form (does this follow SELECT syntax, where are the IDENTs). The Query stage looks at substance (this identifier &lt;code&gt;users&lt;/code&gt; is which table in which schema and what is its OID, &lt;code&gt;id&lt;/code&gt; is which column and what is its type, can &lt;code&gt;1&lt;/code&gt; be coerced to that column's type).&lt;/p&gt;

&lt;p&gt;1.2 splits into three sections.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;1.2.1 From SQL text to raw parse tree (lexer, grammar)&lt;/strong&gt;: how the flex-based lexer and Bison-based grammar turn an SQL string into a tree. This stage is pure syntactic work, no catalog access.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1.2.2 Semantic analysis: name resolution, type checking, catalog lookup&lt;/strong&gt;: take the raw parse tree, dig into the catalog to find what each identifier really refers to, check types, resolve function overloads. This is the body of how PostgreSQL gives meaning to SQL text.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1.2.3 Query tree node types (Query, RangeTblEntry, TargetEntry)&lt;/strong&gt;: the core nodes of the Query tree, which is the output of semantic analysis. This node structure is the standard input format that rewriter, planner, and executor all consume.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By the end of this chapter, you should have a clear picture of how SQL text meets the catalog and acquires meaning, and what data structures carry that meaning into the next stage.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>internals</category>
      <category>parser</category>
    </item>
    <item>
      <title>1.1.3 Optimizable vs Utility</title>
      <dc:creator>JoongHyuk Shin</dc:creator>
      <pubDate>Tue, 05 May 2026 07:36:20 +0000</pubDate>
      <link>https://forem.com/joonghyukshin/113-optimizable-vs-utility-51bl</link>
      <guid>https://forem.com/joonghyukshin/113-optimizable-vs-utility-51bl</guid>
      <description>&lt;p&gt;Inside the five-stage pipeline from 1.1.1, there is another fork right after the parser. PostgreSQL classifies every SQL command into one of two camps. One side holds the &lt;strong&gt;optimizable&lt;/strong&gt; queries, the other holds the &lt;strong&gt;utility&lt;/strong&gt; commands. The classification is decided by a single field on the Query node, &lt;code&gt;commandType&lt;/code&gt;, and from that point on the two camps travel &lt;strong&gt;completely different paths&lt;/strong&gt;. One goes through the rewriter, the planner, and the executor. The other bypasses all three.&lt;/p&gt;

&lt;p&gt;This fork was a single line in the 1.1.1 picture, but it shapes the entire internal structure of PostgreSQL, so it earns its own section.&lt;/p&gt;

&lt;h2&gt;
  
  
  Five optimizables, and everything else
&lt;/h2&gt;

&lt;p&gt;PostgreSQL defines its command types as a single enum.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;typedef&lt;/span&gt; &lt;span class="k"&gt;enum&lt;/span&gt; &lt;span class="n"&gt;CmdType&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;CMD_UNKNOWN&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;CMD_SELECT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;CMD_UPDATE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;CMD_INSERT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;CMD_DELETE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;CMD_MERGE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;CMD_UTILITY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;CMD_NOTHING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="n"&gt;CmdType&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Of these, &lt;code&gt;CMD_SELECT&lt;/code&gt;, &lt;code&gt;CMD_INSERT&lt;/code&gt;, &lt;code&gt;CMD_UPDATE&lt;/code&gt;, &lt;code&gt;CMD_DELETE&lt;/code&gt;, and &lt;code&gt;CMD_MERGE&lt;/code&gt; are the &lt;strong&gt;optimizable&lt;/strong&gt; ones. As the name suggests, these are queries the planner can do meaningful work on. It rearranges join order using a cost model, picks indexes, and chooses scan methods.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;CMD_UTILITY&lt;/code&gt; is the catch-all for everything else: &lt;code&gt;CREATE TABLE&lt;/code&gt;, &lt;code&gt;ALTER TABLE&lt;/code&gt;, &lt;code&gt;DROP&lt;/code&gt;, &lt;code&gt;VACUUM&lt;/code&gt;, &lt;code&gt;BEGIN&lt;/code&gt;/&lt;code&gt;COMMIT&lt;/code&gt;/&lt;code&gt;ROLLBACK&lt;/code&gt;, &lt;code&gt;COPY&lt;/code&gt;, &lt;code&gt;NOTIFY&lt;/code&gt;, &lt;code&gt;LISTEN&lt;/code&gt;, &lt;code&gt;CLUSTER&lt;/code&gt;, &lt;code&gt;REINDEX&lt;/code&gt;, &lt;code&gt;GRANT&lt;/code&gt;, &lt;code&gt;SET&lt;/code&gt;, &lt;code&gt;SHOW&lt;/code&gt;, &lt;code&gt;TRUNCATE&lt;/code&gt;, &lt;code&gt;LOCK&lt;/code&gt;, &lt;code&gt;FETCH&lt;/code&gt;, &lt;code&gt;CHECKPOINT&lt;/code&gt;, &lt;code&gt;PREPARE TRANSACTION&lt;/code&gt;, &lt;code&gt;CREATE INDEX&lt;/code&gt;, &lt;code&gt;CREATE FUNCTION&lt;/code&gt;, and many more. What they share is a single property: &lt;strong&gt;the planner has no room to produce a better plan via cost comparison&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A command like &lt;code&gt;CREATE TABLE foo (id int)&lt;/code&gt; cannot have two different paths. It just inserts a few rows into the system catalog and asks the storage manager to allocate a new relfilenode. &lt;code&gt;BEGIN&lt;/code&gt; similarly nudges the transaction state by one step; there is no choice in "how to BEGIN." &lt;code&gt;VACUUM&lt;/code&gt; walks a target table page by page and cleans up dead tuples through a fixed procedure. The point is that cost comparison is meaningless here.&lt;/p&gt;

&lt;p&gt;The two camps are wired through different code paths. The first split happens in the analyzer, right after the parser hands over a raw parse tree.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fork lives in transformStmt's switch
&lt;/h2&gt;

&lt;p&gt;When the raw parse tree arrives, &lt;code&gt;transformStmt()&lt;/code&gt; runs a large switch on the node tag (&lt;code&gt;src/backend/parser/analyze.c&lt;/code&gt;).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;switch&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nodeTag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parseTree&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="cm"&gt;/* Optimizable statements */&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;T_InsertStmt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;transformInsertStmt&lt;/span&gt;&lt;span class="p"&gt;(...);&lt;/span&gt;
        &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;T_SelectStmt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;transformSelectStmt&lt;/span&gt;&lt;span class="p"&gt;(...);&lt;/span&gt;
        &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="cm"&gt;/* ... UPDATE, DELETE, MERGE ... */&lt;/span&gt;

    &lt;span class="cm"&gt;/* Special cases (utility wrappers around an optimizable inside) */&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;T_DeclareCursorStmt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;T_ExplainStmt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;T_CreateTableAsStmt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;T_CallStmt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="cm"&gt;/* transform the inner query separately */&lt;/span&gt;
        &lt;span class="p"&gt;...&lt;/span&gt;

    &lt;span class="nl"&gt;default:&lt;/span&gt;
        &lt;span class="cm"&gt;/* every other utility */&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;makeNode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Query&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;commandType&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CMD_UTILITY&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;utilityStmt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parseTree&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;PostgreSQL does meaningful semantic analysis on the five optimizable statement types and a handful of special cases. Everything else falls through to the default branch, gets stamped &lt;code&gt;commandType = CMD_UTILITY&lt;/code&gt;, and the raw parse tree is stored verbatim in the &lt;code&gt;utilityStmt&lt;/code&gt; field. Nothing was actually analyzed; a Query shell was wrapped around the raw tree with a "this is utility" sticker.&lt;/p&gt;

&lt;p&gt;The next stage, the rewriter, also reads that sticker (&lt;code&gt;src/backend/tcop/postgres.c&lt;/code&gt;).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;commandType&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;CMD_UTILITY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;querytree_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;list_make1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;   &lt;span class="cm"&gt;/* don't rewrite utilities */&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;
    &lt;span class="n"&gt;querytree_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;QueryRewrite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the query is utility, the rewriter does not touch it. The rule system, view expansion, and RLS policy application all live on the optimizable side and have no meaning for utility commands.&lt;/p&gt;

&lt;p&gt;The planner is the same.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;commandType&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;CMD_UTILITY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="cm"&gt;/* Utility commands require no planning. */&lt;/span&gt;
    &lt;span class="n"&gt;stmt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;makeNode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PlannedStmt&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;stmt&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;commandType&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CMD_UTILITY&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;stmt&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;utilityStmt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;utilityStmt&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;stmt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pg_plan_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...);&lt;/span&gt;   &lt;span class="cm"&gt;/* invoke the planner */&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Utility commands never call into the planner. An empty &lt;code&gt;PlannedStmt&lt;/code&gt; wrapper is built, and the raw parse tree is dropped into it. A PlannedStmt with no plan tree.&lt;/p&gt;

&lt;p&gt;The executor stage is split too. Optimizable statements feed their plan tree into the executor proper, which produces rows. Utility statements get handed off to &lt;code&gt;ProcessUtility()&lt;/code&gt;, which dispatches to a per-statement handler. The dispatch logic and the individual handlers belong to later chapters (DDL in 1.6, transaction commands in chapter 4).&lt;/p&gt;

&lt;p&gt;When you lay out the four stages side by side, the asymmetry is sharp.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;Optimizable (5 types)&lt;/th&gt;
&lt;th&gt;Utility (everything else)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Parse analysis&lt;/td&gt;
&lt;td&gt;Dedicated transform function&lt;/td&gt;
&lt;td&gt;Wrapped in a Query shell&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rewriter&lt;/td&gt;
&lt;td&gt;Rule system applied&lt;/td&gt;
&lt;td&gt;Skipped (passes through)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Planner&lt;/td&gt;
&lt;td&gt;Plan tree generated&lt;/td&gt;
&lt;td&gt;Skipped (empty PlannedStmt holding the raw tree)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Executor&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;ExecutorRun()&lt;/code&gt; (walks the plan tree)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;ProcessUtility()&lt;/code&gt; (per-statement handler)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The whole sophistication of the planner exists for those five types; utility bypasses it entirely. This is not an efficiency choice. It is a structural asymmetry, because utility commands have no alternative paths to compare.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two species sharing one system
&lt;/h2&gt;

&lt;p&gt;Once you see how this asymmetry is wired, the two camps look almost like two species inside the same engine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hooks live on different paths.&lt;/strong&gt; A hook in PostgreSQL is a function pointer exposed at a key point in the execution path so that external code (typically an extension) can plug in. An extension installs its own function address into the hook, and PostgreSQL calls it whenever the hook is non-null at the appropriate moment. Which camp the hook applies to depends on where it sits. &lt;code&gt;planner_hook&lt;/code&gt; is invoked just before the planner runs, so it only affects optimizable queries. Utility never enters the planner, so &lt;code&gt;planner_hook&lt;/code&gt; never fires for utility. On the other side, &lt;code&gt;ProcessUtility_hook&lt;/code&gt; is invoked just before &lt;code&gt;ProcessUtility()&lt;/code&gt; runs, so it only applies to utility commands. That is how an audit logging extension like pgaudit intercepts DDL and DCL. You need both hooks together to cover every SQL execution path. If you write extensions, you have to know up front which camp your hook is intercepting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Statistics are split too.&lt;/strong&gt; PostgreSQL accumulates usage patterns about which SQL ran how often and for how long, and DBAs use that data to find slow queries and pick tuning targets. The recording channels, however, are split between the two camps. &lt;code&gt;pg_stat_statements&lt;/code&gt; records SELECT/INSERT/UPDATE/DELETE/MERGE executions along with plan-level information. Some utility commands (especially DDL) need a separate channel like &lt;code&gt;log_statement = ddl&lt;/code&gt;. A monitoring tool that wants to draw "what is happening in this system" has to read both channels.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prepared statements mean different things in each camp.&lt;/strong&gt; The prepared statements from 1.1.2 cache plans for optimizable queries. Utility commands can be wrapped with &lt;code&gt;PREPARE&lt;/code&gt; too, but since there is no plan, there is nothing to cache. The server just keeps the raw tree around and routes each EXECUTE through ProcessUtility again. Same name, different semantics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;EXPLAIN's reach is asymmetric.&lt;/strong&gt; &lt;code&gt;EXPLAIN SELECT ...&lt;/code&gt; draws a plan tree. &lt;code&gt;EXPLAIN ALTER TABLE ...&lt;/code&gt; does not work; there is no plan tree to draw. The exception is the special-case group from the switch above (&lt;code&gt;T_DeclareCursorStmt&lt;/code&gt;, &lt;code&gt;T_ExplainStmt&lt;/code&gt;, &lt;code&gt;T_CreateTableAsStmt&lt;/code&gt;, &lt;code&gt;T_CallStmt&lt;/code&gt;). They are classified as utility on the outside but contain an optimizable query that goes through the regular pipeline. That hybrid path is why &lt;code&gt;EXPLAIN ANALYZE INSERT INTO ... SELECT ...&lt;/code&gt; works. The outer wrapper is utility; the SELECT and INSERT inside are optimizable.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I once tried to debug "why is this long ALTER TABLE so slow" by analyzing the plan. The plan had no answer. The bulk of the cost lived outside the plan tree: lock waits, catalog updates, full-table rewrites, and WAL volume. That was when I learned why utility needs its own dedicated path. Some costs are invisible to the plan-level cost model, and to see them you need a different stage of tools.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What this means in practice
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;First, planner-related tuning and monitoring tools have no meaning for utility.&lt;/strong&gt; &lt;code&gt;EXPLAIN&lt;/code&gt;, plan-level metrics in &lt;code&gt;pg_stat_statements&lt;/code&gt;, &lt;code&gt;auto_explain&lt;/code&gt;, &lt;code&gt;plan_cache_mode&lt;/code&gt;, all of these are optimizable-side tools. Looking at the plan to debug a slow DDL or DCL is pointless because there is no plan. By the same logic, an ORM or migration tool that wraps &lt;code&gt;ALTER TABLE&lt;/code&gt; in a prepared statement assuming "the plan will be cached anyway" has the wrong mental model. Utility commands have no plan; every call grabs and releases catalog locks while running directly. Slow utility is almost always &lt;strong&gt;lock contention&lt;/strong&gt; or &lt;strong&gt;I/O cost&lt;/strong&gt;, not a plan choice issue, so the diagnostic path is to set &lt;code&gt;log_min_duration_statement&lt;/code&gt; to capture timings, watch lock waits in &lt;code&gt;pg_stat_activity&lt;/code&gt;, and look at &lt;code&gt;wait_events&lt;/code&gt; to see where the command is stuck.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second, audit and security requirements split into two camps and attach to different mechanisms.&lt;/strong&gt; Tracking "who changed which schema and when" lives on the utility side. pgaudit captures ALTER/CREATE/DROP/GRANT into an audit log because it sits on &lt;code&gt;ProcessUtility_hook&lt;/code&gt;. Row-level access control like "this user can only see a subset of rows in this table" lives on the optimizable side. RLS (Row-Level Security) runs in the rewriter, automatically attaching extra WHERE conditions to SELECT/UPDATE/DELETE. The two requirements get bundled under the same security umbrella, but they hook into completely different stages of the pipeline. RLS cannot stop a schema change, and &lt;code&gt;ProcessUtility_hook&lt;/code&gt; cannot filter rows. When a compliance requirement comes in, the first task is to classify it as schema-level tracking versus row-level access control. Only then do the candidate tools fall into place, and you almost always need both mechanisms together to leave no gaps.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>internals</category>
      <category>sql</category>
    </item>
    <item>
      <title>1.1.2 Simple vs Extended</title>
      <dc:creator>JoongHyuk Shin</dc:creator>
      <pubDate>Tue, 05 May 2026 04:35:38 +0000</pubDate>
      <link>https://forem.com/joonghyukshin/112-simple-vs-extended-4l5l</link>
      <guid>https://forem.com/joonghyukshin/112-simple-vs-extended-4l5l</guid>
      <description>&lt;p&gt;The fork visible in 1.1.1 (simple query protocol on one side, extended on the other) is the subject of this section, one level deeper. 1.1.1 set the skeleton: simple is one message, extended is four. The job here is to show how that split translates into four distinct outcomes: plan reuse, parameter safety, pipelining, and error handling.&lt;/p&gt;

&lt;h2&gt;
  
  
  Message sequence: the shape of one cycle is different
&lt;/h2&gt;

&lt;p&gt;Putting the message sequences side by side makes the difference visible at a glance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simple&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Client                          Server
  │                               │
  │── 'Q' (SQL text) ────────────▶│
  │                               │ parse → analyze/rewrite → plan
  │                               │ → create portal → execute → drop portal
  │◀── RowDescription, DataRow*, CommandComplete, ReadyForQuery
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Extended&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Client                          Server
  │                               │
  │── 'P' (SQL template) ────────▶│ parse + analyze, store prepared statement
  │◀── ParseComplete                
  │                               │
  │── 'B' (parameter values) ────▶│ choose plan, create portal
  │◀── BindComplete                
  │                               │
  │── 'E' (execute) ─────────────▶│ run portal, send rows
  │◀── DataRow*, CommandComplete   
  │                               │
  │── 'S' (Sync) ────────────────▶│ close transaction
  │◀── ReadyForQuery               
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Simple finishes one cycle in a single message. Extended slices the cycle into four messages, and that slicing is what produces the four capabilities below.&lt;/p&gt;

&lt;h2&gt;
  
  
  Capability one: execution plans get reused
&lt;/h2&gt;

&lt;p&gt;The central concept that lets extended split the stages is the &lt;strong&gt;prepared statement&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A prepared statement is a SQL template that has already been parsed and analyzed. The places where values would go are left blank with placeholders like &lt;code&gt;$1&lt;/code&gt;, &lt;code&gt;$2&lt;/code&gt;, and at execution time only the actual values get plugged into those slots. Take &lt;code&gt;INSERT INTO users (id, name) VALUES ($1, $2)&lt;/code&gt;. Once you turn that into a prepared statement, you can run it later by sending only the values: &lt;code&gt;(1, 'Alice')&lt;/code&gt;, &lt;code&gt;(2, 'Bob')&lt;/code&gt;. The full SQL text isn't reparsed each time. Give it a name and it becomes a &lt;strong&gt;named prepared statement&lt;/strong&gt; you can call back during the session. Send it without a name and it's an &lt;strong&gt;unnamed prepared statement&lt;/strong&gt;, automatically discarded the moment the next &lt;code&gt;'P'&lt;/code&gt; arrives.&lt;/p&gt;

&lt;p&gt;The four messages of the extended protocol are exactly that flow, sliced.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Message&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;'P'&lt;/code&gt; Parse&lt;/td&gt;
&lt;td&gt;Take the SQL template, finish parse and analysis, store as a prepared statement&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;'B'&lt;/code&gt; Bind&lt;/td&gt;
&lt;td&gt;Bind actual parameter values to the prepared statement and prepare for execution (create a portal)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;'E'&lt;/code&gt; Execute&lt;/td&gt;
&lt;td&gt;Run the prepared portal and send result rows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;'S'&lt;/code&gt; Sync&lt;/td&gt;
&lt;td&gt;End of the cycle, send ReadyForQuery&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;What this means is that the same prepared statement can be re-executed many times with different parameters by repeating just &lt;code&gt;'B' + 'E'&lt;/code&gt;. Take inserting 1,000 users.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Driver pseudocode: 1000 INSERTs via a prepared statement
&lt;/span&gt;&lt;span class="n"&gt;stmt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;prepare&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INSERT INTO users (id, name) VALUES ($1, $2)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;stmt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;conn.prepare(...)&lt;/code&gt; corresponds to a single 'P' message. Parsing and analysis of the SQL text happen there. Each of the 1000 &lt;code&gt;stmt.execute(...)&lt;/code&gt; calls corresponds to a 'B' + 'E' pair. Parse and analyze run only once on the first call; the remaining 999 do bind and execute only. With simple query, the same INSERT text would be sent 1000 times and reparsed 1000 times.&lt;/p&gt;

&lt;p&gt;Internally, a prepared statement is held in a structure called &lt;code&gt;CachedPlanSource&lt;/code&gt;, which keeps the raw parse tree and the analysis result. When the same prepared statement gets another &lt;code&gt;'B' + 'E'&lt;/code&gt;, the backend starts from the saved &lt;code&gt;CachedPlanSource&lt;/code&gt;, only redecides the execution plan, and runs. Parsing and analysis are skipped.&lt;/p&gt;

&lt;h3&gt;
  
  
  Generic plan vs custom plan
&lt;/h3&gt;

&lt;p&gt;One step further. Plan reuse is real, but to be precise there are two kinds of plan.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Custom plan&lt;/strong&gt;: recomputed every time using the bound parameter values. Helpful when the optimal path differs by value. Take &lt;code&gt;WHERE status = $1&lt;/code&gt;. Suppose &lt;code&gt;status='pending'&lt;/code&gt; matches 1% of rows and &lt;code&gt;status='completed'&lt;/code&gt; matches 99%. A distribution where the value-by-value ratios are this lopsided is what's usually called a &lt;strong&gt;skewed distribution&lt;/strong&gt;. Index scan is fast for 'pending'; sequential scan is fast for 'completed'. Custom plan looks at the value on every call and picks the path that fits it. (Plan construction is the entire subject of chapter 1.4; the kinds and behavior of scan nodes are covered in 1.5.2.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generic plan&lt;/strong&gt;: planned once without knowing the parameters and cached. Every EXECUTE from then on reuses the cached plan, so from each call's point of view the cost of "planning this one" is zero. The trade-off is that the same path is forced for every parameter value.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;PostgreSQL decides between the two on every EXECUTE. The decision function is &lt;code&gt;choose_custom_plan()&lt;/code&gt;, and the default policy is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;For the first 5 EXECUTEs, always use a custom plan. Collect actual cost measurements.&lt;/li&gt;
&lt;li&gt;From the 6th onward, compare the average custom plan cost against the generic plan cost. The custom average includes the cost of planning every time, while the generic side has that cost as zero (for the reason above), so the comparison is intentionally asymmetric.&lt;/li&gt;
&lt;li&gt;If generic is cheaper, switch to generic. Otherwise, stay on custom.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The decision can be forced via the &lt;code&gt;plan_cache_mode&lt;/code&gt; GUC. &lt;code&gt;auto&lt;/code&gt; (default) runs the policy above; &lt;code&gt;force_custom_plan&lt;/code&gt; always uses custom; &lt;code&gt;force_generic_plan&lt;/code&gt; always uses generic.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Working on another RDBMS engine, the first time I saw the "5 customs, then start comparing" rule I spent a while looking for the reason behind that 5. The conclusion: it's an arbitrary constant. The PG source comment literally says "until we have done at least 5 (arbitrary)". Other engines tend to be stricter with plan cache policy (e.g. lock the first plan in as the generic one) and let you override via a knob, while PG chose to decide dynamically on every call. The result is that a PG prepared statement isn't simply a "plan cache"; it's "automatic switching driven by statistics." This is one reason that even ORM code that uses prepared statements automatically can show much less plan caching than people expect: if a statement is called fewer than 5 times, it gets recomputed as a custom plan every time.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Capability two: SQL injection is structurally blocked
&lt;/h2&gt;

&lt;p&gt;In simple query, putting a parameter into a query means embedding the value inside the SQL text, something like &lt;code&gt;f"SELECT * FROM users WHERE id = {user_input}"&lt;/code&gt;. If &lt;code&gt;user_input&lt;/code&gt; is untrusted, you've just opened the door to SQL injection.&lt;/p&gt;

&lt;p&gt;Extended separates the SQL template from the parameter values into different messages. &lt;code&gt;'P'&lt;/code&gt; carries only the template, like &lt;code&gt;SELECT * FROM users WHERE id = $1&lt;/code&gt;. &lt;code&gt;'B'&lt;/code&gt; carries the values that fill those slots, in binary or text form. Those values never go through the SQL parser. They're plugged into the already-parsed plan tree as data.&lt;/p&gt;

&lt;p&gt;When JDBC &lt;code&gt;PreparedStatement&lt;/code&gt;, libpq &lt;code&gt;PQexecParams&lt;/code&gt;, or psycopg2 supports &lt;code&gt;?&lt;/code&gt; or &lt;code&gt;$1&lt;/code&gt; placeholders, that's the path being used internally. The real mechanism for SQL injection prevention lives here. It isn't "remember to escape on the client"; it's a structure where the parser has no chance to interpret a user-supplied value as a SQL token.&lt;/p&gt;

&lt;h2&gt;
  
  
  Capability three: messages can be batched (pipelining)
&lt;/h2&gt;

&lt;p&gt;Simple sends ReadyForQuery back the moment each &lt;code&gt;'Q'&lt;/code&gt; is processed. The client can't send the next query until that response arrives. One query equals one round-trip.&lt;/p&gt;

&lt;p&gt;Extended only sends ReadyForQuery when an &lt;code&gt;'S'&lt;/code&gt; (Sync) arrives. That means a sequence like &lt;code&gt;'P', 'B', 'E', 'B', 'E', 'B', 'E', 'S'&lt;/code&gt; can go out as a single batch. 100 INSERTs in one round-trip. In environments with significant network latency (cross-region cloud calls, for instance), the throughput difference is large.&lt;/p&gt;

&lt;p&gt;Built on top of this mechanism, PG 14 introduced an official pipeline mode in libpq (&lt;code&gt;PQpipelineSync&lt;/code&gt;, &lt;code&gt;PQenterPipelineMode&lt;/code&gt;, etc.). The wire-level capability existed before, but the libpq client API for it wasn't clean.&lt;/p&gt;

&lt;h2&gt;
  
  
  Capability four: a partial error doesn't break the whole batch
&lt;/h2&gt;

&lt;p&gt;Simple, on error, immediately sends ErrorResponse plus ReadyForQuery. The cycle closes right away and the backend is ready for the next &lt;code&gt;'Q'&lt;/code&gt;. As noted above, simple is a 1-round-trip structure, so when the backend returns to normal mode there's nothing queued in the buffer behind the failed message. Closing out and waiting for the next &lt;code&gt;'Q'&lt;/code&gt; is enough.&lt;/p&gt;

&lt;p&gt;Where extended runs into real trouble is the batch case. As we saw in capability three, a typical client pushes &lt;code&gt;'B', 'E', 'B', 'E', ..., 'S'&lt;/code&gt; into the wire all at once. Suppose you send 100 INSERTs by pipelining: one 'P', followed by 100 pairs of 'B' + 'E' and a single 'S' all line up in the backend's buffer. While the backend is processing the 1st 'B', the 51st and 70th messages are already sitting in that buffer waiting their turn.&lt;/p&gt;

&lt;p&gt;Now suppose the 50th 'B' fails with something like a unique violation. If the backend behaved like simple (immediately sending ErrorResponse + ReadyForQuery and returning to normal mode), it would pull the 51st 'B' out of the buffer and start processing it next. But that 51st 'B' was sent by the client under the assumption that the first 50 had succeeded. The transaction is already aborted, so processing the 51st errors out too. Same for 52, 53, ..., 100. The client ends up tracking the original error plus 50 more downstream errors.&lt;/p&gt;

&lt;p&gt;PG avoids this chaos with a different strategy. The moment an error occurs, the backend enters a special state called &lt;strong&gt;ignore_till_sync&lt;/strong&gt;. While in that state, every message that arrives is dropped without being processed until the client explicitly sends an &lt;code&gt;'S'&lt;/code&gt; (Sync). No additional error responses go out. Once &lt;code&gt;'S'&lt;/code&gt; arrives, the backend finally sends ReadyForQuery and starts accepting messages normally again.&lt;/p&gt;

&lt;p&gt;The result is that the client receives exactly two responses: one ErrorResponse (the 50th failure) and one ReadyForQuery (in reply to &lt;code&gt;'S'&lt;/code&gt;). A clean boundary forms: "the batch failed somewhere, and everything past that point was discarded." ignore_till_sync is, in essence, the byproduct that makes pipelining safe.&lt;/p&gt;

&lt;h2&gt;
  
  
  All four in one table
&lt;/h2&gt;

&lt;p&gt;Compressing the four capabilities into a single comparison.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Area&lt;/th&gt;
&lt;th&gt;Simple&lt;/th&gt;
&lt;th&gt;Extended&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Message count&lt;/td&gt;
&lt;td&gt;1 (&lt;code&gt;'Q'&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;4+ (&lt;code&gt;'P'&lt;/code&gt;, &lt;code&gt;'B'&lt;/code&gt;, &lt;code&gt;'E'&lt;/code&gt;, &lt;code&gt;'S'&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Plan reuse&lt;/td&gt;
&lt;td&gt;None (parse + plan every time)&lt;/td&gt;
&lt;td&gt;Yes (&lt;code&gt;CachedPlanSource&lt;/code&gt; + auto generic/custom)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Parameters&lt;/td&gt;
&lt;td&gt;Inline in SQL text&lt;/td&gt;
&lt;td&gt;Separated as data at bind time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SQL injection&lt;/td&gt;
&lt;td&gt;Client is responsible for escaping&lt;/td&gt;
&lt;td&gt;Prevented at the protocol level&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Round-trips&lt;/td&gt;
&lt;td&gt;1 per query&lt;/td&gt;
&lt;td&gt;Batched (1 per Sync)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Error handling&lt;/td&gt;
&lt;td&gt;Immediate ReadyForQuery&lt;/td&gt;
&lt;td&gt;ignore_till_sync (wait until Sync)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What this means in practice
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;First, don't assume that "the ORM uses prepared statements" means you're getting full plan caching.&lt;/strong&gt; PG plans a custom plan for the first 5 EXECUTEs of every prepared statement. If a statement is called only once or twice inside a short transaction, the plan caching benefit is essentially zero. The real benefit shows up in workloads that call the same prepared statement dozens to hundreds of times with different parameters. The ratio of &lt;code&gt;calls&lt;/code&gt; to &lt;code&gt;plans&lt;/code&gt; in &lt;code&gt;pg_stat_statements&lt;/code&gt;, plus a forced &lt;code&gt;plan_cache_mode&lt;/code&gt; setting, are the two diagnostic tools.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second, the answer to "why isn't my prepared statement going generic?" is the wall of 5.&lt;/strong&gt; Forcing &lt;code&gt;plan_cache_mode = force_generic_plan&lt;/code&gt; brings planning cost to zero but locks every parameter value to the same path. With skewed data this can actually be slower. The opposite, &lt;code&gt;force_custom_plan&lt;/code&gt;, pays planning cost every time. The default &lt;code&gt;auto&lt;/code&gt;, which decides dynamically from the 6th call, is usually safest, but there are environments where explicitly choosing generic is worth the GUC tweak. For example, environments where prepared statements have very short lifetimes due to PgBouncer transaction pooling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Third, in environments with significant network latency, pipelining is the real lever.&lt;/strong&gt; Cross-region RDS calls in the cloud, or even same-region setups where there's millisecond-level latency between application and database, will turn 100 simple INSERTs into 100 round-trips. libpq pipeline mode, or JDBC &lt;code&gt;addBatch()&lt;/code&gt; + &lt;code&gt;executeBatch()&lt;/code&gt;, can collapse that to a single round-trip. Just keep in mind that the error-handling complexity goes up (you need to understand what ignore_till_sync means), so it pays to design batch-level transaction boundaries and a retry policy at the same time as the pipelining itself.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>internals</category>
      <category>protocol</category>
    </item>
    <item>
      <title>1.1.1 Life of a Query</title>
      <dc:creator>JoongHyuk Shin</dc:creator>
      <pubDate>Mon, 04 May 2026 10:13:22 +0000</pubDate>
      <link>https://forem.com/joonghyukshin/111-life-of-a-query-1phg</link>
      <guid>https://forem.com/joonghyukshin/111-life-of-a-query-1phg</guid>
      <description>&lt;p&gt;This section is the map for the rest of the book. The five stages introduced in the 1.1 chapter overview (parse, analyze/rewrite, plan, portal, execute) are traced here through the actual code: which functions implement each stage, and in what order they get called. The mechanics of each of the five stages are unpacked in later chapters. Here, only the skeleton matters: how a backend starts up, how it receives messages, and where the first fork in the road appears.&lt;/p&gt;

&lt;h2&gt;
  
  
  One backend process owns one query
&lt;/h2&gt;

&lt;p&gt;Every time a client connects, PostgreSQL forks a &lt;strong&gt;backend process&lt;/strong&gt; for it (the parent is &lt;code&gt;postmaster&lt;/code&gt;). That process stays alive until the client disconnects, and it handles every query that client sends, by itself. Unlike the thread-pool model common in other RDBMSs, PG uses one OS process per connection. The reasons behind that decision are taken up in 6.1.1.&lt;/p&gt;

&lt;p&gt;The actual entry point of that backend is a function called &lt;code&gt;PostgresMain&lt;/code&gt;. The name is grand; what it does is unexpectedly simple. Two things, then off it goes.&lt;/p&gt;

&lt;p&gt;First, &lt;strong&gt;it installs signal handlers&lt;/strong&gt;. Signals are asynchronous notifications the OS delivers to a process (for example, &lt;code&gt;SIGTERM&lt;/code&gt; is a request to shut down, &lt;code&gt;SIGUSR1&lt;/code&gt; is for PG-internal communication). A backend has to react to signals from &lt;code&gt;postmaster&lt;/code&gt; and from other backends, so each signal is wired to a handler ahead of time. Signals and IPC in general are covered in 6.3.&lt;/p&gt;

&lt;p&gt;Second, &lt;strong&gt;it initializes the transaction system&lt;/strong&gt;. Every SQL statement in PG, even without an explicit &lt;code&gt;BEGIN&lt;/code&gt;, runs inside some transaction. The transaction system is the core PG machinery that tracks &lt;code&gt;BEGIN&lt;/code&gt;/&lt;code&gt;COMMIT&lt;/code&gt; boundaries, MVCC visibility, XID assignment, and so on. Transactions and MVCC are the subject of all of chapter 3. For now, it's enough to know that this machinery is set up before the backend ever sees a SQL statement.&lt;/p&gt;

&lt;p&gt;Once those two preparations are done, the real work of the backend begins. An infinite loop.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(;;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="p"&gt;...&lt;/span&gt;
    &lt;span class="n"&gt;ReadyForQuery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;whereToSendOutput&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;...&lt;/span&gt;
    &lt;span class="n"&gt;firstchar&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ReadCommand&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;input_message&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;...&lt;/span&gt;
    &lt;span class="k"&gt;switch&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;firstchar&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;PqMsg_Query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;        &lt;span class="c1"&gt;// 'Q', simple query&lt;/span&gt;
            &lt;span class="n"&gt;exec_simple_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_string&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;PqMsg_Parse&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;        &lt;span class="c1"&gt;// 'P', extended: parse&lt;/span&gt;
            &lt;span class="n"&gt;exec_parse_message&lt;/span&gt;&lt;span class="p"&gt;(...);&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;PqMsg_Bind&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;         &lt;span class="c1"&gt;// 'B', extended: bind&lt;/span&gt;
            &lt;span class="n"&gt;exec_bind_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;input_message&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;PqMsg_Execute&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;      &lt;span class="c1"&gt;// 'E', extended: execute&lt;/span&gt;
            &lt;span class="n"&gt;exec_execute_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;portal_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_rows&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;PqMsg_Sync&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;         &lt;span class="c1"&gt;// 'S', end of an extended cycle&lt;/span&gt;
            &lt;span class="n"&gt;finish_xact_command&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
            &lt;span class="n"&gt;send_ready_for_query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="p"&gt;...&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This loop is the entire life of a backend.&lt;/p&gt;

&lt;p&gt;"Announce that I'm ready, read one message, dispatch on its type." Repeat forever. When the client closes the connection, an &lt;code&gt;'X'&lt;/code&gt; (Terminate) message arrives, the loop exits, and the process dies.&lt;/p&gt;

&lt;p&gt;The first fork in the road is visible right here. There's the &lt;code&gt;'Q'&lt;/code&gt; path and the &lt;code&gt;'P' / 'B' / 'E'&lt;/code&gt; path. That split is the difference between the simple query protocol and the extended query protocol.&lt;/p&gt;

&lt;h2&gt;
  
  
  Simple vs extended
&lt;/h2&gt;

&lt;p&gt;Simple is the case where a single message contains the SQL text in full. Type &lt;code&gt;SELECT 1;&lt;/code&gt; into &lt;code&gt;psql&lt;/code&gt; and hit enter, and that's what flies across the wire. The backend receives that one message and runs the full five-stage cycle (parse, analyze and rewrite, plan, portal, execute) before returning the result.&lt;/p&gt;

&lt;p&gt;Extended does the same job but splits it into four messages (&lt;code&gt;'P'&lt;/code&gt;, &lt;code&gt;'B'&lt;/code&gt;, &lt;code&gt;'E'&lt;/code&gt;, &lt;code&gt;'S'&lt;/code&gt;). Splitting the stages opens up plan reuse, parameter safety, and pipelining. The semantic differences between the two protocols and how they play out in practice are unpacked in section 1.1.2.&lt;/p&gt;

&lt;h2&gt;
  
  
  Optimizable vs utility
&lt;/h2&gt;

&lt;p&gt;Everything described so far assumes &lt;strong&gt;optimizable statements&lt;/strong&gt;: &lt;code&gt;SELECT/INSERT/UPDATE/DELETE&lt;/code&gt;. These have paths to optimize. The planner decides between sequential and index scan, hash join and nested loop, one join order or another.&lt;/p&gt;

&lt;p&gt;But statements like &lt;code&gt;CREATE TABLE&lt;/code&gt;, &lt;code&gt;VACUUM&lt;/code&gt;, &lt;code&gt;SET&lt;/code&gt;, and &lt;code&gt;BEGIN&lt;/code&gt; (the &lt;strong&gt;utility statements&lt;/strong&gt;) are different. There's nothing for a cost model to optimize. They're DDL or system commands, with no path to choose. In that case the planner produces only an empty shell of a plan and hands the actual work to a utility-statement handler. The executor never gets called on this path.&lt;/p&gt;

&lt;p&gt;The detailed branching is the subject of 1.1.3. The takeaway here is just one thing: not every query in PG goes through the planner.&lt;/p&gt;

&lt;h2&gt;
  
  
  The big picture
&lt;/h2&gt;

&lt;p&gt;We can now compress the journey of a SQL line into a single diagram.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Client
   │
   │  'Q' (or 'P' + 'B' + 'E')
   ▼
PostgresMain main loop
   │
   ▼
exec_simple_query
   │
   ├─ pg_parse_query           → raw parse tree     (1.2.1, 1.2.3)
   │
   ├─ pg_analyze_and_rewrite   → list of Query nodes (1.2.2, 1.3)
   │
   ├─ pg_plan_queries          → execution plan      (1.4 chapter)
   │     └─ utility produces an empty shell          (1.1.3)
   │
   ├─ PortalStart + PortalRun  → tuple pulling       (1.5)
   │
   └─ PortalDrop + finish_xact_command
   │
   ▼
ReadyForQuery → back to the top of the loop
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each box in this diagram corresponds to a chapter in the book. 1.2 is parser and analyzer, 1.3 is rewriter, 1.4 is planner, 1.5 is the executor. All of part 1 is essentially one zoomed-in view of this diagram.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Working on another RDBMS engine, I once found this aspect of PG surprising. PG accepts a multi-statement query like &lt;code&gt;SELECT 1; SELECT 2;&lt;/code&gt; as a single simple-query message. What's even more surprising is the transaction handling. Without an explicit &lt;code&gt;BEGIN&lt;/code&gt;/&lt;code&gt;COMMIT&lt;/code&gt;, all those statements get bundled into a single implicit transaction block, and if even one of them fails, the whole batch rolls back.&lt;/p&gt;

&lt;p&gt;At first I assumed this was just standard behavior. Comparing the client protocols of other major databases made it clear this is a PG-specific decision. MySQL has &lt;code&gt;CLIENT_MULTI_STATEMENTS&lt;/code&gt; off by default, so multi-statement queries are simply rejected (you have to flip the flag explicitly because of SQL injection risk). Even with the flag on, statements are processed sequentially, and because autocommit is the default, each one commits as its own transaction. Oracle accepts only one statement per OCI call, so to bundle multiple statements you have to wrap them in an anonymous PL/SQL block (&lt;code&gt;BEGIN ... END;&lt;/code&gt;). SQL Server accepts multiple statements in a T-SQL batch, but atomic handling still requires an explicit &lt;code&gt;BEGIN TRANSACTION&lt;/code&gt;. None of the three does what PG does: bundle automatically as soon as the message arrives.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What this means in practice
&lt;/h2&gt;

&lt;p&gt;This five-stage skeleton turns out to be the foundation for two diagnostic tools you'll use in operations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First, you can see exactly where EXPLAIN's output comes from.&lt;/strong&gt; &lt;code&gt;EXPLAIN&lt;/code&gt; runs only as far as stage 4 (plan); it skips stage 5 (execute). &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; actually runs through stage 5 and measures it. That's why &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; produces real load and shouldn't be casually run in production: an &lt;code&gt;EXPLAIN ANALYZE UPDATE ...&lt;/code&gt; actually updates rows. The familiar &lt;code&gt;BEGIN; EXPLAIN ANALYZE UPDATE ...; ROLLBACK&lt;/code&gt; idiom exists for exactly this reason.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second, the fact that one backend means one process means one query at a time explains why connection pooling matters so much.&lt;/strong&gt; A backend's main loop is essentially single-threaded. While one client runs a long query, that backend can't do anything else. Connection counts therefore drive memory and scheduling costs linearly, and a pooler like PgBouncer becomes effectively mandatory. The answer to "why are PostgreSQL connections so expensive?" lives inside this one-line main loop.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>internals</category>
      <category>protocol</category>
    </item>
    <item>
      <title>1.1 Where Does a Query Go?</title>
      <dc:creator>JoongHyuk Shin</dc:creator>
      <pubDate>Mon, 04 May 2026 09:57:44 +0000</pubDate>
      <link>https://forem.com/joonghyukshin/11-where-does-a-query-go-1bka</link>
      <guid>https://forem.com/joonghyukshin/11-where-does-a-query-go-1bka</guid>
      <description>&lt;p&gt;Suppose a client sends &lt;code&gt;SELECT * FROM users WHERE id = 1&lt;/code&gt;. The path that single line travels before coming back as a result row is longer than you might expect. Inside the PostgreSQL backend, that SQL goes through a five-stage pipeline. The five stages are exactly the five chapters of Chapter 1.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;1.1 Where Does a Query Go?&lt;/strong&gt;: the backend decides which processing path the client message should follow.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1.2 Parser and Analyzer: How SQL Gets Its Meaning&lt;/strong&gt;: the SQL text is parsed, and the catalog is consulted to give it meaning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1.3 Rewriter: How a Query is Rewritten&lt;/strong&gt;: the RULE system expands views, injects RLS policies, and otherwise transforms the query tree.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1.4 Planner: Which Path to Take&lt;/strong&gt;: a cost model explores possible execution paths and picks the best one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1.5 Executor: How Results Come Back&lt;/strong&gt;: the chosen plan is walked, pulling tuples up and sending them to the client.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Chapter 1.1, the one you're reading, splits into three sections.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;1.1.1 Life of a Query&lt;/strong&gt;: compresses all five stages into a single diagram. The map for the rest of the book.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1.1.2 Simple vs Extended&lt;/strong&gt;: looks at the semantic difference between the two protocols.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1.1.3 Optimizable vs Utility&lt;/strong&gt;: shows how &lt;code&gt;SELECT/INSERT/...&lt;/code&gt; and &lt;code&gt;CREATE/VACUUM/...&lt;/code&gt; take different paths.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After this chapter, it should be clear how the backend's main loop receives a client message and dispatches it to the right function.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>internals</category>
      <category>backend</category>
    </item>
  </channel>
</rss>
