<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Sergey Solovev</title>
    <description>The latest articles on Forem by Sergey Solovev (@ashenblade).</description>
    <link>https://forem.com/ashenblade</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1101367%2F23b691b3-bbd6-4fe2-82e4-d4f65a724abf.jpeg</url>
      <title>Forem: Sergey Solovev</title>
      <link>https://forem.com/ashenblade</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/ashenblade"/>
    <language>en</language>
    <item>
      <title>Create and debug PostgreSQL extension using VS Code</title>
      <dc:creator>Sergey Solovev</dc:creator>
      <pubDate>Sat, 18 Oct 2025 15:27:17 +0000</pubDate>
      <link>https://forem.com/ashenblade/create-and-debug-postgresql-extension-using-vs-code-2kg</link>
      <guid>https://forem.com/ashenblade/create-and-debug-postgresql-extension-using-vs-code-2kg</guid>
      <description>&lt;p&gt;In this tutorial we will create PostgreSQL extension &lt;code&gt;ban_sus_query&lt;/code&gt;. It will check that DML queries contain predicates, otherwise will just throw an error.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Next, in order not to mislead up, I will use term &lt;code&gt;contrib&lt;/code&gt; for PostgreSQL extension, and for &lt;code&gt;extension&lt;/code&gt; for PostgreSQL Hacker Helper VS Code extension.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This tutorial is created not only for newbies in PostgreSQL development, but also as a tutorial for VS Code extension &lt;a href="https://github.com/ashenBlade/postgres-dev-helper" rel="noopener noreferrer"&gt;PostgreSQL Hacker Helper&lt;/a&gt;. Documentation for it you can find &lt;a href="https://ashenblade.github.io/postgres-dev-helper" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Preparation
&lt;/h2&gt;

&lt;p&gt;First things first, you must setup development environment. This involves 2 things.&lt;/p&gt;

&lt;h3&gt;
  
  
  PostgreSQL setup
&lt;/h3&gt;

&lt;p&gt;To create extension you must build PostgreSQL. Even if you will not run any query and just trust me - you will not be able to build an extension, because compilation depends on several files which are created only during source code compilation.&lt;/p&gt;

&lt;p&gt;Do not worry about it. On documentation site there are 2 related pages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://ashenblade.github.io/postgres-dev-helper/postgresql_setup" rel="noopener noreferrer"&gt;PostgreSQL setup&lt;/a&gt; - introduction to PostgreSQL build system&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ashenblade.github.io/postgres-dev-helper/dev_scripts" rel="noopener noreferrer"&gt;Development scripts&lt;/a&gt; - various development scripts for PostgreSQL building, developing and debugging.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Pages contain not only examples, but also tips.&lt;/p&gt;

&lt;h3&gt;
  
  
  VS Code setup
&lt;/h3&gt;

&lt;p&gt;VS Code - is the IDE we are using, so we should setup it properly for convenient development.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://ashenblade.github.io/postgres-dev-helper/vscode_setup" rel="noopener noreferrer"&gt;VS Code setup page&lt;/a&gt; contains lists of necessary extensions and configuration files.&lt;/p&gt;

&lt;p&gt;Also, it has a very handy &lt;code&gt;tasks.json&lt;/code&gt; file which is bundled with lots of predefined tasks for various use cases of PostgreSQL development. They use scripts from previous section and thus you can use them to turn VS Code in a PostgreSQL IDE. i.e. you can have PostgreSQL up and running from the ground using 2 tasks: &lt;code&gt;Bootstrap&lt;/code&gt; and &lt;code&gt;Run DB&lt;/code&gt;!&lt;/p&gt;

&lt;h2&gt;
  
  
  Creating initial files
&lt;/h2&gt;

&lt;p&gt;PostgreSQL has infrastructure for contrib building and installation. In short, contribs have a template architecture - most parts are common for all.&lt;/p&gt;

&lt;p&gt;So, for faster contrib creation we will use command: &lt;code&gt;PGHH: Bootstrap extension&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F45oxmuoam6maj66ce40f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F45oxmuoam6maj66ce40f.png" alt="Bootstrap extension command" width="335" height="111"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It will prompt us to bootstrap some files - choose only C sources.&lt;/p&gt;

&lt;p&gt;After that we will have our contrib files created:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwyrlk7luhpe6zftfqy1f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwyrlk7luhpe6zftfqy1f.png" alt="README.md with directory contents" width="800" height="303"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Initial code
&lt;/h2&gt;

&lt;p&gt;Query execution pipeline has 3 stages:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Parse/Semantic analysis - query string parsing and resolving tables&lt;/li&gt;
&lt;li&gt;Plan - query optimization and creating execution plan&lt;/li&gt;
&lt;li&gt;Execution - actual query execution&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Our logic will be added to the 2 stage, because we must check real execution plan, not Query.&lt;br&gt;
This is because after multiple transformations query can be changed in multiple ways - predicates can be deleted or added, therefore we may get a completely different query than in the original query string.&lt;/p&gt;

&lt;p&gt;To implement that we will create hook on planner - &lt;code&gt;planner_hook&lt;/code&gt;. Inside we will invoke actual &lt;code&gt;planner&lt;/code&gt; and check it's output for the existence of predicates.&lt;/p&gt;

&lt;p&gt;Starter code is the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="cp"&gt;#include&lt;/span&gt; &lt;span class="cpf"&gt;"postgres.h"&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;
&lt;span class="cp"&gt;#include&lt;/span&gt; &lt;span class="cpf"&gt;"fmgr.h"&lt;/span&gt;&lt;span class="cp"&gt;
#include&lt;/span&gt; &lt;span class="cpf"&gt;"optimizer/planner.h"&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;
&lt;span class="cp"&gt;#ifdef PG_MODULE_MAGIC
&lt;/span&gt;&lt;span class="n"&gt;PG_MODULE_MAGIC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="cp"&gt;#endif
&lt;/span&gt;
&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="n"&gt;planner_hook_type&lt;/span&gt; &lt;span class="n"&gt;prev_planner_hook&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;_PG_init&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;_PG_fini&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="n"&gt;bool&lt;/span&gt;
&lt;span class="nf"&gt;is_sus_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Plan&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;plan&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="cm"&gt;/* ... */&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="n"&gt;PlannedStmt&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="nf"&gt;ban_sus_query_planner_hook&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Query&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                           &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;query_string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                           &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;cursorOptions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                           &lt;span class="n"&gt;ParamListInfo&lt;/span&gt; &lt;span class="n"&gt;boundParams&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;PlannedStmt&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;stmt&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prev_planner_hook&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;stmt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;prev_planner_hook&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query_string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cursorOptions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;boundParams&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;
        &lt;span class="n"&gt;stmt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;standard_planner&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query_string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cursorOptions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;boundParams&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;is_sus_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stmt&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;planTree&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;ereport&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ERROR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;errmsg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"DML query does not contain predicates"&lt;/span&gt;&lt;span class="p"&gt;)));&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;stmt&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="kt"&gt;void&lt;/span&gt;
&lt;span class="nf"&gt;_PG_init&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;prev_planner_hook&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;planner_hook&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;planner_hook&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ban_sus_query_planner_hook&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kt"&gt;void&lt;/span&gt;
&lt;span class="nf"&gt;_PG_fini&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;planner_hook&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;prev_planner_hook&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we are ready to add "business-logic", but before let's understand how such suspicious queries look like.&lt;/p&gt;

&lt;h2&gt;
  
  
  Examine queries
&lt;/h2&gt;

&lt;p&gt;Suspicious query - is a DELETE/UPDATE query that does not contain predicates.&lt;/p&gt;

&lt;p&gt;One of the benefits that we are checking already planned statements is that all predicates are already optimized in a sense that boolean rules are applied.&lt;/p&gt;

&lt;p&gt;Query plan - is a tree of &lt;code&gt;Plan&lt;/code&gt; nodes. Each &lt;code&gt;Plan&lt;/code&gt; contains &lt;code&gt;lefttree&lt;/code&gt;/&lt;code&gt;righttree&lt;/code&gt; - left and right children and &lt;code&gt;qual&lt;/code&gt; - list of predicates to apply at this node. But we must check only UPDATE/DELETE nodes, not each node, - nodes for them is &lt;code&gt;ModifyTable&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Thus our goal is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;traverse query tree, find &lt;code&gt;ModifyTable&lt;/code&gt; and check that it's &lt;code&gt;qual&lt;/code&gt; is not empty&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But, before run sample queries to look what their queries looks like (inside) and which predicates they have.&lt;/p&gt;

&lt;p&gt;For tests we will use this setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Schema&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;tbl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Test queries&lt;/span&gt;
&lt;span class="k"&gt;DELETE&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;tbl&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;DELETE&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;tbl&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;tbl&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;tbl&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To do this we will use our contrib - install it using &lt;code&gt;make install&lt;/code&gt;, add to &lt;code&gt;shared_preload_libraries='ban_sus_query'&lt;/code&gt; and put a breakpoint to &lt;code&gt;return&lt;/code&gt; in &lt;code&gt;ban_sus_query_planner_hook&lt;/code&gt; function.&lt;/p&gt;

&lt;p&gt;PostgreSQL has multiprocess architecture, so you must specify PID of process (backend) to which you want to attach. You can do this in 2 ways:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Request PID of backend directly using &lt;code&gt;SELECT pg_backend_pid();&lt;/code&gt; query. It will return PID of backend.&lt;/li&gt;
&lt;li&gt;Search for required backend by typing &lt;code&gt;postgres&lt;/code&gt; in quick input (when using &lt;code&gt;${command:pickProcess}&lt;/code&gt;).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;But in any way, you must setup &lt;code&gt;launch.json&lt;/code&gt; file. In documentation for the extension I have included sample configuration file (&lt;a href="https://ashenblade.github.io/postgres-dev-helper/vscode_setup/#launchjson" rel="noopener noreferrer"&gt;link&lt;/a&gt;), that you can use in most projects.&lt;/p&gt;

&lt;p&gt;After debugger have attached and you run first DELETE query without predicate, we will see the following:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;PlannedStmt&lt;/code&gt; contains top-level &lt;code&gt;ModifyTable&lt;/code&gt; with empty &lt;code&gt;qual&lt;/code&gt; list&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhwrog5vopmkjpj2k2cwc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhwrog5vopmkjpj2k2cwc.png" alt="qual is empty in ModifyTable node" width="800" height="624"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Inner &lt;code&gt;SeqScan&lt;/code&gt; also contains empty &lt;code&gt;qual&lt;/code&gt; list&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh1319ni71fpfyfcv6i0a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh1319ni71fpfyfcv6i0a.png" alt="qual is empty in SeqScan node" width="800" height="624"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Now run &lt;code&gt;DELETE&lt;/code&gt; query with predicate:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;PlannedStmt&lt;/code&gt; still contains empty &lt;code&gt;qual&lt;/code&gt; list&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2sbji16g7aeo692t5r0u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2sbji16g7aeo692t5r0u.png" alt="qual is still empty in ModifyTable node" width="800" height="624"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Inner &lt;code&gt;SeqScan&lt;/code&gt; now contains &lt;code&gt;qual&lt;/code&gt; with single element - equality predicate&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmx5j7u0p5nbpam651gsk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmx5j7u0p5nbpam651gsk.png" alt="SeqScan has single equality predicate" width="800" height="624"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is no surprise, because our &lt;code&gt;ModifyTable&lt;/code&gt; does not apply any filtering - it just takes tuples from children (note, that by convention single-child nodes store them in &lt;code&gt;lefttree&lt;/code&gt;), so it's &lt;code&gt;qual&lt;/code&gt; is empty, but filtering is applied to &lt;code&gt;SeqScan&lt;/code&gt; - we must check this.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;As you can mention, extension shows all Node variables with actual types, without showing generic &lt;code&gt;Plan&lt;/code&gt; entry.&lt;br&gt;
Also extension is able to show you elements of container types (&lt;code&gt;List *&lt;/code&gt; in this example).&lt;br&gt;
More than that, it renders &lt;code&gt;Expr&lt;/code&gt; nodes (expressions) as it was in a query, so you do not have to manually check each field, trying to figure out what expression it is.&lt;/p&gt;

&lt;p&gt;In vanilla PostgreSQL you would have to evaluate 2 expressions: (first) get &lt;code&gt;NodeTag&lt;/code&gt; and (second) cast variable to obtained &lt;code&gt;NodeTag&lt;/code&gt;.&lt;br&gt;
In this example, to show &lt;code&gt;stmt-&amp;gt;planTree&lt;/code&gt; all you need to do is expand the tree node in variables explorer, but manually (without extension), you need to evaluate (i.e. in &lt;code&gt;watch&lt;/code&gt;) 2 expressions/steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;((Node *)planTree)-&amp;gt;type&lt;/code&gt; get &lt;code&gt;T_ModifyTable&lt;/code&gt; - tag of &lt;code&gt;ModifyTable&lt;/code&gt; node, and then&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;(ModifyTable *)planTree&lt;/code&gt; - show variable with real type.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F96ibnu3b48eac5805fhm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F96ibnu3b48eac5805fhm.png" alt="Evaluate variable in watch" width="800" height="215"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Such manipulations take roughly 5 second, but, as this time accumulates, totally it can take up to 1 hour in a day - just to show variable's contents!&lt;/p&gt;

&lt;p&gt;But there is not such support for &lt;code&gt;Expr&lt;/code&gt; variables - you will not see their representation.&lt;br&gt;
For this you have to dump variable to log using &lt;code&gt;pprint&lt;/code&gt; function, which is not very convenient when you developing in IDE.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Now we are ready to write some code.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;code&gt;is_sus_query&lt;/code&gt; implementation
&lt;/h2&gt;

&lt;p&gt;I repeat, our goal is to &lt;em&gt;traverse query tree, find &lt;code&gt;ModifyTable&lt;/code&gt; and check that it's &lt;code&gt;qual&lt;/code&gt; is not empty&lt;/em&gt;, but now we can refine it:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Search for &lt;code&gt;ModifyTable&lt;/code&gt; in &lt;code&gt;Plan&lt;/code&gt; tree and check that it's children have non-empty &lt;code&gt;qual&lt;/code&gt; list&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;As tree traversal is a recursive function, we will use 2 recursive functions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;is_sus_query&lt;/code&gt; - main function that traverses plan tree to find &lt;code&gt;ModifyTable&lt;/code&gt; node, and when it finds one invokes...&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;contains_predicates&lt;/code&gt; - function that checks that this &lt;code&gt;Plan&lt;/code&gt; node contains any predicate in a query&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's start with &lt;code&gt;is_sus_query&lt;/code&gt;. All we have to do here is to check that &lt;code&gt;Plan&lt;/code&gt; is a &lt;code&gt;ModifyTable&lt;/code&gt; and if so, then check that it's children contain predicates.&lt;/p&gt;

&lt;p&gt;Node type checking is a frequent operation, so extension ships with some snippets - one of them is a &lt;code&gt;isaif&lt;/code&gt;, which expands to &lt;code&gt;if(IsA())&lt;/code&gt; check:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffcd6escd0odrgw958ry5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffcd6escd0odrgw958ry5.png" alt=" raw `isaif` endraw  expansion" width="498" height="273"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When we have determined, that it is a DML operation check that it is DELETE or UPDATE, because &lt;code&gt;ModifyTable&lt;/code&gt; is used for other operations, i.e. &lt;code&gt;INSERT&lt;/code&gt;. This is not hard - just check &lt;code&gt;operation&lt;/code&gt; member.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="n"&gt;bool&lt;/span&gt;
&lt;span class="nf"&gt;is_sus_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Plan&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;plan&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="cm"&gt;/* ... */&lt;/span&gt;
    &lt;span class="n"&gt;ModifyTable&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;modify&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ModifyTable&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;plan&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;switch&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;modify&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;operation&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;CMD_UPDATE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;CMD_DELETE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="cm"&gt;/* Check predicates */&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="nl"&gt;default:&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="cm"&gt;/* ... */&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And now check these operations contain predicates using &lt;code&gt;contains_predicates&lt;/code&gt; function (will be defined further).&lt;/p&gt;

&lt;p&gt;Also, do not forget to handle recursion: call &lt;code&gt;is_sus_query&lt;/code&gt; for children and handle end case (&lt;code&gt;NULL&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;The result function looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="n"&gt;bool&lt;/span&gt;
&lt;span class="nf"&gt;is_sus_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Plan&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;plan&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="cm"&gt;/* Recursion end */&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plan&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nb"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;IsA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plan&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ModifyTable&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;ModifyTable&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;modify&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ModifyTable&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;plan&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="k"&gt;switch&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;modify&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;operation&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;CMD_UPDATE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;CMD_DELETE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;contains_predicates&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;modify&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;plan&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lefttree&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="nl"&gt;default:&lt;/span&gt;
                &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="cm"&gt;/* Handle recursion */&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;is_sus_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plan&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;lefttree&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;is_sus_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plan&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;righttree&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;code&gt;contains_predicates&lt;/code&gt; implementation
&lt;/h2&gt;

&lt;p&gt;Now perform actual checking of the predicates existence using &lt;code&gt;contains_predicates&lt;/code&gt;. Inside this function we must check that given &lt;code&gt;Plan&lt;/code&gt; contains predicates.&lt;/p&gt;

&lt;p&gt;But situation is complicated by the fact that only base &lt;code&gt;Plan&lt;/code&gt; is given and we do not know actual query. For example this query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;DELETE&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;t1&lt;/span&gt; &lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;t2&lt;/span&gt; &lt;span class="k"&gt;where&lt;/span&gt; &lt;span class="n"&gt;t1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;t2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Will contain JOIN in &lt;code&gt;lefttree&lt;/code&gt; of &lt;code&gt;ModifyTable&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgtx7k65q7jom4kbc9dpr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgtx7k65q7jom4kbc9dpr.png" alt="Merge Join as child of ModifyTable" width="632" height="465"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Thus we have to clarify what does &lt;code&gt;contains_predicates&lt;/code&gt; must check. In order not to complicate things a lot, we will just find first node with any predicate.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="n"&gt;bool&lt;/span&gt;
&lt;span class="nf"&gt;contains_predicates&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Plan&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;plan&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plan&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nb"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plan&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;qual&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;NIL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;contains_predicates&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plan&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;lefttree&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;contains_predicates&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plan&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;righttree&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Testing
&lt;/h2&gt;

&lt;p&gt;First things first - test on example queries we defined above:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;postgres&lt;/span&gt;&lt;span class="o"&gt;=#&lt;/span&gt; &lt;span class="k"&gt;delete&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;tbl&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;ERROR&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="n"&gt;DML&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="n"&gt;does&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="n"&gt;contain&lt;/span&gt; &lt;span class="n"&gt;predicates&lt;/span&gt;
&lt;span class="n"&gt;postgres&lt;/span&gt;&lt;span class="o"&gt;=#&lt;/span&gt; &lt;span class="k"&gt;delete&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;t1&lt;/span&gt; &lt;span class="k"&gt;where&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;DELETE&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

&lt;span class="n"&gt;postgres&lt;/span&gt;&lt;span class="o"&gt;=#&lt;/span&gt; &lt;span class="k"&gt;update&lt;/span&gt; &lt;span class="n"&gt;tbl&lt;/span&gt; &lt;span class="k"&gt;set&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;ERROR&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="n"&gt;DML&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="n"&gt;does&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="n"&gt;contain&lt;/span&gt; &lt;span class="n"&gt;predicates&lt;/span&gt;
&lt;span class="n"&gt;postgres&lt;/span&gt;&lt;span class="o"&gt;=#&lt;/span&gt; &lt;span class="k"&gt;update&lt;/span&gt; &lt;span class="n"&gt;tbl&lt;/span&gt; &lt;span class="k"&gt;set&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;where&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's working as expected.&lt;/p&gt;

&lt;p&gt;Also, as we injected our contrib as the last step, we can handle more complicated cases, like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;postgres&lt;/span&gt;&lt;span class="o"&gt;=#&lt;/span&gt; &lt;span class="k"&gt;delete&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;t1&lt;/span&gt; &lt;span class="k"&gt;where&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;ERROR&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="n"&gt;DML&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="n"&gt;does&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="n"&gt;contain&lt;/span&gt; &lt;span class="n"&gt;predicates&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Further improvements
&lt;/h2&gt;

&lt;p&gt;This is just the beginning of the contrib, because there are lot's of corner cases that are not handled.&lt;/p&gt;

&lt;p&gt;For example, if we change &lt;code&gt;true&lt;/code&gt; to &lt;code&gt;false&lt;/code&gt; in last query, then we still will get an &lt;code&gt;ERROR&lt;/code&gt;.&lt;br&gt;
That is because the database has realized that subquery will not return anything, so replaced with "dummy" Plan - &lt;code&gt;Result&lt;/code&gt; node with &lt;code&gt;FALSE&lt;/code&gt; one-time check, so nothing will be returned:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv1spive2zw19e6x5pxlp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv1spive2zw19e6x5pxlp.png" alt="dummy Result Node" width="800" height="351"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;So far we have seen how you can quickly create new contrib using single command that will create all necessary files.&lt;/p&gt;

&lt;p&gt;To write some templated code, we used &lt;code&gt;isaif&lt;/code&gt; snippet to quickly add check for Node type.&lt;/p&gt;

&lt;p&gt;Also, we have traversed query plan tree and saw it's nodes, without requirement to obtain &lt;code&gt;NodeTag&lt;/code&gt; and cast to given type, which incredibly boosts productivity.&lt;/p&gt;

&lt;p&gt;And like the icing on the cake we saw expression representations of predicates. For our purposes this is not a very big deal, because query contained only 1 predicate, but in large queries with dozens of different predicates it's just a lifesaver.&lt;/p&gt;

&lt;p&gt;But actually, this is the small part of the extension's features. As stated before, it knows about PostgreSQL variables semantics and actively uses it. For example, display elements of hash-tables, render some builtin scalar types in more convenient way (i.e. &lt;code&gt;XLogRecPtr&lt;/code&gt; shown in &lt;code&gt;FILE/OFFSET&lt;/code&gt; form, not integer), bitmask support, etc...&lt;/p&gt;

&lt;h2&gt;
  
  
  Tip &amp;amp; tricks
&lt;/h2&gt;

&lt;p&gt;As you may have noticed, the extension greatly facilitates development and debugging process. But extension is just an extension, an automation tool - there are lots of other aspects that can be optimized. You can refer to documentation site where you can find useful information with tips and tricks for development:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://ashenblade.github.io/postgres-dev-helper/dev_scripts" rel="noopener noreferrer"&gt;PostgreSQL management scripts&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ashenblade.github.io/postgres-dev-helper/vscode_setup/#launchjson" rel="noopener noreferrer"&gt;VS Code automation files (&lt;code&gt;tasks.json&lt;/code&gt; and &lt;code&gt;launch.json&lt;/code&gt;)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ashenblade.github.io/postgres-dev-helper/vscode_setup/#extensions" rel="noopener noreferrer"&gt;Useful extensions for VS Code&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And, of course, main links:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/ashenBlade/postgres-dev-helper" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://marketplace.visualstudio.com/items?itemName=ash-blade.postgresql-hacker-helper" rel="noopener noreferrer"&gt;VS Code marketplace&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ashenblade.github.io/postgres-dev-helper" rel="noopener noreferrer"&gt;Documentation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>postgres</category>
      <category>vscode</category>
    </item>
    <item>
      <title>The PostgreSQL Hacker Helper extension is one year old</title>
      <dc:creator>Sergey Solovev</dc:creator>
      <pubDate>Thu, 14 Aug 2025 12:09:02 +0000</pubDate>
      <link>https://forem.com/ashenblade/the-postgresql-hacker-helper-extension-is-one-year-old-5519</link>
      <guid>https://forem.com/ashenblade/the-postgresql-hacker-helper-extension-is-one-year-old-5519</guid>
      <description>&lt;p&gt;&lt;a href="https://github.com/ashenBlade/postgres-dev-helper" rel="noopener noreferrer"&gt;PostgreSQL Hacker Helper&lt;/a&gt; is a VS Code extension for developing PostgreSQL source code. A couple of days ago (August 9th), one year has passed since the release of version 1.0.0.&lt;/p&gt;

&lt;p&gt;Initially, it was a utility for dynamically calculating expressions and casting variables, but after a while I realized that not everything is so simple. The main catch is that there are types (if you can say so) that require special treatment.&lt;/p&gt;

&lt;p&gt;The most striking example is a List, a dynamic array. What's so special about it? Firstly, there is only one data structure, but inside it it stores (either-or) a pointer/&lt;code&gt;int&lt;/code&gt;/&lt;code&gt;TransactionID&lt;/code&gt;/&lt;code&gt;Oid&lt;/code&gt;. Secondly, its implementation depends on the version. Previously, it was implemented as a linked list, but today it is an array.&lt;/p&gt;

&lt;p&gt;Another interesting example is &lt;code&gt;Value&lt;/code&gt;. Today, this structure does not exist, as it has been split into separate String, Integer, Float, Boolean, and BitString (src/include/nodes/value.h). This also violates the initially beautiful picture, as you have to add (complex) logic - the name of the structure does not correspond to the type of the stored node.&lt;/p&gt;

&lt;p&gt;I've added a lot of features this year:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Expression rendering (variables representing expressions are displayed by the expression they represent)&lt;/li&gt;
&lt;li&gt;Displaying the contents of hash tables&lt;/li&gt;
&lt;li&gt;Pointers to relations from variables of the &lt;code&gt;Relids&lt;/code&gt; type&lt;/li&gt;
&lt;li&gt;Formatting a source code using &lt;code&gt;pgindent&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Bootstrapping new extensions (creating template extension files)&lt;/li&gt;
&lt;li&gt;Dump node representation to a log or a separate text file (via &lt;code&gt;pprint&lt;/code&gt;/&lt;code&gt;nodeToString&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If we talk about non-functional features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Greater extensibility due to the configuration file&lt;/li&gt;
&lt;li&gt;Support for multiple debugger extensions&lt;/li&gt;
&lt;li&gt;Testing (integraion) and CI pipeline for this&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;What I remember most was the addition of CodeLLDB debugger support. I've been doing this for 5 days from morning to night. At the same time, I added testing.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The most difficult part of all this is supporting older versions of PostgreSQL. For the extension to work, I rely on dynamic evaluation of functions in the debugger, but various major releases may break binary compatibility and some functions may be removed. I can't remember how many times I spent hours looking for workarounds to implement some functionality.&lt;/p&gt;

&lt;p&gt;Looking at all this, I realize that now it can be called an entire IDE for PostgreSQL. Although it seems that everything that could have been written has already been done, I am constantly finding new opportunities for its development.&lt;/p&gt;

&lt;p&gt;Links for those interested: &lt;a href="https://github.com/ashenBlade/postgres-dev-helper" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt;, &lt;a href="https://marketplace.visualstudio.com/items?itemName=ash-blade.postgresql-hacker-helper" rel="noopener noreferrer"&gt;VS Code marketplace&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>opensource</category>
    </item>
    <item>
      <title>pg_dphyp: teach PostgreSQL to JOIN tables in a different way</title>
      <dc:creator>Sergey Solovev</dc:creator>
      <pubDate>Mon, 28 Jul 2025 09:15:11 +0000</pubDate>
      <link>https://forem.com/ashenblade/pgdphyp-teach-postgresql-to-join-tables-in-a-different-way-33o8</link>
      <guid>https://forem.com/ashenblade/pgdphyp-teach-postgresql-to-join-tables-in-a-different-way-33o8</guid>
      <description>&lt;p&gt;Greetings!&lt;/p&gt;

&lt;p&gt;I work in Tantor Labs as a database developer and naturally I am fond of databases. Once during reading the &lt;a href="https://www.redbook.io" rel="noopener noreferrer"&gt;red book&lt;/a&gt; I have decided to study planner deeply. Main part of relational database planner is join ordering and I came across DPhyp algorithm that is used in most modern (and not so much) databases. I wonder - is there is anything in PostgreSQL? Surprisingly, nothing. Well, if something does not exist, you need to create it yourself.&lt;/p&gt;

&lt;p&gt;This article is not about DPhyp per se, but about what I had to deal with in the process of writing the corresponding extension for PostgreSQL. But first thing first, a little theory.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If you want to go straight to the extension, here is the &lt;a href="https://github.com/TantorLabs/pg_dphyp" rel="noopener noreferrer"&gt;link to the repository&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Join ordering algorithms
&lt;/h2&gt;

&lt;p&gt;The query planner in databases is perhaps the most complex and important component of the system, especially if we are talking about terabytes (especially petabytes) of data. It doesn't matter how fast the hardware is: if the planner made a mistake and started using sequential scan instead of the index scan, that's it, please come back for the result in a week. In this complex component, you can single out the core: JOIN ordering. Choosing the right table JOIN order has the greatest impact on the cost of the entire query. For example, a query like this...&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;t1&lt;/span&gt; 
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;t2&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;t1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;t2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;t3&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;t2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;t3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;t4&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;t3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;t4&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;...has 14 possible combinations of table JOIN orderings. In general, this is the number of possible representations of a binary tree of &lt;code&gt;n&lt;/code&gt; nodes, where nodes are tables, &lt;a href="https://en.wikipedia.org/wiki/Catalan_number" rel="noopener noreferrer"&gt;Catalan number&lt;/a&gt; - 

&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;Cn−1C_{n - 1}&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;C&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;n&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
. But do not forget, that order of tables also important, so for each shape of JOIN tree we must consider all table reorderings. Thus, number of possible JOIN orderings for query with &lt;code&gt;n&lt;/code&gt; tables is 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;Cn−1n!C_{n - 1}n!&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;C&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;n&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;n&lt;/span&gt;&lt;span class="mclose"&gt;!&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
. This number is growing very fast. For example, for 6 tables it will be 30240, and for 7 - 665280! Needless to say, from a certain point on, this number becomes so huge that it becomes almost practically impossible to find the optimal plan by simply iterating through all the combinations? The way we are going to search suitable ordering determines architecture of the planner: top-down vs bottom-up.&lt;/p&gt;

&lt;p&gt;In the top-down approach (also called goal-oriented) we start from the root and recursively descending down the query tree. The advantage here is that we have the full context in our hands and can use it. The most illustrative example are correlated subqueries: top-down planner can use current context to transform such nested subquery into simple JOIN - this can dramatically improve performance. An example is &lt;a href="https://www.srdc.com.tr/share/publications/1995/cas.pdf" rel="noopener noreferrer"&gt;the cascades planner&lt;/a&gt; (roughly speaking, this is the framework), which is used in Microsoft SQL Server: it can easily move the nodes of the query graph under GROUP-BY, which another approach (bottom-up) cannot do without additional help (the architecture assumes a static set of relationships for connection).&lt;/p&gt;

&lt;p&gt;Bottom-up is the opposite approach. Here all JOINS are planned first and only then sorting/grouping (+ other operators) operators are added. This approach is used in many databases, including PostgreSQL. Its advantage is scalability, as it allows you to schedule a huge number of JOINS. For example, in the article &lt;a href="https://db.in.tum.de/~radke/papers/hugejoins.pdf" rel="noopener noreferrer"&gt;Adaptive Optimization of Very Large Join Queries&lt;/a&gt; presents an approach that allows query planning with several thousand tables by combining different algorithms. The example in the article is SAP, which, due to the constant use of views within other views, can create a query using thousands of regular tables.&lt;/p&gt;

&lt;p&gt;There are lots of bottom-up algorithms, so consider the ones that PostgreSQL uses.&lt;/p&gt;
&lt;h3&gt;
  
  
  DPsize
&lt;/h3&gt;

&lt;p&gt;During the dawn of RDBMS, no one fully understood how everything should work, and everyone did their best. Then the first dynamic programming algorithm appeared to find the order of connections. Today, everyone (at least in the articles) calls it DPsize.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Let's remember what an relation is. Relational algebra is built on working with relations. Relation can be understood as a simple data source with its own schema (attributes). The simplest example of a relation is a table, but another important example is a &lt;code&gt;JOIN&lt;/code&gt;, because in fact it meets the requirements (it gives tuples and have attributes). Next, I will use the term "relation", but where it is important to emphasize the difference, I will say "table".&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The idea of DPsize is simple: to create a JOIN of &lt;code&gt;i&lt;/code&gt; tables, you need to JOIN other relations, which in the sum of the number of tables will give &lt;code&gt;i&lt;/code&gt;. For example, for &lt;code&gt;4&lt;/code&gt; we need to connect &lt;code&gt;1 + 3&lt;/code&gt; or &lt;code&gt;2 + 2&lt;/code&gt; tables. Actually, this is dynamic programming - the answer of the current step depends on the answer of the previous ones, but the base is a relation of size &lt;code&gt;1&lt;/code&gt; - ordinary table. The algorithm performs well on an OLTP load when there are few tables, and provides almost optimal query plans, but problems begin when there are too many tables.&lt;/p&gt;

&lt;p&gt;As you can see, the time/space complexity of this algorithm is exponential, since at each subsequent step it is required to process even more pairs of relationships. Different databases deal with this in different ways, and PostgreSQL has started using a different algorithm.&lt;/p&gt;
&lt;h3&gt;
  
  
  GEQO
&lt;/h3&gt;

&lt;p&gt;GEQO (Genetic Query Optimizer) is a genetic algorithm for finding the optimal query plan. If you try to run a query in PostgreSQL, first for 12 tables, and then for 13, you will be surprised, because the time spent has decreased from a few seconds to almost ten &lt;em&gt;milliseconds&lt;/em&gt;. Why? Because GEQO has entered the game (&lt;code&gt;geqo_threshold&lt;/code&gt; setting is &lt;code&gt;12&lt;/code&gt; by default). The fact is that this is a randomized algorithm, its core idea can be described as follows: first, we build &lt;em&gt;at least some&lt;/em&gt; query plan, and then we carry out several iterations (this is determined by the configuration), in each of which we randomly change some nodes, and in the next iteration there is a plan with the best cost. Hence the name "genetic" due to the fact that the strongest (in our case, the cheapest) pass into the next generation (iteration).&lt;/p&gt;
&lt;h2&gt;
  
  
  DPhyp
&lt;/h2&gt;

&lt;p&gt;Let's move on to the main topic — the DPhyp algorithm. DPhyp is a dynamic programming algorithm for JOIN ordering. Its main idea is that the query itself contains guidance on how tables should be joined. So why not use it? The main problem is in the query representation itself. I mentioned earlier that a query can be represented as a graph, but is it possible to do this for the purposes of iterating over JOINS? To understand the difficulty, let's look at an example from the paper:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;R1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;R2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;R3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;R4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;R5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;R6&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;R1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;R2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;R2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;R3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;R4&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;R5&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;R5&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;R6&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt;
      &lt;span class="n"&gt;R1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;R2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;R3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;R4&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;R5&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;R6&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Yes, we see that there are several &lt;em&gt;explicitly&lt;/em&gt; connected tables — for them we can create edges in our graph (for example, &lt;code&gt;R1 - R2&lt;/code&gt;), but what about the last predicate, which actually connects multiple tables? This problem is solved by DPhyp, a dynamic programming algorithm (DP) based on &lt;em&gt;hypergraphs&lt;/em&gt; (hyp — hypergraph). You should not be afraid, I assume that you are familiar with an ordinary graph — a set of nodes connected by edges, but a hypergraph is a set of hypernodes connected by hyperedges:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;hypernode — is a &lt;em&gt;set&lt;/em&gt; of regular &lt;em&gt;nodes&lt;/em&gt;;&lt;/li&gt;
&lt;li&gt;hyperedge — is an edge, connecting 2 &lt;em&gt;hypernodes&lt;/em&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In example query we have next hyperedges:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;code&gt;{R1} - {R2}&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;{R2} - {R3}&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;{R4} - {R5}&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;{R5} - {R6}&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;{R1, R2, R3} - {R4, R5, R6}&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If there is only one relation in a hypernode, then it is called a &lt;em&gt;simple hypernode&lt;/em&gt;. Similarly, if a hyperedge connects two simple hypernodes, then it is a &lt;em&gt;simple hyperedge&lt;/em&gt;. Thus, the first four of the list above are simple hyperedges.&lt;/p&gt;

&lt;p&gt;To create a plan for a set of &lt;code&gt;i&lt;/code&gt; nodes, we need to handle &lt;em&gt;not all possible&lt;/em&gt; pairs that give &lt;code&gt;i&lt;/code&gt; in total, but all pairs of hypernodes that give the same set. This is a pair of two disjoint sets: a &lt;code&gt;connected subgraph&lt;/code&gt; (csg, subgraph) and a &lt;code&gt;connected complement&lt;/code&gt; (cmp, complement). These abbreviations will often appear in the post.&lt;/p&gt;

&lt;p&gt;To make everything work optimally, we need to introduce a small restriction — the order of the nodes. There should be an order above the nodes (i.e. tables), roughly speaking, they should be numbered (numbering is used most often) so that we can compare and sort them later. To understand why, let's look at the core of the algorithm — neighborhood.&lt;/p&gt;

&lt;p&gt;During the operation of the algorithm, we use neighbors to move from one hypernode to another. The article provides a mathematical definition, but the easiest way is to say that the neighbors of a hypernode are other reachable nodes. It is also required that this set be minimal, otherwise we will process the same hypernodes several times. This is where the order is needed — when we bypass the edges to find neighbors, we add only the &lt;em&gt;representative&lt;/em&gt; of the hypernode, its &lt;em&gt;minimal element&lt;/em&gt;, to the set. Next, you just need to check that the other edges do not contain nodes that have already been added.&lt;/p&gt;

&lt;p&gt;The last important detail of the algorithm is the &lt;em&gt;excluded set&lt;/em&gt;. In DPsize, as an optimization, we start iterating over complementary pairs not from 0, but from the next index, or otherwise we will try to connect ourselves to ourselves, and process some pairs twice. The idea is about the same here, we keep track of nodes that are not worth considering (excluded) and check this almost everywhere, even when finding neighbors. Due to this, we can avoid considering the same set twice.&lt;/p&gt;

&lt;p&gt;The main logic of the algorithm in the article is represented by five functions, and the core of the algorithm can be briefly described as follows: we need to find a csg-cmp pair that combines the entire query, so using neighbor search, csg and cmp will increase alternately/recursively. The only question is where to start. The answer here is this — we start iterating over all nodes starting &lt;em&gt;from the end&lt;/em&gt;, and add to &lt;em&gt;excluded&lt;/em&gt; set all nodes that are smaller than the current one. As a result, we start with a simple hypernode that no one has considered, and then recursively expand it and find the cmp for it using neighbors.&lt;/p&gt;

&lt;p&gt;Actually, it's not difficult to understand the functions now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Solve&lt;/code&gt; — algorithm's entry point. Iterate over all simple hypernodes from end and invoke &lt;code&gt;EmitCsg&lt;/code&gt; and &lt;code&gt;EnumerateCsgRecursive&lt;/code&gt;;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;EmitCsg&lt;/code&gt; — accepts &lt;em&gt;fixed&lt;/em&gt; csg, for which we find corresponding cmp, and then invoke &lt;code&gt;EmitCsgCmp&lt;/code&gt; and/or &lt;code&gt;EnumerateCmpRecursive&lt;/code&gt;;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;EmitCsgRecursive&lt;/code&gt; — accepts csg, which is expanded using it's neighborhood, then invoke &lt;code&gt;EmitCsg&lt;/code&gt; and/or &lt;code&gt;EnumerateCsgRecursive&lt;/code&gt;;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;EnumerateCmpRecursive&lt;/code&gt; — accepts &lt;em&gt;fixed&lt;/em&gt; csg with cmp, which is expanded using cmp's neighborhood, then invoke &lt;code&gt;EmitCsgCmp&lt;/code&gt; and/or &lt;code&gt;EnumerateCmpRecursive&lt;/code&gt;;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;EmitCsgCmp&lt;/code&gt; — creates query plan for given &lt;em&gt;fixed&lt;/em&gt; csg/cmp pair.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The main idea is clear. And as an example, the article also has a step-by-step illustration of how the algorithm works for the query from the example:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frh3lbtzm7yrso32ze3zz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frh3lbtzm7yrso32ze3zz.png" alt="Algorithm tracing"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Iteration performed backwards, so start from &lt;code&gt;R6&lt;/code&gt; (the highest index), but it does not have any neighbors, because all other nodes are in excluded set.&lt;/li&gt;
&lt;li&gt;Move on to &lt;code&gt;R5&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Find neighbor &lt;code&gt;R6&lt;/code&gt; and create cmp &lt;code&gt;{R6}&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Expand csg itself to &lt;code&gt;{R5, R6}&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Move on to &lt;code&gt;R4&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Find neighbor &lt;code&gt;R5&lt;/code&gt; and create cmp &lt;code&gt;{R5}&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Expand cmp to &lt;code&gt;{R5, R6}&lt;/code&gt; (&lt;code&gt;R6&lt;/code&gt; is a neighbor of &lt;code&gt;R5&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Move back to 6 step and expand csg to &lt;code&gt;{R4, R5}&lt;/code&gt; (earlier cmp is expanded).&lt;/li&gt;
&lt;li&gt;Find neighbor &lt;code&gt;R6&lt;/code&gt; for csg &lt;code&gt;{R4, R5}&lt;/code&gt; and create corresponding cmp &lt;code&gt;{R6}&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Expand csg &lt;code&gt;{R4, R5}&lt;/code&gt; to &lt;code&gt;{R4, R5, R6}&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Move on to &lt;code&gt;R3&lt;/code&gt; (&lt;code&gt;{R1, R2}&lt;/code&gt; are excluded), but it does not have any neighbors:

&lt;ul&gt;
&lt;li&gt;All hypernodes, we have direct edge with (&lt;code&gt;R1&lt;/code&gt; and &lt;code&gt;R2&lt;/code&gt;), are excluded (have lower index);&lt;/li&gt;
&lt;li&gt;Representative of left hyperedge of complex hyperedge (&lt;code&gt;min({R1, R2, R3}) = R1&lt;/code&gt;) also in excluded set, so this hyperedge is not considered.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Move on to &lt;code&gt;R2&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Find neighbor &lt;code&gt;R3&lt;/code&gt; and create cmp &lt;code&gt;{R3}&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Expand csg to &lt;code&gt;{R2, R3}&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Move on to &lt;code&gt;R1&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Find neighbor &lt;code&gt;R2&lt;/code&gt; and create cmp &lt;code&gt;{R2}&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Expand cmp to &lt;code&gt;{R2, R3}&lt;/code&gt; (&lt;code&gt;R3&lt;/code&gt; is a neighbor of &lt;code&gt;R2&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Expand csg to &lt;code&gt;{R1, R2}&lt;/code&gt; (&lt;code&gt;R2&lt;/code&gt; is a neighbor of &lt;code&gt;R1&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Find neighbor &lt;code&gt;R3&lt;/code&gt; and create cmp &lt;code&gt;{R3}&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Expand csg to&lt;code&gt;{R1, R2, R3}&lt;/code&gt; (&lt;code&gt;R3&lt;/code&gt; is a neighbor of &lt;code&gt;R2&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Find neighbor &lt;code&gt;R4&lt;/code&gt; and create cmp &lt;code&gt;{R4}&lt;/code&gt; (use representative &lt;code&gt;R4&lt;/code&gt; of complex hyperedge &lt;code&gt;{R1, R2, R3} - {R4, R5, R6}&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Expand cmp to &lt;code&gt;{R4, R5}&lt;/code&gt; (&lt;code&gt;R5&lt;/code&gt; is a neighbor of &lt;code&gt;R4&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Expadn cmp to &lt;code&gt;{R4, R5, R6}&lt;/code&gt; once again (&lt;code&gt;R6&lt;/code&gt; is a neighbor of &lt;code&gt;R5&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Move back to step 20 and expand csg using &lt;code&gt;{R4}&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Then add &lt;code&gt;{R5}&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Finally add &lt;code&gt;{R6}&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The algorithm is quite good and intuitive. Can I add it to PostgreSQL? Yes, and it has always been possible! For such case, there is a &lt;code&gt;join_search_hook&lt;/code&gt; — a hook that allows you to replace vanilla JOIN search algorithm.&lt;/p&gt;
&lt;h2&gt;
  
  
  pg_dphyp extension
&lt;/h2&gt;

&lt;p&gt;The idea to create this extension came to me by chance: I was studying different JOIN algorithms and came across it. An Internet search didn't turn up anything like this for PostgreSQL, and I realized that I had to do it myself. Who immediately wants to see what happened, &lt;a href="https://github.com/TantorLabs/pg_dphyp" rel="noopener noreferrer"&gt;here is the link to the repository&lt;/a&gt;. In the context of the core of the algorithm, I did not bring anything new, on the contrary, I copied more from others. Before starting to implement my own, I looked at the implementations in several databases, in particular, YDB, MySQL and DuckDB. If someone wants to learn DPhyp by code, I recommend looking at &lt;a href="https://github.com/ydb-platform/ydb/blob/c23202bc294cf703741f1ea6ac30786578a58920/ydb/library/yql/dq/opt/dq_opt_dphyp_solver.h" rel="noopener noreferrer"&gt;code YDB&lt;/a&gt; — Clean and clear, very easy to read. But I didn't start with YDB myself, but with MySQL, and now its implementation has been significantly changed and optimized, it won't be possible to figure it out right away, only based on the comments and with the initial understanding of DPhyp itself.&lt;/p&gt;

&lt;p&gt;In the implementation, I tried to be closer to the paper and make the minimum number of changes. For example, the names of functions and some variables are the same as in the paper. But although there are practically no changes in the core of the algorithm, they exist at the operational decision-making level, and the first of them concerns the representation of sets.&lt;/p&gt;
&lt;h3&gt;
  
  
  Set representation
&lt;/h3&gt;

&lt;p&gt;Sets are the workhorse of the algorithm, underlying all of its effectiveness. Different DBMS do this in different ways, for example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DuckDB uses &lt;a href="https://github.com/duckdb/duckdb/blob/73f85abbbdd38555ef7afa08090dfb4b10120df8/src/include/duckdb/optimizer/join_order/join_relation.hpp#L24" rel="noopener noreferrer"&gt;numbers directly and store them in an array&lt;/a&gt;;&lt;/li&gt;
&lt;li&gt;YDB - &lt;a href="https://github.com/ydb-platform/ydb/blob/c23202bc294cf703741f1ea6ac30786578a58920/ydb/library/yql/dq/opt/dq_opt_join_cost_based.cpp#L341" rel="noopener noreferrer"&gt;&lt;code&gt;std::bitset&amp;lt;&amp;gt;&lt;/code&gt; from C++ standard library&lt;/a&gt;;&lt;/li&gt;
&lt;li&gt;MySQL - &lt;a href="https://github.com/mysql/mysql-server/blob/ff05628a530696bc6851ba6540ac250c7a059aa7/sql/join_optimizer/node_map.h#L40" rel="noopener noreferrer"&gt;8-byte number as bitset&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Anyone who works with PostgreSQL knows that there is &lt;a href="https://github.com/postgres/postgres/blob/62a47aea1d8d8ea36e63fe6dd3d9891452a3f968/src/include/nodes/bitmapset.h#L49" rel="noopener noreferrer"&gt;its own implementation of the set — &lt;code&gt;Bitmapset&lt;/code&gt;&lt;/a&gt;. It is used everywhere and most common use case is to store relation IDs. It would seem that you should just take it and use, but the problem is that there are a lot of operations on sets, and &lt;code&gt;Bitmapset&lt;/code&gt; creates a new copy every time it changes, that is, these are unnecessary memory allocations. In PostgreSQL, this problem often does not occur, because after creating the &lt;code&gt;Bitmapset&lt;/code&gt; it rarely changes, but not in my case, and this is critical.&lt;/p&gt;

&lt;p&gt;In first implementation, I have solved the problem by implementing two approaches at once — I have created two files, where in one I used &lt;code&gt;bitmapword&lt;/code&gt; (an 8-byte number/bitset, as in MySQL), and in the other I have used &lt;code&gt;Bitmapset&lt;/code&gt; for complex queries (with more than 64 tables). But this happened at the very beginning of development, when I still did not really understand the subtleties of the algorithm. So after a while I dropped updating the file with the "Bitmapset" (I decided to add changes later), and finally completely deleted it.&lt;/p&gt;

&lt;p&gt;Now I use the representation of a set using a number, and this does not cause many problems. The basic operations with a set are performed with simple bitwise operations: &lt;code&gt;|&lt;/code&gt;, &lt;code&gt;&amp;amp;&lt;/code&gt; and &lt;code&gt;~&lt;/code&gt;. But there are a couple more operations that are important for the algorithm itself, for example, iterating over the elements of the set in the process of calculating neighbors. There are many such operations, so I put them in a separate &lt;a href="https://github.com/TantorLabs/pg_dphyp/blob/b5406651b8f95743042be847b38c06b75bd23670/simplebms.h" rel="noopener noreferrer"&gt;header file&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Another interesting operation is iteration over all subsets, this is necessary to expand csg/cmp. Since a set is a number, the operation is performed with a number. In MySQL and YDB, this was solved using the bit trick &lt;code&gt;(init - state) &amp;amp; state&lt;/code&gt;, which, when applied continuously, behaves like an increment, but only the bits of the set change. This implementation is currently &lt;a href="https://github.com/TantorLabs/pg_dphyp/blob/b5406651b8f95743042be847b38c06b75bd23670/pg_dphyp.c#L541" rel="noopener noreferrer"&gt;in use&lt;/a&gt;. For example, for the set &lt;code&gt;01010010&lt;/code&gt; we get the following subsets:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;00000010    001
00010000    010
00010010    011
01000000    100
01000010    101
01010000    110
01010010    111
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;On the left side - number as set in binary representation, and on the right side - it's bitmask. As we iterate by incrementing, so we can say, that bitmask in number representation is the same as iteration number. Next, we will use such property often.&lt;/p&gt;
&lt;h3&gt;
  
  
  DP-table
&lt;/h3&gt;

&lt;p&gt;DPhyp is a dynamic programming algorithm, and it has its own table for tracking execution status. If you look at the algorithm from the paper, you can see that this table is used only for storing ready-made query plans. In PostgreSQL, the &lt;code&gt;RelOptInfo&lt;/code&gt; structure is used to store list of query plans (&lt;code&gt;Path&lt;/code&gt; structures) for fixed set of relations, and &lt;em&gt;it is already stored in the hash table&lt;/em&gt;. It seems that you don't have to think about creating your own hash-table, but no. The problem lies in PostgreSQL itself, or rather, its &lt;code&gt;FULL JOIN&lt;/code&gt; processing.&lt;/p&gt;

&lt;p&gt;For this type of JOIN, only the equality predicate is currently supported, and in the code, when such a predicate occurs, all relations in the left and right parts fall into separate lists that are planned &lt;em&gt;independently&lt;/em&gt;. For example, for query &lt;code&gt;SELECT * FROM t1 FULL JOIN (SELECT t2.x x FROM t2 JOIN t3 ON t2.x = t3.x) s ON t1.x = s.x&lt;/code&gt; we have to run 2 JOIN algorithms: &lt;code&gt;{t2, t3}&lt;/code&gt; and &lt;code&gt;{t1, {t2, t3}}&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This is the reason why internal tables use their own indexes (that is, DPhyp node indexes do not necessarily correspond to relationship indexes). Therefore, even if we temporarily turn &lt;code&gt;bitmapword&lt;/code&gt; into &lt;code&gt;Bitmapset&lt;/code&gt;, which will require additional memory allocation, I will not be able to do this if the relation indexes exceed 64 (the maximum value for an eight-byte number).&lt;/p&gt;

&lt;p&gt;Hash table in PostgreSQL (&lt;code&gt;HTAB&lt;/code&gt; structure) has one specificity - it store element and it's key in same structure (key must be first member of element's structure). But builtin hash table for storing &lt;code&gt;RelOptInfo&lt;/code&gt;s hash &lt;code&gt;Bitmapset&lt;/code&gt; key and pg_dphyp uses &lt;code&gt;bitmapword&lt;/code&gt; (the reason discussed above). So extension creates and maintains it's own hash table.&lt;/p&gt;
&lt;h3&gt;
  
  
  Hypergraph creation
&lt;/h3&gt;

&lt;p&gt;Another important task is the building of the hypergraph itself. The problem is that unlike YDB or MySQL, we don't run the show and obey the rules of the database.&lt;/p&gt;

&lt;p&gt;There doesn't seem to be a problem, I can go through all the predicates used in the query and create edges from them. Actually, this is how it is implemented now. But the devil is in the details.&lt;/p&gt;

&lt;p&gt;Such hyperedge information stored in 3 separate places:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;PlannerInfo-&amp;gt;join_info_list&lt;/code&gt; — list of non-INNER non-INNER JOIN predicates.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;RelOptInfo-&amp;gt;joinclauses&lt;/code&gt; — list of JOIN predicates (require more than 1 relation).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;PlannerInfo-&amp;gt;eq_classes&lt;/code&gt; — list of equivalence classes.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The first is the simplest. This list contains restrictions that are imposed by various non-INNER (i.e. LEFT/RIGHT/FULL, etc.) JOINS. It has 2 pairs of parts: syntactic constraints and the minimum ones (the only needed for the calculation). Hyperedges are created a hyper-edge for both pairs just in case.&lt;/p&gt;

&lt;p&gt;The second one has difficulties in terms of the expression itself — expression may not be binary, or it may be, but the same relation can be mentioned on both sides. For such moments, I added my own concept — cross join set (cjs). It's just a set of relations that need to connect with each other (clique). For each relation in CJS, I create simple edge (each with each one). This solves the problem that some predicates may be missing. In example &lt;code&gt;WHERE sample_func(t1.x, t2.x t3.x)&lt;/code&gt; have CJS of &lt;code&gt;{t1, t2, t3}&lt;/code&gt; - this is not binary predicate and we create &lt;code&gt;{t1} - {t2}&lt;/code&gt;, &lt;code&gt;{t2} - {t3}&lt;/code&gt; and &lt;code&gt;{t1} - {t3}&lt;/code&gt; hyperedges.&lt;/p&gt;

&lt;p&gt;Lastly, the third - equivalence classes. The equivalence class is a PostgreSQL mechanism by which it determines that some expressions are equal to each other. This is not only used in predicates (to deduce equivalences). For example, expressions in &lt;code&gt;ORDER BY&lt;/code&gt; or &lt;code&gt;GROUP BY&lt;/code&gt; are represented by an equivalence class (possibly degenerate, with a single element). But now it's not about that, but about how it shoots. Such equivalence classes appear when there are equality expressions in the query, and even one is enough to create an equivalence class. As you might guess, they are also used for JOINS. When I encounter an equivalence class with multiple relations, I have to create hyper edges for each pair. Therefore, such queries with only equality under the hood turn into a clique (as you can mention - logic is same as for CJS).&lt;/p&gt;

&lt;p&gt;For example, consider this query:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;t1&lt;/span&gt; 
         &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;t2&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;t1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;t2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;t1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;t2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;
         &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;t3&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;t1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;t2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;t3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;
    &lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;t4&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;t4&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;t3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;It contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;equivalence classes: &lt;code&gt;{t1.x, t2.x}&lt;/code&gt; and &lt;code&gt;{t1.x + t2.x, t3.x}&lt;/code&gt;;&lt;/li&gt;
&lt;li&gt;JOIN predicate: &lt;code&gt;t1.y &amp;gt; t2.y&lt;/code&gt;;&lt;/li&gt;
&lt;li&gt;special (LEFT) JOIN predicate: &lt;code&gt;t4.x = t3.x&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We will create the following hyperedges:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;{t1} - {t2}&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;{t3} - {t1, t2}&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;{t3} - {t4}&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Disconnected subgraphs
&lt;/h3&gt;

&lt;p&gt;Another problem arises from the above. What happens if we get a disconnected graph (a forest)? In this case, the algorithm will not be able to create result plan. More precisely, it will create a plan for each connected component in the forest, but it will no longer be able to do so for the &lt;em&gt;entire&lt;/em&gt; query. Moreover, this problem may appear not only because of the &lt;code&gt;CROSS JOIN&lt;/code&gt; or &lt;code&gt;,&lt;/code&gt;, but also because of the external parameters of the subqueries. I'll give you an example:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;t1&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;t1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; 
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;t2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;t2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t3&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;t2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;t1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;t3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;t1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;blockquote&gt;
&lt;p&gt;I discovered this when I ran &lt;code&gt;\d&lt;/code&gt; (to display all tables) in &lt;code&gt;psql&lt;/code&gt;, and it turned out that there was an example query.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In the subquery, &lt;code&gt;t1.x&lt;/code&gt; is a parameter, but we have a forest in the output graph, because there is no relation ID for &lt;code&gt;t1&lt;/code&gt; in the subquery. On one hand, in practice disconnected graphs are quite rare to waste resources to detect such a thing, moreover, we can waste extra time (just double the planning time without any payload). Therefore, I left it up to the users to decide what to do.&lt;/p&gt;

&lt;p&gt;There is setting &lt;code&gt;pg_dphyp.cj_strategy&lt;/code&gt;, which accepts 3 values:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;no&lt;/code&gt; – if we failed to build result plan, then invoke DPsize/GEQO with initial values;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pass&lt;/code&gt; – if we failed to build result plan, then collect plans for all connected components and pass them to DPsize/GEQO;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;detect&lt;/code&gt; – find all connected components during initialization and create dummy hyperedges for them.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can think that &lt;code&gt;no&lt;/code&gt; is superfluous, but it is not because the plans created for these subgraphs may not be optimal due to the fact that there may be implicit connections between disconnected subgraphs that could help create an optimal plan. Therefore, the final plan may also be suboptimal.&lt;/p&gt;

&lt;p&gt;The question remains: how to find disconnected graphs? The answer is simple - &lt;a href="https://github.com/TantorLabs/pg_dphyp/blob/b5406651b8f95743042be847b38c06b75bd23670/pg_dphyp.c#L953" rel="noopener noreferrer"&gt;Union-Set algorithm&lt;/a&gt;. For performance, an optimized version is used: ranking and leader updating.&lt;/p&gt;
&lt;h3&gt;
  
  
  Hypergraph representation
&lt;/h3&gt;

&lt;p&gt;One more task is storing the hypergraph. The graph can be stored as a list of edges, or it can be an adjacency table. But for a hypergraph, only the first option remains, since it is practically impossible to build an adjacency table for hypergraph (hyperedge connects hypernodes which can have more than 1 node on both sides).&lt;/p&gt;

&lt;p&gt;Well, we've decided to use list of hyperedges, but is it possible to apply any optimizations? Yes, we can. For almost all implementations (for example, &lt;a href="https://github.com/ydb-platform/ydb/blob/c23202bc294cf703741f1ea6ac30786578a58920/ydb/library/yql/dq/opt/dq_opt_join_hypergraph.h#L84" rel="noopener noreferrer"&gt;YDB&lt;/a&gt; and &lt;a href="https://github.com/mysql/mysql-server/blob/ff05628a530696bc6851ba6540ac250c7a059aa7/sql/join_optimizer/hypergraph.h#L69" rel="noopener noreferrer"&gt;MySQL&lt;/a&gt;) there is an optimization for simple hyperedges. Let me remind you once again that a simple hyperedge has set of a single element on both sides. This optimization uses this fact to store all nodes for which we have simple hyperedge in single set. Next, during work we need to do a single operation (i.e. subset or overlaps) to check multiple edges at once. Such a set is often called a "simple neighborhood", I think it's clear why. This optimization is also used in extension, but I went a little further, and I store this simple neighborhood for each hypernode (in the &lt;code&gt;HyperNode&lt;/code&gt; structure). This simplifies the work a bit, since you don't need to spend time on additional iteration across nodes, and it takes up only 8 bytes of space.&lt;/p&gt;

&lt;p&gt;But this is still not the end. Earlier, I said that for the same expression in the query text, multiple instances of &lt;code&gt;RestrictInfo&lt;/code&gt; (representation of the expression in the source code) can be created, but with a different set of indexes of the required relations. At the same time, I said that I couldn't process these additional relation IDs, so I could end up with a lot of duplicates of expressions. This can lead to lots of wasted work. I solved this using &lt;a href="https://github.com/ashenBlade/pg_dphyp/blob/0cdc5b410d3bce41398a6646c576cca77994b6e3/pg_dphyp.c#L1096" rel="noopener noreferrer"&gt;sorting&lt;/a&gt;: all hyperedges stored in sorted array. So adding new hyperedge is adding new element to sorted array - binary search can quickly find duplicates and prevent such bloating.&lt;/p&gt;
&lt;h3&gt;
  
  
  Query plan building
&lt;/h3&gt;

&lt;p&gt;If you have read original paper then you known that DPhyp is not only about JOIN ordering, but it also gives some tips for effective query plan creation. This is an important point, since some JOIN operators are not commutative (for example, &lt;code&gt;LEFT JOIN&lt;/code&gt;), and such JOINs impose restrictions. But that's not all. Let me remind you that DPsize (builtin planner) has high cohesion with the code base — so much so that it is impossible to create a plan for the &lt;code&gt;i&lt;/code&gt; tables without finding the optimal plan for the underlying ones.&lt;/p&gt;

&lt;p&gt;According to DPhyp query plan built in &lt;code&gt;EmitCsgCmp&lt;/code&gt;. Initially, that's what I did:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Take 2 &lt;code&gt;RelOptInfo&lt;/code&gt; (csg/cmp, left/right) and invoke &lt;code&gt;make_join_rel&lt;/code&gt; with them.&lt;/li&gt;
&lt;li&gt;Call &lt;code&gt;generate_useful_gather_paths&lt;/code&gt; to create gather paths (parallel).&lt;/li&gt;
&lt;li&gt;Call &lt;code&gt;set_cheapest&lt;/code&gt; to find the best plan.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;But if you look inside, you will see that the &lt;code&gt;generate_useful_gather_paths&lt;/code&gt; and &lt;code&gt;set_cheapest&lt;/code&gt; functions go through all the paths created so far, that is, as the program runs, the time spent on its execution will only increase. Moreover, &lt;code&gt;set_cheapest&lt;/code&gt; function will update cheapest paths found, so these paths can be used to create new &lt;code&gt;RelOptInfo&lt;/code&gt;s - that is the problem, because in such approach we can use not cheapest plan, but first created.&lt;/p&gt;

&lt;p&gt;Due to these facts, I switched to a DPsize-like approach. Keep a list of candidate hypernode pairs that can create target for each hypernode. At the end recursively build target &lt;code&gt;RelOptInfo&lt;/code&gt;s using this list and finally invoke &lt;code&gt;generate_useful_gather_paths&lt;/code&gt;/&lt;code&gt;set_cheapest&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Yes, there is a chance that there will be a pair inside that will not be able to create a query plan, and I will waste time, but the overhead of the initial approach (with a constant call to &lt;code&gt;set_cheapest&lt;/code&gt;) is still greater, and the probability that some pair of hyper nodes will not create a useful JOIN is extremely low due tofor the very idea of DPhyp.&lt;/p&gt;
&lt;h2&gt;
  
  
  Optimal neighborhood calculation
&lt;/h2&gt;

&lt;p&gt;Finally, we've come to the most interesting part - how we're going to traverse the neighbors. We can highlight three functions in the algorithm that are engaged in creation of csg/cmp pairs, and in them we need to calculate neighbors as many as four times: &lt;code&gt;EmitCsg&lt;/code&gt;, &lt;code&gt;EnumerateCsgRec&lt;/code&gt; and &lt;code&gt;EnumerateCmpRec&lt;/code&gt;. Everything starts to look more intimidating when you realize that the algorithm involves iterating over all possible subsets of neighbors (&lt;a href="https://en.wikipedia.org/wiki/Power_set" rel="noopener noreferrer"&gt;power set&lt;/a&gt; without the empty set) is 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;2i−12^i - 1&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;i&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 combinations.&lt;/p&gt;

&lt;p&gt;Yes, we have added simple neighborhood optimization, but still we need to process all the nodes in the subsets. Totally, we get 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;∑k=1nkCnk\sum_{k = 1}^{n}kC_{n}^{k}&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mop"&gt;&lt;span class="mop op-symbol small-op"&gt;∑&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;span class="mrel mtight"&gt;=&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;n&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;k&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;C&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;n&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 nodes that need to be processed for all subsets.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The number of different combinations of subsets of 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;kk&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 elements is 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;CnkC_{n}^{k}&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;C&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;n&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
, but for each subset of size 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;kk&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
, each of its elements must be processed, and there are 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;kk&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 of them. In other words, the number of elements in all subsets of size 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;kk&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 is 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;kCnkkC_{n}^{k}&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;k&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;C&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;n&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;First, let's look at the definition of neighborhood:&lt;/p&gt;


&lt;div class="katex-element"&gt;
  &lt;span class="katex-display"&gt;&lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;E↓′(S,X)=v∣(u,v)∈E,u⊆S,v∩S=∅,v∩X=∅E↓(S,X)=minimal set:∀v∈E↓′(S,X)∃v′∈E↓(S,X):v′⊆vN(X,X)=∪v∈E↓(S,X)min(v)
E\darr'(S, X) = {v|(u, v) \in E, u \subseteq S, v \cap S = \emptyset, v \cap X = \emptyset } \newline

E\darr(S, X) = minimal\ set: \forall_{v\in E\darr'(S,X)} \exists v' \in E\darr(S,X): v' \subseteq v \newline

\mathcal{N}(X, X) = \cup_{v \in E\darr(S, X)} min(v)
&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;E&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;&lt;span class="mrel"&gt;↓&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mtight"&gt;′&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal"&gt;S&lt;/span&gt;&lt;span class="mpunct"&gt;,&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;X&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;v&lt;/span&gt;&lt;span class="mord"&gt;∣&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal"&gt;u&lt;/span&gt;&lt;span class="mpunct"&gt;,&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;v&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;∈&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;E&lt;/span&gt;&lt;span class="mpunct"&gt;,&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;u&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;⊆&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;S&lt;/span&gt;&lt;span class="mpunct"&gt;,&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;v&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;∩&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;S&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord"&gt;∅&lt;/span&gt;&lt;span class="mpunct"&gt;,&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;v&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;∩&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;X&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord"&gt;∅&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace newline"&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;E&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;↓&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal"&gt;S&lt;/span&gt;&lt;span class="mpunct"&gt;,&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;X&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;minima&lt;/span&gt;&lt;span class="mord mathnormal"&gt;l&lt;/span&gt;&lt;span class="mspace"&gt; &lt;/span&gt;&lt;span class="mord mathnormal"&gt;se&lt;/span&gt;&lt;span class="mord mathnormal"&gt;t&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;:&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;∀&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;v&lt;/span&gt;&lt;span class="mrel mtight"&gt;∈&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;E&lt;/span&gt;&lt;span class="mrel mtight"&gt;&lt;span class="mrel mtight"&gt;↓&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size3 size1 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mtight"&gt;′&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mopen mtight"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;S&lt;/span&gt;&lt;span class="mpunct mtight"&gt;,&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;X&lt;/span&gt;&lt;span class="mclose mtight"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mord"&gt;∃&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;v&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mtight"&gt;′&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;∈&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;E&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;↓&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal"&gt;S&lt;/span&gt;&lt;span class="mpunct"&gt;,&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;X&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;:&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;v&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mtight"&gt;′&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;⊆&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;v&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace newline"&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathcal"&gt;N&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal"&gt;X&lt;/span&gt;&lt;span class="mpunct"&gt;,&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;X&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mbin"&gt;∪&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;v&lt;/span&gt;&lt;span class="mrel mtight"&gt;∈&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;E&lt;/span&gt;&lt;span class="mrel mtight"&gt;↓&lt;/span&gt;&lt;span class="mopen mtight"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;S&lt;/span&gt;&lt;span class="mpunct mtight"&gt;,&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;X&lt;/span&gt;&lt;span class="mclose mtight"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;min&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal"&gt;v&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;The set 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;E↓′E\darr'&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;E&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;&lt;span class="mrel"&gt;↓&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mtight"&gt;′&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 is created — (interesting) hyperedges belonging to the current hypernode and pointing beyond it.&lt;/li&gt;
&lt;li&gt;This set is minimized to eliminate subsumed hypernodes (so there are no overlaps and potential duplicates).&lt;/li&gt;
&lt;li&gt;Only representatives (the smallest elements) are selected from remaining hypernodes.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The second step (finding the &lt;em&gt;minimum&lt;/em&gt; set) is a complex computational task, so many databases use optimization: as nodes are traversed representatives added into the neighbors &lt;em&gt;only&lt;/em&gt; if this hypernode does not intersect with already computed neighborhood. This approach is used by &lt;a href="https://github.com/mysql/mysql-server/blob/ff05628a530696bc6851ba6540ac250c7a059aa7/sql/join_optimizer/subgraph_enumeration.h#L314" rel="noopener noreferrer"&gt;MySQL&lt;/a&gt; and &lt;a href="https://github.com/ydb-platform/ydb/blob/c23202bc294cf703741f1ea6ac30786578a58920/ydb/library/yql/dq/opt/dq_opt_dphyp_solver.h#L438" rel="noopener noreferrer"&gt;YDB&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Does it make sense to optimize this fragment? There are quite a lot of comments in the MySQL code that describe certain decisions made (most often based on micro benchmarks). One of the &lt;a href="https://github.com/mysql/mysql-server/blob/ff05628a530696bc6851ba6540ac250c7a059aa7/sql/join_optimizer/subgraph_enumeration.h#L280" rel="noopener noreferrer"&gt;such comments&lt;/a&gt; is written for the neighborhood computation function &lt;code&gt;FindNeighborhood&lt;/code&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;...&lt;br&gt;
This function accounts for roughly 20–70% of the total DPhyp running time, depending on the shape of the graph (~40% average across the microbenchmarks)&lt;br&gt;
...&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is, almost all the time the algorithm is running is occupied by the logic of calculating neighbors, and therefore, yes, there is a point in optimizing it.&lt;/p&gt;

&lt;h3&gt;
  
  
  MySQL approach
&lt;/h3&gt;

&lt;p&gt;The idea implemented in MySQL is based on the property of subsets that we get when iterating incrementally. You can notice that in the next step we often get a &lt;em&gt;superset of the previous step&lt;/em&gt;, and very often (in most cases - every next step) the first bit is constantly switched from &lt;code&gt;0&lt;/code&gt; to &lt;code&gt;1&lt;/code&gt;. And if we switch it from &lt;code&gt;0&lt;/code&gt; to &lt;code&gt;1&lt;/code&gt;, then we get a subset! For example, &lt;code&gt;0011101&lt;/code&gt; is subset of &lt;code&gt;0011100&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;There is an immediate desire to simply take the previous neighbors and add this first element, but the question arises — is this correct? Yes, that's correct. If we look again at the definition of neighbors from the article, we will not see any restrictions on the order of node traversal, which means that we can take one set and enlarge it with some new element.&lt;/p&gt;

&lt;p&gt;To make the idea clearer, let's look at an example of iterating over a subset of 4 nodes. Here is an example from a comment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0001
0010 *
0011
0100 *
0101
0110 *
0111
1000 *
1001
1010 *
1011
1100 *
1101
1110 *
1111
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;With an asterisk, I indicated the places where we will have to fully calculate the neighbors, that is, you must clear the cache, but after that — on the next subset — it will be enough for us to process only one node (the first one). If everything is done honestly (without optimizations), then for an initial set of size 4, - 32 nodes will have to be processed, and using this heuristic, only 20.&lt;/p&gt;

&lt;p&gt;It was also pointed out in the comment that there are optimal subset traversal orders that will give even greater gains. For a set of four elements, it will be like this (in the comment in the code, probably a little bit — after 0111 comes 1110, which does not allow optimal calculation of neighbors, so here is my corrected version).&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0001 
0010 *
0011 
0100 *
0101
0110 *
0111
1000 *
1010
1100
1001 *
1011
1101
1110 *
1111
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The idea is simple: we iteratively shift the rightmost (first) bit to the left until it hits the left side of sequence of ones, and when this left side is over (all 1), we start a new digit — we update the cache at the beginning of a new digit or when this bit hits a sequence of ones.&lt;/p&gt;

&lt;p&gt;Here we only need 15 iterations (nodes to process). In practice, even such a heuristic with tracking the last bit gives a significant increase, requiring almost 2 times fewer iterations.&lt;/p&gt;

&lt;p&gt;Is the optimal order known for a larger set? Yes, but only for five elements (I won't give example of it here, you can see it in the same comment).&lt;/p&gt;

&lt;p&gt;If someone thought, "just take a shortcut and that's it," then alas. Do not forget that right now you are just looking at a &lt;em&gt;bitmask&lt;/em&gt; of included/excluded elements from the set. In reality, the elements in the bitmask are sparse, for example, iterating over the set of three elements &lt;code&gt;00101001&lt;/code&gt; will look like this (on the left is a subset, on the right is this mask/iteration number):&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;00000001   001
00001000   010
00001001   011
00100000   100
00100001   101
00101000   110
00101001   111
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;
  Limitations of caching scheme
  &lt;p&gt;The neighbor caching scheme only works when the set of excluded nodes &lt;em&gt;is fixed&lt;/em&gt;. This requirement is satisfied by the last cycles in the &lt;code&gt;Enumerate*&lt;/code&gt; functions, when the excluded set is fixed during all iterations (in the definition from the article, these functions recursively call themselves, but with a fixed set of excluded nodes 
&lt;span&gt;
  &lt;span&gt;&lt;span&gt;X∪N(S,X)X\cup\mathcal{N}(S,X)&lt;/span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;/span&gt;&lt;span&gt;X&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;span&gt;∪&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span&gt;&lt;/span&gt;&lt;span&gt;N&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;S&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;span&gt;X&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
).&lt;/p&gt;

&lt;p&gt;But even this is enough to increase performance, since the other two places of traversing subsets are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;EnumerateCmpRecursive&lt;/code&gt;, but it only invokes &lt;code&gt;EmitCsgCmp&lt;/code&gt; which does not require to compute neighborhood;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;EnumerateCsgRecursive&lt;/code&gt;, but it invokes &lt;code&gt;EmitCsg&lt;/code&gt;, which require new excluded set (it is not fixed, depends on iteration).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To understand why neighbors cannot be cached if the set of excluded nodes is changing, let's look at a specific example. Call &lt;code&gt;EnumerateCsgRecursive&lt;/code&gt; with a hypernode 
&lt;span&gt;
  &lt;span&gt;&lt;span&gt;S=000110S = 000110&lt;/span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;/span&gt;&lt;span&gt;S&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span&gt;&lt;/span&gt;&lt;span&gt;000110&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 and excluded set 
&lt;span&gt;
  &lt;span&gt;&lt;span&gt;X=000111X = 000111&lt;/span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;/span&gt;&lt;span&gt;X&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span&gt;&lt;/span&gt;&lt;span&gt;000111&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
. It has many neighbors, whose subsets we will iterate over, 
&lt;span&gt;
  &lt;span&gt;&lt;span&gt;N=1110000\mathcal{N} = 1110000&lt;/span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;/span&gt;&lt;span&gt;N&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span&gt;&lt;/span&gt;&lt;span&gt;1110000&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 (that is, note that 
&lt;span&gt;
  &lt;span&gt;&lt;span&gt;00010000001000&lt;/span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;/span&gt;&lt;span&gt;0001000&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 is free for now and does not belong to anyone):&lt;/p&gt;

&lt;p&gt;We want to efficiently compute neighbors, and for this purpose we cache them. The question is: what should I cache? The immutable part is 
&lt;span&gt;
  &lt;span&gt;&lt;span&gt;SS&lt;/span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;/span&gt;&lt;span&gt;S&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
, to which a subset of neighbors is added. We decide to cache it, but then this situation arises: for some node there is a hyperedge, the right side of which is 
&lt;span&gt;
  &lt;span&gt;&lt;span&gt;10010001001000&lt;/span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;/span&gt;&lt;span&gt;1001000&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 (remember that previous free 
&lt;span&gt;
  &lt;span&gt;&lt;span&gt;00010000001000&lt;/span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;/span&gt;&lt;span&gt;0001000&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 is used).&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;S =  000110
X =  000111
N = 1110000  &amp;lt;-- current neighborhood
R = 1001000  &amp;lt;-- neighbor candidate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;How can this hyperedge be processed correctly? We have two possible outcomes (depending on whether we have added this right part to excluded set or not):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;all neighbors &lt;em&gt;were not added&lt;/em&gt; to the set of excluded nodes. Then the node 
&lt;span&gt;
  &lt;span&gt;&lt;span&gt;00010000001000&lt;/span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;/span&gt;&lt;span&gt;0001000&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 is added to the neighbors for &lt;code&gt;EmitCsg&lt;/code&gt;, which will be passed everywhere. But this means that for iterations in which 
&lt;span&gt;
  &lt;span&gt;&lt;span&gt;10000001000000&lt;/span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;/span&gt;&lt;span&gt;1000000&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 is involved, the logic of computing neighbors in &lt;code&gt;EmitCsg&lt;/code&gt; will be violated, since this element will be in the excluded set, and accordingly &lt;em&gt;the edge should not be taken into account&lt;/em&gt;;&lt;/li&gt;
&lt;li&gt;all neighbors &lt;em&gt;were added&lt;/em&gt; to the set of excluded nodes. Then we get the other side — in iterations where the 
&lt;span&gt;
  &lt;span&gt;&lt;span&gt;10000001000000&lt;/span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;/span&gt;&lt;span&gt;1000000&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 element is not involved, &lt;em&gt;some node will be missing&lt;/em&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you look at the MySQL code, you can see that they still use caching, but why? The fact is that they compute the neighbors for a subset, and then &lt;em&gt;add the excluded nodes and the original neighbors to the resulting neighbors&lt;/em&gt;. Excluded ones are added according to the logic of &lt;code&gt;Enumerate*&lt;/code&gt; functions, since excluded sets are passed to them as input, which contain all previously found sets, although only a small subset could be passed by a recursive call. For example, during operation, &lt;code&gt;EnumerateCsgRecursive&lt;/code&gt; recursively called itself several times, and each time the set of neighbors consisted of two elements, but recursive calls passed only one, although the accumulated excluded nodes are the union of all found neighbors. That is, the set of excluded nodes that have been passed to us can contain all the nodes that will be valid neighbors in &lt;code&gt;EmitCsg&lt;/code&gt; (since it resets the current excluded ones and starts using its own).&lt;/p&gt;

&lt;p&gt;Example: after several recursive calls, &lt;code&gt;EnumerateCsgRecursive&lt;/code&gt; has 
&lt;span&gt;
  &lt;span&gt;&lt;span&gt;S=1010001011S = 1010001011&lt;/span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;/span&gt;&lt;span&gt;S&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span&gt;&lt;/span&gt;&lt;span&gt;1010001011&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
, 
&lt;span&gt;
  &lt;span&gt;&lt;span&gt;X=1111111111X = 1111111111&lt;/span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;/span&gt;&lt;span&gt;X&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span&gt;&lt;/span&gt;&lt;span&gt;1111111111&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
, but when invoking &lt;code&gt;EmitCsg&lt;/code&gt;, the excluded set will be reset to 
&lt;span&gt;
  &lt;span&gt;&lt;span&gt;X=0000000011X = 0000000011&lt;/span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;/span&gt;&lt;span&gt;X&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span&gt;&lt;/span&gt;&lt;span&gt;0000000011&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 (for example, because the smallest node is 2). Then adding 
&lt;span&gt;
  &lt;span&gt;&lt;span&gt;X§=0101110100X\S = 0101110100&lt;/span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;/span&gt;&lt;span&gt;X&lt;/span&gt;&lt;span&gt;§&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span&gt;&lt;/span&gt;&lt;span&gt;0101110100&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 will simply mean that, just in case, we want to process those nodes that were neighbors of previous calls, but did not get into the current CSG set.&lt;/p&gt;

&lt;p&gt;The original neighbors are added according to the same logic, but in order not to violate the correctness of the algorithm: all nodes that are in the resulting subgraph are removed from the resulting set. The developers initially know that the neighbors obtained in this way may not contain all the neighbors, so they generally add everything that may be true, even if this leads to excessive calculations. Here is a part of this code with comments describing the reasons in more detail (&lt;code&gt;lowest_node_idx&lt;/code&gt; is the index of the node with which the current iteration in &lt;code&gt;solve&lt;/code&gt; is running):&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// EnumerateComplementsTo() resets the forbidden set, since nodes that&lt;/span&gt;
&lt;span class="c1"&gt;// were forbidden under this subgraph may very well be part of the&lt;/span&gt;
&lt;span class="c1"&gt;// complement. However, this also means that the neighborhood we just&lt;/span&gt;
&lt;span class="c1"&gt;// computed may be incomplete; it just looks at recently-added nodes,&lt;/span&gt;
&lt;span class="c1"&gt;// but there are older nodes that may have neighbors that we added to&lt;/span&gt;
&lt;span class="c1"&gt;// the forbidden set (X) instead of the subgraph itself (S). However,&lt;/span&gt;
&lt;span class="c1"&gt;// this is also the only time we add to the forbidden set, so we know&lt;/span&gt;
&lt;span class="c1"&gt;// exactly which nodes they are! Thus, simply add our forbidden set&lt;/span&gt;
&lt;span class="c1"&gt;// to the neighborhood for purposes of computing the complement.&lt;/span&gt;
&lt;span class="c1"&gt;//&lt;/span&gt;
&lt;span class="c1"&gt;// This behavior is tested in the SmallStar unit test.&lt;/span&gt;
&lt;span class="n"&gt;new_neighborhood&lt;/span&gt; &lt;span class="o"&gt;|=&lt;/span&gt; &lt;span class="n"&gt;forbidden&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;TablesBetween&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lowest_node_idx&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// This node's neighborhood is also part of the new neighborhood&lt;/span&gt;
&lt;span class="c1"&gt;// it's just not added to the forbidden set yet, so we missed it in&lt;/span&gt;
&lt;span class="c1"&gt;// the previous calculation).&lt;/span&gt;
&lt;span class="n"&gt;new_neighborhood&lt;/span&gt; &lt;span class="o"&gt;|=&lt;/span&gt; &lt;span class="n"&gt;neighborhood&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;/p&gt;

&lt;p&gt;To implement their singleton cache, they use the &lt;a href="https://github.com/mysql/mysql-server/blob/ff05628a530696bc6851ba6540ac250c7a059aa7/sql/join_optimizer/subgraph_enumeration.h#L163" rel="noopener noreferrer"&gt;class &lt;code&gt;NeighborhoodCache&lt;/code&gt;&lt;/a&gt; and they transitively pass it almost everywhere. Its logic is simple: before starting the neighbor search, we find the delta of the sets, but in fact we check that the last bit is set, and at the end we save the calculated neighbors — but only if the first bit (&lt;code&gt;taboo bit&lt;/code&gt;) is not set - just because this will not contribute to any further neighborhood sets anymore (this is last bit can be set).&lt;/p&gt;

&lt;p&gt;As soon as I understood the idea, I rewrote the code to myself almost word for word, only rename &lt;code&gt;taboo&lt;/code&gt; to &lt;code&gt;forbidden&lt;/code&gt;. The code lived in this form for quite a one week, but then I realized — GPLv2! The MySQL code is distributed under the GPLv2 license, and considering that I rewrote almost everything word for word (at that time, probably not even fully understanding the idea itself), I violated this license: the extension uses MIT - they are incompatible! Then I faced the question — should I throw out this good optimization and make the code slow, or leave it, but change the license to GPLv2? As a result, I chose the first one, and this was the beginning of an exciting several-week thinking on how to optimize this combinatorics.&lt;/p&gt;
&lt;h3&gt;
  
  
  Suffix cache
&lt;/h3&gt;

&lt;p&gt;My task is to optimize the algorithm, but in such a way that it is not a MySQL tracing paper. The idea itself is clear: why to compute the entire set if you can simply extend it with this delta. We somehow need to find either a template or something that will help us detect such a subset.&lt;/p&gt;

&lt;p&gt;But what if you look at it from the other side? Literally from the other side. Indeed, our subset iteration method has a good property: the upper part (MSB) changes much less frequently than the lower part (LSB). So why don't we use this property? We will just cache something that rarely changes!&lt;/p&gt;

&lt;p&gt;But what is "rarely changing" in essence? In the first idea, I took the closing ones for this constant — the sequence of the last (leftmost) ones (with a fixed digit) in the number can only increase during increment, that is, it is enough only to count the neighbors for this sequence, and then add the changing one. We can separate the base (immutable) part with a simple bitmask, and we know its size (calculated).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If we recall the optimal iteration sequence proposed by MySQL, then we can see that this strategy is ideally suited for such a sequence.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;How do we track the change in this base part? It's easy, because this is binary arithmetic: we just keep track of how many iterations until the next new &lt;code&gt;1&lt;/code&gt;, and then divide by &lt;code&gt;2&lt;/code&gt; and wait calculated number of iterations. When a new digit begins, we simply reset the counter and repeat. The algorithm is as follows:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Initialization:

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;nb_cache = 0&lt;/code&gt; — cached neighborhood.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;nb_cache_subset = 0&lt;/code&gt; — bitmask of cached neighborhood above.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;next_update = 0&lt;/code&gt; — number of iterations until next cache update.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;prev_update = 0&lt;/code&gt; —  saved &lt;code&gt;next_update&lt;/code&gt; value.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;For each iteration (starting from 1):

&lt;ol&gt;
&lt;li&gt;If &lt;code&gt;next_update == 0&lt;/code&gt; (new digit added to base part):

&lt;ol&gt;
&lt;li&gt;Calculate neighborhood using the first element (add to cached neighborhood).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;next_update = prev_update / 2&lt;/code&gt; — next new 1 will be right after half the number of previous iteration number (binary arithmetic).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;prev_update = next_update&lt;/code&gt; — reset value.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;nb_cache = *currently computed neighborhood*&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;nb_cache_subset = *current subset*&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;If there is only single element in subset (of iteration number is a power of 2):

&lt;ol&gt;
&lt;li&gt;Calculate neighborhood using the only element.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;next_update = 2^*digit number - 2*&lt;/code&gt; — number of iterations before new digit is added to base part.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;prev_update = next_update&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;nb_cache = *currently computed neighborhood*&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;nb_cache_subset = *current subset*&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;Otherwise:

&lt;ol&gt;
&lt;li&gt;Remove &lt;code&gt;nb_cache_subset&lt;/code&gt; from current subset (so find that delta).&lt;/li&gt;
&lt;li&gt;Calculate current neighborhood as &lt;code&gt;nb_cache&lt;/code&gt; + neighborhood of delta.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;
  
  
  1-layered cache
&lt;/h3&gt;

&lt;p&gt;Is this a good idea? Yes, but only a starting point for further reasoning, but in practice, this caching scheme will give an increase only at the end, when almost everything is filled with ones (since I have not found a way to iterate over optimal subsets, it means iterating incrementally).&lt;/p&gt;

&lt;p&gt;Let's go back to the beginning. We discussed that rarely changing parts need to be cached. For this part, we took the sequence of closing ones. It rarely changes, but note that this part is variable, that is, you need to track its size. Now we will &lt;em&gt;make base part fixed&lt;/em&gt;, that is, we will &lt;em&gt;cache a certain suffix&lt;/em&gt; of the set.&lt;/p&gt;

&lt;p&gt;Which will give a greater increase — caching of closing ones or MSB suffix? Let's count on the same set of 4 elements. For the caching scheme of the leading &lt;code&gt;1&lt;/code&gt; we have the following computation scheme (the second column shows the number of processed nodes, and an asterisk marks the cache update locations):&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0001  1
0010  1  *
0011  2
0100  1  *
0101  2
0110  2  *
0111  1
1000  1  *
1001  1
1010  1
1011  2
1100  2  *
1101  1
1110  3  *
1111  1 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;In total, 22 elements need to be processed — two more than the taboo-bit scheme.&lt;/p&gt;

&lt;p&gt;With the MSB suffix caching scheme, you first need to answer the question — what is its size? If we take a little, we will perform computations of variable part a lot, on the contrary, if we take too much, we will calculate this MSB often. Basically, you can calculate everything, because it's the same binary math. But for clarity, let's take the MSB equal to 2:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0001  1
0010  1
0011  2
0100  1  *
0101  1
0110  1
0111  2
1000  1  *
1001  1
1010  1
1011  2
1100  2  *
1101  1
1110  1
1111  2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;There are already 20 elements here — the same number as in the MySQL approach, but less than the closing ones approach.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Here, every time I updated the cache - computed the neighbors completely. You might have noticed that when calculating the cached part for the neighbors, we could use the same optimization with the last bit — add this last element to the cached neighbors. This optimization will allow us to achieve only 16 iterations, which is much more efficient.&lt;br&gt;
Back then I missed this moment, but it was for the best — you will soon find out why.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The scheme with caching of the fixed part of the MSB proved to be better, and besides, it is configurable (the parameter is the size of the cached part). As a result, we choose a caching scheme with MSB suffix. Next, I will call this immutable part (which we use as the starting value for calculating neighbors) - base. The algorithm for working with it is quite simple: every 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;2len(64−MSB)2^{len(64 - MSB)}&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;e&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;n&lt;/span&gt;&lt;span class="mopen mtight"&gt;(&lt;/span&gt;&lt;span class="mord mtight"&gt;64&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;MSB&lt;/span&gt;&lt;span class="mclose mtight"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 iterations we save this basic part to the cache, and then just use it as a starting point when calculating neighbors.&lt;/p&gt;
&lt;h3&gt;
  
  
  2-layered cache
&lt;/h3&gt;

&lt;p&gt;Wait, we're working with sets, and any set can be created from the previous one by adding single element! For example, the set &lt;code&gt;011010&lt;/code&gt; can be constructed from any &lt;code&gt;001010&lt;/code&gt;, &lt;code&gt;010010&lt;/code&gt; or &lt;code&gt;011000&lt;/code&gt;. But it also means that you can create the current set by simply adding the outermost element from the beginning. The same initial example with the taboo bit is a special case when we added the first element.&lt;/p&gt;

&lt;p&gt;Let's draw a graph of 4 element set and see how we can create the current set from the previous one:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0000 &amp;lt;+  &amp;lt;+  &amp;lt;+
   ^  |   |   |
0001  |   |   |
      |   |   |
0010 -+   |   |
   ^      |   |
0011      |   |
          |   |
0100 &amp;lt;+  -+   |
   ^  |       |
0101  |       |
      |       |
0110 -+       |
   ^          |
0111          |
              |
1000 &amp;lt;+  &amp;lt;+  -+
   ^  |   |
1001  |   |
      |   |
1010 -+   |
   ^      |
1011      |
          |
1100 &amp;lt;+  -+
   ^  |
1101  |
      |
1110 -+
   ^
1111
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Do you see this pattern? We take the elements in the cell under &lt;em&gt;the index less than ours by some power of two and use it as a base&lt;/em&gt; to create the current set by adding the current outermost element. What is this power of two? You can see from the same diagram that the number of current leading zeros (or it can be represented as number of first element), that is, the set of neighbors that can be used to create the current one, is located 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;2zeros2^{zeros}&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;zeros&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 steps back, where 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;zeroszeros&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;zeros&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 is the number of current leading zeros of the iteration number.&lt;/p&gt;

&lt;p&gt;This is a &lt;em&gt;dynamic programming table&lt;/em&gt;! To calculate the current set, we use the result of the previous calculation. Wait! Doesn't this mean that to calculate the neighbors for each of the 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;2i2^i&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;i&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 subset, we will need to process exactly 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;2i2^i&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;i&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 nodes? Yes, it means! Thus, our entire set can be divided into 2 parts: the base part, where we have cached the rarely changing upper part, and the table part, which is calculated using our dynamic programming table. There is no question about the size of each part — the table part takes what is left: 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;64−len(MSB)64 - len(MSB)&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;64&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;l&lt;/span&gt;&lt;span class="mord mathnormal"&gt;e&lt;/span&gt;&lt;span class="mord mathnormal"&gt;n&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal"&gt;MSB&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 (or vice versa). As a result, we have the following algorithm:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;If iteration number (number representation of subset bit mask) is divided by 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;2tablesize2^{table size}&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;ab&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;es&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;i&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;ze&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
, then compute neighborhood - this is new base part begins.&lt;/li&gt;
&lt;li&gt;Otherwise:

&lt;ol&gt;
&lt;li&gt;Take lower part of iteration number (i.e. using bit mask)&lt;/li&gt;
&lt;li&gt;Calculate number of zeros in subset and get delta: 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;delta=2zerosdelta = 2^{zeros}&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;d&lt;/span&gt;&lt;span class="mord mathnormal"&gt;e&lt;/span&gt;&lt;span class="mord mathnormal"&gt;lt&lt;/span&gt;&lt;span class="mord mathnormal"&gt;a&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;zeros&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
&lt;/li&gt;
&lt;li&gt;Take parent neighborhood (starting point for computation): 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;table[iteration−delta]table[iteration - delta]&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal"&gt;ab&lt;/span&gt;&lt;span class="mord mathnormal"&gt;l&lt;/span&gt;&lt;span class="mord mathnormal"&gt;e&lt;/span&gt;&lt;span class="mopen"&gt;[&lt;/span&gt;&lt;span class="mord mathnormal"&gt;i&lt;/span&gt;&lt;span class="mord mathnormal"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal"&gt;er&lt;/span&gt;&lt;span class="mord mathnormal"&gt;a&lt;/span&gt;&lt;span class="mord mathnormal"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal"&gt;i&lt;/span&gt;&lt;span class="mord mathnormal"&gt;o&lt;/span&gt;&lt;span class="mord mathnormal"&gt;n&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;d&lt;/span&gt;&lt;span class="mord mathnormal"&gt;e&lt;/span&gt;&lt;span class="mord mathnormal"&gt;lt&lt;/span&gt;&lt;span class="mord mathnormal"&gt;a&lt;/span&gt;&lt;span class="mclose"&gt;]&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
&lt;/li&gt;
&lt;li&gt;Compute neighborhood using first element&lt;/li&gt;
&lt;li&gt;Save computed neighborhood to table&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A practical question arises: how much memory to allocate for table? An 8-byte number is used to store the set (&lt;code&gt;bitmapword == uint64&lt;/code&gt;), so the size of each set is fixed. This means that in my case it takes 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;8∗2i=2i+38* 2^i = 2^{i+3}&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;8&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;∗&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;i&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;i&lt;/span&gt;&lt;span class="mbin mtight"&gt;+&lt;/span&gt;&lt;span class="mord mtight"&gt;3&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 bytes to store a table with 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;ii&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;i&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 elements. Now we can make rough estimates.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Table size&lt;/th&gt;
&lt;th&gt;Memory consumption&lt;/th&gt;
&lt;th&gt;Total&lt;/th&gt;
&lt;th&gt;Optimized&lt;/th&gt;
&lt;th&gt;Saved&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;32   b&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;64   b&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;128  b&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;17&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;256  b&lt;/td&gt;
&lt;td&gt;80&lt;/td&gt;
&lt;td&gt;31&lt;/td&gt;
&lt;td&gt;49&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;512  b&lt;/td&gt;
&lt;td&gt;192&lt;/td&gt;
&lt;td&gt;63&lt;/td&gt;
&lt;td&gt;129&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;1    Kb&lt;/td&gt;
&lt;td&gt;448&lt;/td&gt;
&lt;td&gt;127&lt;/td&gt;
&lt;td&gt;321&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;2    Kb&lt;/td&gt;
&lt;td&gt;1024&lt;/td&gt;
&lt;td&gt;255&lt;/td&gt;
&lt;td&gt;769&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;4    Kb&lt;/td&gt;
&lt;td&gt;2304&lt;/td&gt;
&lt;td&gt;511&lt;/td&gt;
&lt;td&gt;1793&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;8    Kb&lt;/td&gt;
&lt;td&gt;5120&lt;/td&gt;
&lt;td&gt;1023&lt;/td&gt;
&lt;td&gt;4097&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;16   Kb&lt;/td&gt;
&lt;td&gt;11264&lt;/td&gt;
&lt;td&gt;2047&lt;/td&gt;
&lt;td&gt;9217&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;32   Kb&lt;/td&gt;
&lt;td&gt;24576&lt;/td&gt;
&lt;td&gt;4095&lt;/td&gt;
&lt;td&gt;20481&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Legend:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Table size — size of suffix we are caching.&lt;/li&gt;
&lt;li&gt;Memory consumption — amount of memory required to allocate table (
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;2table size+32^{table\ size + 3}&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;ab&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;e&lt;/span&gt;&lt;span class="mspace mtight"&gt;&lt;span class="mtight"&gt; &lt;/span&gt;&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;s&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;i&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;ze&lt;/span&gt;&lt;span class="mbin mtight"&gt;+&lt;/span&gt;&lt;span class="mord mtight"&gt;3&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
).&lt;/li&gt;
&lt;li&gt;Total — total number of all elements across all subsets (
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;∑k=1table sizekCtable sizek\sum_{k = 1}^{table\ size}k C_{table\ size}^{k}&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mop"&gt;&lt;span class="mop op-symbol small-op"&gt;∑&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;span class="mrel mtight"&gt;=&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;ab&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;e&lt;/span&gt;&lt;span class="mspace mtight"&gt;&lt;span class="mtight"&gt; &lt;/span&gt;&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;s&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;i&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;ze&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;k&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;C&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;ab&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;e&lt;/span&gt;&lt;span class="mspace mtight"&gt;&lt;span class="mtight"&gt; &lt;/span&gt;&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;s&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;i&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;ze&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
).&lt;/li&gt;
&lt;li&gt;Optimized — number of elements we have to process in table version, the same as number of iterations but without first (empty) set (
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;2table size−12^{table\ size} - 1&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;ab&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;e&lt;/span&gt;&lt;span class="mspace mtight"&gt;&lt;span class="mtight"&gt; &lt;/span&gt;&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;s&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;i&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;ze&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
).&lt;/li&gt;
&lt;li&gt;Saved — difference between "Total" and "Optimized".&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Which table size to use can be calculated dynamically or simply choose the optimal value for your load heuristically (constant set in configuration). To make it easier to think further, I will choose a table size of 10 elements.&lt;/p&gt;

&lt;p&gt;I was wondering — what kind of gain does an increase in the table give? For example, how many iterations can we save if we use &lt;code&gt;tbl + 1&lt;/code&gt; instead of a table with &lt;code&gt;tbl&lt;/code&gt; elements. To begin with, here is the formula for the total number of iterations. Let's imagine that the number of our neighbors (i.e., the size of the set) is &lt;code&gt;max&lt;/code&gt; (for current implementation it is &lt;code&gt;64&lt;/code&gt;, but this is generalization), and the size of the table is &lt;code&gt;tbl&lt;/code&gt;. Then, the number of iterations required to process all subsets of neighbors is:&lt;/p&gt;


&lt;div class="katex-element"&gt;
  &lt;span class="katex-display"&gt;&lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;∑k=0max−tblCmax−tblk(2tbl−1+k)
\sum^{k = 0}{max - tbl}C{max - tbl}^{k}(2^{tbl} - 1 + k)
&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mop op-limits"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="mop op-symbol large-op"&gt;∑&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;span class="mrel mtight"&gt;=&lt;/span&gt;&lt;span class="mord mtight"&gt;0&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal"&gt;x&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;C&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal"&gt;x&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;k&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/div&gt;



&lt;p&gt;That is, for each new &lt;code&gt;base&lt;/code&gt; there are 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;kk&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 iterations to calculate its neighbors, and then another 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;2tbl2^{tbl}&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 iterations for table. We do this for each subset (
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;∑max−tblk\sum^{k}_{max - tbl}&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mop"&gt;&lt;span class="mop op-symbol small-op"&gt;∑&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
). Now we can calculate the difference between the number of iterations for the current table and the one increased by 1.&lt;/p&gt;


&lt;div class="katex-element"&gt;
  &lt;span class="katex-display"&gt;&lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;∑kmax−tblCmax−tblk(2tbl−1+k)−∑kmax−tbl−1Cmax−tbl−1k(2tbl+1−1+k)=∑kmax−tbl−1Cmax−tblk(2tbl−1+k)−∑kmax−tbl−1Cmax−tbl−1k(2tbl+1−1+k)+Cmax−tblmax−tbl(2tbl−1+max−tbl)=∑kmax−tbl−1(Cmax−tbl−1k+Cmax−tbl−1k−1)(2tbl−1+k)−∑kmax−tbl−1Cmax−tbl−1k(2tbl+1−1+k)+2tbl−1+max−tbl=∑kmax−tbl−1Cmax−tbl−1k(2tbl−1+k)+∑kmax−tbl−1Cmax−tbl−1k−1(2tbl−1+k)−∑kmax−tbl−1Cmax−tbl−1k(2tbl+1−1+k)+2tbl−1+max−tbl=∑kmax−tbl−1Cmax−tbl−1k(2tbl−1+k−2tbl+1+1−k)+∑kmax−tbl−1Cmax−tbl−1k−1(2tbl−1+k)+2tbl−1+max−tbl=−2tbl∑kmax−tbl−1Cmax−tbl−1k+∑kmax−tbl−1Cmax−tbl−1k−1(2tbl−1+k)+2tbl−1+max−tbl=2tbl(∑kmax−tbl−1Cmax−tbl−1k−1−∑kmax−tbl−1Cmax−tbl−1k)+∑kmax−tbl−1Cmax−tbl−1k−1(2tbl−1+k)+2tbl+max−tbl
\sum_{k}^{max - tbl}C_{max - tbl}^{k}(2^{tbl} - 1 + k) - \sum_{k}^{max - tbl - 1}C_{max - tbl - 1}^{k}(2^{tbl + 1} - 1 + k) = \newline
\sum_{k}^{max - tbl - 1}C_{max - tbl}^{k}(2^{tbl} - 1 + k) - \sum_{k}^{max - tbl - 1}C_{max - tbl - 1}^{k}(2^{tbl + 1} - 1 + k) + C_{max - tbl}^{max - tbl}(2^{tbl} - 1 + max - tbl) = \newline
\sum_{k}^{max - tbl - 1}(C_{max - tbl - 1}^{k} + C_{max - tbl - 1}^{k - 1})(2^{tbl} - 1 + k) - \sum_{k}^{max - tbl - 1}C_{max - tbl - 1}^{k}(2^{tbl + 1} - 1 + k) + 2^{tbl} - 1 + max - tbl = \newline
\sum_{k}^{max - tbl - 1}C_{max - tbl - 1}^{k}(2^{tbl} - 1 + k) + \sum_{k}^{max - tbl - 1}C_{max - tbl - 1}^{k - 1}(2^{tbl} - 1 + k) - \sum_{k}^{max - tbl - 1}C_{max - tbl - 1}^{k}(2^{tbl + 1} - 1 + k) + 2^{tbl} - 1 + max - tbl = \newline
\sum_{k}^{max - tbl - 1}C_{max - tbl - 1}^{k}(2^{tbl} - 1 + k - 2^{tbl + 1} + 1 - k) + \sum_{k}^{max - tbl - 1}C_{max - tbl - 1}^{k - 1}(2^{tbl} - 1 + k) + 2^{tbl} - 1 + max - tbl = \newline
-2^{tbl}\sum_{k}^{max - tbl - 1}C_{max - tbl - 1}^{k} + \sum_{k}^{max - tbl - 1}C_{max - tbl - 1}^{k - 1}(2 ^{tbl} - 1 + k) + 2^{tbl} - 1 + max - tbl = \newline
2^{tbl}(\sum_{k}^{max - tbl - 1}C_{max - tbl - 1}^{k - 1} - \sum_{k}^{max - tbl - 1}C_{max - tbl - 1}^{k}) + \sum_{k}^{max - tbl - 1}C_{max - tbl - 1}^{k - 1}(2^{tbl} - 1 + k) + 2^{tbl} + max - tbl
&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mop op-limits"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="mop op-symbol large-op"&gt;∑&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;C&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;k&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mop op-limits"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="mop op-symbol large-op"&gt;∑&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;C&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;+&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;k&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace newline"&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mop op-limits"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="mop op-symbol large-op"&gt;∑&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;C&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;k&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mop op-limits"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="mop op-symbol large-op"&gt;∑&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;C&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;+&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;k&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;C&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal"&gt;x&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal"&gt;l&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace newline"&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mop op-limits"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="mop op-symbol large-op"&gt;∑&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;C&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;C&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;k&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mop op-limits"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="mop op-symbol large-op"&gt;∑&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;C&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;+&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;k&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal"&gt;x&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal"&gt;l&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace newline"&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mop op-limits"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="mop op-symbol large-op"&gt;∑&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;C&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;k&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mop op-limits"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="mop op-symbol large-op"&gt;∑&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;C&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;k&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mop op-limits"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="mop op-symbol large-op"&gt;∑&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;C&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;+&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;k&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal"&gt;x&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal"&gt;l&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace newline"&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mop op-limits"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="mop op-symbol large-op"&gt;∑&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;C&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;k&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;+&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;k&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mop op-limits"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="mop op-symbol large-op"&gt;∑&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;C&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;k&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal"&gt;x&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal"&gt;l&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace newline"&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;−&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mop op-limits"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="mop op-symbol large-op"&gt;∑&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;C&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mop op-limits"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="mop op-symbol large-op"&gt;∑&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;C&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;k&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal"&gt;x&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal"&gt;l&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace newline"&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mop op-limits"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="mop op-symbol large-op"&gt;∑&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;C&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mop op-limits"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="mop op-symbol large-op"&gt;∑&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;C&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mop op-limits"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="mop op-symbol large-op"&gt;∑&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;C&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;k&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal"&gt;x&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/div&gt;


&lt;p&gt;If we expand the sums under the brackets in the first term, we get a sum of the following type: 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;C0−C1+C1−C2+C2−...C^0 - C^1 + C^1 - C^2 + C^2 - ...&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;C&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;0&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;C&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;C&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;C&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;2&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;C&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;2&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;...&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
. It is easy to see that the terms destroy each other, and we get 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;C0−Cmax−tbl−1=0C^0 - C^{max - tbl - 1} = 0&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;C&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;0&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;C&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;0&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 — the first term is zero. Then the final expression is:&lt;/p&gt;


&lt;div class="katex-element"&gt;
  &lt;span class="katex-display"&gt;&lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;∑kmax−tbl−1Cmax−tbl−1k−1(2tbl−1+k)+2tbl−1+max−tbl=∑kmax−tbl−1Cmax−tbl−1k−1(2tbl−1+k)=∑kmax−tblCmax−tbl−1k(2tbl+k)
\sum_{k}^{max - tbl - 1}C_{max - tbl - 1}^{k - 1}(2^{tbl} - 1 + k) + 2^{tbl} - 1 + max - tbl = \newline
\sum_{k}^{max - tbl - 1}C_{max - tbl - 1}^{k - 1}(2^{tbl} - 1 + k) = \newline
\sum_{k}^{max - tbl}C_{max - tbl - 1}^{k}(2^{tbl} + k)
&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mop op-limits"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="mop op-symbol large-op"&gt;∑&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;C&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;k&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal"&gt;x&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal"&gt;l&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace newline"&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mop op-limits"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="mop op-symbol large-op"&gt;∑&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;C&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;k&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace newline"&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mop op-limits"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="mop op-symbol large-op"&gt;∑&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;C&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;x&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;k&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/div&gt;


&lt;p&gt;Considering that 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;max&amp;gt;tblmax &amp;gt; tbl&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;ma&lt;/span&gt;&lt;span class="mord mathnormal"&gt;x&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;&amp;gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 (the formula assumes that we increase 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;tbltbl&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal"&gt;b&lt;/span&gt;&lt;span class="mord mathnormal"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 by 1, that is, this is the previous value, and it makes no sense to increase the size of the table beyond the size of the set), we see that this expression is non-negative, since each term of this sum is non-negative.&lt;/p&gt;
&lt;h3&gt;
  
  
  3-layered cache
&lt;/h3&gt;

&lt;p&gt;So, we came up with the idea of storing a table of cached neighborhood. It's not bad, but it's sad that there are only fixed number of nodes in the cache. For complex queries, we will have to recalculate the base. Yes, we can make the table size configurable, but in any case, from some point on, the table size may become impractically large, i.e., starting with 15 elements, a &lt;em&gt;256 Kb&lt;/em&gt; table will be required. In the worst case, after several similar recursive calls (for example, the same &lt;code&gt;EnumerateCsgRecursive&lt;/code&gt;), the memory allocated only for tables may amount to more than a megabyte. Is there any other way to optimize? Yes.&lt;/p&gt;

&lt;p&gt;Let's go back to the beginning. The whole idea of caching in MySQL was based on calculating the neighbors by the first node of the current set, but when the moment came to save the neighbors for further calculations, they &lt;em&gt;did not&lt;/em&gt; save the neighbors if this first &lt;code&gt;taboo bit&lt;/code&gt; was set. Why? But because this is the final station! Neighbors created in such set will no longer be used by anyone when iterating incrementally. It doesn't take long to go after the proof — look at the previous scheme and you'll see that the neighbors calculated in odd iterations are not used by anyone. That is, we can safely avoid saving neighbors of odd iterations. How much does this optimization save us? Exactly half: half of the sets are even, and the other half are odd. The code, of course, will have to be slightly modified, it is necessary to take into account the indexing of the new scheme: to get the previous index, you need to take not 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;2zeros2^{zeros}&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;zeros&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
, but 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;2zeros−12^{zeros - 1}&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;zeros&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 elements back.&lt;/p&gt;

&lt;p&gt;Wait, but if we cut off the first bits, there will be another part with a similar pattern. What if you optimize it that way too, and how many times can you repeat it that way? I won't waste much time on these thoughts, but I'll say right away that I was able to apply a similar prefix compression for four elements, while only needing two additional variables to calculate the current neighbors. Thus, we come to a 3-layered caching scheme: base, table, hot (initially I called them well-done, medium and rare). The third part, hot, can be called a compressed prefix.&lt;/p&gt;

&lt;p&gt;The algorithm now depends even more on the iteration number: depending on its value, we work with the base, table, or hot part.&lt;/p&gt;

&lt;p&gt;First, let's look at the order of processing the hot part. It consists of 4 elements, and the main thing is that the pattern repeats itself:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0   0000 &amp;lt;+  &amp;lt;+  &amp;lt;+    quad leader
       ^  |   |   |
1   0001  |   |   |
          |   |   |
2   0010 -+   |   |
       ^      |   |
3   0011      |   |
              |   |
4   0100 &amp;lt;+  -+   |
       ^  |       |
5   0101  |       |
          |       |
6   0110 -+       |
       ^          |
7   0111          |
                  |
8   1000 &amp;lt;+  &amp;lt;+  -+    quad leader
       ^  |   |
9   1001  |   |
          |   |
10  1010 -+   |
       ^      |
11  1011      |
              |
12  1100 &amp;lt;+  -+
       ^  |
13  1101  |
          |
14  1110 -+
       ^
15  1111
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We divide all these 16 elements (total number of possible subsets for 4 elements) in half into 2 parts, each of which has its own leader, called the &lt;em&gt;quad leader&lt;/em&gt;. In fact, these are the cached neighbors for a set in which only the last bit is set. We calculate it 2 times: at iterations 0 and 8. Next, we also use optimization for odd iterations — for them, we specifically save the neighbors of even iterations at each iteration and simply use the ready-made value at the next (already odd) iteration (we don't need to save anything for odd ones).&lt;/p&gt;

&lt;p&gt;As a result, we are left with iterations: 2, 4, 6, 10, 12, 14. Here we can note that for 2, 6, 10, 14, we just need to take the neighbors of the previous even iteration — we save it for odd iterations anyway. But 4 and 12 require special processing: we need to take the quad leader as a starting point, but we also save it (the second required variable). As a result, we store only two calculated values of the neighbors: the previous even iteration and the quad leader.&lt;/p&gt;

&lt;p&gt;Moving on to the table part. The changes primarily affected the calculation of the table index. Now, for the parent neighborhood, you need to subtract &lt;code&gt;4&lt;/code&gt; (the size of the hot part) from the number of leading zeros: 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;2zeros−42^{zeros - 4}&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;zeros&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;4&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
. It is worth noting that when a new table part begins, a new hot part also begins, so each time after calculating the neighbors and saving this value to the table, we should clear the state of the hot part - assign a new quad-leader with the current neighbor value.&lt;/p&gt;

&lt;p&gt;There are almost no changes with the base part. First, as can be understood from the previous steps, when starting a new base part, we need to reset the state of the table part (the calculation of the table begins anew), as well as the hot part (set the quad leader to the current calculated value of the neighbors). But that's not all: why don't we use the same trick with odd iterations? And it's true: if the first bit of the current iteration number of the base part is 1, then we can simply recompute the neighbors, just as we did before.&lt;/p&gt;

&lt;p&gt;The last question is: how to decide which action to perform? Use the subset prefix here. It is easy to see that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;every 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;2tablesize2^{table size}&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;ab&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;es&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;i&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;ze&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 iterations, the base part should be updated (which in terms of binary numbers means that the first 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;ii&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;i&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 bits are 0);&lt;/li&gt;
&lt;li&gt;every 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;24=162^4 = 16&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;4&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;16&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 iterations, new entries should be made in the table;&lt;/li&gt;
&lt;li&gt;every 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;88&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;8&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 iterations, the quad leader should be updated in the hot part.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The algorithm has become much more complicated, but that is done for the sake of optimization. Now for the same set size, not 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;2i2^{i}&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;i&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 bytes are required, but only 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;2i−42^{i - 4}&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;i&lt;/span&gt;&lt;span class="mbin mtight"&gt;−&lt;/span&gt;&lt;span class="mord mtight"&gt;4&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
. For example, instead of 8Kb, only 0.5Kb is now required whereas still all 10 elements are cached.&lt;/p&gt;

&lt;p&gt;And when this idea settled in my head and I started writing code with all my might, it suddenly dawned on me...&lt;/p&gt;
&lt;h3&gt;
  
  
  Perfect cache
&lt;/h3&gt;

&lt;p&gt;I looked at the graph again, wrote it for five elements, and noticed something that was in plain sight all the time, but I didn't notice. Here is this diagram, on the right of which is the number of accesses to the elements (I did not write for odd ones):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;00000 &amp;lt;+  &amp;lt;+  &amp;lt;+  &amp;lt;+    5
    ^  |   |   |   |  
00001  |   |   |   |
       |   |   |   |
00010 -+   |   |   |    1
    ^      |   |   |
00011      |   |   |
           |   |   |
00100 &amp;lt;+  -+   |   |    2
    ^  |       |   |
00101  |       |   |
       |       |   |
00110 -+       |   |    1
    ^          |   |
00111          |   |
               |   |
01000 &amp;lt;+  &amp;lt;+  -+   |    3
    ^  |   |       |
01001  |   |       |
       |   |       |
01010 -+   |       |    1
    ^      |       |
01011      |       |
           |       |
01100 &amp;lt;+  -+       |    2
    ^  |           |
01101  |           |
       |           |
01110 -+           |    1
    ^              |
11111              |
                   |
10000 &amp;lt;+  &amp;lt;+  &amp;lt;+  -+    4  
    ^  |   |   |
10001  |   |   |
       |   |   |
10010 -+   |   |        1
    ^      |   |
10011      |   |
           |   |
10100 &amp;lt;+  -+   |        2
    ^  |       |
10101  |       |
       |       |
10110 -+       |        1
    ^          |
10111          |
               |
11000 &amp;lt;+  &amp;lt;+  -+        3
    ^  |   |
11001  |   |
       |   |
11010 -+   |            1
    ^      |
11011      |
           |
11100 &amp;lt;+  -+            2
    ^  |
11101  |
       |
11110 -+                1
    ^
11111
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Do you see the pattern? If not yet, here's a hint — the relationship between the prefix of the iteration number and the number of hits. The number of accesses to the &lt;em&gt;element is equal to the number of zeros at the beginning, and after this number of accesses, the value  will not be used at all&lt;/em&gt;. Instead, it will be a new one, but the iteration number will have the same number of zeros at the beginning. That is, we &lt;em&gt;don't need&lt;/em&gt; to store these values, and since our iteration is only forward (incremental), we can safely delete old elements. With this approach, the maximum size of the table is equal to the size max size of nodes. Since I represent a set as a number, I don't even need to think about it — I allocate a 64-sized array on the stack (64 * 8 = 512 bytes).&lt;/p&gt;

&lt;p&gt;But how to use it correctly? Let's remember how we compute the neighbors: we take the parent neighborhood and calculate current one with the first element. And here's the second observation: the parent set is ours, but without the last bit, for example, for &lt;code&gt;01110&lt;/code&gt; the parent will be &lt;code&gt;01100&lt;/code&gt;. And which cell the parent neighborhood for &lt;code&gt;01100&lt;/code&gt; is stored? That's right — under the index &lt;code&gt;2&lt;/code&gt;, because it has 2 zeros at the beginning. The corner case is when we move to a new digit, then after deleting a bit we get an empty set, but for an empty set (nodes) we return an empty set (neighborhood), because it cannot have neighbors.&lt;/p&gt;

&lt;p&gt;And now the algorithm itself:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Create DP-table, (dynamic) array.&lt;/li&gt;
&lt;li&gt;Start next iteration.&lt;/li&gt;
&lt;li&gt;Remove the first bit (element) from the iteration number and count the number of zeros.&lt;/li&gt;
&lt;li&gt;Take the element with this index from the DP table.&lt;/li&gt;
&lt;li&gt;Calculate the neighborhood based on the cached one, and for delta — the first element of the current subset.&lt;/li&gt;
&lt;li&gt;Save computed neighborhood to DP-table with index of current number of zeros.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Is this correct? Yes. Here is a evidence to the contrary. The proof itself comes from the property of the subset—increment algorithm.&lt;/p&gt;

&lt;p&gt;Algorithm:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Given current iteration &lt;code&gt;iter&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Remove last bit and get number &lt;code&gt;iter_parent&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Calculate number of zeros and get parent neighborhood from table &lt;code&gt;nb_parent&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It is stated that &lt;code&gt;nb_parent&lt;/code&gt; is the neighborhood of the set &lt;code&gt;iter_parent&lt;/code&gt;. Let's assume that it is not, that is, the neighbors of &lt;code&gt;iter_parent_2&lt;/code&gt; are stored in that cell, which is not equal to &lt;code&gt;iter_parent&lt;/code&gt;. Then this number can be either more or less than expected, but:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;iter_parent_2&lt;/code&gt; definitely can not be greater than &lt;code&gt;iter_parent&lt;/code&gt;, since this means that it would be greater than the original &lt;code&gt;iter&lt;/code&gt; (since the differences are in the MSB part), but we iterate in a strictly increasing way, which means that &lt;em&gt;we have not yet encountered this number&lt;/em&gt;, and its value is not in the table.&lt;/li&gt;
&lt;li&gt;it can't be a smaller number either, because that would mean that we missed the &lt;code&gt;iter_parent&lt;/code&gt;: both numbers have the same number of leading zeros (according to our assumption), and since &lt;code&gt;iter_parent&lt;/code&gt; comes &lt;em&gt;after&lt;/em&gt; &lt;code&gt;iter_parent_2&lt;/code&gt; (since it is larger in numerical representation), we should have overwritten the value under the corresponding index (overwrite the neighbors of &lt;code&gt;iter_parent_2&lt;/code&gt;), but we did not do this and thus violated the algorithm.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The only remaining option is that the neighborhood of the &lt;code&gt;iter_parent&lt;/code&gt; is stored under this index. Given that the neighbors are defined for an empty set, we can also say that we can't get garbage either:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;if the current set consists of single element, it means that we have moved to a new higher order digit, which we have not been in before. By removing this element, we get an empty set, but the neighbors are defined for it, and for this new index, which has not been seen before, the neighbors will be preserved, and then the value will be determined;&lt;/li&gt;
&lt;li&gt;otherwise, the number of leading zeros will &lt;em&gt;not exceed&lt;/em&gt; the number of the highest digit (otherwise we started a new digit and this is case 1), and we should have already written the neighbors for them in the table, that is, it is determined.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So we proved that this caching scheme is correct. Here is the code itself that calculates it all.:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;typedef&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;SubsetIteratorState&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="cm"&gt;/* Current subset */&lt;/span&gt;
    &lt;span class="n"&gt;bitmapword&lt;/span&gt; &lt;span class="n"&gt;subset&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="cm"&gt;/* State variables for subset iteration */&lt;/span&gt;
    &lt;span class="n"&gt;bitmapword&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;bitmapword&lt;/span&gt; &lt;span class="n"&gt;init&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="cm"&gt;/* Current iteration number */&lt;/span&gt;
    &lt;span class="n"&gt;bitmapword&lt;/span&gt; &lt;span class="n"&gt;iteration&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="cm"&gt;/* Neighborhood cache */&lt;/span&gt;
    &lt;span class="n"&gt;bitmapword&lt;/span&gt; &lt;span class="n"&gt;cached_neighborhood&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="n"&gt;SubsetIteratorState&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="cm"&gt;/* Get parent neighborhood */&lt;/span&gt;
&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kr"&gt;inline&lt;/span&gt; &lt;span class="n"&gt;bitmapword&lt;/span&gt;
&lt;span class="nf"&gt;get_parent_neighborhood&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DPHypContext&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SubsetIteratorState&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;iter_state&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;zero_count&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;bitmapword&lt;/span&gt; &lt;span class="n"&gt;last_bit_removed&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="cm"&gt;/* Remove first bit/element */&lt;/span&gt;
    &lt;span class="n"&gt;last_bit_removed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bmw_difference&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;iter_state&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;iteration&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bmw_lowest_bit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;iter_state&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;iteration&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;last_bit_removed&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="cm"&gt;/* There are no neighbors */&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;zero_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bmw_rightmost_one_pos&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;last_bit_removed&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;iter_state&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;cached_neighborhood&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;zero_count&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="cm"&gt;/* Calculate neighborhood for current iteration */&lt;/span&gt;
&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="n"&gt;bitmapword&lt;/span&gt;
&lt;span class="nf"&gt;get_neighbors_iter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DPHypContext&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bitmapword&lt;/span&gt; &lt;span class="n"&gt;subgroup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                   &lt;span class="n"&gt;bitmapword&lt;/span&gt; &lt;span class="n"&gt;excluded&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SubsetIteratorState&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;iter_state&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;zero_count&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;bitmapword&lt;/span&gt; &lt;span class="n"&gt;neighbors&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;EdgeArray&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;complex_edges&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="n"&gt;excluded&lt;/span&gt; &lt;span class="o"&gt;|=&lt;/span&gt; &lt;span class="n"&gt;subgroup&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="n"&gt;iter_state&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;iteration&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bmw_rightmost_one_pos&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;iter_state&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;subset&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="cm"&gt;/* Computation basis - parent neighborhood */&lt;/span&gt;
    &lt;span class="n"&gt;neighbors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;get_parent_neighborhood&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;iter_state&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="cm"&gt;/* Add simple neighborhood */&lt;/span&gt;
    &lt;span class="n"&gt;neighbors&lt;/span&gt; &lt;span class="o"&gt;|=&lt;/span&gt; &lt;span class="n"&gt;bmw_difference&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;simple_edges&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;excluded&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="cm"&gt;/* Process complex edges */&lt;/span&gt;
    &lt;span class="n"&gt;complex_edges&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;complex_edges&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;get_start_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;complex_edges&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;neighbors&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;excluded&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;complex_edges&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;HyperEdge&lt;/span&gt; &lt;span class="n"&gt;edge&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;complex_edges&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;edges&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;bmw_is_subset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;edge&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;left&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;subgroup&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;
            &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;bmw_overlap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;edge&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;right&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;neighbors&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;excluded&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;neighbors&lt;/span&gt; &lt;span class="o"&gt;|=&lt;/span&gt; &lt;span class="n"&gt;bmw_lowest_bit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;edge&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;right&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;neighbors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bmw_difference&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;neighbors&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;excluded&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="cm"&gt;/* Save current neighborhood to table */&lt;/span&gt;
    &lt;span class="n"&gt;zero_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bmw_rightmost_one_pos&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;iter_state&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;iteration&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;iter_state&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;cached_neighborhood&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;zero_count&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;neighbors&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;neighbors&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;The scheme can also be slightly optimized - do not save neighborhood for odd iterations, but this is micro-optimization.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Is it possible to add more optimizations? Yes. An example of this is already in the code above — indexing.&lt;/p&gt;

&lt;p&gt;
  Not exactly perfect cache
  &lt;p&gt;Yes, this caching strategy is good, but not perfect. Remember the MySQL caching approach — they use it even in the first &lt;code&gt;EnumerateCsgRec&lt;/code&gt; cycle, when excluded set varies. MySQL overcomes this using heuristic — it caches excluded nodes, and then add all previously prohibited ones which &lt;em&gt;could be&lt;/em&gt; neighbors.&lt;/p&gt;

&lt;p&gt;Here is an example when this is &lt;em&gt;not&lt;/em&gt; the case. There are 3 calls to &lt;code&gt;EnumerateCsgRec&lt;/code&gt; (&lt;code&gt;S&lt;/code&gt; is a subgraph, &lt;code&gt;X&lt;/code&gt; is a set of excluded nodes, &lt;code&gt;N&lt;/code&gt; is the neighbors that are currently being iterated over, i.e. &lt;code&gt;N(S, X)&lt;/code&gt; has been calculated):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. S = 00000001, X = 00000001, N = 00001110
2. S = 00001001, X = 00001111, N = 01110000
3. S = 01001001, X = 01111111, N = 10000000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first two iterations yielded neighbors of three adjacent elements, and now we are in the third call: subgraph = &lt;code&gt;1001001&lt;/code&gt;, excluded = &lt;code&gt;1111111&lt;/code&gt;, neighbors = &lt;code&gt;10000000&lt;/code&gt;. In the first (and only cycle), we found that &lt;code&gt;11001001&lt;/code&gt; has a plan, so we need to find neighbors for this set. Following the logic of MySQL, the neighbors for it are &lt;code&gt;10110110&lt;/code&gt;, since you need to add all the previous neighbors.&lt;/p&gt;

&lt;p&gt;Where could there be a problem here? Look at the second call: where did the neighbors &lt;code&gt;1110000&lt;/code&gt; come from? We got them from &lt;code&gt;0000100&lt;/code&gt;, but it &lt;em&gt;was not in our **final&lt;/em&gt;* subgraph*, which means that one of the hyperedges could contain &lt;code&gt;01000000&lt;/code&gt;, which should be excluded (contained in the final subgraph).&lt;/p&gt;

&lt;p&gt;It is not difficult to guess that after this, the number of nodes (and therefore possible csg/cmp pairs) that should be considered will increase, which means that the number of "junk" solutions that either will not give a useful answer or will be considered several times will also increase. There is certainly a connection between them, since the past neighbors are obtained from current nodes.&lt;/p&gt;

&lt;p&gt;It is difficult to say whether it is bad or good for MySQL. But for PostgreSQL, at least at this stage of the extension's life, we can say that it is bad. The reason is the choice of the plan creation function — &lt;code&gt;make_join_rel&lt;/code&gt;. Calling this function is very expensive and takes up almost all the time (i.e. if you built flame-graph for planner). If we use the MySQL approach in this part, then as a result, quite a lot of unnecessary CSG/CMP pairs will be created, for which we will have to create a plan. Most likely, we will waste time and resources on this, because even if there are no predicates between the relations, we can create a &lt;code&gt;CROSS JOIN&lt;/code&gt;. In short, in the current cost model, it is much more profitable to spend an extra call to the neighbor finding function (&lt;code&gt;get_neighbors&lt;/code&gt;) than to create a plan (&lt;code&gt;make_join_rel&lt;/code&gt;). But that's not all.&lt;/p&gt;

&lt;p&gt;Note the fundamental difference in the caching approach. In the MySQL approach, they can quite rightly use their cache and use it if necessary, since you just need to check the subset. Maybe it's not as optimal, but it works even in the case of &lt;em&gt;conditional&lt;/em&gt; execution. But the "perfect" cache is another matter, it requires us to constantly maintain it, that is, we are required to compute the neighbors at each iteration to maintain DP-table intact so that further calculations give the correct result.&lt;/p&gt;

&lt;p&gt;It's hard to say which is better.:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;In some scenarios, it makes sense to use an "perfect" cache (and adapt caching from MySQL):

&lt;ul&gt;
&lt;li&gt;we already have a plan in place for most of the subsets;&lt;/li&gt;
&lt;li&gt;no further recursive calls are expected (and there will be no combinatorial explosion)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;In other scenarios, there are no such plans at all, so there is no need to compute neighbors, you can only conditionally calculate them.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Even so, taking the first approach, we should evaluate the consequences, since adding a few more nodes to the neighborhood will increase further costs for other iterations. This can almost completely negate the gains you're making right now: don't forget that each additional node increases the number of subset iterations &lt;em&gt;by 2 times&lt;/em&gt;, and this additional node will also add new nodes to future neighbors! From the example above, we added two nodes to the neighbors, which means that there will be four times as many iterations, and for each one we will need to find more neighbors, call other functions, and so on. And how many such "indirect" neighbors will accumulate for an even greater number of recursive calls is hard to imagine.&lt;/p&gt;



&lt;br&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  Hyperedge indexing
&lt;/h2&gt;

&lt;p&gt;Earlier I have mentioned that complex hyperedges are sorted, to get rid of duplicates. But there is another benefit. If we look at some parts of DPhyp, and specifically at the places where we need to traverse hyperedges — computing neighborhood and determining the connection (edges) between two hypernodes — we can see a similar pattern: there are moving parts, and there are permanent ones:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;when searching for neighbors, the set of excluded nodes remains unchanged (or only increases).&lt;/li&gt;
&lt;li&gt;when determining the connection between hypernodes, both hypernodes in hyperedge are fixed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's try to use this knowledge somehow, and first we'll learn how to account for excluded nodes. Analysis of DPhyp makes it clear that the set of excluded nodes has the same structure: leading ones, and then sparse elements, for example, &lt;code&gt;010110011111&lt;/code&gt; — it has 5 leading ones.&lt;/p&gt;

&lt;p&gt;It's no coincidence, the algorithm works this way: we should not look at nodes that have not yet been processed, so each time we go through the edges, we check that no part intersects with these excluded ones (such constant part is determined in &lt;code&gt;solve&lt;/code&gt;). The set of excluded nodes during iteration can only increase, but not decrease, which means that we can know in advance which nodes will definitely not satisfy the condition, that is, they will intersect with these leading zeros.&lt;/p&gt;

&lt;p&gt;What we can do is take all the hyperedges and sort them depending on the right hypernode, and use the number of leading zeros as comparator. Then, during iteration, we calculate the length of the sequence of leading ones in the set (hypernode we are checking) of excluded ones and set the loop start index so that the first element of the right hypernode of the first hyperedge exactly exceeds the last element of the sequence of ones.&lt;/p&gt;

&lt;p&gt;I'll show you an example. We have several hyperedges sorted by the number of leading zeros. I wrote indexes on the right for convenience (the left part of the hyperedge is not important, only the right one, so &lt;code&gt;xxxxx&lt;/code&gt; is a placeholder):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[
    xxxxx - 00101    0
    xxxxx - 00111    1
    xxxxx - 00100    2
    xxxxx - 00100    3
    xxxxx - 01100    4
    xxxxx - 10100    5
    xxxxx - 11100    6
    xxxxx - 01000    7
]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now it's easy to figure out which index we should start iterating from, depending on how many excluded nodes we have in front of us:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;code&gt;excluded&lt;/code&gt;&lt;/th&gt;
&lt;th&gt;start index&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;00100&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;00001&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;11101&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;10011&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;00111&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;01111&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A few points:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;We can only be sure of the size of the leading ones - the rest is a black box for us. For example, because of this, in the third example, we start the iteration from &lt;code&gt;2&lt;/code&gt;, although it is clear that the rest is completely excluded.&lt;/li&gt;
&lt;li&gt;To generalize the code when there are no suitable hyperedges, we can return the length of the array (a large number).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;To understand the benefits of this optimization, remember that all edges in the graph are bidirectional, that is, for each hyperedge we create two pairs with swapped left and right sides. It may turn out that a very large number of tables refer to the same node (for example, this is a table of facts, and we have 100+ dimensions), and this table itself has the largest index, that is, &lt;code&gt;excluded&lt;/code&gt; will include almost all tables. In this case, we will have to unsuccessfully go through all the complex edges, knowing that this will not give any result.&lt;/p&gt;

&lt;p&gt;Well, how are we going to get the start index? The first thing that comes to mind when someone says "sorted" is binary search - just perform binary search using given hypernode (it's set) as key. But don't rush it. Let's recall that our space of "keys" (the number of leading zeros/ones) is discrete. It starts from zero and can only increment (in my case, there is a limit of 64). And what fits the description of such a structure? Array! This array will store hyperedge at index &lt;code&gt;i&lt;/code&gt;, if it's right hypernode constans at least &lt;code&gt;i&lt;/code&gt; leading zeros.&lt;/p&gt;

&lt;p&gt;What about the gaps? In the example, &lt;code&gt;00111&lt;/code&gt; is followed by &lt;code&gt;00100&lt;/code&gt;, without any &lt;code&gt;00010&lt;/code&gt;. We fill in these gaps with the previous value (index), that is, if for a certain length of &lt;code&gt;i&lt;/code&gt; I do not know which index we should start the iteration from, then we need to take the previous index - index that was used for &lt;code&gt;i - 1&lt;/code&gt;. We will take the index &lt;code&gt;0&lt;/code&gt; as the base, that is, if there is no such sequence of excluded nodes, then its value will be &lt;code&gt;0&lt;/code&gt;, since in this case we must iterate over all hyperedges.&lt;/p&gt;

&lt;p&gt;The cost of using such an index is only equal to calculating the number of leading units in the set of excluded nodes. But this can be optimized with a simple bit trick — we add &lt;code&gt;1&lt;/code&gt; and get a sequence of &lt;code&gt;0&lt;/code&gt; of the same length, but ending in &lt;code&gt;1&lt;/code&gt;, and then calculate the position of this "1". You can get this value using a special instruction, for example, &lt;code&gt;POPCNT&lt;/code&gt; (it is used in PostgreSQL). So, time complexity for this index is &lt;code&gt;O(1)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;For the edges from the example, we can build such an index (the index of the array of edges on the right):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[
    0: 0
    1: 2
    2: 2
    3: 7
    4: 8
    5: 8
]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The size of this index depends only on the maximum number of leading &lt;code&gt;0&lt;/code&gt; edges in the entire array. For example, you can see that I could create an array of only four elements, and then always return &lt;code&gt;8&lt;/code&gt; (as the size of the array).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Of course, there is a problem with sparse sets, i.e. when right hypernode is &lt;code&gt;{1, 1000}&lt;/code&gt;. Such arrays waste too much memory. We can fix this by using sparse arrays, but I don't think that this is a very big problem for now.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We have learned how to quickly cut off unnecessary edges when searching for neighbors. But we also use edges to determine the connectivity of two hypernodes. Is it possible to use this index here? Yes, we can. Two hypernodes are connected when there is a hyperedge, the left part of which is a subset of the left hypernode, and the right part is the subset of right one. The hypernodes for which we must find connectivity do not change during iterations, just like the excluded set. And now it should be noted that the right part of the hyperedge will definitely not be a subset if there are elements in this part whose index is less than the smallest in the hypernode on the right. Example: the right side of the hyperedge is &lt;code&gt;001010&lt;/code&gt; and never be a subset of the hypernode &lt;code&gt;001100&lt;/code&gt;, since there is an element with index 2 in the edge. That is, the semantics here are practically the same as in the case of an excluded set — we must exclude all edges that have elements smaller than the smallest element of the right hypernode.&lt;/p&gt;

&lt;p&gt;The calculation of the starting index is in the &lt;code&gt;get_start_index&lt;/code&gt; function. For convenience, it accepts not a ready-made node index (minimal) as input, but a set of excluded nodes. This function is also used to determine the connectivity of two hypernodes: you just need to subtract "1" from the set representing the right hypernode before passing the argument, then the sequence of "0" will become a sequence of "1" back, which is the same thing according to the semantics.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;
&lt;span class="nf"&gt;get_start_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;EdgeArray&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;edges&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bitmapword&lt;/span&gt; &lt;span class="n"&gt;excluded&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;lowest_bit&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="n"&gt;lowest_bit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bmw_rightmost_one_pos&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;excluded&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;edges&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;start_idx_size&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;lowest_bit&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;edges&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;edges&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;start_idx&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;lowest_bit&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Query complexity
&lt;/h2&gt;

&lt;p&gt;So, PostgreSQL uses 2 algorithms: DPsize and GEQO, and the latter is used if the table query has more than the value of the &lt;code&gt;geqo_threshold&lt;/code&gt; parameter. But why exactly the &lt;em&gt;number of tables&lt;/em&gt;? The fact is that the complexity of DPsize is determined by the number of tables — we will certainly consider all possible combinations. But with DPhyp, everything is different, its complexity depends on the shape of query graph. The original paper compares the performance of algorithms on some types of queries (chain, cliques, etc...), but they do not provide a direct answer on how to determine the complexity of a query (or maybe I overlooked it). This answer is given in another paper — &lt;a href="https://db.in.tum.de/~radke/papers/hugejoins.pdf" rel="noopener noreferrer"&gt;"Adaptive Optimization of Large Join Queries"&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The authors of this paper propose a meta-algorithm that, by combining several different JOIN algorithms, allows you to build query plans for several thousand tables. For simple queries, the authors suggest using DPhyp, but what is a simple query? For example, if there are 100 tables in a query, this does not mean that DPhyp cannot handle it. If the query graph is a simple chain, for example (all predicates are in the form &lt;code&gt;Ti.x OP T(i + 1).x&lt;/code&gt;) then it's not difficult to find a plan for it, but if it's a clique (each joins with each other), then even 15 tables is too much. For DPhyp, complexity should be determined not in the number of tables, but in the complexity of the query graph — &lt;em&gt;the number of connected subgraphs&lt;/em&gt;. A value of 10000 connected subgraphs is the limit of query efficiency, which corresponds to about 14 tables in clique.&lt;/p&gt;

&lt;p&gt;The same article not only suggests an idea, but also a function for calculating the number of connected subgraphs - &lt;code&gt;countCC&lt;/code&gt;. When I looked at her, I realized: it fits perfectly fits the caching scheme described above. Without further interruptions, the ready code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="n"&gt;uint64&lt;/span&gt;
&lt;span class="nf"&gt;count_cc_recursive&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DPHypContext&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bitmapword&lt;/span&gt; &lt;span class="n"&gt;subgraph&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bitmapword&lt;/span&gt; &lt;span class="n"&gt;excluded&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                   &lt;span class="n"&gt;uint64&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;uint64&lt;/span&gt; &lt;span class="n"&gt;max&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bitmapword&lt;/span&gt; &lt;span class="n"&gt;base_neighborhood&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;SubsetIteratorState&lt;/span&gt; &lt;span class="n"&gt;subset_iter&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;subset_iterator_init&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;subset_iter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;base_neighborhood&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subset_iterator_next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;subset_iter&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;bitmapword&lt;/span&gt; &lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="n"&gt;bitmapword&lt;/span&gt; &lt;span class="n"&gt;excluded_ext&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="n"&gt;bitmapword&lt;/span&gt; &lt;span class="n"&gt;neighborhood&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;max&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="n"&gt;excluded_ext&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;excluded&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;base_neighborhood&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="n"&gt;set&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subgraph&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;subset_iter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subset&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="n"&gt;neighborhood&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;get_neighbors_iter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;excluded_ext&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;subset_iter&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;count_cc_recursive&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;excluded_ext&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;neighborhood&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="n"&gt;uint64&lt;/span&gt;
&lt;span class="nf"&gt;count_cc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DPHypContext&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;uint64&lt;/span&gt; &lt;span class="n"&gt;max&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;int64&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;rels_count&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="n"&gt;rels_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;list_length&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;initial_rels&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;rels_count&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;bitmapword&lt;/span&gt; &lt;span class="n"&gt;excluded&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="n"&gt;bitmapword&lt;/span&gt; &lt;span class="n"&gt;neighborhood&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;max&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="n"&gt;excluded&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bmw_make_b_v&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;neighborhood&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;get_neighbors_base&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;excluded&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;count_cc_recursive&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bmw_make_singleton&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;excluded&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                   &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;neighborhood&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I have kept the names of the functions the same as in the article, but adapted the signature to support efficient iteration across neighbors.&lt;/p&gt;

&lt;p&gt;Bonus: The number of connected subgraphs is the size of the resulting DP table. Now this value is used for its preliminary hash-table allocation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Testing
&lt;/h2&gt;

&lt;p&gt;Let's check everything first on some simple query.:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;EXPLAIN&lt;/span&gt; &lt;span class="k"&gt;ANALYZE&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;t1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t13&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t14&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; 
    &lt;span class="n"&gt;t1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;t2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;t3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;t4&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;t5&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;t6&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;t7&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;t8&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;t9&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;t10&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;t11&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;t12&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;t13&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;t14&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These are 14 tables connected by a single hyperline: &lt;code&gt;{t1, t2, t3, t4, t5, t6, t7} - {t8, t9, t10, t11, t12, t13, t14}&lt;/code&gt;. Each table is a single—column table with three values (so that the query does not run for a long time). For DPsize, the result is as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                                                                  QUERY PLAN                                   
-----------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..215311.78 rows=1594323 width=56) (actual time=1.672..1330.944 rows=2083371 loops=1)
   Join Filter: (((((((t1.x + t2.x) + t3.x) + t4.x) + t5.x) + t6.x) + t7.x) &amp;gt; ((((((t8.x + t9.x) + t10.x) + t11.x) + t12.x) + t13.x) + t14.x))
   Rows Removed by Join Filter: 2699598
   -&amp;gt;  Nested Loop  (cost=0.00..36.36 rows=2187 width=28) (actual time=0.057..0.612 rows=2187 loops=1)
         -&amp;gt;  Nested Loop  (cost=0.00..5.39 rows=81 width=16) (actual time=0.042..0.128 rows=81 loops=1)
               -&amp;gt;  Nested Loop  (cost=0.00..2.18 rows=9 width=8) (actual time=0.029..0.061 rows=9 loops=1)
                     -&amp;gt;  Seq Scan on t4  (cost=0.00..1.03 rows=3 width=4) (actual time=0.018..0.028 rows=3 loops=1)
                     -&amp;gt;  Materialize  (cost=0.00..1.04 rows=3 width=4) (actual time=0.003..0.008 rows=3 loops=3)
                           -&amp;gt;  Seq Scan on t5  (cost=0.00..1.03 rows=3 width=4) (actual time=0.003..0.012 rows=3 loops=1)
               -&amp;gt;  Materialize  (cost=0.00..2.23 rows=9 width=8) (actual time=0.001..0.005 rows=9 loops=9)
                     -&amp;gt;  Nested Loop  (cost=0.00..2.18 rows=9 width=8) (actual time=0.007..0.025 rows=9 loops=1)
                           -&amp;gt;  Seq Scan on t6  (cost=0.00..1.03 rows=3 width=4) (actual time=0.002..0.010 rows=3 loops=1)
                           -&amp;gt;  Materialize  (cost=0.00..1.04 rows=3 width=4) (actual time=0.002..0.004 rows=3 loops=3)
                                 -&amp;gt;  Seq Scan on t7  (cost=0.00..1.03 rows=3 width=4) (actual time=0.002..0.006 rows=3 loops=1)
         -&amp;gt;  Materialize  (cost=0.00..3.69 rows=27 width=12) (actual time=0.001..0.002 rows=27 loops=81)
               -&amp;gt;  Nested Loop  (cost=0.00..3.56 rows=27 width=12) (actual time=0.014..0.037 rows=27 loops=1)
                     -&amp;gt;  Nested Loop  (cost=0.00..2.18 rows=9 width=8) (actual time=0.008..0.022 rows=9 loops=1)
                           -&amp;gt;  Seq Scan on t1  (cost=0.00..1.03 rows=3 width=4) (actual time=0.003..0.008 rows=3 loops=1)
                           -&amp;gt;  Materialize  (cost=0.00..1.04 rows=3 width=4) (actual time=0.002..0.004 rows=3 loops=3)
                                 -&amp;gt;  Seq Scan on t2  (cost=0.00..1.03 rows=3 width=4) (actual time=0.002..0.007 rows=3 loops=1)
                     -&amp;gt;  Materialize  (cost=0.00..1.04 rows=3 width=4) (actual time=0.001..0.001 rows=3 loops=9)
                           -&amp;gt;  Seq Scan on t3  (cost=0.00..1.03 rows=3 width=4) (actual time=0.003..0.004 rows=3 loops=1)
   -&amp;gt;  Materialize  (cost=0.00..47.29 rows=2187 width=28) (actual time=0.000..0.078 rows=2187 loops=2187)
         -&amp;gt;  Nested Loop  (cost=0.00..36.36 rows=2187 width=28) (actual time=0.039..0.405 rows=2187 loops=1)
               -&amp;gt;  Nested Loop  (cost=0.00..5.39 rows=81 width=16) (actual time=0.021..0.043 rows=81 loops=1)
                     -&amp;gt;  Nested Loop  (cost=0.00..2.18 rows=9 width=8) (actual time=0.010..0.014 rows=9 loops=1)
                           -&amp;gt;  Seq Scan on t11  (cost=0.00..1.03 rows=3 width=4) (actual time=0.004..0.005 rows=3 loops=1)
                           -&amp;gt;  Materialize  (cost=0.00..1.04 rows=3 width=4) (actual time=0.002..0.002 rows=3 loops=3)
                                 -&amp;gt;  Seq Scan on t12  (cost=0.00..1.03 rows=3 width=4) (actual time=0.003..0.004 rows=3 loops=1)
                     -&amp;gt;  Materialize  (cost=0.00..2.23 rows=9 width=8) (actual time=0.001..0.002 rows=9 loops=9)
                           -&amp;gt;  Nested Loop  (cost=0.00..2.18 rows=9 width=8) (actual time=0.009..0.013 rows=9 loops=1)
                                 -&amp;gt;  Seq Scan on t13  (cost=0.00..1.03 rows=3 width=4) (actual time=0.004..0.005 rows=3 loops=1)
                                 -&amp;gt;  Materialize  (cost=0.00..1.04 rows=3 width=4) (actual time=0.001..0.002 rows=3 loops=3)
                                       -&amp;gt;  Seq Scan on t14  (cost=0.00..1.03 rows=3 width=4) (actual time=0.003..0.003 rows=3 loops=1)
               -&amp;gt;  Materialize  (cost=0.00..3.69 rows=27 width=12) (actual time=0.000..0.001 rows=27 loops=81)
                     -&amp;gt;  Nested Loop  (cost=0.00..3.56 rows=27 width=12) (actual time=0.016..0.026 rows=27 loops=1)
                           -&amp;gt;  Nested Loop  (cost=0.00..2.18 rows=9 width=8) (actual time=0.010..0.014 rows=9 loops=1)
                                 -&amp;gt;  Seq Scan on t8  (cost=0.00..1.03 rows=3 width=4) (actual time=0.005..0.006 rows=3 loops=1)
                                 -&amp;gt;  Materialize  (cost=0.00..1.04 rows=3 width=4) (actual time=0.001..0.002 rows=3 loops=3)
                                       -&amp;gt;  Seq Scan on t9  (cost=0.00..1.03 rows=3 width=4) (actual time=0.003..0.003 rows=3 loops=1)
                           -&amp;gt;  Materialize  (cost=0.00..1.04 rows=3 width=4) (actual time=0.001..0.001 rows=3 loops=9)
                                 -&amp;gt;  Seq Scan on t10  (cost=0.00..1.03 rows=3 width=4) (actual time=0.004..0.005 rows=3 loops=1)
 Planning Time: 3069.835 ms
 Execution Time: 1371.090 ms
(44 rows)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It took 3 seconds for planner, although it took less than 1.5 seconds to complete. What will DPhyp give us?:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                                                                  QUERY PLAN                                   
-----------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..215311.78 rows=1594323 width=56) (actual time=1.612..1325.670 rows=2083371 loops=1)
   Join Filter: (((((((t1.x + t2.x) + t3.x) + t4.x) + t5.x) + t6.x) + t7.x) &amp;gt; ((((((t8.x + t9.x) + t10.x) + t11.x) + t12.x) + t13.x) + t14.x))
   Rows Removed by Join Filter: 2699598
   -&amp;gt;  Nested Loop  (cost=0.00..36.36 rows=2187 width=28) (actual time=0.039..0.551 rows=2187 loops=1)
         -&amp;gt;  Nested Loop  (cost=0.00..5.39 rows=81 width=16) (actual time=0.029..0.131 rows=81 loops=1)
               -&amp;gt;  Nested Loop  (cost=0.00..2.18 rows=9 width=8) (actual time=0.021..0.051 rows=9 loops=1)
                     -&amp;gt;  Seq Scan on t4  (cost=0.00..1.03 rows=3 width=4) (actual time=0.015..0.023 rows=3 loops=1)
                     -&amp;gt;  Materialize  (cost=0.00..1.04 rows=3 width=4) (actual time=0.001..0.006 rows=3 loops=3)
                           -&amp;gt;  Seq Scan on t5  (cost=0.00..1.03 rows=3 width=4) (actual time=0.003..0.012 rows=3 loops=1)
               -&amp;gt;  Materialize  (cost=0.00..2.23 rows=9 width=8) (actual time=0.001..0.006 rows=9 loops=9)
                     -&amp;gt;  Nested Loop  (cost=0.00..2.18 rows=9 width=8) (actual time=0.006..0.034 rows=9 loops=1)
                           -&amp;gt;  Seq Scan on t6  (cost=0.00..1.03 rows=3 width=4) (actual time=0.003..0.012 rows=3 loops=1)
                           -&amp;gt;  Materialize  (cost=0.00..1.04 rows=3 width=4) (actual time=0.001..0.005 rows=3 loops=3)
                                 -&amp;gt;  Seq Scan on t7  (cost=0.00..1.03 rows=3 width=4) (actual time=0.002..0.009 rows=3 loops=1)
         -&amp;gt;  Materialize  (cost=0.00..3.69 rows=27 width=12) (actual time=0.000..0.002 rows=27 loops=81)
               -&amp;gt;  Nested Loop  (cost=0.00..3.56 rows=27 width=12) (actual time=0.009..0.032 rows=27 loops=1)
                     -&amp;gt;  Nested Loop  (cost=0.00..2.18 rows=9 width=8) (actual time=0.006..0.020 rows=9 loops=1)
                           -&amp;gt;  Seq Scan on t2  (cost=0.00..1.03 rows=3 width=4) (actual time=0.002..0.008 rows=3 loops=1)
                           -&amp;gt;  Materialize  (cost=0.00..1.04 rows=3 width=4) (actual time=0.001..0.003 rows=3 loops=3)
                                 -&amp;gt;  Seq Scan on t3  (cost=0.00..1.03 rows=3 width=4) (actual time=0.002..0.006 rows=3 loops=1)
                     -&amp;gt;  Materialize  (cost=0.00..1.04 rows=3 width=4) (actual time=0.000..0.001 rows=3 loops=9)
                           -&amp;gt;  Seq Scan on t1  (cost=0.00..1.03 rows=3 width=4) (actual time=0.002..0.003 rows=3 loops=1)
   -&amp;gt;  Materialize  (cost=0.00..47.29 rows=2187 width=28) (actual time=0.000..0.078 rows=2187 loops=2187)
         -&amp;gt;  Nested Loop  (cost=0.00..36.36 rows=2187 width=28) (actual time=0.025..0.397 rows=2187 loops=1)
               -&amp;gt;  Nested Loop  (cost=0.00..5.39 rows=81 width=16) (actual time=0.013..0.037 rows=81 loops=1)
                     -&amp;gt;  Nested Loop  (cost=0.00..2.18 rows=9 width=8) (actual time=0.007..0.012 rows=9 loops=1)
                           -&amp;gt;  Seq Scan on t11  (cost=0.00..1.03 rows=3 width=4) (actual time=0.003..0.005 rows=3 loops=1)
                           -&amp;gt;  Materialize  (cost=0.00..1.04 rows=3 width=4) (actual time=0.001..0.002 rows=3 loops=3)
                                 -&amp;gt;  Seq Scan on t12  (cost=0.00..1.03 rows=3 width=4) (actual time=0.003..0.003 rows=3 loops=1)
                     -&amp;gt;  Materialize  (cost=0.00..2.23 rows=9 width=8) (actual time=0.001..0.002 rows=9 loops=9)
                           -&amp;gt;  Nested Loop  (cost=0.00..2.18 rows=9 width=8) (actual time=0.006..0.009 rows=9 loops=1)
                                 -&amp;gt;  Seq Scan on t13  (cost=0.00..1.03 rows=3 width=4) (actual time=0.003..0.003 rows=3 loops=1)
                                 -&amp;gt;  Materialize  (cost=0.00..1.04 rows=3 width=4) (actual time=0.001..0.001 rows=3 loops=3)
                                       -&amp;gt;  Seq Scan on t14  (cost=0.00..1.03 rows=3 width=4) (actual time=0.002..0.003 rows=3 loops=1)
               -&amp;gt;  Materialize  (cost=0.00..3.69 rows=27 width=12) (actual time=0.000..0.001 rows=27 loops=81)
                     -&amp;gt;  Nested Loop  (cost=0.00..3.56 rows=27 width=12) (actual time=0.011..0.021 rows=27 loops=1)
                           -&amp;gt;  Nested Loop  (cost=0.00..2.18 rows=9 width=8) (actual time=0.007..0.011 rows=9 loops=1)
                                 -&amp;gt;  Seq Scan on t9  (cost=0.00..1.03 rows=3 width=4) (actual time=0.004..0.004 rows=3 loops=1)
                                 -&amp;gt;  Materialize  (cost=0.00..1.04 rows=3 width=4) (actual time=0.001..0.001 rows=3 loops=3)
                                       -&amp;gt;  Seq Scan on t10  (cost=0.00..1.03 rows=3 width=4) (actual time=0.003..0.003 rows=3 loops=1)
                           -&amp;gt;  Materialize  (cost=0.00..1.04 rows=3 width=4) (actual time=0.000..0.001 rows=3 loops=9)
                                 -&amp;gt;  Seq Scan on t8  (cost=0.00..1.03 rows=3 width=4) (actual time=0.003..0.004 rows=3 loops=1)
 Planning Time: 4.706 ms
 Execution Time: 1365.543 ms
(44 rows)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;4 milliseconds&lt;/strong&gt;! At the same time, the plans are identical — an increase in productivity of almost &lt;strong&gt;600 times&lt;/strong&gt;!&lt;/p&gt;

&lt;p&gt;But this is just a small query, let's take something more serious - JOB (Join Order Benchmark) presented in &lt;a href="https://vldb.org/pvldb/vol9/p204-leis.pdf" rel="noopener noreferrer"&gt;"How Good Are Query Optimizers, Really?"&lt;/a&gt;. It was used to compare planners of several databases, including PostgreSQL. The benchmark can be found in &lt;a href="https://github.com/gregrahn/join-order-benchmark" rel="noopener noreferrer"&gt;this repository&lt;/a&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;In the paper, authors compared planner estimations, not planning time, so simple queries were used — these are INNER equi-JOINS (i.e., the JOIN conditions are only equality). A wide variety of loads should be used to evaluate real performance, including all kinds of complex outer joins, but for now this is good enough.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;How the testing was performed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the time was measured using a simple extension that measured the execution time of &lt;code&gt;join_search_hook&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;after feeding the database, the final size is slightly more than 8GB (IMDB dataset).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There were 113 queries in total, but actually there are 33 queries - all the others are variations with different constants. For the tests, each query was run 10 times and the average execution time was calculated, and then, in order not to "make noise", the results of the same query classes (with different constants) were grouped and the average value was calculated.&lt;/p&gt;

&lt;p&gt;Finally we got the following table:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Query class&lt;/th&gt;
&lt;th&gt;Time, DPsize&lt;/th&gt;
&lt;th&gt;Time, DPhyp&lt;/th&gt;
&lt;th&gt;Cost, DPsize&lt;/th&gt;
&lt;th&gt;Cost, DPhyp&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;20063.63..20063.64&lt;/td&gt;
&lt;td&gt;20047.21..20047.22&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;3917.21..3917.22&lt;/td&gt;
&lt;td&gt;3865.78..3865.79&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;16893.51..16893.52&lt;/td&gt;
&lt;td&gt;16893.07..16893.08&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;16537.03..16537.04&lt;/td&gt;
&lt;td&gt;16532.75..16532.76&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;55136.70..55136.71&lt;/td&gt;
&lt;td&gt;55110.84..55110.85&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;9136.23..9136.24&lt;/td&gt;
&lt;td&gt;8601.80..8601.81&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;1.30&lt;/td&gt;
&lt;td&gt;2.00&lt;/td&gt;
&lt;td&gt;26596.24..26596.25&lt;/td&gt;
&lt;td&gt;25281.84..25281.85&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;0.25&lt;/td&gt;
&lt;td&gt;0.85&lt;/td&gt;
&lt;td&gt;237500.71..237500.73&lt;/td&gt;
&lt;td&gt;215342.88..215342.89&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;1.75&lt;/td&gt;
&lt;td&gt;1.95&lt;/td&gt;
&lt;td&gt;121709.02..121709.03&lt;/td&gt;
&lt;td&gt;118041.88..118041.89&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;0.23&lt;/td&gt;
&lt;td&gt;0.30&lt;/td&gt;
&lt;td&gt;218646.80..218646.81&lt;/td&gt;
&lt;td&gt;216146.76..216146.77&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;1.25&lt;/td&gt;
&lt;td&gt;2.03&lt;/td&gt;
&lt;td&gt;4264.53..4264.54&lt;/td&gt;
&lt;td&gt;4263.81..4263.82&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;1.47&lt;/td&gt;
&lt;td&gt;2.27&lt;/td&gt;
&lt;td&gt;18062.07..18062.08&lt;/td&gt;
&lt;td&gt;17923.86..17923.87&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;td&gt;3.28&lt;/td&gt;
&lt;td&gt;3.73&lt;/td&gt;
&lt;td&gt;19880.57..19880.58&lt;/td&gt;
&lt;td&gt;19561.58..19561.60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;1.20&lt;/td&gt;
&lt;td&gt;1.30&lt;/td&gt;
&lt;td&gt;6675.30..6675.31&lt;/td&gt;
&lt;td&gt;6675.12..6675.13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;4.70&lt;/td&gt;
&lt;td&gt;6.03&lt;/td&gt;
&lt;td&gt;140612.22..140612.23&lt;/td&gt;
&lt;td&gt;105280.24..105280.26&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;2.03&lt;/td&gt;
&lt;td&gt;2.30&lt;/td&gt;
&lt;td&gt;4373.61..4373.62&lt;/td&gt;
&lt;td&gt;3928.40..3928.41&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;17&lt;/td&gt;
&lt;td&gt;0.72&lt;/td&gt;
&lt;td&gt;1.10&lt;/td&gt;
&lt;td&gt;4526.53..4526.54&lt;/td&gt;
&lt;td&gt;4073.57..4073.58&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;td&gt;0.50&lt;/td&gt;
&lt;td&gt;1.03&lt;/td&gt;
&lt;td&gt;36882.96..36882.97&lt;/td&gt;
&lt;td&gt;33151.67..33151.68&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;td&gt;8.35&lt;/td&gt;
&lt;td&gt;9.23&lt;/td&gt;
&lt;td&gt;141225.66..141225.67&lt;/td&gt;
&lt;td&gt;131527.60..131527.61&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;4.60&lt;/td&gt;
&lt;td&gt;6.23&lt;/td&gt;
&lt;td&gt;12982.67..12982.68&lt;/td&gt;
&lt;td&gt;12976.69..12976.70&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;td&gt;4.57&lt;/td&gt;
&lt;td&gt;5.90&lt;/td&gt;
&lt;td&gt;3833.12..3833.13&lt;/td&gt;
&lt;td&gt;3833.12..3833.13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;td&gt;17.55&lt;/td&gt;
&lt;td&gt;21.00&lt;/td&gt;
&lt;td&gt;7532.63..7532.64&lt;/td&gt;
&lt;td&gt;7532.28..7532.29&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;23&lt;/td&gt;
&lt;td&gt;16.07&lt;/td&gt;
&lt;td&gt;20.53&lt;/td&gt;
&lt;td&gt;43981.68..43981.69&lt;/td&gt;
&lt;td&gt;42108.50..42108.51&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;24&lt;/td&gt;
&lt;td&gt;47.90&lt;/td&gt;
&lt;td&gt;56.35&lt;/td&gt;
&lt;td&gt;6665.46..6665.47&lt;/td&gt;
&lt;td&gt;6580.84..6580.85&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;25&lt;/td&gt;
&lt;td&gt;3.73&lt;/td&gt;
&lt;td&gt;5.30&lt;/td&gt;
&lt;td&gt;8502.14..8502.15&lt;/td&gt;
&lt;td&gt;8495.15..8495.16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;26&lt;/td&gt;
&lt;td&gt;29.83&lt;/td&gt;
&lt;td&gt;41.83&lt;/td&gt;
&lt;td&gt;9324.37..9324.38&lt;/td&gt;
&lt;td&gt;9237.43..9237.44&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;27&lt;/td&gt;
&lt;td&gt;42.40&lt;/td&gt;
&lt;td&gt;69.67&lt;/td&gt;
&lt;td&gt;1053.48..1053.49&lt;/td&gt;
&lt;td&gt;1290.69..1290.70&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;28&lt;/td&gt;
&lt;td&gt;137.20&lt;/td&gt;
&lt;td&gt;208.17&lt;/td&gt;
&lt;td&gt;7534.70..7534.71&lt;/td&gt;
&lt;td&gt;7534.67..7534.68&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;29&lt;/td&gt;
&lt;td&gt;949.77&lt;/td&gt;
&lt;td&gt;936.80&lt;/td&gt;
&lt;td&gt;4013.73..4013.74&lt;/td&gt;
&lt;td&gt;4013.73..4013.74&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;38.67&lt;/td&gt;
&lt;td&gt;61.60&lt;/td&gt;
&lt;td&gt;9323.59..9323.60&lt;/td&gt;
&lt;td&gt;9323.59..9323.60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;31&lt;/td&gt;
&lt;td&gt;20.83&lt;/td&gt;
&lt;td&gt;34.00&lt;/td&gt;
&lt;td&gt;9584.13..9584.14&lt;/td&gt;
&lt;td&gt;9575.61..9575.62&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;3880.96..3880.97&lt;/td&gt;
&lt;td&gt;3838.37..3838.38&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;33&lt;/td&gt;
&lt;td&gt;165.90&lt;/td&gt;
&lt;td&gt;216.83&lt;/td&gt;
&lt;td&gt;3029.91..3029.92&lt;/td&gt;
&lt;td&gt;2995.14..2995.15&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;When I ran through the table a bit, I couldn't believe. Let's look at it as a bar plot.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FashenBlade%2Fhabr-posts%2Fmaster%2Fpg_dphyp%2Fimg%2Fjob_cost_compare.en.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FashenBlade%2Fhabr-posts%2Fmaster%2Fpg_dphyp%2Fimg%2Fjob_cost_compare.en.svg" alt="Visual comparison of costs"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;DPhyp overwhelmingly creates a plan &lt;em&gt;better&lt;/em&gt; than DPsize. Yes, the time to complete it is a little bit longer, but the plan is better, which means that we win in the long run!&lt;/p&gt;

&lt;p&gt;However, something is clearly wrong here. At least, we only managed the &lt;em&gt;join order&lt;/em&gt; of the relations, but we didn't create the plans ourselves - it's up to PostgreSQL. So does vanilla planner misbehave? Let's look at the output manually and see what happened. We will examine &lt;code&gt;6f&lt;/code&gt; query. That's what DPsize gives us:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                                                                        QUERY PLAN                                           
-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=15215.38..15215.39 rows=1 width=96)
   -&amp;gt;  Nested Loop  (cost=8.10..15175.34 rows=5339 width=48)
         -&amp;gt;  Nested Loop  (cost=7.67..12744.50 rows=5339 width=37)
               Join Filter: (ci.movie_id = t.id)
               -&amp;gt;  Nested Loop  (cost=7.23..12483.46 rows=147 width=41)
                     -&amp;gt;  Nested Loop  (cost=6.80..12351.21 rows=270 width=20)
                           -&amp;gt;  Seq Scan on keyword k  (cost=0.00..3691.40 rows=8 width=20)
                                 Filter: (keyword = ANY ('{superhero,sequel,second-part,marvel-comics,based-on-comic,tv-special,fight,violence}'::text[]))
                           -&amp;gt;  Bitmap Heap Scan on movie_keyword mk  (cost=6.80..1079.43 rows=305 width=8)
                                 Recheck Cond: (k.id = keyword_id)
                                 -&amp;gt;  Bitmap Index Scan on keyword_id_movie_keyword  (cost=0.00..6.72 rows=305 width=0)
                                       Index Cond: (keyword_id = k.id)
                     -&amp;gt;  Index Scan using title_pkey on title t  (cost=0.43..0.49 rows=1 width=21)
                           Index Cond: (id = mk.movie_id)
                           Filter: (production_year &amp;gt; 2000)
               -&amp;gt;  Index Scan using movie_id_cast_info on cast_info ci  (cost=0.44..1.33 rows=36 width=8)
                     Index Cond: (movie_id = mk.movie_id)
         -&amp;gt;  Index Scan using name_pkey on name n  (cost=0.43..0.46 rows=1 width=19)
               Index Cond: (id = ci.person_id)
(19 rows)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And that's for DPhyp:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                                                                        QUERY PLAN                                         
-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=13720.08..13720.09 rows=1 width=96)
   -&amp;gt;  Nested Loop  (cost=8.10..13704.27 rows=2108 width=48)
         -&amp;gt;  Nested Loop  (cost=7.67..12744.50 rows=2108 width=37)
               Join Filter: (ci.movie_id = t.id)
               -&amp;gt;  Nested Loop  (cost=7.23..12483.46 rows=147 width=41)
                     -&amp;gt;  Nested Loop  (cost=6.80..12351.21 rows=270 width=20)
                           -&amp;gt;  Seq Scan on keyword k  (cost=0.00..3691.40 rows=8 width=20)
                                 Filter: (keyword = ANY ('{superhero,sequel,second-part,marvel-comics,based-on-comic,tv-special,fight,violence}'::text[]))
                           -&amp;gt;  Bitmap Heap Scan on movie_keyword mk  (cost=6.80..1079.43 rows=305 width=8)
                                 Recheck Cond: (k.id = keyword_id)
                                 -&amp;gt;  Bitmap Index Scan on keyword_id_movie_keyword  (cost=0.00..6.72 rows=305 width=0)
                                       Index Cond: (keyword_id = k.id)
                     -&amp;gt;  Index Scan using title_pkey on title t  (cost=0.43..0.49 rows=1 width=21)
                           Index Cond: (id = mk.movie_id)
                           Filter: (production_year &amp;gt; 2000)
               -&amp;gt;  Index Scan using movie_id_cast_info on cast_info ci  (cost=0.44..1.33 rows=36 width=8)
                     Index Cond: (movie_id = mk.movie_id)
         -&amp;gt;  Index Scan using name_pkey on name n  (cost=0.43..0.46 rows=1 width=19)
               Index Cond: (id = ci.person_id)
(19 rows)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The plan is cheaper by almost 2000 units! That is, the expansion really gives you a better plan... Stop! &lt;strong&gt;the plans are identical, but the costs are different&lt;/strong&gt;? The differences occur in the third node, the &lt;code&gt;Nested Loop&lt;/code&gt; with &lt;code&gt;ci.movie_id = t.id&lt;/code&gt; predicate: DPsize evaluates it to &lt;code&gt;5339&lt;/code&gt; rows, and DPhyp evaluates it to &lt;code&gt;2108&lt;/code&gt;. How could this even happen? We need to find this out.&lt;/p&gt;

&lt;p&gt;The starting point will be query tracing to discover which subplans are used to create this NL. We will have to do this manually using debugger, because there are no special settings for this (there is a macro &lt;code&gt;OPTIMIZER_DEBUG&lt;/code&gt;, but it will output ready-made relations, but we have to follow order of which relations used to create final, so it is not suitable).&lt;/p&gt;

&lt;p&gt;For DPsize, the order will be as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{1, 2, 3} {5}
{1, 3, 5} {2}
{2, 3, 5} {1}
{1, 5} {2, 3}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For DPhyp — like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{1} {2, 3, 5}
{1, 5} {2, 3}
{1, 3, 5} {2}
{1, 2, 3} {5}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Numbers are IDs of relations:&lt;/p&gt;

&lt;p&gt;1 - &lt;code&gt;cast_info ci&lt;/code&gt;&lt;br&gt;
2 - &lt;code&gt;keyword k&lt;/code&gt;&lt;br&gt;
3 - &lt;code&gt;movie_keyword mk&lt;/code&gt;&lt;br&gt;
5 - &lt;code&gt;title t&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Despite the difference in order, the same pairs are processed, meaning we don't lose anything. But why the different cost then? Let's look at the other side — what are these numbers &lt;code&gt;2108&lt;/code&gt; and &lt;code&gt;5339&lt;/code&gt; in the query plans, where do they come from? If you look in the code, this is the &lt;code&gt;rows&lt;/code&gt; field in the &lt;code&gt;Path&lt;/code&gt; structure. How is this member initialized? In the code, we will see that the &lt;code&gt;rows&lt;/code&gt; of the &lt;code&gt;Path&lt;/code&gt; structure is initialized by the &lt;code&gt;rows&lt;/code&gt; field of &lt;code&gt;RelOptInfo&lt;/code&gt;, and this is done in all types of plan nodes (examples are all JOIN nodes):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="cm"&gt;/* https://github.com/postgres/postgres/blob/144ad723a4484927266a316d1c9550d56745ff67/src/backend/optimizer/path/costsize.c#L3375 */&lt;/span&gt;
&lt;span class="kt"&gt;void&lt;/span&gt;
&lt;span class="nf"&gt;final_cost_nestloop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PlannerInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NestPath&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;JoinCostWorkspace&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;workspace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;JoinPathExtraData&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;extra&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="cm"&gt;/* ... */&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;jpath&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;param_info&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;jpath&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;jpath&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;param_info&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;ppi_rows&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;
        &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;jpath&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;jpath&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="cm"&gt;/* ... */&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="cm"&gt;/* https://github.com/postgres/postgres/blob/144ad723a4484927266a316d1c9550d56745ff67/src/backend/optimizer/path/costsize.c#L3873 */&lt;/span&gt;
&lt;span class="kt"&gt;void&lt;/span&gt;
&lt;span class="nf"&gt;final_cost_mergejoin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PlannerInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;MergePath&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;JoinCostWorkspace&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;workspace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;JoinPathExtraData&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;extra&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="cm"&gt;/* ... */&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;jpath&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;param_info&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;jpath&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;jpath&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;param_info&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;ppi_rows&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;
        &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;jpath&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;jpath&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="cm"&gt;/* ... */&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="cm"&gt;/* https://github.com/postgres/postgres/blob/144ad723a4484927266a316d1c9550d56745ff67/src/backend/optimizer/path/costsize.c#L4305 */&lt;/span&gt;
&lt;span class="kt"&gt;void&lt;/span&gt;
&lt;span class="nf"&gt;final_cost_hashjoin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PlannerInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;HashPath&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;JoinCostWorkspace&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;workspace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;JoinPathExtraData&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;extra&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="cm"&gt;/* ... */&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;jpath&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;param_info&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;jpath&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;jpath&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;param_info&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;ppi_rows&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;
        &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;jpath&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;jpath&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="cm"&gt;/* ... */&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Okay, then where does &lt;code&gt;rows&lt;/code&gt; come from in &lt;code&gt;RelOptInfo&lt;/code&gt;? If we search through the code, then for JOIN we will find the &lt;em&gt;only&lt;/em&gt; place of its initialization — &lt;code&gt;set_joinrel_size_estimates&lt;/code&gt;. It is called in two places: &lt;code&gt;build_join_rel&lt;/code&gt; to create a &lt;em&gt;new&lt;/em&gt; JOIN &lt;code&gt;RelOptInfo&lt;/code&gt; and &lt;code&gt;build_child_join_rel&lt;/code&gt; — the same thing, but for inherited tables (i.e. partitions fall here). In our case, there are no partitions, so &lt;code&gt;build_join_rel&lt;/code&gt; is used. So where does it estimate the number of rows? The answer is that when &lt;em&gt;creating&lt;/em&gt; the structure for the first time &lt;code&gt;set_joinrel_size_estimates&lt;/code&gt; is called, which sets this field, &lt;em&gt;evaluating by the current pair&lt;/em&gt; of the connected relations. In other words, the estimate of the returned number of rows occurs once, and then we use this estimate in all cases. It sounds quite logical, since predicates are also fixed for each set of relations, so the number of rows returned by this set of relations should not depend on the physical implementation of the operators.&lt;/p&gt;

&lt;p&gt;But then why does the estimate vary so much? To do this, we will do the tracking again, but this time we will track all the invocations and all the estimates that we make. Let's build a tree, the nodes of which will be sets of relations, and the child nodes will be those from which the parent is created. Since only the &lt;code&gt;JOIN INNER&lt;/code&gt; is used in the query, the formula for calculating the number of tuples will be as follows: &lt;code&gt;nrows = outer_rows * inner_rows * jselect&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;nrows&lt;/code&gt; — total number of tuples;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;outer_rows&lt;/code&gt; — number of tuples in outer (left) part;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;inner_rows&lt;/code&gt; — number of tuples in inner (right) part;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;jselec&lt;/code&gt; — predicates selectivity.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For DPsize, the call stack will be as follows (predicate selectivity is written under sets, edges contain the number of tuples at the output):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                     {1, 2, 3, 5}
                        3.95e-7
                    /           \
                9813           1375372
                 /                   \
            {1, 2, 3}                {5}
             7.45e-6  
            /      \ 
     164574168      8
        /            \
    {1, 3}           {2}
     1e-6
    /     \
36245584  4523930 
  /          \
{1}          {3}  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The specified selectivity is trimmed in order not to take up much space, since the numbers are long, but even without this, it can be seen that &lt;code&gt;9813 * 1375372 * 3.95 e-7 = 5332&lt;/code&gt;. If you add the dropped digits, it will be typed to the expected number — &lt;code&gt;5339&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Now let's see what happened in DPhyp:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                 {1, 2, 3, 5}
                    3.95e-7
                  /         \   
             36245584        147
               /               \
            {1}             {2, 3, 5}
                              7.45e-6
                            /        \
                           8       2461152
                          /             \
                        {2}           {3, 5}
                                      3.95e-7
                                     /      \
                                 4523930  1375372
                                   /          \
                                 {3}          {5} 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;36245584 * 147 * 3.95 e-7 = 2104&lt;/code&gt;, with ceiling gives us &lt;code&gt;2108&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;So, we found the original problem — a incorrect initial estimations. Is it really bad? In our case, yes, because if we run the query, we will get the following results:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                                                                             QUERY PLAN                                             
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=13720.08..13720.09 rows=1 width=96) (actual time=8753.807..8753.810 rows=1 loops=1)
   -&amp;gt;  Nested Loop  (cost=8.10..13704.27 rows=2108 width=48) (actual time=0.637..8530.663 rows=785477 loops=1)
         -&amp;gt;  Nested Loop  (cost=7.67..12744.50 rows=2108 width=37) (actual time=0.623..2643.045 rows=785477 loops=1)
               Join Filter: (ci.movie_id = t.id)
               -&amp;gt;  Nested Loop  (cost=7.23..12483.46 rows=147 width=41) (actual time=0.610..405.496 rows=14165 loops=1)
                     -&amp;gt;  Nested Loop  (cost=6.80..12351.21 rows=270 width=20) (actual time=0.597..140.721 rows=35548 loops=1)
                           -&amp;gt;  Seq Scan on keyword k  (cost=0.00..3691.40 rows=8 width=20) (actual time=0.143..37.305 rows=8 loops=1)
                                 Filter: (keyword = ANY ('{superhero,sequel,second-part,marvel-comics,based-on-comic,tv-special,fight,violence}'::text[]))
                                 Rows Removed by Filter: 134162
                           -&amp;gt;  Bitmap Heap Scan on movie_keyword mk  (cost=6.80..1079.43 rows=305 width=8) (actual time=0.993..12.278 rows=4444 loops=8)
                                 Recheck Cond: (k.id = keyword_id)
                                 Heap Blocks: exact=23488
                                 -&amp;gt;  Bitmap Index Scan on keyword_id_movie_keyword  (cost=0.00..6.72 rows=305 width=0) (actual time=0.501..0.501 rows=4444 loops=8)
                                       Index Cond: (keyword_id = k.id)
                     -&amp;gt;  Index Scan using title_pkey on title t  (cost=0.43..0.49 rows=1 width=21) (actual time=0.007..0.007 rows=0 loops=35548)
                           Index Cond: (id = mk.movie_id)
                           Filter: (production_year &amp;gt; 2000)
                           Rows Removed by Filter: 1
               -&amp;gt;  Index Scan using movie_id_cast_info on cast_info ci  (cost=0.44..1.33 rows=36 width=8) (actual time=0.008..0.148 rows=55 loops=14165)
                     Index Cond: (movie_id = mk.movie_id)
         -&amp;gt;  Index Scan using name_pkey on name n  (cost=0.43..0.46 rows=1 width=19) (actual time=0.007..0.007 rows=1 loops=785477)
               Index Cond: (id = ci.person_id)
 Planning Time: 1.419 ms
 Execution Time: 8753.873 ms
(24 rows)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In reality, that node gave 785477 tuples, and the error is (multiplicity):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DPhyp: 370;&lt;/li&gt;
&lt;li&gt;DPsize: 150.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We made a mistake in estimating the number of tuples by more than two times, and for the worse — underestimation. But that's not all. Remember the first example, a query with a single hyperedge. It is planned very quickly, but if we change something a little bit, for example, move &lt;code&gt;t7.x&lt;/code&gt; to the right side of the binary predicate, we will get such a plan:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;-----------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..215344.72 rows=1594323 width=56)
   Join Filter: ((((((t1.x + t2.x) + t3.x) + t4.x) + t5.x) + t6.x) &amp;gt; (((((((t7.x + t8.x) + t9.x) + t10.x) + t11.x) + t12.x) + t13.x) + t14.x))
   -&amp;gt;  Nested Loop  (cost=0.00..93.00 rows=6561 width=32)
         -&amp;gt;  Nested Loop  (cost=0.00..5.39 rows=81 width=16)
               -&amp;gt;  Nested Loop  (cost=0.00..2.18 rows=9 width=8)
                     -&amp;gt;  Seq Scan on t7  (cost=0.00..1.03 rows=3 width=4)
                     -&amp;gt;  Materialize  (cost=0.00..1.04 rows=3 width=4)
                           -&amp;gt;  Seq Scan on t8  (cost=0.00..1.03 rows=3 width=4)
               -&amp;gt;  Materialize  (cost=0.00..2.23 rows=9 width=8)
                     -&amp;gt;  Nested Loop  (cost=0.00..2.18 rows=9 width=8)
                           -&amp;gt;  Seq Scan on t9  (cost=0.00..1.03 rows=3 width=4)
                           -&amp;gt;  Materialize  (cost=0.00..1.04 rows=3 width=4)
                                 -&amp;gt;  Seq Scan on t10  (cost=0.00..1.03 rows=3 width=4)
         -&amp;gt;  Materialize  (cost=0.00..5.80 rows=81 width=16)
               -&amp;gt;  Nested Loop  (cost=0.00..5.39 rows=81 width=16)
                     -&amp;gt;  Nested Loop  (cost=0.00..2.18 rows=9 width=8)
                           -&amp;gt;  Seq Scan on t11  (cost=0.00..1.03 rows=3 width=4)
                           -&amp;gt;  Materialize  (cost=0.00..1.04 rows=3 width=4)
                                 -&amp;gt;  Seq Scan on t12  (cost=0.00..1.03 rows=3 width=4)
                     -&amp;gt;  Materialize  (cost=0.00..2.23 rows=9 width=8)
                           -&amp;gt;  Nested Loop  (cost=0.00..2.18 rows=9 width=8)
                                 -&amp;gt;  Seq Scan on t13  (cost=0.00..1.03 rows=3 width=4)
                                 -&amp;gt;  Materialize  (cost=0.00..1.04 rows=3 width=4)
                                       -&amp;gt;  Seq Scan on t14  (cost=0.00..1.03 rows=3 width=4)
   -&amp;gt;  Materialize  (cost=0.00..19.94 rows=729 width=24)
         -&amp;gt;  Nested Loop  (cost=0.00..16.29 rows=729 width=24)
               -&amp;gt;  Nested Loop  (cost=0.00..3.56 rows=27 width=12)
                     -&amp;gt;  Nested Loop  (cost=0.00..2.18 rows=9 width=8)
                           -&amp;gt;  Seq Scan on t2  (cost=0.00..1.03 rows=3 width=4)
                           -&amp;gt;  Materialize  (cost=0.00..1.04 rows=3 width=4)
                                 -&amp;gt;  Seq Scan on t3  (cost=0.00..1.03 rows=3 width=4)
                     -&amp;gt;  Materialize  (cost=0.00..1.04 rows=3 width=4)
                           -&amp;gt;  Seq Scan on t1  (cost=0.00..1.03 rows=3 width=4)
               -&amp;gt;  Materialize  (cost=0.00..3.69 rows=27 width=12)
                     -&amp;gt;  Nested Loop  (cost=0.00..3.56 rows=27 width=12)
                           -&amp;gt;  Nested Loop  (cost=0.00..2.18 rows=9 width=8)
                                 -&amp;gt;  Seq Scan on t5  (cost=0.00..1.03 rows=3 width=4)
                                 -&amp;gt;  Materialize  (cost=0.00..1.04 rows=3 width=4)
                                       -&amp;gt;  Seq Scan on t6  (cost=0.00..1.03 rows=3 width=4)
                           -&amp;gt;  Materialize  (cost=0.00..1.04 rows=3 width=4)
                                 -&amp;gt;  Seq Scan on t4  (cost=0.00..1.03 rows=3 width=4)
 Planning Time: 9.477 ms
(42 rows)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Yes, it's a little slower — 9ms instead of 4ms, but it's still fast. Yes, it's fast, but it's not about speed anymore. See what DPsize gives you:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                                                                  QUERY PLAN                                   
-----------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..215311.78 rows=1594323 width=56)
   Join Filter: ((((((t1.x + t2.x) + t3.x) + t4.x) + t5.x) + t6.x) &amp;gt; (((((((t7.x + t8.x) + t9.x) + t10.x) + t11.x) + t12.x) + t13.x) + t14.x))
   -&amp;gt;  Nested Loop  (cost=0.00..36.36 rows=2187 width=28)
         -&amp;gt;  Nested Loop  (cost=0.00..5.39 rows=81 width=16)
               -&amp;gt;  Nested Loop  (cost=0.00..2.18 rows=9 width=8)
                     -&amp;gt;  Seq Scan on t4  (cost=0.00..1.03 rows=3 width=4)
                     -&amp;gt;  Materialize  (cost=0.00..1.04 rows=3 width=4)
                           -&amp;gt;  Seq Scan on t5  (cost=0.00..1.03 rows=3 width=4)
               -&amp;gt;  Materialize  (cost=0.00..2.23 rows=9 width=8)
                     -&amp;gt;  Nested Loop  (cost=0.00..2.18 rows=9 width=8)
                           -&amp;gt;  Seq Scan on t6  (cost=0.00..1.03 rows=3 width=4)
                           -&amp;gt;  Materialize  (cost=0.00..1.04 rows=3 width=4)
                                 -&amp;gt;  Seq Scan on t7  (cost=0.00..1.03 rows=3 width=4)
         -&amp;gt;  Materialize  (cost=0.00..3.69 rows=27 width=12)
               -&amp;gt;  Nested Loop  (cost=0.00..3.56 rows=27 width=12)
                     -&amp;gt;  Nested Loop  (cost=0.00..2.18 rows=9 width=8)
                           -&amp;gt;  Seq Scan on t1  (cost=0.00..1.03 rows=3 width=4)
                           -&amp;gt;  Materialize  (cost=0.00..1.04 rows=3 width=4)
                                 -&amp;gt;  Seq Scan on t2  (cost=0.00..1.03 rows=3 width=4)
                     -&amp;gt;  Materialize  (cost=0.00..1.04 rows=3 width=4)
                           -&amp;gt;  Seq Scan on t3  (cost=0.00..1.03 rows=3 width=4)
   -&amp;gt;  Materialize  (cost=0.00..47.29 rows=2187 width=28)
         -&amp;gt;  Nested Loop  (cost=0.00..36.36 rows=2187 width=28)
               -&amp;gt;  Nested Loop  (cost=0.00..5.39 rows=81 width=16)
                     -&amp;gt;  Nested Loop  (cost=0.00..2.18 rows=9 width=8)
                           -&amp;gt;  Seq Scan on t11  (cost=0.00..1.03 rows=3 width=4)
                           -&amp;gt;  Materialize  (cost=0.00..1.04 rows=3 width=4)
                                 -&amp;gt;  Seq Scan on t12  (cost=0.00..1.03 rows=3 width=4)
                     -&amp;gt;  Materialize  (cost=0.00..2.23 rows=9 width=8)
                           -&amp;gt;  Nested Loop  (cost=0.00..2.18 rows=9 width=8)
                                 -&amp;gt;  Seq Scan on t13  (cost=0.00..1.03 rows=3 width=4)
                                 -&amp;gt;  Materialize  (cost=0.00..1.04 rows=3 width=4)
                                       -&amp;gt;  Seq Scan on t14  (cost=0.00..1.03 rows=3 width=4)
               -&amp;gt;  Materialize  (cost=0.00..3.69 rows=27 width=12)
                     -&amp;gt;  Nested Loop  (cost=0.00..3.56 rows=27 width=12)
                           -&amp;gt;  Nested Loop  (cost=0.00..2.18 rows=9 width=8)
                                 -&amp;gt;  Seq Scan on t8  (cost=0.00..1.03 rows=3 width=4)
                                 -&amp;gt;  Materialize  (cost=0.00..1.04 rows=3 width=4)
                                       -&amp;gt;  Seq Scan on t9  (cost=0.00..1.03 rows=3 width=4)
                           -&amp;gt;  Materialize  (cost=0.00..1.04 rows=3 width=4)
                                 -&amp;gt;  Seq Scan on t10  (cost=0.00..1.03 rows=3 width=4)
 Planning Time: 3337.097 ms
(42 rows)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The planning time is indeed much longer, but take a closer look at the query plan. The first thing to notice is that the cost is lower. Take a look at why:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                     -&amp;gt;  Nested Loop  (cost=0.00..2.18 rows=9 width=8)
                           -&amp;gt;  Seq Scan on t6  (cost=0.00..1.03 rows=3 width=4)
                           -&amp;gt;  Materialize  (cost=0.00..1.04 rows=3 width=4)
                                 -&amp;gt;  Seq Scan on t7  (cost=0.00..1.03 rows=3 width=4)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Yes, DPsize was able to &lt;em&gt;find an implicit connection&lt;/em&gt; between the relations, even if they were on different sides of the operands. DPhyp will not do this, since these are different sides of an edge, and according to its logic it is forbidden to do this — you cannot connect separate nodes of different hypernodes if they are on different sides of a hyperedge. From this we can conclude that DPhyp is a very good, but &lt;em&gt;heuristic&lt;/em&gt;. Unfortunately, these are not all the problems.&lt;/p&gt;

&lt;p&gt;3 different sources are used to create hyperedgegs, but they do not cover all the possible cases. The problem lies in the &lt;code&gt;joinclauses&lt;/code&gt;. During operation, PostgreSQL creates all possible variants of an expression with a different set of necessary relations of the left and right sides. This allows you to consider different options for the location of the expression in the tree. The problem is that the relation IDs used there may refer not to tables, but to indexes of JOIN nodes (&lt;code&gt;RangeTblEntry&lt;/code&gt; of type &lt;code&gt;RTE_JOIN&lt;/code&gt;). And they are there for a reason - to "tell" the planner which relations can be used and how to reorder them. To understand, let's look at such a query (taken from the regression tests of PostgreSQL itself):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="n"&gt;t1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt;
  &lt;span class="n"&gt;text_tbl&lt;/span&gt; &lt;span class="n"&gt;t1&lt;/span&gt;
  &lt;span class="k"&gt;left&lt;/span&gt; &lt;span class="k"&gt;join&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'***'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;d1&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;int8_tbl&lt;/span&gt; &lt;span class="n"&gt;i8b1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;b1&lt;/span&gt;
    &lt;span class="k"&gt;left&lt;/span&gt; &lt;span class="k"&gt;join&lt;/span&gt; &lt;span class="n"&gt;int8_tbl&lt;/span&gt; &lt;span class="n"&gt;i8&lt;/span&gt;
      &lt;span class="k"&gt;left&lt;/span&gt; &lt;span class="k"&gt;join&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;d2&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;int8_tbl&lt;/span&gt; &lt;span class="n"&gt;i8b2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;b2&lt;/span&gt;
      &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i8&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;q1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;b2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;q1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;d2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;b1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;q2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;f1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;b1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;d1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;left&lt;/span&gt; &lt;span class="k"&gt;join&lt;/span&gt; &lt;span class="n"&gt;int4_tbl&lt;/span&gt; &lt;span class="n"&gt;i4&lt;/span&gt;
  &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i8&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;q2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;i4&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;f1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The problematic predicate here is &lt;code&gt;i8.q2 = i4.f1&lt;/code&gt; — it stores &lt;em&gt;3&lt;/em&gt; copies with a different set of relations of the left and right sides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;{3}       - {8}&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;{3, 6}    - {8}&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;{3, 6, 7} - {8}&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;3 and 8 are indexes corresponding to tables &lt;code&gt;i8&lt;/code&gt; and &lt;code&gt;i4&lt;/code&gt;, respectively. But what are these 6 and 7? These are the indexes of JOINS:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- 6&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'***'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;d1&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;int8_tbl&lt;/span&gt; &lt;span class="n"&gt;i8b1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;b1&lt;/span&gt;
    &lt;span class="k"&gt;left&lt;/span&gt; &lt;span class="k"&gt;join&lt;/span&gt; &lt;span class="n"&gt;int8_tbl&lt;/span&gt; &lt;span class="n"&gt;i8&lt;/span&gt;
      &lt;span class="k"&gt;left&lt;/span&gt; &lt;span class="k"&gt;join&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;d2&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;int8_tbl&lt;/span&gt; &lt;span class="n"&gt;i8b2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;b2&lt;/span&gt;
      &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i8&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;q1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;b2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;q1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;d2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;b1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;q2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;-- 7&lt;/span&gt;
&lt;span class="n"&gt;text_tbl&lt;/span&gt; &lt;span class="n"&gt;t1&lt;/span&gt;
  &lt;span class="k"&gt;left&lt;/span&gt; &lt;span class="k"&gt;join&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'***'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;d1&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;int8_tbl&lt;/span&gt; &lt;span class="n"&gt;i8b1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;b1&lt;/span&gt;
    &lt;span class="k"&gt;left&lt;/span&gt; &lt;span class="k"&gt;join&lt;/span&gt; &lt;span class="n"&gt;int8_tbl&lt;/span&gt; &lt;span class="n"&gt;i8&lt;/span&gt;
      &lt;span class="k"&gt;left&lt;/span&gt; &lt;span class="k"&gt;join&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;d2&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;int8_tbl&lt;/span&gt; &lt;span class="n"&gt;i8b2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;b2&lt;/span&gt;
      &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i8&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;q1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;b2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;q1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;d2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;b1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;q2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;f1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;b1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;d1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This reflects the limitations: &lt;code&gt;i8.q2 = i4.f1&lt;/code&gt; does not apply to &lt;code&gt;i8&lt;/code&gt; itself, but to the result of the &lt;code&gt;LEFT JOIN&lt;/code&gt;.&lt;/p&gt;




&lt;p&gt;OK, but then why is the execution time longer? If we are not considering additional cases (discussed above), then this time should be shorter. Here's the thing.&lt;/p&gt;

&lt;p&gt;If we take a look at the code of the vanilla planner, we can see that it is quite smart. &lt;a href="https://github.com/postgres/postgres/blob/0810fbb02dbe70b8a7a7bcc51580827b8bbddbdc/src/backend/optimizer/path/joinrels.c#L73" rel="noopener noreferrer"&gt;Function &lt;code&gt;join_search_one_level&lt;/code&gt;&lt;/a&gt; is responsible for processing single level of DPsize:&lt;/p&gt;

&lt;p&gt;
  join_search_one_level
  &lt;br&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="cm"&gt;/* src/backend/optimizer/path/joinrels.c */&lt;/span&gt;
&lt;span class="kt"&gt;void&lt;/span&gt;
&lt;span class="nf"&gt;join_search_one_level&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PlannerInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;List&lt;/span&gt;      &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;joinrels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;join_rel_level&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;ListCell&lt;/span&gt;   &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt;         &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;join_cur_level&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="cm"&gt;/*
     * Make ZIG-ZAG plan - join table with previous level.
     */&lt;/span&gt;
    &lt;span class="n"&gt;foreach&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;joinrels&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;RelOptInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;old_rel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RelOptInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;lfirst&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;old_rel&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;joininfo&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;NIL&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;old_rel&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;has_eclass_joins&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
            &lt;span class="n"&gt;has_join_restriction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;old_rel&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="kt"&gt;int&lt;/span&gt;         &lt;span class="n"&gt;first_rel&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;first_rel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;foreach_current_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="k"&gt;else&lt;/span&gt;
                &lt;span class="n"&gt;first_rel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

            &lt;span class="n"&gt;make_rels_by_clause_joins&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;old_rel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;joinrels&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;first_rel&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;make_rels_by_clauseless_joins&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                          &lt;span class="n"&gt;old_rel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                          &lt;span class="n"&gt;joinrels&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="cm"&gt;/*
     * Creation of "bushy" plans - main DPsize logic, where all possible
     * pairs of relations with several tables on both sides are considered.
     */&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;;&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kt"&gt;int&lt;/span&gt;         &lt;span class="n"&gt;other_level&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;level&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="n"&gt;foreach&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;joinrels&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;RelOptInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;old_rel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RelOptInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;lfirst&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="kt"&gt;int&lt;/span&gt;         &lt;span class="n"&gt;first_rel&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="n"&gt;ListCell&lt;/span&gt;   &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;r2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;other_level&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;first_rel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;foreach_current_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="k"&gt;else&lt;/span&gt;
                &lt;span class="n"&gt;first_rel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

            &lt;span class="n"&gt;for_each_from&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;joinrels&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;other_level&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;first_rel&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;RelOptInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;new_rel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RelOptInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;lfirst&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r2&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

                &lt;span class="cm"&gt;/* Build plan only if it makes sense */&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;bms_overlap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;old_rel&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;relids&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_rel&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;relids&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;have_relevant_joinclause&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;old_rel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_rel&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
                        &lt;span class="n"&gt;have_join_order_restriction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;old_rel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_rel&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
                    &lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;make_join_rel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;old_rel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_rel&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
                    &lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="cm"&gt;/*
     * Build CROSS JOIN if we failed to build anything at current level
     */&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;joinrels&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;NIL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;foreach&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;joinrels&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;RelOptInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;old_rel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RelOptInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;lfirst&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

            &lt;span class="n"&gt;make_rels_by_clauseless_joins&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                          &lt;span class="n"&gt;old_rel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                          &lt;span class="n"&gt;joinrels&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;/p&gt;

&lt;p&gt;Logic is the following:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;First, we create left-deep/right-deep plans — we create the current level by joining 1 table to the previous level (even if there is no predicate, that is, we create a &lt;code&gt;CROSS JOIN&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Then we run the main DPsize logic with consideration of all possible pairs, that create target set.&lt;/li&gt;
&lt;li&gt;In the end, if we couldn't find anything, then we create the &lt;code&gt;CROSS JOIN&lt;/code&gt; nodes.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The main optimization is in step 2 — we &lt;em&gt;do not&lt;/em&gt; create extra &lt;code&gt;CROSS JOINS&lt;/code&gt; if they are not needed. This is, in fact, an imitation of the behavior of DPhyp, since its idea is to connect relationships only if there is a connection between them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="cm"&gt;/* Left/right parts do not intersect */&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;bms_overlap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;old_rel&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;relids&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_rel&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;relids&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="cm"&gt;/* Have edge between nodes */&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;have_relevant_joinclause&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;old_rel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_rel&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
        &lt;span class="n"&gt;have_join_order_restriction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;old_rel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_rel&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;make_join_rel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;old_rel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_rel&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We spend the remaining microseconds on operational:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hypergraph building.&lt;/li&gt;
&lt;li&gt;Neighbors traversing.&lt;/li&gt;
&lt;li&gt;Hash-table maintenance.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All these extra cycles accumulate and result in an even longer execution.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusions
&lt;/h2&gt;

&lt;p&gt;The extension, of course, is still not ready due to performance reasons. The existing infrastructure is highly coherent with existing code base of PostgreSQL, so to implement some functionality, you have to look for hacks, or it will work ineffectively.&lt;/p&gt;

&lt;p&gt;The question arises: is it all necessary? R&amp;amp;D. As I said at the beginning, the planner is an important part of the DBMS, so researches in this area may at some point turn into a significant gain (the keyword is "may"). Unfortunately, the payback on this particular investment is negative for now — in practice, this algorithm creates plans that are neither better nor faster.&lt;/p&gt;

&lt;p&gt;Let's summarize the disadvantages:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Some possible optimizations are lost: hyperedges are created from predicates that currently cannot be completely transformed into hyperedges due to &lt;code&gt;RangeTable&lt;/code&gt; indexes that do NOT refer to relations (main problem is for OUTER JOINS)&lt;/li&gt;
&lt;li&gt;Disconnected subgraphs require special processing: the user must tell the extension how to act (the &lt;code&gt;cj_strategy&lt;/code&gt; setting).&lt;/li&gt;
&lt;li&gt;Estimations suffer due to the suboptimal order of relation pairs: the order of the processed relation pairs is important for JOINS, but now it does not correspond to what the built-in planner does.&lt;/li&gt;
&lt;li&gt;It takes quite a long time to complete: the built-in &lt;code&gt;make_join_rel&lt;/code&gt; is currently being used, but it takes the lion's share of the time.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;There is also good news: everything is fixable. The limitations are dictated by the implementation alone, that is, they are not fundamental limitations. No one forbids us to write our own &lt;code&gt;make_join_rel&lt;/code&gt; optimized for DPhyp. As a last resort, we can patch the core and add what we're missing. Moreover, we found at least one query that we were able to speed up hundreds of times, and this can open up a whole field of applicability, which means that the work has not been done in vain.&lt;/p&gt;

&lt;p&gt;Links:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/TantorLabs/pg_dphyp" rel="noopener noreferrer"&gt;Extension pg_dphyp&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://15721.courses.cs.cmu.edu/spring2018/papers/16-optimizer2/p539-moerkotte.pdf" rel="noopener noreferrer"&gt;Dynamic Programming Strikes Back&lt;/a&gt; — DPhyp&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://db.in.tum.de/~radke/papers/hugejoins.pdf" rel="noopener noreferrer"&gt;Adaptive Optimization of Very Large Join Queries&lt;/a&gt; — planning of large queries (1000+ tables)&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>postgres</category>
      <category>database</category>
    </item>
    <item>
      <title>How to handle files?</title>
      <dc:creator>Sergey Solovev</dc:creator>
      <pubDate>Sat, 03 May 2025 15:21:10 +0000</pubDate>
      <link>https://forem.com/ashenblade/how-to-handle-files-4mpl</link>
      <guid>https://forem.com/ashenblade/how-to-handle-files-4mpl</guid>
      <description>&lt;p&gt;Greetings!&lt;/p&gt;

&lt;p&gt;Fault tolerance is a very important aspect of every non-startup application.&lt;br&gt;
It can be described as a &lt;a href="https://en.wikipedia.org/wiki/Fault_tolerance" rel="noopener noreferrer"&gt;definition&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Fault tolerance is the ability of a system to maintain proper operation despite failures or faults in one or more of its components.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But this gives only slight overview - fault tolerance concerns many areas especially when we are talking about software engineering:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Network failures (i.e. connection halt due to power outage on intermediate router)&lt;/li&gt;
&lt;li&gt;Dependent service unavailability (i.e. another microservice)&lt;/li&gt;
&lt;li&gt;Hardware bugs (i.e. &lt;a href="https://en.wikipedia.org/wiki/Pentium_FDIV_bug" rel="noopener noreferrer"&gt;Pentium FDIV bug&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Storage layer corruptions (i.e. &lt;a href="https://en.wikipedia.org/wiki/Data_degradation" rel="noopener noreferrer"&gt;bit rot&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As a database developer I'm interested in latter - in the end all data stored in disks.&lt;/p&gt;

&lt;p&gt;But you should know that not only disks can lead to such faults. There are other pieces that can misbehave or it's developer's mistake that he/she didn't work with these parts correctly.&lt;/p&gt;

&lt;p&gt;I'm going to explain how this 'file write stack' (named in opposite to 'network stack') works.&lt;br&gt;
Of course, main concern will be about fault tolerance.&lt;/p&gt;
&lt;h2&gt;
  
  
  Application
&lt;/h2&gt;

&lt;p&gt;Everything starts in application's code. Usually, there is separate interface to work with files.&lt;/p&gt;

&lt;p&gt;Each PL (programming language) has own interface. Some examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;fwrite&lt;/code&gt; - C&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;std::fstream.write&lt;/code&gt; - C++&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;FileStream.Write&lt;/code&gt; - C#&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;FileOutputStream.Write&lt;/code&gt; - Java&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;open().write&lt;/code&gt; - Python&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;os.WriteFile&lt;/code&gt; - go&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These all means given by PL itself: read, write, etc...&lt;br&gt;
Their main advantage - platform independence: runtime (i.e. C#) or compiler (i.e. C) implements this.&lt;br&gt;
But this have drawbacks. Buffering in this case - all calls to this virtual "write" function store going to be written data in special buffer to later write it all at once (make syscall to write).&lt;/p&gt;

&lt;p&gt;Due to documentation, each of all PL above support buffering:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;C - &lt;a href="https://en.cppreference.com/w/c/io/setvbuf" rel="noopener noreferrer"&gt;setvbuf&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;C++ - &lt;a href="https://cplusplus.com/reference/fstream/filebuf/" rel="noopener noreferrer"&gt;filebuf&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;C# - &lt;a href="https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/IO/Strategies/BufferedFileStreamStrategy.cs" rel="noopener noreferrer"&gt;BufferedFileStrategy&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Java - &lt;a href="https://docs.oracle.com/javase/8/docs/api/java/nio/file/Files.html#newBufferedWriter-java.nio.file.Path-java.nio.charset.Charset-java.nio.file.OpenOption...-" rel="noopener noreferrer"&gt;Files.newBufferedReader&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Python - &lt;a href="https://docs.python.org/3/library/io.html#io.BufferedIOBase" rel="noopener noreferrer"&gt;io.BufferedIOBase&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;go - &lt;a href="https://pkg.go.dev/bufio#Reader" rel="noopener noreferrer"&gt;bufio.Reader&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;What about C# - class &lt;code&gt;FileStream&lt;/code&gt; uses another class &lt;code&gt;FileStreamStrategy&lt;/code&gt; - it handles all requests (&lt;a href="https://en.wikipedia.org/wiki/Strategy_pattern" rel="noopener noreferrer"&gt;Strategy pattern&lt;/a&gt;).&lt;br&gt;
For example, when we are instantiating &lt;code&gt;FileStream&lt;/code&gt; through &lt;code&gt;File.Open()&lt;/code&gt; - &lt;code&gt;BufferedFileStrategy&lt;/code&gt; wraps &lt;code&gt;OSFileStreamStrategy&lt;/code&gt;.&lt;br&gt;
It adds buffering layer above syscalls.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Generaly, buffering in user space is a good feature - usually it improves performance.&lt;br&gt;
But programmer should be aware of additional buffers. There are 2 types of buffering I can identify:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Manual (go, Java) - create buffered file object manually&lt;/li&gt;
&lt;li&gt;Transparent (C, C++) - buffer maintained automatically&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In first case we know, that buffer should be flushed at end of write session, but in second case can you can easily shoot in your foot:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;File opened in memory, but no callbacks for flushing registered or just forgot to add such code&lt;/li&gt;
&lt;li&gt;Application closed abnormally (i.e. got SIGKILL)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;All such cases ends up with all data in in-memory buffer get lost.&lt;/p&gt;

&lt;p&gt;So, there are 2 solutions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Always flush data after write session.
The easiest case for simple application is by adding single write function that accepts batch of write buffers and calls flush function at the end.&lt;/li&gt;
&lt;li&gt;Disable buffering at all and make writes directly.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;From performance view - first case is the most attractive.&lt;/p&gt;

&lt;p&gt;
  Performance comparison
  &lt;p&gt;I have made a small benchmark - sequential write of 64Mb data into a file.&lt;br&gt;
Test matrix: old laptop with HDD and new laptop with SSD NVMe M.2.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Direct write, s&lt;/th&gt;
&lt;th&gt;Buffered write, s&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Old&lt;/td&gt;
&lt;td&gt;894.7&lt;/td&gt;
&lt;td&gt;109.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;New&lt;/td&gt;
&lt;td&gt;8.932&lt;/td&gt;
&lt;td&gt;1.198&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The result is obvious: buffered IO faster by at least 8 times.&lt;br&gt;
Source code of benchmark &lt;a href="https://github.com/ashenBlade/habr-posts/blob/file-write/file-write/src/FileWrite.Benchmarks/FileWriteBenchmarks.cs" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;



&lt;/p&gt;

&lt;h2&gt;
  
  
  Operating system
&lt;/h2&gt;

&lt;p&gt;Programming language gives good platform abstraction - programmer does not need to think about (at least, not very often) on which operating system application works.&lt;br&gt;
But eventually, PL interface will be mapped to some syscalls.&lt;br&gt;
These syscalls are OS dependent, but in common we have 4 main functions:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Operation&lt;/th&gt;
&lt;th&gt;*nix&lt;/th&gt;
&lt;th&gt;Windows&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Open/Create&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;open&lt;/code&gt;/&lt;code&gt;creat&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;CreateFile&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Read&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;pread&lt;/code&gt;/&lt;code&gt;read&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ReadFile&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Write&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;pwrite&lt;/code&gt;/&lt;code&gt;write&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;WriteFile&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Close&lt;/td&gt;
&lt;td&gt;&lt;code&gt;close&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;CloseFile&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;*nix&lt;/code&gt; means Unix-like OS: Linux, FreeBSD, OSX.&lt;br&gt;
Semantics of some operations can differ, but now it is not very important - common interface/behaviour pattern the same.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;There is buffering also at OS layer. It is called &lt;a href="https://en.wikipedia.org/wiki/Page_cache" rel="noopener noreferrer"&gt;page buffer&lt;/a&gt;.&lt;br&gt;
And it's even easier to shoot yourself in the foot with it.&lt;/p&gt;

&lt;p&gt;Operations with file are perform in pages (huge sequential block of data, usually 4Kb) even if only single byte was requested.&lt;/p&gt;

&lt;p&gt;When we read page is stored in buffer. Then we send write request that is not written to device immeately. Instead, it just marked as "dirty" and scheduled to be flushed at near future.&lt;br&gt;
All such pages (clean and dirty) make up a page cache.&lt;/p&gt;

&lt;p&gt;How it can harm us? Follow these steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;User sends request to update info about himself. For example, it can be his password&lt;/li&gt;
&lt;li&gt;We open file with users information - retrieve required page&lt;/li&gt;
&lt;li&gt;Page is updated with new password (of course it will be password hash for security reasons)&lt;/li&gt;
&lt;li&gt;Application make "write" request to OS (everything is ok)&lt;/li&gt;
&lt;li&gt;We tell user that everything is updated&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What could go wrong? Power outage!&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;User replaced his old password with new one in his password manager (old &lt;em&gt;cryptographic&lt;/em&gt; password)&lt;/li&gt;
&lt;li&gt;"dirty" page with new password (step 3) was lost because of power failure and old password still the same (on disk)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Congratulations - user lost access to his account.&lt;/p&gt;

&lt;p&gt;Page cache is very helpfull, especially when there are many operations on same page.&lt;br&gt;
But when we should make a "durable" write (that is it must be saved to disk permanently, not being just updated in memory) we must ensure data is flushed to disk.&lt;/p&gt;

&lt;p&gt;Even here there are 2 solutions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Special syscall to flush data to disk&lt;/li&gt;
&lt;li&gt;Bypass page cache and always write to disk&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;
  
  
  Special syscall to flush data
&lt;/h3&gt;

&lt;p&gt;This solution uses separate syscall that ensures that all our write requests were completed.&lt;/p&gt;
&lt;h4&gt;
  
  
  Linux
&lt;/h4&gt;

&lt;p&gt;In Linux we can use:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;fdatasync(fd)&lt;/code&gt; - checks, that &lt;em&gt;data in memory&lt;/em&gt; and &lt;em&gt;on disk&lt;/em&gt; are synchronized, that is performs flushing of pending ("dirty") pages.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;fsync(fd)&lt;/code&gt; - the same as &lt;code&gt;fdatasync(fd)&lt;/code&gt;, but also updates file metadata (i.e. last access time). Usually, you don't need this and should prefer &lt;code&gt;fdatasync&lt;/code&gt; for performance.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;sync_file_range(fd, range)&lt;/code&gt; - checks that specific range of file is flushed to disk (according to &lt;code&gt;man&lt;/code&gt; this is very dangerous function, so I added it here just for familiarization)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;sync()&lt;/code&gt; - it flushes all pages to disk, but for &lt;em&gt;all files&lt;/em&gt;, not some specified&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;As said earlier, &lt;code&gt;fdatasync&lt;/code&gt; synchronizes only contents, without metadata like &lt;code&gt;fsync&lt;/code&gt;, so it is faster.&lt;br&gt;
etcd &lt;a href="https://github.com/etcd-io/etcd/blob/6f55dfa26e1a359e47e1fb15af79951e97dbac39/client/pkg/fileutil/sync_linux.go#L32" rel="noopener noreferrer"&gt;comments this out&lt;/a&gt; and makes explicit separation between them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Fsync is a wrapper around file.Sync(). Special handling is needed on darwin platform.&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;Fsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;File&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
 &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sync&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;// Fdatasync is similar to fsync(), but does not flush modified metadata&lt;/span&gt;
&lt;span class="c"&gt;// unless that metadata is needed in order to allow a subsequent data retrieval&lt;/span&gt;
&lt;span class="c"&gt;// to be correctly handled.&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;Fdatasync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;File&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
 &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;syscall&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fdatasync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fd&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Windows
&lt;/h4&gt;

&lt;p&gt;For Windows there are functions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;_commit&lt;/code&gt; - flushes data from file to disk&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;FlushFileBuffers(fd)&lt;/code&gt; - causes write all buffered data to a file (I'm not a Windows developer, so do not known difference with previous one)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;NtFlushBuffersFileEx(fd, params)&lt;/code&gt; - causes page cache flush, but it is more low-level function (marked with &lt;code&gt;NTSYSCALLAPI&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;P.S. &lt;code&gt;NtFlushBuffersFileEx&lt;/code&gt; &lt;a href="https://github.com/postgres/postgres/blob/874d817baa160ca7e68bee6ccc9fc1848c56e750/src/port/win32fdatasync.c#L40" rel="noopener noreferrer"&gt;used in Postgres&lt;/a&gt; as a writer for crossplatform &lt;code&gt;fdatasync&lt;/code&gt;, and for &lt;code&gt;fsync&lt;/code&gt; it uses &lt;code&gt;_commit&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// https://github.com/postgres/postgres/blob/9acae56ce0b0812f3e940cf1f87e73e8d5784e78/src/include/port/win32_port.h#L85&lt;/span&gt;
&lt;span class="cm"&gt;/* Windows doesn't have fsync() as such, use _commit() */&lt;/span&gt;
&lt;span class="cp"&gt;#define fsync(fd) _commit(fd)
&lt;/span&gt;
&lt;span class="c1"&gt;// https://github.com/postgres/postgres/blob/874d817baa160ca7e68bee6ccc9fc1848c56e750/src/port/win32fdatasync.c#L23&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt;
&lt;span class="nf"&gt;fdatasync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// ...&lt;/span&gt;
 &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pg_NtFlushBuffersFileEx&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;handle&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="n"&gt;FLUSH_FLAGS_FILE_DATA_SYNC_ONLY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="nb"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;iosb&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  macOS
&lt;/h4&gt;

&lt;p&gt;It is also worth mentioning macOS. Althrough it is POSIX-complient, but one call to &lt;code&gt;fsync&lt;/code&gt; is not enough - &lt;code&gt;fcntl(F_FULLSYNC)&lt;/code&gt; call requred.&lt;br&gt;
This is outlined even in &lt;a href="https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man2/fsync.2.html" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;For applications that require tighter guarantees about the integrity of their data, Mac OS X provides the &lt;em&gt;F_FULLFSYNC fcntl&lt;/em&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Again, this is &lt;a href="https://github.com/etcd-io/etcd/blob/6f55dfa26e1a359e47e1fb15af79951e97dbac39/client/pkg/fileutil/sync_darwin.go#L36" rel="noopener noreferrer"&gt;handled in etcd&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Fsync on HFS/OSX flushes the data on to the physical drive but the drive&lt;/span&gt;
&lt;span class="c"&gt;// may not write it to the persistent media for quite sometime and it may be&lt;/span&gt;
&lt;span class="c"&gt;// written in out-of-order sequence. Using F_FULLFSYNC ensures that the&lt;/span&gt;
&lt;span class="c"&gt;// physical drive's buffer will also get flushed to the media.&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;Fsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;File&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
 &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;unix&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FcntlInt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fd&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;unix&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;F_FULLFSYNC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
 &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;// Fdatasync on darwin platform invokes fcntl(F_FULLFSYNC) for actual persistence&lt;/span&gt;
&lt;span class="c"&gt;// on physical drive media.&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;Fdatasync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;File&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
 &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;Fsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Bypass OS page cache
&lt;/h3&gt;

&lt;p&gt;The second solution require us to pass special flag during file opening.&lt;/p&gt;

&lt;h4&gt;
  
  
  Linux
&lt;/h4&gt;

&lt;p&gt;Linux require passing 2 flags &lt;code&gt;O_SYNC | O_DIRECT&lt;/code&gt; to &lt;code&gt;open&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;O_SYNC&lt;/code&gt; - every write flushed immediately to disk. Also, there is &lt;code&gt;O_DSYNC&lt;/code&gt; - replace &lt;code&gt;fsync&lt;/code&gt; with &lt;code&gt;fdatasync&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;O_DIRECT&lt;/code&gt; - write bypasses page cache (equivalent to disabling it at all for this file)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Using &lt;code&gt;O_SYNC&lt;/code&gt;/&lt;code&gt;O_DSYNC&lt;/code&gt; is equivalent to calling &lt;code&gt;fsync&lt;/code&gt;/&lt;code&gt;fdatasync&lt;/code&gt; right after we call &lt;code&gt;write&lt;/code&gt;/&lt;code&gt;pwrite&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Also, using &lt;code&gt;fcntl&lt;/code&gt; you can change &lt;code&gt;O_DIRECT&lt;/code&gt; flag, but no &lt;code&gt;O_SYNC&lt;/code&gt;.&lt;br&gt;
This is specified in documentation for &lt;code&gt;fcntl&lt;/code&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;... It is not possible to change the O_DSYNC and O_SYNC flags; see BUGS, below.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h4&gt;
  
  
  Windows
&lt;/h4&gt;

&lt;p&gt;In Windows we have similar flags:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;FILE_FLAG_NO_BUFFERING&lt;/code&gt; - disables OS page buffering&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;FILE_FLAG_WRITE_THROUGH&lt;/code&gt; - all writes flushed to disk, without buffering&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Documentation &lt;a href="https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-createfilea#caching-behavior" rel="noopener noreferrer"&gt;gives description&lt;/a&gt; of behaviour when both are specified:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If FILE_FLAG_WRITE_THROUGH and FILE_FLAG_NO_BUFFERING are both specified, so that system caching is not in effect, then the data is immediately flushed to disk without going through the Windows system cache. The operating system also requests a write-through of the hard disk's local hardware cache to persistent media.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And &lt;a href="https://devblogs.microsoft.com/oldnewthing/20210729-00/?p=105494" rel="noopener noreferrer"&gt;this blog article&lt;/a&gt; introduces comparison matrix:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;tr&gt;
  &lt;td colspan="2" rowspan="2"&gt;&lt;/td&gt;
  &lt;td colspan="2"&gt;NO_BUFFERING&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Clear&lt;/td&gt;
&lt;td&gt;Set&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td rowspan="2"&gt;WRITE_THROUGH&lt;/td&gt;
  &lt;td&gt;Clear&lt;/td&gt;
  &lt;td&gt;Writes go into cache 
Lazily written to disk 
No hardware flush&lt;/td&gt;
  &lt;td&gt;Writes bypass cache 
Immediately written to disk 
No hardware flush&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Set&lt;/td&gt;
  &lt;td&gt;Writes go into cache
Immediately written to disk
Hardware flush&lt;/td&gt;
  &lt;td&gt;Writes bypass cache
Immediately written to disk
Hardware flush&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;p&gt;Postgres uses this mapping between Windows/Linux flags:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Windows&lt;/th&gt;
&lt;th&gt;Unix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;FILE_FLAG_NO_BUFFERING&lt;/td&gt;
&lt;td&gt;O_DIRECT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FILE_FLAG_WRITE_THROUGH&lt;/td&gt;
&lt;td&gt;O_DSYNC&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// https://github.com/postgres/postgres/blob/30e144287a72529c9cd9fd6b07fe96eb8a1e270e/src/port/open.c#L65&lt;/span&gt;
&lt;span class="n"&gt;HANDLE&lt;/span&gt;
&lt;span class="nf"&gt;pgwin32_open_handle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;fileName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;fileFlags&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="n"&gt;backup_semantics&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// ...&lt;/span&gt;
 &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CreateFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fileName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                           &lt;span class="c1"&gt;// ...&lt;/span&gt;
         &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;fileFlags&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;O_DIRECT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;?&lt;/span&gt; &lt;span class="n"&gt;FILE_FLAG_NO_BUFFERING&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;
         &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;fileFlags&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;O_DSYNC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;?&lt;/span&gt; &lt;span class="n"&gt;FILE_FLAG_WRITE_THROUGH&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
         &lt;span class="nb"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;INVALID_HANDLE_VALUE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Directory sync
&lt;/h3&gt;
&lt;h4&gt;
  
  
  Linux
&lt;/h4&gt;

&lt;p&gt;But that's not all. Unix way - everythin is file, &lt;em&gt;even directory&lt;/em&gt;. Directory is a file, but read/write operations are different:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;read&lt;/code&gt; - iterate directory entries&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;write&lt;/code&gt; - create/delete entry&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Read is not very interesting for us, but write is - when we create or delete file (moving file is both operations) performed to must ensure these operations are flushed - &lt;code&gt;fsync(directory_fd)&lt;/code&gt; is called.&lt;br&gt;
This is &lt;a href="https://man7.org/linux/man-pages/man2/fsync.2.html#DESCRIPTION" rel="noopener noreferrer"&gt;documented&lt;/a&gt; as well:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Calling &lt;code&gt;fsync()&lt;/code&gt; does not necessarily ensure that the entry in the directory containing the file has also reached disk.&lt;br&gt;
For that an explicit &lt;code&gt;fsync()&lt;/code&gt; on a file descriptor for the directory is also needed.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h4&gt;
  
  
  Windows
&lt;/h4&gt;

&lt;p&gt;As for Windows - we can not do it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Directory is not openable - we get &lt;code&gt;EACCESS&lt;/code&gt; when try to do this&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;FlushFileBuffers&lt;/code&gt; works only with files or whole volume (all files in volume), but for the latter elevated privileges required.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Again, Postgres use *nix style - it call &lt;code&gt;fsync&lt;/code&gt; for directories, but for Windows it &lt;a href="https://github.com/postgres/postgres/blob/874d817baa160ca7e68bee6ccc9fc1848c56e750/src/backend/storage/file/fd.c#L3797" rel="noopener noreferrer"&gt;has additional check&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="cm"&gt;/*
 * fsync_fname_ext -- Try to fsync a file or directory
 *
 * If ignore_perm is true, ignore errors upon trying to open unreadable
 * files. Logs other errors at a caller-specified level.
 *
 * Returns 0 if the operation succeeded, -1 otherwise.
 */&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt;
&lt;span class="nf"&gt;fsync_fname_ext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;fname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="n"&gt;isdir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="n"&gt;ignore_perm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;elevel&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
 &lt;span class="n"&gt;returncode&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pg_fsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

 &lt;span class="cm"&gt;/*
  * Some OSes don't allow us to fsync directories at all, so we can ignore
  * those errors. Anything else needs to be logged.
  */&lt;/span&gt;
 &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;returncode&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;isdir&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;errno&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;EBADF&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;errno&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;EINVAL&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
 &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// ...&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
 &lt;span class="p"&gt;}&lt;/span&gt;

 &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  fsync errors
&lt;/h2&gt;

&lt;p&gt;Even though &lt;code&gt;write&lt;/code&gt; returned success status code, it does not mean that write was successful - recall that we use buffering, so we just updated in-memory page.&lt;/p&gt;

&lt;p&gt;When we call &lt;code&gt;fsync&lt;/code&gt; it can return error. It was shown in the code above.&lt;br&gt;
But what we should do when we get error code from &lt;code&gt;fsync&lt;/code&gt;?&lt;/p&gt;

&lt;p&gt;Generally - halt immediately. If developing for specific OS - it depends. Why? Main reason, is that we no longer can be sure about consistency:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;File system can be broken&lt;/li&gt;
&lt;li&gt;File can have a hole (invalid write)&lt;/li&gt;
&lt;li&gt;"Dirty" pages can be just marked "clean" and no longer flushed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Latter may cause differences in file (contents) view from in-memory and disk perspectives, but we will not notice that.&lt;/p&gt;

&lt;p&gt;Marking pages as "clean" is a part of Linux implementation - it assumes, that file system correctly completes write operation.&lt;/p&gt;

&lt;p&gt;Postgres hackers make a brief overview of different operating system behaviour on &lt;code&gt;fsync&lt;/code&gt; error:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Darwin/macOS - invalidate buffers&lt;/li&gt;
&lt;li&gt;OpenBSD - invalidate buffers&lt;/li&gt;
&lt;li&gt;NetBSD - invalidate buffers&lt;/li&gt;
&lt;li&gt;FreeBSD - remain "dirty"&lt;/li&gt;
&lt;li&gt;Linux (after 4.16) - marked "clean"&lt;/li&gt;
&lt;li&gt;Windows - unknown&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So, if we are developing crossplatform application - the state of buffers after we get &lt;code&gt;fsync&lt;/code&gt; error is undefined.&lt;br&gt;
But when developing for specific OS - it is defined.&lt;/p&gt;
&lt;h3&gt;
  
  
  fsyncgate
&lt;/h3&gt;

&lt;p&gt;We can not talk about &lt;code&gt;fsync&lt;/code&gt; errors without touching fsyncgate.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;fsyncgate&lt;/code&gt; - is a name of a bug found in Postgres that caused data loss. More infor can be found &lt;a href="https://wiki.postgresql.org/wiki/Fsync_Errors" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;br&gt;
Briefly, initially developers thought &lt;code&gt;fsync&lt;/code&gt; semntics is following:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If &lt;code&gt;fsync()&lt;/code&gt; completed successfully, then all writes since last &lt;em&gt;successful&lt;/em&gt; &lt;code&gt;fsync()&lt;/code&gt; were flushed to disk.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In another words - try to call &lt;code&gt;fsync&lt;/code&gt; until it returns a successful status code.&lt;br&gt;
But as we observed earlier it is a misconception. In reality, if &lt;code&gt;fsync&lt;/code&gt; returned error code, than all "dirty" pages will be just forgotten.&lt;br&gt;
Therefore, more correct to state:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If &lt;code&gt;fsync()&lt;/code&gt; completed successfully, then all writes since last &lt;del&gt;successful&lt;/del&gt; &lt;code&gt;fsync()&lt;/code&gt; were flushed to disk.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This was noticed in 2018 year and answer was discussed in &lt;a href="https://www.postgresql.org/message-id/CAMsr%2BYFNivjj1eYX0-%3DjfaAi8u%2BQ6CSOXN82_xuALzXAdpWe-Q%40mail.gmail.com" rel="noopener noreferrer"&gt;mailing list&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;That bug was fixed in 12 version (with backpatching) in &lt;a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9ccdd7f66e3324d2b6d3dec282cfa9ff084083f1" rel="noopener noreferrer"&gt;this commit&lt;/a&gt; using next code (increase error level):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// https://github.com/postgres/postgres/blob/30e144287a72529c9cd9fd6b07fe96eb8a1e270e/src/backend/storage/file/fd.c#L3936&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt;
&lt;span class="nf"&gt;data_sync_elevel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;elevel&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;data_sync_retry&lt;/span&gt; &lt;span class="o"&gt;?&lt;/span&gt; &lt;span class="n"&gt;elevel&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;PANIC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Любая функция, например эта - https://github.com/postgres/postgres/blob/30e144287a72529c9cd9fd6b07fe96eb8a1e270e/src/backend/access/heap/rewriteheap.c#L1132&lt;/span&gt;
&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;sample_function&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; 
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Любой вызов fsync в логике &lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg_fsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;// ereport(ERROR,&lt;/span&gt;
       &lt;span class="n"&gt;ereport&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data_sync_elevel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ERROR&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;errcode_for_file_access&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                 &lt;span class="n"&gt;errmsg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"could not fsync file &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;%s&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;: %m"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)));&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;
  Page cache in databases
  &lt;p&gt;Disk operations - is one of the most important part of each database. So many of them implement their own page cache manager (subsystem) and does not rely on OS.&lt;/p&gt;

&lt;p&gt;This adds some improvements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;More effective IO operations scheduling (durable write should occur after &lt;code&gt;COMMIT&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;More effective (potentially) page eviction algorithms&lt;/li&gt;
&lt;li&gt;Size of page and whole cache can be adjusted&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here some examples of several databases:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;DBMS&lt;/th&gt;
&lt;th&gt;Implementation&lt;/th&gt;
&lt;th&gt;Page eviction algorithm&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Postgres&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://github.com/postgres/postgres/blob/d13ff82319ccaacb04d77b77a010ea7a1717564f/src/include/storage/bufmgr.h" rel="noopener noreferrer"&gt;bufmgr.h&lt;/a&gt;, &lt;a href="https://github.com/postgres/postgres/blob/d13ff82319ccaacb04d77b77a010ea7a1717564f/src/backend/storage/buffer/bufmgr.c" rel="noopener noreferrer"&gt;bufmgr.c&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;clock-sweep&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SQL Server&lt;/td&gt;
&lt;td&gt;sources N/A&lt;/td&gt;
&lt;td&gt;LRU-2 (LRU-K)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Oracle&lt;/td&gt;
&lt;td&gt;sources N/A&lt;/td&gt;
&lt;td&gt;LRU, Temperature-based&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MySQL (InnoDB)&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://github.com/mysql/mysql-server/blob/824e2b4064053f7daf17d7f3f84b7a3ed92e5fb4/storage/innobase/include/buf0buf.h" rel="noopener noreferrer"&gt;buf0buf.h&lt;/a&gt;, &lt;a href="https://github.com/mysql/mysql-server/blob/824e2b4064053f7daf17d7f3f84b7a3ed92e5fb4/storage/innobase/buf/buf0buf.cc" rel="noopener noreferrer"&gt;buf0buf.cc&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;LRU&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;As I said, own page cache/manager can optimize database workflow and smart page eviction algorithm is not only place to optimize:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Oracle and SQL Server use flash disks as temporal page cache storage instead of flushing them to main disk (settings &lt;code&gt;DB_FLASH_CACHE_FILE&lt;/code&gt; for Oracle and BPE (buffer pool extension) for SQL Server)&lt;/li&gt;
&lt;li&gt;Postgres have &lt;code&gt;pg_prewarm&lt;/code&gt; extension to "warm" page cache&lt;/li&gt;
&lt;/ul&gt;



&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I searched Linux kernel source code and found &lt;code&gt;fsync&lt;/code&gt; implementation &lt;a href="https://github.com/torvalds/linux/blob/67be068d31d423b857ffd8c34dbcc093f8dfff76/fs/buffer.c#L769" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  File system
&lt;/h2&gt;

&lt;p&gt;Usually, we say "write to disk", but actually it is final destination. There is another layer that should be passed - file system.&lt;/p&gt;

&lt;p&gt;Nowadays all data stored in filesystems. As for me, I do not know any system that access block devices using either &lt;a href="https://en.wikipedia.org/wiki/Logical_block_addressing" rel="noopener noreferrer"&gt;LBA&lt;/a&gt; or &lt;a href="https://en.wikipedia.org/wiki/Cylinder-head-sector" rel="noopener noreferrer"&gt;CHS&lt;/a&gt; - you just write at some offset in a file.&lt;br&gt;
So, before data is written to disk it must be processed by file system.&lt;/p&gt;

&lt;p&gt;Today there are &lt;a href="https://en.wikipedia.org/wiki/Comparison_of_file_systems" rel="noopener noreferrer"&gt;huge amount&lt;/a&gt; of different file systems, so I primarly focus on small part:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ext family&lt;/li&gt;
&lt;li&gt;btrfs&lt;/li&gt;
&lt;li&gt;xfs&lt;/li&gt;
&lt;li&gt;ntfs&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Choose them, as most popular in my opinion&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;File systems can be characterized by multiple aspects (i.e. max file name length), but now we are interested in those, who related to fault tolerance.&lt;/p&gt;
&lt;h3&gt;
  
  
  File system integrity
&lt;/h3&gt;

&lt;p&gt;From programming perspective, file system - is a global, shared object. Everyone has access to and can to modify it.&lt;br&gt;
Taking global lock during write operations seems a bad idea, so there are a couple of mechanisms different file systems use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Journaling (ext family, ntfs, xfs) - file system maintains a log of operations and always able to replay it in case of error&lt;/li&gt;
&lt;li&gt;COW, Copy on write (btrfs, zfs) - blocks of data are not modified, but instead they create new block with modifed data&lt;/li&gt;
&lt;li&gt;Log-structured - file system itself is a log of operations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not all file system support such mechanisms, i.e. ext2 is not journaling fs and all modifications applied directly.&lt;/p&gt;

&lt;p&gt;That seems to be an answer - just choose FS with jounaling/COW and here you go. But no.&lt;/p&gt;

&lt;p&gt;Research &lt;a href="https://research.cs.wisc.edu/wind/Publications/sfa-dsn05.pdf" rel="noopener noreferrer"&gt;Model-Based Failure Analysis of Journaling File Systems&lt;/a&gt; reveals, that even journaled file systems can end up being inconsistent.&lt;br&gt;
3 file systems (ext3, reiserfs, jfs) were examined when block writing error occurred. Result is presented in table:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1i478bjs7zai6cjhfodt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1i478bjs7zai6cjhfodt.png" alt="Errors in journaled file systems"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can mention - every file system can leave content of block (Data Block) in inconsistent state (DC, Data Corruption).&lt;br&gt;
Conclusion: do not rely heavily on file system journaling.&lt;/p&gt;

&lt;p&gt;I also found &lt;a href="https://pages.cs.wisc.edu/~laksh/research/Bairavasundaram-ThesisWithFront.pdf" rel="noopener noreferrer"&gt;research for NTFS&lt;/a&gt; (163 p.) - even with metadata redundancy file system can be corrupted and become unrecoverable.&lt;/p&gt;

&lt;p&gt;What about COW and log structured?&lt;/p&gt;

&lt;p&gt;In &lt;a href="https://elinux.org/images/b/b6/EMMC-SSD_File_System_Tuning_Methodology_v1.0.pdf" rel="noopener noreferrer"&gt;this research&lt;/a&gt; (26 p.) btrfs was tested on SSD and after some power outages file system has become unusable and unrecoverable.&lt;br&gt;
On the other hand, journaled ext4 survived 1406.&lt;/p&gt;

&lt;p&gt;As for log-structured - I couldn't find any research about them.&lt;/p&gt;
&lt;h3&gt;
  
  
  Magic number
&lt;/h3&gt;

&lt;p&gt;We can outline 2 main components in file, that we should update - file size and data blocks.&lt;br&gt;
What we should update first?&lt;/p&gt;

&lt;p&gt;If file size, then in case of failure garbage can be found in file - file size can be larger, but data in new block contains some garbage.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;It can be compared to garbage that stored in initialized variables on stack or in just allocated memory (i.e. using &lt;code&gt;malloc&lt;/code&gt; in C)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;From file system perspective - there is no inconsistency. But there may be in case of application - garbage in data.&lt;/p&gt;

&lt;p&gt;We can protect ourselves by using special markers in file - after we read some chunk of data first check that marker.&lt;br&gt;
Usually, it is some predefined constant.&lt;/p&gt;

&lt;p&gt;This approach is used by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Postgres - add "magic number" in header of WAL page&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;
  Postgres code
  &lt;br&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="cm"&gt;/*
* Each page of XLOG file has a header like this:
*/&lt;/span&gt;
&lt;span class="cp"&gt;#define XLOG_PAGE_MAGIC 0xD114 &lt;/span&gt;&lt;span class="cm"&gt;/* can be used as WAL version indicator */&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;
&lt;span class="k"&gt;typedef&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;XLogPageHeaderData&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;uint16&lt;/span&gt;  &lt;span class="n"&gt;xlp_magic&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="cm"&gt;/* magic value for correctness checks */&lt;/span&gt;
    &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="n"&gt;XLogPageHeaderData&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kafka - &lt;a href="https://github.com/apache/kafka/blob/9bc9fae9425e4dac64ef078cd3a4e7e6e09cc45a/clients/src/main/java/org/apache/kafka/common/record/FileLogInputStream.java#L63" rel="noopener noreferrer"&gt;"magic number"&lt;/a&gt; used as a version&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;
  Kafka magic number
  &lt;br&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;FileLogInputStream&lt;/span&gt; &lt;span class="kd"&gt;implements&lt;/span&gt; &lt;span class="nc"&gt;LogInputStream&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;FileLogInputStream&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;FileChannelRecordBatch&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;FileChannelRecordBatch&lt;/span&gt; &lt;span class="nf"&gt;nextBatch&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="kd"&gt;throws&lt;/span&gt; &lt;span class="nc"&gt;IOException&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// ...&lt;/span&gt;

        &lt;span class="kt"&gt;byte&lt;/span&gt; &lt;span class="n"&gt;magic&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logHeaderBuffer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="no"&gt;MAGIC_OFFSET&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="nc"&gt;FileChannelRecordBatch&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;magic&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nc"&gt;RecordBatch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;MAGIC_VALUE_V2&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;LegacyFileChannelRecordBatch&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;magic&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fileRecords&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;position&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;
            &lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;DefaultFileChannelRecordBatch&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;magic&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fileRecords&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;position&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SQLite - use &lt;a href="https://github.com/sqlite/sqlite/blob/f79b0bdcbfb46164cfd665d256f2862bf3f42a7c/src/btree.c#L22" rel="noopener noreferrer"&gt;ASCII constant string "SQLite format 3"&lt;/a&gt; for main database file or &lt;a href="https://github.com/sqlite/sqlite/blob/f79b0bdcbfb46164cfd665d256f2862bf3f42a7c/src/pager.c#L754" rel="noopener noreferrer"&gt;random bytes&lt;/a&gt; for journal&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;
  SQLite magic numbers
  &lt;br&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Header for undo journal&lt;/span&gt;

&lt;span class="cm"&gt;/*
** Journal files begin with the following magic string.  The data
** was obtained from /dev/random.  It is used only as a sanity check.
*/&lt;/span&gt;
&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;unsigned&lt;/span&gt; &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="n"&gt;aJournalMagic&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="mh"&gt;0xd9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0xd5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0xf9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0xa1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x63&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0xd7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;readJournalHdr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="n"&gt;Pager&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;               &lt;span class="cm"&gt;/* Pager object */&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;isHot&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;i64&lt;/span&gt; &lt;span class="n"&gt;journalSize&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;             &lt;span class="cm"&gt;/* Size of the open journal file in bytes */&lt;/span&gt;
&lt;span class="n"&gt;u32&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;pNRec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                  &lt;span class="cm"&gt;/* OUT: Value read from the nRec field */&lt;/span&gt;
&lt;span class="n"&gt;u32&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;pDbSize&lt;/span&gt;                 &lt;span class="cm"&gt;/* OUT: Value of original database size field */&lt;/span&gt;
&lt;span class="p"&gt;){&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;                      &lt;span class="cm"&gt;/* Return code */&lt;/span&gt;
&lt;span class="kt"&gt;unsigned&lt;/span&gt; &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="n"&gt;aMagic&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;     &lt;span class="cm"&gt;/* A buffer to hold the magic header */&lt;/span&gt;
&lt;span class="n"&gt;i64&lt;/span&gt; &lt;span class="n"&gt;iHdrOff&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;                 &lt;span class="cm"&gt;/* Offset of journal header being read */&lt;/span&gt;

&lt;span class="cm"&gt;/* Read in the first 8 bytes of the journal header. If they do not match
** the  magic string found at the start of each journal header, return
** SQLITE_DONE. If an IO error occurs, return an error code. Otherwise,
** proceed.
*/&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;isHot&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;iHdrOff&lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;journalHdr&lt;/span&gt; &lt;span class="p"&gt;){&lt;/span&gt;
    &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sqlite3OsRead&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;jfd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;aMagic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;aMagic&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;iHdrOff&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="p"&gt;){&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;memcmp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;aMagic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;aJournalMagic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;aMagic&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="p"&gt;){&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;SQLITE_DONE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Header of main database file&lt;/span&gt;

&lt;span class="cp"&gt;#ifndef SQLITE_FILE_HEADER &lt;/span&gt;&lt;span class="cm"&gt;/* 123456789 123456 */&lt;/span&gt;&lt;span class="cp"&gt;
#  define SQLITE_FILE_HEADER "SQLite format 3"
#endif
&lt;/span&gt;
&lt;span class="cm"&gt;/*
** The header string that appears at the beginning of every
** SQLite database.
*/&lt;/span&gt;
&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="n"&gt;zMagicHeader&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SQLITE_FILE_HEADER&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;lockBtree&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BtShared&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;pBt&lt;/span&gt;&lt;span class="p"&gt;){&lt;/span&gt;
&lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;nPage&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="p"&gt;){&lt;/span&gt;
    &lt;span class="cm"&gt;/* EVIDENCE-OF: R-43737-39999 Every valid SQLite database file begins
    ** with the following 16 bytes (in hex): 53 51 4c 69 74 65 20 66 6f 72 6d
    ** 61 74 20 33 00. */&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;memcmp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;zMagicHeader&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="p"&gt;){&lt;/span&gt;
    &lt;span class="k"&gt;goto&lt;/span&gt; &lt;span class="n"&gt;page1_init_failed&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// ...&lt;/span&gt;

&lt;span class="n"&gt;page1_init_failed&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;span class="n"&gt;pBt&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;pPage1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;/p&gt;

&lt;p&gt;Additionally, there are some predefined headers for different file formats. For example, BOM, JPEG or PNG.&lt;/p&gt;

&lt;h3&gt;
  
  
  Checksum
&lt;/h3&gt;

&lt;p&gt;But this is just a constant, that does not depend on stored data - it is used to definetely say "this is a garbage".&lt;/p&gt;

&lt;p&gt;What go wrong? For example, we send write request and it failed in the middle of operation. Then checksum in file header will be valid, but contents of file are not.&lt;br&gt;
One can think that we should move magic number to the end of such chunk, but who said that data starts to be written from begin to end and &lt;em&gt;not from end to begin&lt;/em&gt;?&lt;/p&gt;

&lt;p&gt;Even so, what happens if data in the middle of page will be corrupted?&lt;br&gt;
Magic number will not detect this, so we need some small digest of our chunk.&lt;br&gt;
We will use it to compare our stored data with those that we have read to ensure integrity.&lt;/p&gt;

&lt;p&gt;This small digest called checksum. This is just small fixed-sized byte array computed using some hash functions over all contents of file.&lt;br&gt;
Checksum can be stored both at the begginning and at the end of chunk.&lt;br&gt;
If you have such choice, I recommend store it in the end - data locality: page (OS) with high probability will be already in cache (same for cpu cache, L1/2/3).&lt;/p&gt;

&lt;p&gt;Checksum is widely used:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Postgres - uses &lt;a href="https://github.com/postgres/postgres/blob/5f79cb7629a4ce6321f509694ebf475a931608b6/src/include/access/xlogrecord.h#L49" rel="noopener noreferrer"&gt;CRC32C for WAL records&lt;/a&gt;:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;
  Postgres WAL CRC
  &lt;br&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;typedef&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;XLogRecord&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// ...&lt;/span&gt;
    &lt;span class="n"&gt;pg_crc32c&lt;/span&gt; &lt;span class="n"&gt;xl_crc&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;   &lt;span class="cm"&gt;/* CRC for this record */&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="n"&gt;XLogRecord&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;etcd - uses &lt;a href="https://github.com/etcd-io/etcd/blob/8b9909e20de0aeacb3e63a0df992347b2f683703/server/storage/wal/walpb/record.pb.go#L28" rel="noopener noreferrer"&gt;CRC for WAL records&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;
  etcd sliding crc
  &lt;br&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;Record&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="n"&gt;Crc&lt;/span&gt;                  &lt;span class="kt"&gt;uint32&lt;/span&gt;   &lt;span class="s"&gt;`protobuf:"varint,2,opt,name=crc" json:"crc"`&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;KurrentDB (formerly EventStore) - uses &lt;a href="https://github.com/EventStore/EventStore/blob/e18b459c0f44c76ff7a1146d023070f5423b759a/src/EventStore.Core/TransactionLog/Chunks/ChunkFooter.cs#L21" rel="noopener noreferrer"&gt;MD5&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;
  EventStore MD5
  &lt;br&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ChunkFooter&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// ...&lt;/span&gt;
    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="n"&gt;readonly&lt;/span&gt; &lt;span class="n"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="n"&gt;MD5Hash&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;/p&gt;

&lt;h3&gt;
  
  
  File System consistency model
&lt;/h3&gt;

&lt;p&gt;Before going ahead, we should talk about consistency model.&lt;/p&gt;

&lt;p&gt;In programming languages there is such a thing as &lt;a href="https://en.wikipedia.org/wiki/Memory_model_(programming)" rel="noopener noreferrer"&gt;"memory model"&lt;/a&gt;.&lt;br&gt;
In short, our program can be described in terms of store/load operations (assign value to a variable/read variable's value) and memory model describes which operations can be reordered (more accurately, which reoderings are &lt;em&gt;prohibited&lt;/em&gt;).&lt;br&gt;
Such reordering can give some performance improvements, but when we are moving to multi-threaded/concurrent execution these reorderings can spoil everything.&lt;br&gt;
Given example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;function&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// 1&lt;/span&gt;
    &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;123&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;244&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// 2&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;555&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; 
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In 1 case - are we allowed to write &lt;code&gt;b&lt;/code&gt; and only then &lt;code&gt;a&lt;/code&gt;, so perform store/store reordering&lt;/li&gt;
&lt;li&gt;In 2 case, are we allowed to write &lt;code&gt;b&lt;/code&gt; and only then read from &lt;code&gt;a&lt;/code&gt;, so perform store/load reodering.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The memory order answers these questions. For PLs and hardware there is documentation with memory model explanation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://en.cppreference.com/w/cpp/language/memory_model" rel="noopener noreferrer"&gt;C++&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/dotnet/runtime/blob/main/docs/design/specs/Memory-model.md" rel="noopener noreferrer"&gt;C#&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.cs.umd.edu/~pugh/java/memoryModel/jsr133.pdf" rel="noopener noreferrer"&gt;Java&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://peps.python.org/pep-0583/" rel="noopener noreferrer"&gt;Python&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://go.dev/ref/mem" rel="noopener noreferrer"&gt;Go&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://doc.rust-lang.org/nomicon/atomics.html" rel="noopener noreferrer"&gt;Rust&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24592.pdf" rel="noopener noreferrer"&gt;AMD64&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's leave programming languages for later. The main thing is that we &lt;strong&gt;can define such rules for the file system as well&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This is called persistency semantics (name from &lt;a href="https://people.mpi-sws.org/~viktor/papers/popl2021-persevere.pdf" rel="noopener noreferrer"&gt;this paper&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Study &lt;a href="https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf" rel="noopener noreferrer"&gt;"All File Systems Are Not Created Equal"&lt;/a&gt; identified these "basic" operations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;File chunk overwrite&lt;/li&gt;
&lt;li&gt;Append data to file&lt;/li&gt;
&lt;li&gt;Rename&lt;/li&gt;
&lt;li&gt;Directory operations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After conducting the experiment, this table was compiled:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi0ih53raozd3unmp699j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi0ih53raozd3unmp699j.png" alt="Persistency semantics of different file systems"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Also note, that table shows only &lt;em&gt;found&lt;/em&gt; bugs - if bug is not found, it does not mean that is does not exist.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For simplicity, I will continue use term "journaled" for file systems that have some kind of buffer, in which all operations stored before being applied - COW, log-structured, journaled and so on.&lt;/p&gt;

&lt;h4&gt;
  
  
  Atomicity
&lt;/h4&gt;

&lt;p&gt;Single logical operation may consists of multiple physical operations. For example, append to a file (write to end) consists of:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Update size of file&lt;/li&gt;
&lt;li&gt;Add new data block&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Atomicity in that case means transactional atomicity, like in databases - if we fail to perform even single operation, we must be able rollback state to the initial (with same metadata).&lt;/p&gt;

&lt;p&gt;We can draw the following conclusions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No file system can atomically append some data blocks to file (except single block)&lt;/li&gt;
&lt;li&gt;Overwrite of 1 disk sector is atomic&lt;/li&gt;
&lt;li&gt;Non-journaled file system almost always do not provide atomicity of operations&lt;/li&gt;
&lt;li&gt;Directory operations almost always atomic, except non-journaled file systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What happens in case of failure during non-atomic operation?&lt;br&gt;
The worst case - file system will be corrupted and you will have to run &lt;code&gt;fsck&lt;/code&gt; (and pray to God it will be fixed).&lt;br&gt;
As for files:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;There will be garbage in file (in case of apppend) - just allocated new block and updated length&lt;/li&gt;
&lt;li&gt;Part on block will be overwritten - overwrite operation failed in the middle&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The same study also presented comparison matrix of behaviour in case of failure during some common file operation patterns.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9mu2l8nf9boo1bm55s0q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9mu2l8nf9boo1bm55s0q.png" alt="Observed errors in case of failure during file operation pattern"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Legend:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;PA&lt;/code&gt; (Prefix Append) - safe append of new data bloc{ks&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ARVR&lt;/code&gt; (Atomic Replace Via Rename) - rename file to update large amount of data&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ACVR&lt;/code&gt; (Atomic Create Via Rename) - create new file with initialized contents by renaming file&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As you can see each file system is not perfect and can become inconsistent in case of interruption during operation.&lt;/p&gt;

&lt;p&gt;
  Important ARVR/ACVR assumption made
  &lt;p&gt;Study covered behaviour for &lt;code&gt;ACVR&lt;/code&gt;/&lt;code&gt;ARVR&lt;/code&gt; and showed that even these operation can be non-atomic.&lt;br&gt;
But, then there was a note - there is no &lt;code&gt;fsync&lt;/code&gt; call in those experments.&lt;br&gt;
Take a look at tests specification:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Atomic Replace Via Rename (ARVR)
initial:
  g &amp;lt;- creat("file", 0600)
  write(g, old)
main:
  f &amp;lt;- creat("file.tmp", 0600)
  write(f, new)
  rename("file.tmp", "file")
exists?:
  content("file") 6= old ^ content("file") 6= new

# Atomic create via rename (ACVR)
main:
  f &amp;lt;- creat("file.tmp", 0600)
  write(f, data)
  rename("file.tmp", "file")
exists?:
  content("file") 6= ∅ ^ content("file") 6= data
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can see, that there is no &lt;code&gt;fsync&lt;/code&gt; call between &lt;code&gt;write&lt;/code&gt; and &lt;code&gt;rename&lt;/code&gt;. What is happening:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;New file is created&lt;/li&gt;
&lt;li&gt;Data is written to file&lt;/li&gt;
&lt;li&gt;File is renamed&lt;/li&gt;
&lt;li&gt;File system decides to first rename file (&lt;em&gt;operation reordering&lt;/em&gt;)&lt;/li&gt;
&lt;li&gt;!Failure!&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;After application reboot we observe that file is empty, because only &lt;code&gt;rename&lt;/code&gt; operation performed on empty file.&lt;/p&gt;

&lt;p&gt;To fix this, we should add &lt;code&gt;fsync&lt;/code&gt; call before &lt;code&gt;rename&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;But, file system developers known about this pattern and add some hacks (to flush data to disk before &lt;code&gt;rename&lt;/code&gt;):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ext4 - &lt;a href="https://man7.org/linux/man-pages/man5/ext4.5.html" rel="noopener noreferrer"&gt;mount option &lt;code&gt;auto_da_alloc&lt;/code&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;btrfs - &lt;a href="https://btrfs.readthedocs.io/en/latest/Administration.html" rel="noopener noreferrer"&gt;mount option &lt;code&gt;flushoncommit&lt;/code&gt;&lt;/a&gt; (description took from &lt;a href="https://archive.kernel.org/oldwiki/btrfs.wiki.kernel.org/index.php/FAQ.html#What_are_the_crash_guarantees_of_overwrite-by-rename.3F#What_are_the_crash_guarantees_of_overwrite-by-rename.3F" rel="noopener noreferrer"&gt;here&lt;/a&gt;, but wiki is archieved, so I do not known whether it is true anymore)&lt;/li&gt;
&lt;li&gt;xfs - according to &lt;a href="https://www.spinics.net/lists/xfs/msg36717.html" rel="noopener noreferrer"&gt;mailing list&lt;/a&gt;, &lt;code&gt;the sync-after-rename behaviour was suggested and rejected for xfs&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And the last question - is &lt;code&gt;rename&lt;/code&gt; itself atomic?&lt;br&gt;
Of course, we can write data to temporary file, &lt;code&gt;fsync&lt;/code&gt; it, but what the point if everything breaks during &lt;code&gt;rename&lt;/code&gt; call?&lt;br&gt;
&lt;a href="https://man7.org/linux/man-pages/man2/rename.2.html#DESCRIPTION" rel="noopener noreferrer"&gt;Documentation&lt;/a&gt; says the following:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If newpath already exists, it will be atomically replaced, so that there is no point at which another process attempting to access newpath will find it missing.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So, &lt;code&gt;rename&lt;/code&gt; is "atomic" only from multi-*process*ing point of view, but not fault-tolerance.&lt;br&gt;
Then, we find:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If newpath exists but the operation fails for some reason, rename() guarantees to leave an instance of newpath in place.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That means, that in case of errors, contents in &lt;code&gt;newpath&lt;/code&gt; stay the same.&lt;br&gt;
But still, no information about fault-tolerance.&lt;/p&gt;

&lt;p&gt;In &lt;a href="https://www.gnu.org/software/libc/manual/html_node/Renaming-Files.html#Renaming-Files" rel="noopener noreferrer"&gt;GNU C documentation&lt;/a&gt; I have found the following behaviour description:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If there is a system crash during the operation, it is possible for both names to still exist; but newname will always be intact if it exists at all.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Finally - target file (which is replaced) does not changes in case of failure.&lt;/p&gt;

&lt;p&gt;So, conlusion: &lt;code&gt;rename&lt;/code&gt; is atomic, but you should call &lt;code&gt;fsync&lt;/code&gt; before to guarantee, that new data actually present on disk and file is not empty/half-full.&lt;/p&gt;



&lt;/p&gt;

&lt;h4&gt;
  
  
  Reorderings
&lt;/h4&gt;

&lt;p&gt;As for operations reordering, we can make such conclusion: if file system is journaled, then order of operations &lt;em&gt;in most cases&lt;/em&gt; is preserved.&lt;br&gt;
But that's not true for ext2, ext3-writeback, ext4-writeback, reiserfs-writeback so operations can be reordered.&lt;/p&gt;

&lt;p&gt;Also, do not forget about directory operations - they are handled in the same way as regular file operations.&lt;br&gt;
That means, that if we are writing to a file and then &lt;code&gt;rename&lt;/code&gt;, actual operations can be &lt;code&gt;rename&lt;/code&gt; empty file and start appending data blocks.&lt;/p&gt;

&lt;p&gt;All file systems can do this, so always call &lt;code&gt;fsync&lt;/code&gt;!&lt;/p&gt;
&lt;h4&gt;
  
  
  Write barrier
&lt;/h4&gt;

&lt;p&gt;In memory model there is a definition of "write barrier" - machanism, that prohibit reordering of store/load operations.&lt;br&gt;
As you can notice, file systems also have such machanism - it is &lt;code&gt;fsync&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;We can give such semantics:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;All write operations happens before &lt;code&gt;fsync&lt;/code&gt; (for same file)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We can say nothing about reordering of write operations before &lt;code&gt;fsync&lt;/code&gt;, but definetely say - not after &lt;code&gt;fsync&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Actually, &lt;code&gt;fsync&lt;/code&gt; is not real barrier, just we can use such semantics for it.&lt;br&gt;
But similar discussion has been raised - there was suggestion to add &lt;code&gt;fbarrier&lt;/code&gt; syscall, but &lt;a href="https://lwn.net/Articles/326505/" rel="noopener noreferrer"&gt;Linus rejected&lt;/a&gt; this idea, considering that it will add unnecessary complexity.&lt;/p&gt;
&lt;h3&gt;
  
  
  Other file systems
&lt;/h3&gt;

&lt;p&gt;Previous studyings were oriented primarly on widely-used *nix file systems, but of course there are many others.&lt;/p&gt;
&lt;h4&gt;
  
  
  NTFS
&lt;/h4&gt;

&lt;p&gt;NTFS - is a "standard" file system on Windows. I didn't find any research about it's fault tolerance.&lt;/p&gt;

&lt;p&gt;The only I can do is draw conclusions based on file system properties:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Only metadata is journaled - there may be garbage in files after operations&lt;/li&gt;
&lt;li&gt;Metadata is duplicated - in case of some hardware fault, you always have a "second chance" to recover it&lt;/li&gt;
&lt;li&gt;Have it's own transactional API (TxF), but developers &lt;a href="https://learn.microsoft.com/en-us/windows/win32/fileio/deprecation-of-txf#abstract" rel="noopener noreferrer"&gt;should not use it&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h4&gt;
  
  
  APFS
&lt;/h4&gt;

&lt;p&gt;APFS (Apple File System) - file system for Apple, which should replace HFS+.&lt;/p&gt;

&lt;p&gt;According to papers and blogs (everything that I could find):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Uses Copy on write (not journaling) - main reason because it fits well for SSD&lt;/li&gt;
&lt;li&gt;Have &lt;code&gt;Atomic Safe-Save&lt;/code&gt; technique to guarantee atomic &lt;code&gt;rename&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Uses checksums for metadata only, but not user data&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I have highlighted last point specifically, because I didn't understand &lt;a href="https://danluu.com/filesystem-errors#error-detection" rel="noopener noreferrer"&gt;the article&lt;/a&gt; I relied on.&lt;br&gt;
It states:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;apfs doesn’t checksum data because “[apfs] engineers contend that Apple devices basically don’t return bogus data”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But, if you go to the referenced article, you will see:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;APFS checksums its own metadata but not user data&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The misunderstanding is caused by the fact that in referenced article there was comparison with ZFS (which have checksum for user data), but in first - there is no mention about this.&lt;/p&gt;
&lt;h3&gt;
  
  
  Appliction &lt;code&gt;fsync&lt;/code&gt; error handling
&lt;/h3&gt;

&lt;p&gt;In previous section we were talking about &lt;code&gt;fsync&lt;/code&gt; and it's errors. We saw how OS handles &lt;code&gt;fsync&lt;/code&gt; errors (in my optinion, &lt;code&gt;fsync&lt;/code&gt; should be special function in interface of a file system driver, so OS also should handle such error), but how real applications respond to errors?&lt;/p&gt;

&lt;p&gt;Here we will use paper &lt;a href="https://www.usenix.org/system/files/atc20-rebello.pdf" rel="noopener noreferrer"&gt;"Can Applications Recover from fsync Failures?"&lt;/a&gt;. The title is selfdescriptive - this paper shows how different software and file systems behaves when encounters an error returned by &lt;code&gt;fsync&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;First table shows behaviour of file system:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FashenBlade%2Fhabr-posts%2Ffile-write%2Ffile-write%2Fimg%2Fcan-appinlications-recover-from-fsync-table-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FashenBlade%2Fhabr-posts%2Ffile-write%2Ffile-write%2Fimg%2Fcan-appinlications-recover-from-fsync-table-1.png" alt="Behaviour of file systems in case of fsync error"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;ext4 data - means journaled mode&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This table says:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;fsync&lt;/code&gt; errors arise only during writing of data blocks or journal. But for metadata errors behaviour differs: xfs and btrfs will be remounted in read-only mode, and ext4 just log it (to OS) and continue to work&lt;/li&gt;
&lt;li&gt;When error occurres during data block writing, metadata will not rollback. So size can be increased, but content of file will contain garbage.&lt;/li&gt;
&lt;li&gt;After file system recover in runtime (error occurred and file system driver has been unloaded and loaded), state in memory can left unmodifed. In example, there is btrfs - after recovery, metadata can be changed, but file descriptor is old and points to a position in file outside of it.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;All file systems mark pages "clean" but that because tests were run on Linux&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The next table shows behaviour of applications in case of &lt;code&gt;fsync&lt;/code&gt; error arising:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fntmk0ujgs60w95kmn4h1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fntmk0ujgs60w95kmn4h1.png" alt="Application behaviour for fsync error"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Legend:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OV (old value) - return old value, instead of new&lt;/li&gt;
&lt;li&gt;FF (false failure) - return user error, but actually it's ok&lt;/li&gt;
&lt;li&gt;KC/VC (ke/value corruption) - data was corrupted (tests were run on key-value storage)&lt;/li&gt;
&lt;li&gt;KNF (key not found) - return user that all ok, but new value not saved (get lost)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We can make such conclusions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If an &lt;code&gt;fsync&lt;/code&gt; error occurres not immediately, then error amount increases&lt;/li&gt;
&lt;li&gt;COW file systems better handle &lt;code&gt;fsync&lt;/code&gt; errors, compared to regular journaled&lt;/li&gt;
&lt;li&gt;Many applications in case of &lt;code&gt;fsync&lt;/code&gt; error just halt and rollback to previous state (&lt;code&gt;-&lt;/code&gt; and &lt;code&gt;|&lt;/code&gt; in table)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Storage
&lt;/h2&gt;

&lt;p&gt;And the last layer - persistent storage. We will talk about HDD and SSD.&lt;/p&gt;

&lt;p&gt;They have different storage technologies under the hood, but now we are focused on their parameters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Time to failure&lt;/li&gt;
&lt;li&gt;ECC (error correction codes)&lt;/li&gt;
&lt;li&gt;Access controller&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;
  Beyond HDD and SSD
  &lt;p&gt;There are other storage devices beside HDD and SSD. For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tape drives&lt;/li&gt;
&lt;li&gt;CD/DVD/Blu-ray disks&lt;/li&gt;
&lt;li&gt;PCM, FRAM, MRAM&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We will not talk about them in the rest part, but mention here.&lt;/p&gt;

&lt;p&gt;Tape drives seems ideal option for backup storage. Compared to HDD, tape drives has more longevity and capacity/cost fraction. But they are not suitable for modern workloads with lots of random access.&lt;/p&gt;

&lt;p&gt;CD/DVD/Blu-ray disks also not forgotten. According to &lt;a href="https://www.anythingresearch.com/industry/Manufacturing-Reproducing-Magnetic-Optical-Media.htm" rel="noopener noreferrer"&gt;this research&lt;/a&gt; sales of optical disks only increasing.&lt;br&gt;
Again, optical disks are not well suited for extensive workloads.&lt;/p&gt;

&lt;p&gt;Also, there are some technologies like &lt;a href="https://en.wikipedia.org/wiki/Phase-change_memory" rel="noopener noreferrer"&gt;PCM&lt;/a&gt; (Phase-Change Memory), &lt;a href="https://en.wikipedia.org/wiki/Ferroelectric_RAM" rel="noopener noreferrer"&gt;FRAM&lt;/a&gt; (Ferroelectric RAM) and &lt;a href="https://en.wikipedia.org/wiki/Magnetoresistive_RAM" rel="noopener noreferrer"&gt;MRAM&lt;/a&gt; (Magnetoresistive RAM). I couldn't find enough information on them, so I won't say anything so as not to misinform.&lt;/p&gt;



&lt;/p&gt;

&lt;h3&gt;
  
  
  Time to failure
&lt;/h3&gt;

&lt;p&gt;Every piece of equipment wears out. In case of processing hardware (CPU or GPU) will break down, we just replace it and that's all.&lt;/p&gt;

&lt;p&gt;But, if our persistence will break down, we might loose data. Blackbaze released a &lt;a href="https://www.backblaze.com/blog/backblaze-drive-stats-for-2023/" rel="noopener noreferrer"&gt;report for 2023 year&lt;/a&gt; with disk failure statistics. We can draw such conclusions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.backblaze.com/blog/wp-content/uploads/2024/02/4-Lifetime-AFR.png" rel="noopener noreferrer"&gt;AFR (Annualized Failure Rate)&lt;/a&gt; depends on many factors. For example, vendor and disk size, but average across the board is about 65 months (5.5 years)&lt;/li&gt;
&lt;li&gt;Compared to 2022 year AFR increased&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As for SSD, they have &lt;a href="https://www.backblaze.com/blog/ssd-drive-stats-mid-2022-review/" rel="noopener noreferrer"&gt;report for 2022&lt;/a&gt;. According to it, AFR for SSD - 0.92% (lower than HDD). But take into account, that SSD was took into operations only in 2018 year, so statistics can be not accurate.&lt;/p&gt;

&lt;p&gt;Now, something about the impact of physical world on HDD/SSD operations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HDD has more moving parts, so it is more susceptible to physical damage. There is a bright example - &lt;a href="https://www.youtube.com/watch?v=tDacjrSCeq4" rel="noopener noreferrer"&gt;"Shouting in the Datacenter"&lt;/a&gt;. This video shows that even such small action is enough to affect HDD - response time increases. Digging deeper, &lt;a href="https://www.princeton.edu/~pmittal/publications/acoustic-ashes18.pdf" rel="noopener noreferrer"&gt;this study&lt;/a&gt; analyzed the effect of noise on HDD operations - HDD was forced to work in noise (ADoS - Acoustice Denial of Service). As a result, positioning errors rate increases and sometimes there are disk failures occurred.&lt;/li&gt;
&lt;li&gt;SSD, on the other hand, does not have moving parts, but it heavily relies on elictricity. It is very sensible to sudden power outage! &lt;a href="https://arxiv.org/pdf/1805.00140.pdf" rel="noopener noreferrer"&gt;This research&lt;/a&gt; have tested behaviour of SSD for such power outage. And suddenly, such sudden power outage can lead to: data integrity violation, data loss or even &lt;a href="https://www.usenix.org/system/files/conference/fast13/fast13-final80.pdf" rel="noopener noreferrer"&gt;trun SSD into a brick&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  ECC
&lt;/h3&gt;

&lt;p&gt;Often, HDD and SSD has builtin support for ECC - Error Correction Codes.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HDD supports &lt;a href="https://en.wikipedia.org/wiki/Advanced_Format" rel="noopener noreferrer"&gt;Advanced Format&lt;/a&gt; layout. It allows to store ECC for whole sector. But, this requires additional support from OS (nowadays almost everyone support this). Also note, that file systems can &lt;a href="https://wiki.archlinux.org/title/Advanced_Format#File_systems" rel="noopener noreferrer"&gt;know about&lt;/a&gt; Advanced Format and can adjust to it, but this is out of article's scope.&lt;/li&gt;
&lt;li&gt;SSD also has such support, but only for &lt;a href="https://en.wikipedia.org/wiki/Flash_memory#NAND_flash" rel="noopener noreferrer"&gt;NAND&lt;/a&gt; storage technology and &lt;a href="https://en.wikipedia.org/wiki/Flash_memory#NOR_flash" rel="noopener noreferrer"&gt;NOR&lt;/a&gt; (for microcontrollers). But this is supported at storage level, not OS.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But ECC is much smaller than stored data, so sometimes it will not be able to fix all errors. In &lt;a href="https://arxiv.org/pdf/2012.12373.pdf" rel="noopener noreferrer"&gt;this study&lt;/a&gt; compared HDD and SSD lifecycles (using their own tests). Next plots show how many Uncorrectable Errors (UE) occurred until disk failure - errors, that ECC can not handle.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fggec7mmgs4q03l91sd4s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fggec7mmgs4q03l91sd4s.png" alt="Uncorrectable Errors in disk lifecycles"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Conclusions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Amount of UE in SSD depends on lifetime of disk, whereas HDD depends on head flying hours.&lt;/li&gt;
&lt;li&gt;Amount of errors on HDD increases dramatically 2 days before failure.&lt;/li&gt;
&lt;li&gt;Amount of UE on SSD is higher, than on HDD.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Principal conclusion: modern hardware (with modern OS) have ECC, but we should not rely on it heavily.&lt;/p&gt;

&lt;p&gt;P.S. point 1 - is another reason to consider data locality (the longer HDD's lifetime, the less disk head moves)&lt;/p&gt;

&lt;h3&gt;
  
  
  Access controller
&lt;/h3&gt;

&lt;p&gt;Last component - is a disk access controller. It is stored in disk itself and handles all requests to disk. Important detail here - is a disk cache.&lt;/p&gt;

&lt;p&gt;Recall &lt;code&gt;fsync&lt;/code&gt; - it must make sure all changes were flush to disk. But if you look more closely to it's &lt;code&gt;man&lt;/code&gt;, you will see:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The fsync() implementations in older kernels and lesser used filesystems do not know how to flush disk caches. In these cases disk caches need to be disabled using hdparm(8) or sdparm(8) to guarantee safe operation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In old kernel versions (for Linux it's less than 2.2) &lt;code&gt;fsync&lt;/code&gt; did not know how to correctly flush disk cache, just as some filesystems. According to blog article &lt;a href="https://lwn.net/Articles/457667/" rel="noopener noreferrer"&gt;"Ensuring data reaches disk"&lt;/a&gt; there is &lt;code&gt;barrier&lt;/code&gt; mount option for ext3/4, btfs and xfs filesystems that enables barriers (disk cache flushing).&lt;/p&gt;

&lt;p&gt;I have researched some &lt;a href="https://github.com/torvalds/linux/tree/a4145ce1e7bc247fd6f2846e8699473448717b37/fs" rel="noopener noreferrer"&gt;Linux kernel code&lt;/a&gt; and figured out, that only "readonly" filesystems do not have &lt;code&gt;fsync&lt;/code&gt; implementation (i.e. &lt;a href="https://github.com/torvalds/linux/blob/67be068d31d423b857ffd8c34dbcc093f8dfff76/fs/efs/dir.c#L13" rel="noopener noreferrer"&gt;efs&lt;/a&gt; and &lt;a href="https://github.com/torvalds/linux/blob/67be068d31d423b857ffd8c34dbcc093f8dfff76/fs/isofs/dir.c#L268" rel="noopener noreferrer"&gt;isofs&lt;/a&gt; do not register custom &lt;code&gt;fsync&lt;/code&gt;).&lt;br&gt;
Also, there is a generic &lt;code&gt;fsync&lt;/code&gt; implementation (i.e. for non-journaled file systems)&lt;/p&gt;

&lt;p&gt;
  HFS fsync implementation
  &lt;br&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// https://github.com/torvalds/linux/blob/a4145ce1e7bc247fd6f2846e8699473448717b37/block/bdev.c#L203&lt;/span&gt;
&lt;span class="cm"&gt;/*
 * Write out and wait upon all the dirty data associated with a block
 * device via its mapping.  Does not take the superblock lock.
 */&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;sync_blockdev&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;block_device&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;bdev&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
 &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;bdev&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
 &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;filemap_write_and_wait&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bdev&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;bd_inode&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;i_mapping&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;EXPORT_SYMBOL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sync_blockdev&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// https://github.com/torvalds/linux/blob/a4145ce1e7bc247fd6f2846e8699473448717b37/mm/filemap.c#L779&lt;/span&gt;
&lt;span class="cm"&gt;/**
 * file_write_and_wait_range - write out &amp;amp; wait on a file range
 * @file: file pointing to address_space with pages
 * @lstart: offset in bytes where the range starts
 * @lend: offset in bytes where the range ends (inclusive)
 *
 * Write out and wait upon file offsets lstart-&amp;gt;lend, inclusive.
 *
 * Note that @lend is inclusive (describes the last byte to be written) so
 * that this function can be used to write to the very end-of-file (end = -1).
 *
 * After writing out and waiting on the data, we check and advance the
 * f_wb_err cursor to the latest value, and return any errors detected there.
 *
 * Return: %0 on success, negative error code otherwise.
 */&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;file_write_and_wait_range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;file&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;loff_t&lt;/span&gt; &lt;span class="n"&gt;lstart&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;loff_t&lt;/span&gt; &lt;span class="n"&gt;lend&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
 &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
 &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;address_space&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;mapping&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;f_mapping&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

 &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lend&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;lstart&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

 &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mapping_needs_writeback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mapping&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;__filemap_fdatawrite_range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mapping&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lstart&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lend&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;WB_SYNC_ALL&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="cm"&gt;/* See comment of filemap_write_and_wait() */&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;EIO&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="n"&gt;__filemap_fdatawait_range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mapping&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lstart&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lend&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
 &lt;span class="p"&gt;}&lt;/span&gt;
 &lt;span class="n"&gt;err2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;file_check_and_advance_wb_err&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
 &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;err2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
 &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;EXPORT_SYMBOL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_write_and_wait_range&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// https://github.com/torvalds/linux/blob/a4145ce1e7bc247fd6f2846e8699473448717b37/fs/hfs/inode.c#L661&lt;/span&gt;
&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;hfs_file_fsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;file&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;filp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;loff_t&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;loff_t&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;datasync&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// ...&lt;/span&gt;
 &lt;span class="n"&gt;file_write_and_wait_range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
 &lt;span class="c1"&gt;// ... &lt;/span&gt;
 &lt;span class="n"&gt;sync_blockdev&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sb&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;s_bdev&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
 &lt;span class="c1"&gt;// ... &lt;/span&gt;
 &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;ret&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;/p&gt;

&lt;p&gt;Again, access controller is not bad thing. For example, it allows SSD to live longer by smart utilizing blocks as they have limited number or P/E cycles (roughly speaking, access controller plays more crucial part in SSD than HDD).&lt;/p&gt;

&lt;h3&gt;
  
  
  Persistence properties
&lt;/h3&gt;

&lt;p&gt;At the end, let's talk about persistence properties provided by HDD and SSD: atomicity and PowerSafe OverWrite.&lt;/p&gt;

&lt;h4&gt;
  
  
  Atomicity
&lt;/h4&gt;

&lt;p&gt;IO is very slow, compared to other operations. Here reference to &lt;a href="https://colin-scott.github.io/personal_website/research/interactive_latency.html" rel="noopener noreferrer"&gt;"Latency Numbers every programmer should know"&lt;/a&gt; (extended version by year) - even for 2020 year HDD spends 2ms just for seek.&lt;br&gt;
As we can not eliminate disk seeks (from HDD) or increase memory cells (for SSD) the only thing we can do is to add some optimizations. Main optimization is to batch operations by blocks. So, even when you request to read/write single byte actually you will read/write whole block.&lt;/p&gt;

&lt;p&gt;Today, we have 2 block sizes: 512 bytes and 4Kb. But for HDD name of such unit is "sector" and for SSD are "page" and "block" for read and write accordingly.&lt;br&gt;
Reading can not corrupt data, but write can, so we will focus primarly on write operations, so use "unit for write".&lt;/p&gt;

&lt;p&gt;There is already &lt;a href="https://stackoverflow.com/a/61832882" rel="noopener noreferrer"&gt;an answer to such question on StackOverflow&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;a sector write sent by the kernel is &lt;em&gt;likely&lt;/em&gt; atomic&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But under conditions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Access controller has a spare battery (if power outage occurres during operations)&lt;/li&gt;
&lt;li&gt;SCSI disk vendor gives guarantees for write atomicity&lt;/li&gt;
&lt;li&gt;(for NVMe) special atomic write function is called&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That sounds quite logical, so I'll believe it.&lt;/p&gt;
&lt;h4&gt;
  
  
  PowerSafe OverWrite
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://www.sqlite.org/psow.html" rel="noopener noreferrer"&gt;PowerSafe OverWrite (PSOW)&lt;/a&gt; - is a term, used by SQLite developers to describe behaviour of file systems and disk in case of sudden power outage.&lt;br&gt;
The meaning is as follows:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;When an application writes a range of bytes in a file, no bytes outside of that range will change, even if the write occurs just before a crash or power failure.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Practically, it means that there is a spare battery in disk that will be used to safe write remaining data. If there is no such thing, then when we write single unit of write (either or both):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;One part of write range will contain new data and another old&lt;/li&gt;
&lt;li&gt;Part of same sector that not in write range will contain garbage&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;
  Atomicity and PSOW are not the same
  &lt;p&gt;At first glance, one can think atomicity and PowerSafe OverWrite are same, but that's not true.&lt;/p&gt;

&lt;p&gt;For example let's imagine such situation - we want to overwrite part of file and during write operation power outage occurred. Depending on different combinations of properties, we can get different consequences.&lt;/p&gt;

&lt;p&gt;To be specific, we have 3 sectors/unit of write which all contains 0 (old data) and we want to write range of 1.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;              А         Б         В
Sectors: |000000000|000000000|000000000|        
Write:        |------------------|
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, we have situations:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Atomic + PSOW: each sector contains either new data or old data.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   | A:  | 000011111 | 000000000 | 000000000 |
   | --- |

   | B:  | 000000000 | 111111111 | 000000000 |
   | --- |

   | C:  | 000000000 | 000000000 | 111100000 |
   | --- |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;!Atomic + PSOW: the sector that was overwritten during power outage will contain garbage
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   | A:  | 000011010 | 000000000 | 000000000 |
   | --- |

   | B:  | 000000000 | 110011010 | 000000000 |
   | --- |

   | C:  | 000000000 | 000000000 | 001000000 |
   | --- |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;NOTE: data outside the sector is not affected/corrupted/changed&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Atomic + !PSOW: thanks to atomicity data in the sector we write will be written successfully, but PSOW can not guarantee that other sectors will be fine. So consinder such result:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   | A:  | 000011111 | 001011100 | 000000000 |
   | --- |

   | B:  | 000000000 | 111111111 | 001000000 |
   | --- |

   | C:  | 000001010 | 111000110 | 111100000 |
   | --- |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How can this happen? For example, battery is enough to write that single page but only for it - when we finish writing that page, disk head will randomly walk, affecting stored data with remaining magnetic energy.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;!Atomic + !PSOW: it gets more interesting when we can not guarantee anything. Example of such behaviour is given by SQLite developers: OS reads whole sector, modify some bytes, write that page (Read-Modify-Write) and during write there is a power outage. Data was partially written and ECC is not updated, so when after restart disk controller figures out incorrect sector and clears that page.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   | A:  | 111111111 | 000000000 | 000000000 |
   | --- |

   | B:  | 000000000 | 111111111 | 000000000 |
   | --- |

   | C:  | 000000000 | 000000000 | 111111111 |
   | --- |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Repository &lt;a href="https://github.com/hashicorp/raft-wal" rel="noopener noreferrer"&gt;hashcorp/raft-wal&lt;/a&gt; have README with collected assumptions of different applications about persistence guarantees:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Application&lt;/th&gt;
&lt;th&gt;Atomicity&lt;/th&gt;
&lt;th&gt;PowerSafe OverWrite&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://sqlite.org/atomiccommit.html#_hardware_assumptions" rel="noopener noreferrer"&gt;SQLite&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;+ (from 3.7.9)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/hashicorp/raft-wal/tree/main?tab=readme-ov-file#our-assumptions" rel="noopener noreferrer"&gt;Hashicorp&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/hashicorp/raft-wal/tree/main?tab=readme-ov-file#user-content-etcd-wal" rel="noopener noreferrer"&gt;Etcd/wal&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;+&lt;/td&gt;
&lt;td&gt;+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/hashicorp/raft-wal/tree/main?tab=readme-ov-file#user-content-lmdb" rel="noopener noreferrer"&gt;LMDB&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;+&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/hashicorp/raft-wal/tree/main?tab=readme-ov-file#user-content-rocksdb-wal" rel="noopener noreferrer"&gt;BoltDB&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;+&lt;/td&gt;
&lt;td&gt;+&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;You can think that's all, but we did not cover one important layer, that many modern programming langauges have - runtime.&lt;/p&gt;

&lt;h2&gt;
  
  
  Runtime
&lt;/h2&gt;

&lt;p&gt;At the very beginning we went from PL to OS directly, but some languages have some managed layer - runtime:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Runtime itself - Nodejs, .NET, JVM&lt;/li&gt;
&lt;li&gt;Interpreter - Python, Ruby&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In case of C/C++ we can make syscall directly (just invoke &lt;code&gt;fsync&lt;/code&gt;), but other langauges may have some abstraction layer. Now, we will talk about invoking &lt;code&gt;fsync&lt;/code&gt; as it is necessary to ensure data is persisted.&lt;/p&gt;

&lt;p&gt;For java it is very simple - function &lt;code&gt;force(true)&lt;/code&gt;. According to documentation:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Forces any updates to this channel's file to be written to the storage device that contains it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So, we will not directly call &lt;code&gt;fsync&lt;/code&gt; - we are programming in abstractions that runtime provides to us. So, the same is applied to .NET - class &lt;code&gt;FileStream&lt;/code&gt; has overloaded method &lt;code&gt;Flush(bool flushToDisk)&lt;/code&gt;. When passing &lt;code&gt;true&lt;/code&gt; all data must be flushed to disk:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Use this overload when you want to ensure that all buffered data in intermediate file buffers is written to disk. When you call the Flush method, the operating system I/O buffer is also flushed.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Again note - there is no word about &lt;code&gt;fsync&lt;/code&gt; as it is platform-dependent implementation detail. But, I'm not used to blindly rely on words, so let's take a look at .NET source code - we file such piece of code (call chain):&lt;/p&gt;

&lt;p&gt;
  FileStream.Flush call chain
  &lt;br&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;FileStream&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;readonly&lt;/span&gt; &lt;span class="n"&gt;FileStreamStrategy&lt;/span&gt; &lt;span class="n"&gt;_strategy&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// https://github.com/dotnet/runtime/blob/da781b3aab1bc30793812bced4a6b64d2df31a9f/src/libraries/System.Private.CoreLib/src/System/IO/FileStream.cs#L389&lt;/span&gt;
    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;virtual&lt;/span&gt; &lt;span class="k"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;Flush&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="n"&gt;flushToDisk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_strategy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IsClosed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;ThrowHelper&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ThrowObjectDisposedException_FileClosed&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="n"&gt;_strategy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Flush&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;flushToDisk&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;internal&lt;/span&gt; &lt;span class="k"&gt;abstract&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;OSFileStreamStrategy&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;FileStreamStrategy&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// https://github.com/dotnet/runtime/blob/da781b3aab1bc30793812bced4a6b64d2df31a9f/src/libraries/System.Private.CoreLib/src/System/IO/Strategies/OSFileStreamStrategy.cs#L137&lt;/span&gt;
    &lt;span class="k"&gt;internal&lt;/span&gt; &lt;span class="k"&gt;sealed&lt;/span&gt; &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;Flush&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="n"&gt;flushToDisk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;flushToDisk&lt;/span&gt; &lt;span class="p"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;CanWrite&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;FileStreamHelpers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;FlushToDisk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_fileHandle&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;internal&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="k"&gt;partial&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;FileStreamHelpers&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// https://github.com/dotnet/runtime/blob/da781b3aab1bc30793812bced4a6b64d2df31a9f/src/libraries/System.Private.CoreLib/src/System/IO/Strategies/FileStreamHelpers.Unix.cs#L40&lt;/span&gt;
    &lt;span class="k"&gt;internal&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="k"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;FlushToDisk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SafeFileHandle&lt;/span&gt; &lt;span class="n"&gt;handle&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Interop&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;FSync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;handle&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;Interop&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ErrorInfo&lt;/span&gt; &lt;span class="n"&gt;errorInfo&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Interop&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;GetLastErrorInfo&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
            &lt;span class="k"&gt;switch&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;errorInfo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;Interop&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EROFS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;Interop&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EINVAL&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;Interop&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ENOTSUP&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="c1"&gt;// Ignore failures for special files that don't support synchronization.&lt;/span&gt;
                    &lt;span class="c1"&gt;// In such cases there's nothing to flush.&lt;/span&gt;
                    &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
                &lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="n"&gt;Interop&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;GetExceptionForIoErrno&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;errorInfo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;handle&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;internal&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="k"&gt;partial&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Interop&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;internal&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="k"&gt;partial&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Sys&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// https://github.com/dotnet/runtime/blob/da781b3aab1bc30793812bced4a6b64d2df31a9f/src/libraries/Common/src/Interop/Unix/System.Native/Interop.FSync.cs#L11&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;LibraryImport&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Libraries&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SystemNative&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;EntryPoint&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"SystemNative_FSync"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SetLastError&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
        &lt;span class="k"&gt;internal&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="k"&gt;partial&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;FSync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SafeFileHandle&lt;/span&gt; &lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// https://github.com/dotnet/runtime/blob/da781b3aab1bc30793812bced4a6b64d2df31a9f/src/native/libs/System.Native/pal_io.c#L736&lt;/span&gt;
&lt;span class="n"&gt;int32_t&lt;/span&gt; &lt;span class="nf"&gt;SystemNative_FSync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;intptr_t&lt;/span&gt; &lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;fileDescriptor&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;ToFileDescriptor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="n"&gt;int32_t&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt;
&lt;span class="cp"&gt;#if defined(TARGET_OSX) &amp;amp;&amp;amp; HAVE_F_FULLFSYNC
&lt;/span&gt;    &lt;span class="nf"&gt;fcntl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fileDescriptor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F_FULLFSYNC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="cp"&gt;#else
&lt;/span&gt;    &lt;span class="nf"&gt;fsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fileDescriptor&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="cp"&gt;#endif
&lt;/span&gt;    &lt;span class="p"&gt;&amp;lt;&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;errno&lt;/span&gt; &lt;span class="p"&gt;==&lt;/span&gt; &lt;span class="n"&gt;EINTR&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;/p&gt;

&lt;p&gt;So, when we are passing &lt;code&gt;true&lt;/code&gt;, then there is &lt;code&gt;fsync&lt;/code&gt; must occur. But let's look what happens in reality. I have written such code for this and trace it with &lt;code&gt;strace&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="nn"&gt;var&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;FileStream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"sample.txt"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;FileMode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OpenOrCreate&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"hello, world"&lt;/span&gt;&lt;span class="n"&gt;u8&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Flush&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is part of &lt;code&gt;strace&lt;/code&gt; output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;openat(AT_FDCWD, "/path/sample.txt", O_RDWR|O_CREAT|O_CLOEXEC, 0666) = 19
lseek(19, 0, SEEK_CUR)                  = 0
pwrite64(19, "hello, world", 12, 0)     = 12
fsync(19)                               = 0
flock(19, LOCK_UN)                      = 0
close(19)                               = 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;openat&lt;/code&gt; - open file and file descriptor is 19&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;lseek&lt;/code&gt; - position file at the very beginning&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pwrite64&lt;/code&gt; - our data is written&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;fsync(19)&lt;/code&gt; - &lt;code&gt;fsync&lt;/code&gt; call happened&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;close(19)&lt;/code&gt; - file is closed&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's all fine - &lt;code&gt;fsync&lt;/code&gt; is called. But I used .NET 8.0.1 for that, next I wanted to test behaviour on another version - 7.0.11. Source code is the same but output is different:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;openat(AT_FDCWD, "/path/sample.txt", O_RDWR|O_CREAT|O_CLOEXEC, 0666) = 19
lseek(19, 0, SEEK_CUR)                  = 0
pwrite64(19, "hello, world", 12, 0)     = 12
flock(19, LOCK_UN)                      = 0
close(19)                               = 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There is no &lt;code&gt;fsync&lt;/code&gt;! Moreover, if we call &lt;code&gt;Flush(true)&lt;/code&gt; again it will appear and all subsequent calls will invoke &lt;code&gt;fsync&lt;/code&gt; (add second &lt;code&gt;Flush(true)&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;openat(AT_FDCWD, "/path/sample.txt", O_RDWR|O_CREAT|O_CLOEXEC, 0666) = 19
lseek(19, 0, SEEK_CUR)                  = 0
pwrite64(19, "hello, world", 12, 0)     = 12
fsync(19)                               = 0
flock(19, LOCK_UN)                      = 0
close(19)                               = 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So, I have concluded that first &lt;code&gt;Flush(true)&lt;/code&gt; is ignored (for some reason, idk), but subsequent are not.&lt;/p&gt;

&lt;p&gt;Also, I must note that developer often is limited to abstractions and programming model provided by runtime. Take .NET as an example again.&lt;br&gt;
Remember that directories are also files and we must call &lt;code&gt;fsync&lt;/code&gt; after operations. In .NET we can not open directories (Windows legacy):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Directory&lt;/code&gt; class does not have &lt;code&gt;Open&lt;/code&gt; method (or some &lt;code&gt;Sync&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;If we call &lt;code&gt;FileStream&lt;/code&gt; with directory path (even when specifying readonly mode), then we get &lt;code&gt;UnauthorizedAccessException&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I have found a rough workaround - call &lt;code&gt;open&lt;/code&gt; function using P/Invoke, get directory file descriptor and wrap it with &lt;code&gt;SafeFileHandle&lt;/code&gt;. In this case, there is no excpetion and we can use &lt;code&gt;fsync&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;directory&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Directory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;CreateDirectory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"sample-directory"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;directoryFlags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="m"&gt;65536&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// O_DIRECTORY | O_RDONLY&lt;/span&gt;
&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;handle&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;Open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;directory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FullName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;directoryFlags&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; 
&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="nn"&gt;var&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;FileStream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;SafeFileHandle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;handle&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;FileAccess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ReadWrite&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Flush&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;DllImport&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"libc"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;EntryPoint&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"open"&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="k"&gt;extern&lt;/span&gt; &lt;span class="kt"&gt;nint&lt;/span&gt; &lt;span class="nf"&gt;Open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;flags&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is &lt;code&gt;strace&lt;/code&gt; output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;openat(AT_FDCWD, "/path/sample-directory", O_RDONLY|O_DIRECTORY) = 19
lseek(19, 0, SEEK_CUR)                  = 0
lseek(19, 0, SEEK_CUR)                  = 0
fsync(19)                               = 0
close(19)                               = 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Key takeways
&lt;/h3&gt;

&lt;p&gt;To sum up, we can see that IO stack has multiple details that we must take into account to ensure data integrity. Each layer has own semantics and characteristics.&lt;br&gt;
Neglecting them means neglecting our data: data corruption, data loss, garbage occurrence and other unpleasant events.&lt;/p&gt;

&lt;p&gt;Here a small diagram briefly summarizing all above:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx27fmbvjx2kuptipe7ag.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx27fmbvjx2kuptipe7ag.png" alt="File write operation stack"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  File operations patterns
&lt;/h2&gt;

&lt;p&gt;After considering possible problems, that can arise during operations with files, let's consider opposite - how to fight with these problems.&lt;/p&gt;

&lt;p&gt;As a developers we create huge systems with lots of connected subsystems. The most illustrative example is a database (persistent, not in-memory).&lt;br&gt;
So most of example implementation will be given in DBMSs source code.&lt;/p&gt;
&lt;h3&gt;
  
  
  Create new file
&lt;/h3&gt;

&lt;p&gt;Let's kick off from the very beginning - creating a new file.&lt;/p&gt;

&lt;p&gt;If we want a new &lt;em&gt;empty&lt;/em&gt; file then we should just call &lt;em&gt;&lt;code&gt;fsync&lt;/code&gt; on parent directory&lt;/em&gt; after we have created it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;creat("/dir/data")&lt;/code&gt; - create new file&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;fsync("/dir")&lt;/code&gt; - sync directory's contents&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Initially, file is empty, but what if file must have initial data, i.e. header with metadata. As we have seen, just &lt;code&gt;fsync&lt;/code&gt; is not enough because if operation is interrupted then file either will not exist or will be half-full (which is not acceptable).&lt;/p&gt;

&lt;p&gt;We already have seen such pattern - &lt;code&gt;Atomic Create Via Rename&lt;/code&gt;. Again, the title is selfdescriptive - to create initialized file we need to swap it (rename) with another initialized file.&lt;br&gt;
Algorithm is the following:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;creat("/dir/data.tmp")&lt;/code&gt; - create temporary file&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;write("/dir/data.tmp", new_data)&lt;/code&gt; - write required data to it&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;fsync("/dir/data.tmp")&lt;/code&gt; - flush contents of temporary file to disk&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;fsync("/dir")&lt;/code&gt; - flush updates to parent directory&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;rename("/dir/data.tmp", "/dir/data")&lt;/code&gt; - rename (replace) temporary file with target one&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;fsync("/dir")&lt;/code&gt; - flush rename to parent directory&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Pratical example: creation of new log segment (file with operations being performed on data) in etcd:&lt;/p&gt;

&lt;p&gt;
  Creation of new log segment file in etcd
  &lt;br&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// cut closes current file written and creates a new one ready to append.&lt;/span&gt;
&lt;span class="c"&gt;// cut first creates a temp wal file and writes necessary headers into it.&lt;/span&gt;
&lt;span class="c"&gt;// Then cut atomically rename temp wal file to a wal file.&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;WAL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;cut&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
 &lt;span class="c"&gt;// Название для нового файла сегмента&lt;/span&gt;
 &lt;span class="n"&gt;fpath&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;walName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;seq&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;enti&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="c"&gt;// 1. Create temporary file&lt;/span&gt;
 &lt;span class="n"&gt;newTail&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Open&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
 &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
 &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// 2. Write data to temporary file&lt;/span&gt;
 &lt;span class="c"&gt;// update writer and save the previous crc&lt;/span&gt;
 &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;locks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;locks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;newTail&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
 &lt;span class="n"&gt;prevCrc&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;encoder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;crc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sum32&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
 &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;encoder&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;newFileEncoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tail&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;File&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prevCrc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
 &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
 &lt;span class="p"&gt;}&lt;/span&gt;
 &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;saveCrc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prevCrc&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
 &lt;span class="p"&gt;}&lt;/span&gt;
 &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;encoder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;walpb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Record&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;MetadataType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
 &lt;span class="p"&gt;}&lt;/span&gt;
 &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;saveState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
 &lt;span class="p"&gt;}&lt;/span&gt;

 &lt;span class="c"&gt;// atomically move temp wal file to wal file&lt;/span&gt;

    &lt;span class="c"&gt;// 3-4. Flush file data to disk&lt;/span&gt;
 &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sync&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
 &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// 5. Rename temporary file to target&lt;/span&gt;
 &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Rename&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;newTail&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;fpath&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
 &lt;span class="p"&gt;}&lt;/span&gt;

 &lt;span class="c"&gt;// 6. Flush "rename" of parent directory to disk&lt;/span&gt;
 &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fileutil&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dirFile&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
 &lt;span class="p"&gt;}&lt;/span&gt;

 &lt;span class="c"&gt;// reopen newTail with its new path so calls to Name() match the wal filename format&lt;/span&gt;
 &lt;span class="n"&gt;newTail&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
 &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;newTail&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fileutil&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LockFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fpath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;O_WRONLY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fileutil&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PrivateFileMode&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
 &lt;span class="p"&gt;}&lt;/span&gt;

 &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;locks&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;locks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;newTail&lt;/span&gt;
 &lt;span class="n"&gt;prevCrc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;encoder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;crc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sum32&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
 &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;encoder&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;newFileEncoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tail&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;File&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prevCrc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
 &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
 &lt;span class="p"&gt;}&lt;/span&gt;

 &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;I have commented steps of our defined algorithm with numbers. As you can see, steps performed in same sequence.&lt;/p&gt;

&lt;p&gt;Also, target file is reopened in the end. This is done to correctly display target filename, not temporary (does not affect correctness).&lt;/p&gt;



&lt;/p&gt;

&lt;h3&gt;
  
  
  File modification
&lt;/h3&gt;

&lt;p&gt;File is created and now we must make some modifications - write new data to it or update (overwrite) existing. Here we have 2 options:&lt;/p&gt;

&lt;h4&gt;
  
  
  Change of a small file
&lt;/h4&gt;

&lt;p&gt;If our file is small enough, then we can apply sibling of previous pattern - &lt;code&gt;Atomic Replace Via Rename&lt;/code&gt; (btrfs call this &lt;code&gt;overwrite-by-rename&lt;/code&gt;). Algorithm is almost the same:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;creat("/dir/data.tmp")&lt;/code&gt; - create temporary file&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;write("/dir/data.tmp", new_data)&lt;/code&gt; - write new data to it&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;fsync("/dir/data.tmp")&lt;/code&gt; - flush data to disk&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;fsync("/dir")&lt;/code&gt; - update parent directory contents (new file creation)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;rename("/dir/data.tmp", "/dir/data")&lt;/code&gt; - rename (replace) old file with a new one&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;fsync("/dir")&lt;/code&gt; - update parent directory contents (file rename)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Earlier I said that some file systems can detect this pattern and perform &lt;code&gt;fsync&lt;/code&gt; themselves. But as a developers we target on multiple systems and do not rely on specific tehnologies - specific file system features in this case.&lt;/p&gt;

&lt;p&gt;Pratical example: LevelDB flush memtable to disk.&lt;/p&gt;

&lt;p&gt;
  LevelDB flush memtable to disk
  &lt;p&gt;LevelDB stores data in memory using special MemTable. When it becomes too big it is flushed to disk - this is called "Compaction"&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// https://github.com/google/leveldb/blob/068d5ee1a3ac40dabd00d211d5013af44be55bea/db/db_impl.cc#L549&lt;/span&gt;
&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="n"&gt;DBImpl&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;CompactMemTable&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// ...&lt;/span&gt;

  &lt;span class="c1"&gt;// Replace immutable memtable with the generated Table&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;edit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetPrevLogNumber&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;edit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetLogNumber&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logfile_number_&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;// Earlier logs no longer needed&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;versions_&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;LogAndApply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;edit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;mutex_&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// Versionset-&amp;gt;LogAndApply&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// https://github.com/google/leveldb/blob/068d5ee1a3ac40dabd00d211d5013af44be55bea/db/version_set.cc#L777&lt;/span&gt;
&lt;span class="n"&gt;Status&lt;/span&gt; &lt;span class="n"&gt;VersionSet&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;LogAndApply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;VersionEdit&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;edit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Mutex&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;mu&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// 1. Create new state - apply patches to current state (in-memory for now)&lt;/span&gt;
  &lt;span class="n"&gt;Version&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;Version&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Builder&lt;/span&gt; &lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current_&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;edit&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SaveTo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="n"&gt;Finalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="c1"&gt;// Initialize new descriptor log file if necessary by creating&lt;/span&gt;
  &lt;span class="c1"&gt;// a temporary file that contains a snapshot of the current version.&lt;/span&gt;
  &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt; &lt;span class="n"&gt;new_manifest_file&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="n"&gt;Status&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;descriptor_log_&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nb"&gt;nullptr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; 
    &lt;span class="c1"&gt;// 2. Create temporary file&lt;/span&gt;
    &lt;span class="n"&gt;new_manifest_file&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DescriptorFileName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dbname_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;manifest_file_number_&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;env_&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;NewWritableFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_manifest_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;descriptor_file_&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="c1"&gt;// 3. Write data (new snapshot) to temporary file&lt;/span&gt;
      &lt;span class="n"&gt;descriptor_log_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Writer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;descriptor_file_&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;WriteSnapshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;descriptor_log_&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="c1"&gt;// 4. Flush data to disk&lt;/span&gt;
      &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;descriptor_file_&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;Sync&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// If we just created a new descriptor file, install it by writing a&lt;/span&gt;
    &lt;span class="c1"&gt;// new CURRENT file that points to it.&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;new_manifest_file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;empty&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="c1"&gt;// 5. Rename temporary file to target&lt;/span&gt;
      &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SetCurrentFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;env_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dbname_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;manifest_file_number_&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;mu&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;Lock&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// https://github.com/google/leveldb/blob/068d5ee1a3ac40dabd00d211d5013af44be55bea/util/env_posix.cc#L334&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;PosixWritableFile&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="n"&gt;WritableFile&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="nl"&gt;public:&lt;/span&gt;
  &lt;span class="n"&gt;Status&lt;/span&gt; &lt;span class="n"&gt;Sync&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Status&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SyncDirIfManifest&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

    &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;FlushBuffer&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;SyncFd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fd_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filename_&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;private&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
  &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="n"&gt;Status&lt;/span&gt; &lt;span class="nf"&gt;SyncFd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;fd_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="cp"&gt;#if HAVE_FULLFSYNC
&lt;/span&gt;    &lt;span class="c1"&gt;// On macOS and iOS, fsync() doesn't guarantee durability past power&lt;/span&gt;
    &lt;span class="c1"&gt;// failures. fcntl(F_FULLFSYNC) is required for that purpose. Some&lt;/span&gt;
    &lt;span class="c1"&gt;// filesystems don't support fcntl(F_FULLFSYNC), and require a fallback to&lt;/span&gt;
    &lt;span class="c1"&gt;// fsync().&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;fcntl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F_FULLFSYNC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;Status&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;OK&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="cp"&gt;#endif  // HAVE_FULLFSYNC
&lt;/span&gt;
&lt;span class="cp"&gt;#if HAVE_FDATASYNC
&lt;/span&gt;    &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="n"&gt;sync_success&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;fdatasync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="cp"&gt;#else
&lt;/span&gt;    &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="n"&gt;sync_success&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;fsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="cp"&gt;#endif  // HAVE_FDATASYNC
&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sync_success&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;Status&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;OK&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;PosixError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fd_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;errno&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// https://github.com/google/leveldb/blob/068d5ee1a3ac40dabd00d211d5013af44be55bea/db/filename.cc#L123&lt;/span&gt;
&lt;span class="n"&gt;Status&lt;/span&gt; &lt;span class="nf"&gt;SetCurrentFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Env&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;dbname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="kt"&gt;uint64_t&lt;/span&gt; &lt;span class="n"&gt;descriptor_number&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// ...&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Rename temporary file to target&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;RenameFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tmp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CurrentFileName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dbname&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Steps performed the same. Also, if you saw the code, you could mention usage of macOS specific - it requires usage of &lt;code&gt;fcntl&lt;/code&gt; instead of &lt;code&gt;fsync&lt;/code&gt;.&lt;/p&gt;



&lt;br&gt;
&lt;/p&gt;

&lt;h4&gt;
  
  
  UNDO log
&lt;/h4&gt;

&lt;p&gt;This example is given for a small file, but what if file is large or free disk space is low?&lt;/p&gt;

&lt;p&gt;So far we saw many atomic operations + &lt;code&gt;fsync&lt;/code&gt; and now we will use them to create atomic file overwrite. But what does mean "atomically" here? Actually, we can go in both directions: undo operation (rollback operation) and reapply new operation (redo operation). Thus, we come to the 2 main conceptions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;UNDO log - contains information about how to rollback to previous state if operation fails&lt;/li&gt;
&lt;li&gt;REDO log - contains information about what operation we wanted to perform in order to redo it if some failure occurred&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;First, we describe UNDO log and REDO log come after.&lt;/p&gt;

&lt;p&gt;Undo log contains information to &lt;em&gt;rollback&lt;/em&gt; operations. In case of file rewrite it can &lt;em&gt;store original data that we are overwriting&lt;/em&gt;. So, after restart (there is a failure occurred) we check for some incomplete operation in that log and undo them.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Example for undo/redo log is borrowed from &lt;a href="https://danluu.com/file-consistency" rel="noopener noreferrer"&gt;"Files Are Hard"&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Our example is following: we want to write new data (&lt;code&gt;new_data&lt;/code&gt;) starting from 10 byte (&lt;code&gt;start&lt;/code&gt;) with length of 15 bytes (&lt;code&gt;length&lt;/code&gt;), so undo log will contain bytes from 10 to 25 of original data (&lt;code&gt;old_data&lt;/code&gt;).&lt;br&gt;
Algorithm is the following:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;creat("/dir/undo.log")&lt;/code&gt; - create undo log&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;write("/dir/undo.log", "[check_sum, start, length, old_data]")&lt;/code&gt; - store original data, that we want to overwrite:

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;start&lt;/code&gt; - start position&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;length&lt;/code&gt; - length of byte range&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;old_data&lt;/code&gt; - actual data in that range&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;check_sum&lt;/code&gt; - chech-sum, computed for the data&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;fsync("/dir/undo.log")&lt;/code&gt; - persist undo log on disk&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;fsync("/dir")&lt;/code&gt; - sync parent directory contents (undo log creation)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;write("/dir/data", new_data)&lt;/code&gt; - overwrite original file&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;fsync("/dir/data")&lt;/code&gt; - flush file updates to disk&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;unlink("/dir/undo.log")&lt;/code&gt; - remove undo log&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;fsync("/dir")&lt;/code&gt; - sync parent directory contents (undo log deletion)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What is taken into account in this example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fault can occur even right after UNDO log creation or data can be corrupted during reboot - so store checksum of that stored data&lt;/li&gt;
&lt;li&gt;Always call &lt;code&gt;fsync&lt;/code&gt; even for undo log - without this undo log can disappear after we have modified main data file&lt;/li&gt;
&lt;li&gt;Call &lt;code&gt;fsync&lt;/code&gt; even after deletion of undo log - otherwise we would think that operation was not completed at time of halt and perform rollback&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Actually, you do not need to constantly create and remove undo log - you can just create it once and use special markers to detect that operation is completed successfully.&lt;br&gt;
But you still need to call &lt;code&gt;fsync&lt;/code&gt; when performing changes in that file.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;At this time you have probably already figured out how to rollback:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Check undo-log existence (or there are some operations "in-progress")&lt;/li&gt;
&lt;li&gt;Check correctness of records (checksums)&lt;/li&gt;
&lt;li&gt;Find all undo operations (start byte, length, data)&lt;/li&gt;
&lt;li&gt;Write that original data&lt;/li&gt;
&lt;li&gt;Remove undo-log (mark operations "aborted")&lt;/li&gt;
&lt;li&gt;Flush changes to disk&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Even if there is another fault during rollback we always can retry because our algorithm is idempotent (additionally, you should consider atomicity and PSOW).&lt;/p&gt;

&lt;p&gt;SQLite developers use undo log and &lt;a href="https://devdoc.net/database/sqlite-3.0.7.2/atomiccommit.html#section_7_6" rel="noopener noreferrer"&gt;developed 2 optimizations&lt;/a&gt; (as new file creation is a costly operation):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Truncate file after operation is complete: &lt;code&gt;PRAGMA journal_mode=TRUNCATE&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Set file header to all 0 - &lt;code&gt;PRAGMA journal_mode=PERSIST&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Practical example: undo log in SQLite&lt;/p&gt;

&lt;p&gt;
  SQLite undo log
  &lt;p&gt;This algorithm is described in documentation - &lt;a href="https://devdoc.net/database/sqlite-3.0.7.2/atomiccommit.html" rel="noopener noreferrer"&gt;Atomic Commit In SQLite&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// https://github.com/sqlite/sqlite/blob/5007833f5f82d33c95f44c65fc46221de1c5950f/src/btree.c#L4388&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;sqlite3BtreeCommit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Btree&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;){&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="c1"&gt;// First phase - create log and update data in database file&lt;/span&gt;
  &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sqlite3BtreeCommitPhaseOne&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="n"&gt;SQLITE_OK&lt;/span&gt; &lt;span class="p"&gt;){&lt;/span&gt;
    &lt;span class="c1"&gt;// Second phase - remove/trancate/nullify log&lt;/span&gt;
    &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sqlite3BtreeCommitPhaseTwo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// https://github.com/sqlite/sqlite/blob/5007833f5f82d33c95f44c65fc46221de1c5950f/src/btree.c#L4267&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;sqlite3BtreeCommitPhaseOne&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Btree&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;zSuperJrnl&lt;/span&gt;&lt;span class="p"&gt;){&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SQLITE_OK&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;inTrans&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="n"&gt;TRANS_WRITE&lt;/span&gt; &lt;span class="p"&gt;){&lt;/span&gt;
    &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sqlite3PagerCommitPhaseOne&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pBt&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;zSuperJrnl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// https://github.com/sqlite/sqlite/blob/5007833f5f82d33c95f44c65fc46221de1c5950f/src/pager.c#L6437&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;sqlite3PagerCommitPhaseOne&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;Pager&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                  &lt;span class="cm"&gt;/* Pager object */&lt;/span&gt;
  &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;zSuper&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;            &lt;span class="cm"&gt;/* If not NULL, the super-journal name */&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;noSync&lt;/span&gt;                      &lt;span class="cm"&gt;/* True to omit the xSync on the db file */&lt;/span&gt;
&lt;span class="p"&gt;){&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SQLITE_OK&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;             &lt;span class="cm"&gt;/* Return code */&lt;/span&gt;

  &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="n"&gt;pagerFlushOnCommit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;){&lt;/span&gt;
    &lt;span class="c1"&gt;// ...&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;pagerUseWal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;){&lt;/span&gt;
      &lt;span class="c1"&gt;// ...&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="c1"&gt;// 1. Save original data to log&lt;/span&gt;
      &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pager_incr_changecounter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="c1"&gt;// 2. Sync data to disk (call fsync)&lt;/span&gt;
      &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;syncJournal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="n"&gt;SQLITE_OK&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;goto&lt;/span&gt; &lt;span class="n"&gt;commit_phase_one_exit&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

      &lt;span class="c1"&gt;// 3. Write updated data to main database file&lt;/span&gt;
      &lt;span class="n"&gt;pList&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sqlite3PcacheDirtyList&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;pPCache&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="cm"&gt;/* ... */&lt;/span&gt; &lt;span class="p"&gt;){&lt;/span&gt;
        &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pager_write_pagelist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pList&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="n"&gt;SQLITE_OK&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;goto&lt;/span&gt; &lt;span class="n"&gt;commit_phase_one_exit&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

      &lt;span class="c1"&gt;// 4. Flush main database file to disk&lt;/span&gt;
      &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="cm"&gt;/* ... */&lt;/span&gt; &lt;span class="p"&gt;){&lt;/span&gt;
        &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sqlite3PagerSync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;zSuper&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;commit_phase_one_exit&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// https://github.com/sqlite/sqlite/blob/5007833f5f82d33c95f44c65fc46221de1c5950f/src/pager.c#L4259&lt;/span&gt;
&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;syncJournal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Pager&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;newHdr&lt;/span&gt;&lt;span class="p"&gt;){&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;                         &lt;span class="cm"&gt;/* Return code */&lt;/span&gt;

  &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="cm"&gt;/* ... */&lt;/span&gt; &lt;span class="p"&gt;){&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="cm"&gt;/* ... */&lt;/span&gt; &lt;span class="p"&gt;){&lt;/span&gt;
      &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="cm"&gt;/* ... */&lt;/span&gt; &lt;span class="p"&gt;){&lt;/span&gt;
        &lt;span class="c1"&gt;// Write total amount of pages equal to 0 (just in case)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="n"&gt;SQLITE_OK&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="n"&gt;memcmp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;aMagic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;aJournalMagic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;){&lt;/span&gt;
          &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;u8&lt;/span&gt; &lt;span class="n"&gt;zerobyte&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
          &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sqlite3OsWrite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;jfd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;zerobyte&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;iNextHdrOffset&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="cm"&gt;/* ... */&lt;/span&gt; &lt;span class="p"&gt;){&lt;/span&gt;
          &lt;span class="c1"&gt;// Flush written data&lt;/span&gt;
          &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sqlite3OsSync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;jfd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;syncFlags&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
          &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="n"&gt;SQLITE_OK&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="c1"&gt;// Write header with metadata&lt;/span&gt;
        &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sqlite3OsWrite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;jfd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;zHeader&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;zHeader&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;journalHdr&lt;/span&gt;
        &lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="n"&gt;SQLITE_OK&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="cm"&gt;/* ... */&lt;/span&gt; &lt;span class="p"&gt;){&lt;/span&gt;
        &lt;span class="c1"&gt;// Flush data to disk again&lt;/span&gt;
        &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sqlite3OsSync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;jfd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;syncFlags&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;
          &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;syncFlags&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="n"&gt;SQLITE_SYNC_FULL&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="n"&gt;SQLITE_SYNC_DATAONLY&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="n"&gt;SQLITE_OK&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;journalHdr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;journalOff&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;SQLITE_OK&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// https://github.com/sqlite/sqlite/blob/5007833f5f82d33c95f44c65fc46221de1c5950f/src/pager.c#L6372&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;sqlite3PagerSync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Pager&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;zSuper&lt;/span&gt;&lt;span class="p"&gt;){&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SQLITE_OK&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="cm"&gt;/* ... */&lt;/span&gt; &lt;span class="p"&gt;){&lt;/span&gt;
    &lt;span class="c1"&gt;// fsync&lt;/span&gt;
    &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sqlite3OsSync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;syncFlags&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// https://github.com/sqlite/sqlite/blob/5007833f5f82d33c95f44c65fc46221de1c5950f/src/btree.c#L4356&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;sqlite3BtreeCommitPhaseTwo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Btree&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;bCleanup&lt;/span&gt;&lt;span class="p"&gt;){&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;inTrans&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="n"&gt;TRANS_WRITE&lt;/span&gt; &lt;span class="p"&gt;){&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sqlite3PagerCommitPhaseTwo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;pBt&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="n"&gt;SQLITE_OK&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;bCleanup&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="p"&gt;){&lt;/span&gt;
      &lt;span class="n"&gt;sqlite3BtreeLeave&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;SQLITE_OK&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// https://github.com/sqlite/sqlite/blob/5007833f5f82d33c95f44c65fc46221de1c5950f/src/pager.c#L6674&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;sqlite3PagerCommitPhaseTwo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Pager&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="p"&gt;){&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SQLITE_OK&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;                  &lt;span class="cm"&gt;/* Return code */&lt;/span&gt;
  &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pager_end_transaction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;setSuper&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;pager_error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// https://github.com/sqlite/sqlite/blob/5007833f5f82d33c95f44c65fc46221de1c5950f/src/pager.c#L2033&lt;/span&gt;
&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;pager_end_transaction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Pager&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;hasSuper&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;bCommit&lt;/span&gt;&lt;span class="p"&gt;){&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SQLITE_OK&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;rc2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SQLITE_OK&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;     

  &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;isOpen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;jfd&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;){&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;sqlite3JournalIsInMemory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;jfd&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;){&lt;/span&gt;
      &lt;span class="c1"&gt;// ...&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="nf"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;journalMode&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="n"&gt;PAGER_JOURNALMODE_TRUNCATE&lt;/span&gt; &lt;span class="p"&gt;){&lt;/span&gt;
      &lt;span class="c1"&gt;// PRAGMA journal_mode=TRUNCATE&lt;/span&gt;
      &lt;span class="c1"&gt;// Truncate file to 0&lt;/span&gt;
      &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;journalOff&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="p"&gt;){&lt;/span&gt;
        &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SQLITE_OK&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sqlite3OsTruncate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;jfd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="n"&gt;SQLITE_OK&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;fullSync&lt;/span&gt; &lt;span class="p"&gt;){&lt;/span&gt;
          &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sqlite3OsSync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;jfd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;syncFlags&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;journalOff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="nf"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;journalMode&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="n"&gt;PAGER_JOURNALMODE_PERSIST&lt;/span&gt;&lt;span class="p"&gt;){&lt;/span&gt;
      &lt;span class="c1"&gt;// PRAGMA journal_mode=PERSIST&lt;/span&gt;
      &lt;span class="c1"&gt;// Nullify header&lt;/span&gt;
      &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;zeroJournalHdr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hasSuper&lt;/span&gt;&lt;span class="o"&gt;||&lt;/span&gt;&lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;tempFile&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;journalOff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="c1"&gt;// PRAGMA journal_mode=DELETE&lt;/span&gt;
      &lt;span class="c1"&gt;// Remove undo log file&lt;/span&gt;
      &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;bDelete&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;tempFile&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="n"&gt;sqlite3OsClose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;jfd&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;bDelete&lt;/span&gt; &lt;span class="p"&gt;){&lt;/span&gt;
        &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sqlite3OsDelete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;pVfs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;zJournal&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pPager&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;extraSync&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="c1"&gt;// ...&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="n"&gt;SQLITE_OK&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="n"&gt;rc2&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// sqlite3OsDelete implementation for *nix&lt;/span&gt;
&lt;span class="c1"&gt;// https://github.com/sqlite/sqlite/blob/5007833f5f82d33c95f44c65fc46221de1c5950f/src/os_unix.c#L6533&lt;/span&gt;
&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;unixDelete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;sqlite3_vfs&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;NotUsed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="cm"&gt;/* VFS containing this as the xDelete method */&lt;/span&gt;
  &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;zPath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="cm"&gt;/* Name of file to be deleted */&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;dirSync&lt;/span&gt;               &lt;span class="cm"&gt;/* If true, fsync() directory after deleting file */&lt;/span&gt;
&lt;span class="p"&gt;){&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SQLITE_OK&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="c1"&gt;// Remove file itself&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;osUnlink&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;zPath&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;){&lt;/span&gt;
    &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;unixLogError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SQLITE_IOERR_DELETE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"unlink"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;zPath&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// Sync directory contents&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dirSync&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="p"&gt;){&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;osOpenDirectory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;zPath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="n"&gt;SQLITE_OK&lt;/span&gt; &lt;span class="p"&gt;){&lt;/span&gt;
      &lt;span class="c1"&gt;// "Improved fsync"&lt;/span&gt;
      &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;full_fsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;){&lt;/span&gt;
        &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;unixLogError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SQLITE_IOERR_DIR_FSYNC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"fsync"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;zPath&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SQLITE_OK&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// "Improved fsync" - is just fsync that counts in specifics of some OS&lt;/span&gt;
&lt;span class="c1"&gt;// https://github.com/sqlite/sqlite/blob/5007833f5f82d33c95f44c65fc46221de1c5950f/src/os_unix.c#L3638&lt;/span&gt;
&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;full_fsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;fullSync&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;dataOnly&lt;/span&gt;&lt;span class="p"&gt;){&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="cm"&gt;/* If we compiled with the SQLITE_NO_SYNC flag, then syncing is a
  ** no-op.  But go ahead and call fstat() to validate the file
  ** descriptor as we need a method to provoke a failure during
  ** coverage testing.
  */&lt;/span&gt;
&lt;span class="cp"&gt;#ifdef SQLITE_NO_SYNC
&lt;/span&gt;  &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;stat&lt;/span&gt; &lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;osFstat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="cp"&gt;#elif HAVE_FULLFSYNC
&lt;/span&gt;  &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;fullSync&lt;/span&gt; &lt;span class="p"&gt;){&lt;/span&gt;
    &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;osFcntl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F_FULLFSYNC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="cm"&gt;/* If the FULLFSYNC failed, fall back to attempting an fsync().
  ** It shouldn't be possible for fullfsync to fail on the local
  ** file system (on OSX), so failure indicates that FULLFSYNC
  ** isn't supported for this file system. So, attempt an fsync
  ** and (for now) ignore the overhead of a superfluous fcntl call.
  ** It'd be better to detect fullfsync support once and avoid
  ** the fcntl call every time sync is called.
  */&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="cp"&gt;#elif defined(__APPLE__)
&lt;/span&gt;  &lt;span class="cm"&gt;/* fdatasync() on HFS+ doesn't yet flush the file size if it changed correctly
  ** so currently we default to the macro that redefines fdatasync to fsync
  */&lt;/span&gt;
  &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="cp"&gt;#else
&lt;/span&gt;  &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fdatasync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="cp"&gt;#if OS_VXWORKS
&lt;/span&gt;  &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="o"&gt;==-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;errno&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="n"&gt;ENOTSUP&lt;/span&gt; &lt;span class="p"&gt;){&lt;/span&gt;
    &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="cp"&gt;#endif &lt;/span&gt;&lt;span class="cm"&gt;/* OS_VXWORKS */&lt;/span&gt;&lt;span class="cp"&gt;
#endif &lt;/span&gt;&lt;span class="cm"&gt;/* ifdef SQLITE_NO_SYNC elif HAVE_FULLFSYNC */&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;OS_VXWORKS&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="p"&gt;){&lt;/span&gt;
    &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;br&gt;
&lt;/p&gt;

&lt;h4&gt;
  
  
  REDO log
&lt;/h4&gt;

&lt;p&gt;REDO log works similarly. The main difference is that we write to log &lt;em&gt;new&lt;/em&gt; data instead of old.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;creat("/dir/redo.log")&lt;/code&gt; - create new REDO log&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;write("/dir/redo.log", "[check_sum, start, length, new_data]")&lt;/code&gt; - write new data to log:

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;start&lt;/code&gt; - starting position of data&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;length&lt;/code&gt; - length of new data&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;new_data&lt;/code&gt; - new data being written&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;check_sum&lt;/code&gt; - checksum for new data&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;fsync("/dir/redo.log")&lt;/code&gt; - flush new redo log file changes to disk&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;fsync("/dir")&lt;/code&gt; - sync parent directory contents (create redo-log)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;write("/dir/data", new_data)&lt;/code&gt; - write new data to main file&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;fsync("/dir/data")&lt;/code&gt; - flush changes of main file to disk&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;unlink("/dir/redo.log")&lt;/code&gt; - remove redo-log (mark completed)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;fsync("/dir")&lt;/code&gt; - sync parent directory contents (remove redo-log)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Main advantage of REDO log, that is widely used by multiple dbms, is that we can return response to user &lt;em&gt;right after we made record to redo-log&lt;/em&gt; (step 4).&lt;br&gt;
We can be sure changes will be applied - later (flush to file in background) or during recovery.&lt;/p&gt;

&lt;p&gt;As in the UNDO log here we can apply some optimizations. Again, main optimization is to persist this file instead of constant creation/deletions. When we perform commit we just mark record as applied (or can just create new special record that previous record is applied ~ "commit record").&lt;/p&gt;

&lt;p&gt;REDO log has alternative name - WAL, Write Ahead Log.&lt;/p&gt;

&lt;p&gt;Practical example: WAL in Postgres.&lt;/p&gt;

&lt;p&gt;
  WAL in Postgres
  &lt;p&gt;As I said, WAL is widely used in databases. It is used in &lt;a href="https://docs.oracle.com/en/database/oracle/oracle-database/19/admin/managing-the-redo-log.html" rel="noopener noreferrer"&gt;Oracle&lt;/a&gt;, &lt;a href="https://dev.mysql.com/doc/refman/8.0/en/innodb-redo-log.html" rel="noopener noreferrer"&gt;MySQL&lt;/a&gt;, &lt;a href="https://www.postgresql.org/docs/current/runtime-config-wal.html" rel="noopener noreferrer"&gt;Postgres&lt;/a&gt;, &lt;a href="https://www.sqlite.org/wal.html" rel="noopener noreferrer"&gt;SQLite&lt;/a&gt;, &lt;a href="https://learn.microsoft.com/en-us/troubleshoot/sql/database-engine/database-file-operations/logging-data-storage-algorithms" rel="noopener noreferrer"&gt;SQL Server&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This example is splited into 2 parts:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;COMMIT&lt;/code&gt; - save data to WAL&lt;/li&gt;
&lt;li&gt;Dirty page flushing to disk&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here we can see advantage that I have highlighted earlier - it's enough to make record in WAL to return response to user and continue processing other requests. Dirty page will be flushed to table file later, during checkpoint.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// 1. "COMMIT;"&lt;/span&gt;

&lt;span class="c1"&gt;// https://github.com/postgres/postgres/blob/cc6e64afda530576d83e331365d36c758495a7cd/src/backend/access/transam/xact.c#L2158&lt;/span&gt;
&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt;
&lt;span class="nf"&gt;CommitTransaction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;   
    &lt;span class="c1"&gt;// ...&lt;/span&gt;
 &lt;span class="n"&gt;RecordTransactionCommit&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// https://github.com/postgres/postgres/blob/97d85be365443eb4bf84373a7468624762382059/src/backend/access/transam/xact.c#L1284&lt;/span&gt;
&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="n"&gt;TransactionId&lt;/span&gt;
&lt;span class="nf"&gt;RecordTransactionCommit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// ...&lt;/span&gt;
    &lt;span class="n"&gt;XactLogCommitRecord&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;GetCurrentTransactionStopTimestamp&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                        &lt;span class="n"&gt;nchildren&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;children&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nrels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;ndroppedstats&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;droppedstats&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;nmsgs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;invalMessages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;RelcacheInitFileInval&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;MyXactFlags&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;InvalidTransactionId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;NULL&lt;/span&gt; &lt;span class="cm"&gt;/* plain commit */&lt;/span&gt; &lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// https://github.com/postgres/postgres/blob/eeefd4280f6e5167d70efabb89586b7d38922d95/src/backend/access/transam/xact.c#L5736&lt;/span&gt;
&lt;span class="n"&gt;XLogRecPtr&lt;/span&gt;
&lt;span class="nf"&gt;XactLogCommitRecord&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TimestampTz&lt;/span&gt; &lt;span class="n"&gt;commit_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;nsubxacts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TransactionId&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;subxacts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;nrels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;RelFileLocator&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;rels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;ndroppedstats&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;xl_xact_stats_item&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;droppedstats&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;nmsgs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SharedInvalidationMessage&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;msgs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="n"&gt;relcacheInval&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;xactflags&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TransactionId&lt;/span&gt; &lt;span class="n"&gt;twophase_xid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;twophase_gid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// ...&lt;/span&gt;
 &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;XLogInsert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RM_XACT_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;XLogRecPtr&lt;/span&gt;
&lt;span class="nf"&gt;XLogInsertRecord&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;XLogRecData&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;rdata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="n"&gt;XLogRecPtr&lt;/span&gt; &lt;span class="n"&gt;fpw_lsn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="n"&gt;uint8&lt;/span&gt; &lt;span class="n"&gt;flags&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;num_fpi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="n"&gt;topxid_included&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// ...&lt;/span&gt;
    &lt;span class="n"&gt;XLogFlush&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;EndPos&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// https://github.com/postgres/postgres/blob/dbfc44716596073b99e093a04e29e774a518f520/src/backend/access/transam/xlog.c#L2728&lt;/span&gt;
&lt;span class="kt"&gt;void&lt;/span&gt;
&lt;span class="nf"&gt;XLogFlush&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;XLogRecPtr&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
 &lt;span class="c1"&gt;// ...&lt;/span&gt;
 &lt;span class="n"&gt;XLogWrite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;WriteRqst&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;insertTLI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// https://github.com/postgres/postgres/blob/dbfc44716596073b99e093a04e29e774a518f520/src/backend/access/transam/xlog.c#L2273&lt;/span&gt;
&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt;
&lt;span class="nf"&gt;XLogWrite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;XLogwrtRqst&lt;/span&gt; &lt;span class="n"&gt;WriteRqst&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TimeLineID&lt;/span&gt; &lt;span class="n"&gt;tli&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="n"&gt;flexible&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
 &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="cm"&gt;/* Have data to write */&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
 &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// 1. Create new WAL segment file or open existing&lt;/span&gt;
     &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="cm"&gt;/* Размер сегмента превышен */&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt;
   &lt;span class="n"&gt;openLogFile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;XLogFileInit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;openLogSegNo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tli&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;openLogFile&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt;
   &lt;span class="n"&gt;openLogFile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;XLogFileOpen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;openLogSegNo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tli&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="c1"&gt;// 2. Write data to WAL&lt;/span&gt;
        &lt;span class="k"&gt;do&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;written&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pg_pwrite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;openLogFile&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;from&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nleft&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;startoffset&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="cm"&gt;/* Have data to write */&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

        &lt;span class="c1"&gt;// 3. Flush WAL to disk&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;finishing_seg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;issue_xlog_fsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;openLogFile&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;openLogSegNo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tli&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// https://github.com/postgres/postgres/blob/dbfc44716596073b99e093a04e29e774a518f520/src/backend/access/transam/xlog.c#L8516&lt;/span&gt;
&lt;span class="kt"&gt;void&lt;/span&gt;
&lt;span class="nf"&gt;issue_xlog_fsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;XLogSegNo&lt;/span&gt; &lt;span class="n"&gt;segno&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TimeLineID&lt;/span&gt; &lt;span class="n"&gt;tli&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// fsync behaviour can be adjusted by GUC (configuration)&lt;/span&gt;
 &lt;span class="k"&gt;switch&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wal_sync_method&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
 &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;WAL_SYNC_METHOD_FSYNC&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
      &lt;span class="n"&gt;pg_fsync_no_writethrough&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
   &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;WAL_SYNC_METHOD_FSYNC_WRITETHROUGH&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
      &lt;span class="n"&gt;pg_fsync_writethrough&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
   &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;WAL_SYNC_METHOD_FDATASYNC&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
      &lt;span class="n"&gt;pg_fdatasync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;    
   &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="c1"&gt;// ...&lt;/span&gt;
 &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// 2. Flush "dirty" pages to disk, table file itself&lt;/span&gt;

&lt;span class="c1"&gt;// https://github.com/postgres/postgres/blob/97d85be365443eb4bf84373a7468624762382059/src/backend/storage/buffer/bufmgr.c#L3437&lt;/span&gt;
&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt;
&lt;span class="nf"&gt;FlushBuffer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BufferDesc&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SMgrRelation&lt;/span&gt; &lt;span class="n"&gt;reln&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;IOObject&lt;/span&gt; &lt;span class="n"&gt;io_object&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="n"&gt;IOContext&lt;/span&gt; &lt;span class="n"&gt;io_context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// 3. Flush change to WAL&lt;/span&gt;
 &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buf_state&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;BM_PERMANENT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;XLogFlush&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;recptr&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Flush changes to table file&lt;/span&gt;
    &lt;span class="n"&gt;smgrwrite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reln&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="n"&gt;BufTagGetForkNum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;tag&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
     &lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;tag&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;blockNum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="n"&gt;bufToWrite&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// https://github.com/postgres/postgres/blob/eeefd4280f6e5167d70efabb89586b7d38922d95/src/include/storage/smgr.h#L121&lt;/span&gt;
&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kr"&gt;inline&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt;
&lt;span class="nf"&gt;smgrwrite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SMgrRelation&lt;/span&gt; &lt;span class="n"&gt;reln&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ForkNumber&lt;/span&gt; &lt;span class="n"&gt;forknum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BlockNumber&lt;/span&gt; &lt;span class="n"&gt;blocknum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="n"&gt;skipFsync&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
 &lt;span class="n"&gt;smgrwritev&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reln&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;forknum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;blocknum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;skipFsync&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// https://github.com/postgres/postgres/blob/eeefd4280f6e5167d70efabb89586b7d38922d95/src/backend/storage/smgr/smgr.c#L631&lt;/span&gt;
&lt;span class="kt"&gt;void&lt;/span&gt;
&lt;span class="nf"&gt;smgrwritev&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SMgrRelation&lt;/span&gt; &lt;span class="n"&gt;reln&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ForkNumber&lt;/span&gt; &lt;span class="n"&gt;forknum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BlockNumber&lt;/span&gt; &lt;span class="n"&gt;blocknum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;buffers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BlockNumber&lt;/span&gt; &lt;span class="n"&gt;nblocks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="n"&gt;skipFsync&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
 &lt;span class="n"&gt;smgrsw&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;reln&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;smgr_which&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;smgr_writev&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reln&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;forknum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;blocknum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;buffers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nblocks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;skipFsync&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// https://github.com/postgres/postgres/blob/eeefd4280f6e5167d70efabb89586b7d38922d95/src/backend/storage/smgr/md.c#L928&lt;/span&gt;
&lt;span class="kt"&gt;void&lt;/span&gt;
&lt;span class="nf"&gt;mdwritev&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SMgrRelation&lt;/span&gt; &lt;span class="n"&gt;reln&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ForkNumber&lt;/span&gt; &lt;span class="n"&gt;forknum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BlockNumber&lt;/span&gt; &lt;span class="n"&gt;blocknum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;buffers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BlockNumber&lt;/span&gt; &lt;span class="n"&gt;nblocks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="n"&gt;skipFsync&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// 5. Write data to table file&lt;/span&gt;
 &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="cm"&gt;/* Have more data blocks */&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
 &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;FileWriteV&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;mdfd_vfd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;iov&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;iovcnt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;seekpos&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                   &lt;span class="n"&gt;WAIT_EVENT_DATA_FILE_WRITE&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
 &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// 6. Flush data to disk&lt;/span&gt;
    &lt;span class="n"&gt;register_dirty_segment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reln&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;forknum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;


&lt;span class="c1"&gt;// https://github.com/postgres/postgres/blob/6b41ef03306f50602f68593d562cd73d5e39a9b9/src/backend/storage/file/fd.c#L2192&lt;/span&gt;
&lt;span class="kt"&gt;ssize_t&lt;/span&gt;
&lt;span class="nf"&gt;FileWriteV&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;File&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;iovec&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;iov&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;iovcnt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;off_t&lt;/span&gt; &lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="n"&gt;uint32&lt;/span&gt; &lt;span class="n"&gt;wait_event_info&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Perform write directly&lt;/span&gt;
 &lt;span class="n"&gt;returnCode&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pg_pwritev&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vfdP&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;iov&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;iovcnt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; 
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// https://github.com/postgres/postgres/blob/eeefd4280f6e5167d70efabb89586b7d38922d95/src/backend/storage/smgr/md.c#L1353&lt;/span&gt;
&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt;
&lt;span class="nf"&gt;register_dirty_segment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SMgrRelation&lt;/span&gt; &lt;span class="n"&gt;reln&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ForkNumber&lt;/span&gt; &lt;span class="n"&gt;forknum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;MdfdVec&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;seg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
 &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="cm"&gt;/* Failed to ask checkpointer to perform fsync */&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
 &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Perform fsync directly&lt;/span&gt;
  &lt;span class="n"&gt;FileSync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;seg&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;mdfd_vfd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;WAIT_EVENT_DATA_FILE_SYNC&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
 &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// https://github.com/postgres/postgres/blob/c20d90a41ca869f9c6dd4058ad1c7f5c9ee9d912/src/backend/storage/file/fd.c#L2297&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt;
&lt;span class="nf"&gt;FileSync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;File&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;uint32&lt;/span&gt; &lt;span class="n"&gt;wait_event_info&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
 &lt;span class="n"&gt;pg_fsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;VfdCache&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;br&gt;
&lt;/p&gt;

&lt;h3&gt;
  
  
  Segmented log + Snapshot
&lt;/h3&gt;

&lt;p&gt;Since we are talking about WAL, it's worth talking about pair of segmented log and snapshot. &lt;a href="https://martinfowler.com/articles/patterns-of-distributed-systems/segmented-log.html" rel="noopener noreferrer"&gt;Segmented log&lt;/a&gt; - is a pattern of representing single "logical" log as multiple "physical" segments.&lt;/p&gt;

&lt;p&gt;The benefit we get with this approach is significant. Earlier we have only one big file. Huge systems with highload might have this file have size of thousands TB. You might think that this is OK because we huge amount of storage, but remember that file is stored in file system that might have limit for file size.&lt;br&gt;
Even if we use zfs or xfs (with file size limit of 8 exbibytes) there are engineering problems - maintaince.&lt;/p&gt;

&lt;p&gt;But, if we split this big file into several small/medium sized files we get not only ability to grow WAL indefinitely, but also comfortable engineering experience - old WAL segments can be stored in some other place (and possible compressed).&lt;/p&gt;

&lt;p&gt;The segmented log itself is a good tool for transactional processing (ability to rollback), but if our server works some months, then this log can become too large and starup time will become impractical (several hours). And here comes "Snapshot" - serialized application state to which WAL records applied up to certain record.&lt;/p&gt;

&lt;p&gt;There is illustration from &lt;a href="https://raft.github.io/raft.pdf" rel="noopener noreferrer"&gt;Raft paper&lt;/a&gt; that better describes relationship between these 2 components:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F56g02p3ad1q630w8suqj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F56g02p3ad1q630w8suqj.png" alt="Relationship between segmented log and snapshot from Raft"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As a result, we have 2 "files" representing application state:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Snapshot - file with serialized state after applying some WAL records, and&lt;/li&gt;
&lt;li&gt;Segmented log/WAL - multiple files representing single logical sequence of commands that must be applied to state to bring it up to date&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Actual state = Snapshot + WAL records.&lt;/p&gt;

&lt;p&gt;Finally, main magic - now we can work with both files atomically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;To create or update snapshot - use &lt;code&gt;ACVR&lt;/code&gt;/&lt;code&gt;ARVR&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;When some command comes from user - add to WAL/Segmented log (and return response)&lt;/li&gt;
&lt;li&gt;New WAL segments created - &lt;code&gt;ACVR&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Also, when using Snapshot we actual do not need old WAL segments, because they are already applied in snapshot. If we were using not segmented log, but a &lt;em&gt;monolog&lt;/em&gt; (came up with the name myself), then freeing up a disk space will be problematic: atomic operation will require &lt;code&gt;ARVR&lt;/code&gt;, but for large file it will consume to much resources. As for segmented log - we can just remove some log segments (or send to another storage/archive).&lt;/p&gt;

&lt;p&gt;Some examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Postgres: WAL represented as &lt;a href="https://www.postgresql.org/docs/current/wal-internals.html" rel="noopener noreferrer"&gt;multiple segments&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Apache Kafka: each partition consists of &lt;a href="https://kafka.apache.org/documentation/#log" rel="noopener noreferrer"&gt;several segments&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;etcd: data stored in &lt;a href="https://pkg.go.dev/github.com/etcd-io/etcd/snap" rel="noopener noreferrer"&gt;snapshot&lt;/a&gt; and &lt;a href="https://pkg.go.dev/go.etcd.io/etcd/wal" rel="noopener noreferrer"&gt;segmented WAL&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;log-cabin: data stored in &lt;a href="https://github.com/logcabin/logcabin/blob/master/Storage/SnapshotFile.cc" rel="noopener noreferrer"&gt;snapshot&lt;/a&gt; and &lt;a href="https://github.com/logcabin/logcabin/blob/master/Storage/SegmentedLog.cc" rel="noopener noreferrer"&gt;segmented WAL&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Actually, we can think of table files in Postgres in terms of Snapshot.&lt;br&gt;
But in Raft snapshot is immutable and Postgres performs changes on such "snapshot" directly.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Such approach for the organization of data storage has advantages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fault-tolerant update of application state: fault-tolerant WAL write and snapshot updates&lt;/li&gt;
&lt;li&gt;Increase data update speed: response right after WAL saved record + data locality and sequential access when writing to WAL&lt;/li&gt;
&lt;li&gt;Replication is more effective: streaming replication of state (just send WAL records sequentially) and/or send immutable snapshot&lt;/li&gt;
&lt;li&gt;Application startup speed increases: we need to deserialize application state and apply some records from WAL (instead of applying &lt;em&gt;all&lt;/em&gt; records for all time, without snapshot)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this blog-post we went through the main layers that write request passes. But some topics are not covered:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Network file systems&lt;/li&gt;
&lt;li&gt;More deep description of storage devices&lt;/li&gt;
&lt;li&gt;Architecture and implementation of different file systems&lt;/li&gt;
&lt;li&gt;Comparison of file systems&lt;/li&gt;
&lt;li&gt;Ephemeral file systems (i.e. OverlayFS used in Docker)&lt;/li&gt;
&lt;li&gt;Cross-platform&lt;/li&gt;
&lt;li&gt;Bugs in implementation of each layer&lt;/li&gt;
&lt;li&gt;Behaviours in emulators/virtual machines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Hope this article was useful. Bye!&lt;/p&gt;

</description>
      <category>database</category>
      <category>programming</category>
      <category>postgres</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>PostgreSQL planner development and debugging</title>
      <dc:creator>Sergey Solovev</dc:creator>
      <pubDate>Thu, 06 Feb 2025 15:52:46 +0000</pubDate>
      <link>https://forem.com/ashenblade/postgresql-planner-development-and-debugging-47mc</link>
      <guid>https://forem.com/ashenblade/postgresql-planner-development-and-debugging-47mc</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This is translation of my report "Debugging PostgreSQL planner" from &lt;a href="https://pgbootcamp.ru/en/2024-kazan" rel="noopener noreferrer"&gt;PGBootCamp 2024&lt;/a&gt; conference.&lt;br&gt;
You can find repository with source code and another staff &lt;a href="https://github.com/TantorLabs/meetups/tree/main/2024-09-17_Kazan/Sergey%20Solovev%20-%20Debugging%20PostgreSQL%20planner" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In this post we will look at how the PostgreSQL planner works, but on code level (functions and data structures) and how to hack on it's planner.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Go over the main functions used by the planner, main pipeline&lt;/li&gt;
&lt;li&gt;Get acquainted with the type system: &lt;code&gt;Node&lt;/code&gt; and it's children&lt;/li&gt;
&lt;li&gt;How query is represented in code and different data structures that represent it's parts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After some theory we will implement and add a little feature into the planner.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting up
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;Work will be done in folder link a gave earlier (it's cwd)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you want to reproduce some parts of this post, then you need to setup repository.&lt;/p&gt;

&lt;p&gt;All you need to do is to run &lt;code&gt;init.sh&lt;/code&gt;. This script:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Downloads PostgreSQL 16.4&lt;/li&gt;
&lt;li&gt;Applies patch&lt;/li&gt;
&lt;li&gt;Copies development scripts&lt;/li&gt;
&lt;li&gt;If VS Code is installed:

&lt;ol&gt;
&lt;li&gt;Copies configuration files to &lt;code&gt;.vscode&lt;/code&gt; folder&lt;/li&gt;
&lt;li&gt;Installs required extensions&lt;/li&gt;
&lt;/ol&gt;


&lt;/li&gt;

&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;For source code downloading you also need to have &lt;code&gt;wget&lt;/code&gt; or &lt;code&gt;curl&lt;/code&gt; and &lt;code&gt;tar&lt;/code&gt; to unzip.&lt;br&gt;
If they are missing install them manually or download archive manually using &lt;a href="https://ftp.postgresql.org/pub/source/v16.4/postgresql-16.4.tar.gz" rel="noopener noreferrer"&gt;this link&lt;/a&gt; and store it in same directory as &lt;code&gt;init.sh&lt;/code&gt; script.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For building and debugging PostgreSQL you also need to have these libraries/executables installed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;libreadline&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;bison&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;flex&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;make&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;gcc&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;gdb&lt;/code&gt; (or another debugger)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;CPAN&lt;/code&gt; (for PERL)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can install them in such way:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Debian based&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt update
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;build-essential gdb bison flex libreadline-dev git

&lt;span class="c"&gt;# RPM based&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;yum update
&lt;span class="nb"&gt;sudo &lt;/span&gt;yum &lt;span class="nb"&gt;install &lt;/span&gt;gcc gdb bison flex make readline-devel perl-CPAN git
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So, whole setup pipeline is this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Clone repository and go to directory with meetup files&lt;/span&gt;
git clone https://github.com/TantorLabs/meetups
&lt;span class="nb"&gt;cd&lt;/span&gt; &lt;span class="s2"&gt;"meetups/2024-09-17_Kazan/Sergey Solovev - Debugging PostgreSQL planner"&lt;/span&gt;


&lt;span class="c"&gt;# Run initialization script: files downloading, applying patches, etc...&lt;/span&gt;
./init.sh

&lt;span class="c"&gt;# Go to PostgreSQL source code directory&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;postgresql

&lt;span class="c"&gt;# Run build using dev script (parallel with 8 threads)&lt;/span&gt;
./dev/build.sh &lt;span class="nt"&gt;-j&lt;/span&gt; 8

&lt;span class="c"&gt;# Initialize DB and setup schema for tests&lt;/span&gt;
./dev/run.sh &lt;span class="nt"&gt;--init-db&lt;/span&gt; &lt;span class="nt"&gt;--run-db&lt;/span&gt; &lt;span class="nt"&gt;--psql&lt;/span&gt; &lt;span class="nt"&gt;--script&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;../schema_constrexcl.sql &lt;span class="nt"&gt;--stop-db&lt;/span&gt;

&lt;span class="c"&gt;# Run VS Code (if you have) and open main work file&lt;/span&gt;
code &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="nt"&gt;--goto&lt;/span&gt; src/backend/optimizer/util/constrexcl.c
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After this you can run scripts in &lt;code&gt;queries_constrexcl.sql&lt;/code&gt; using &lt;code&gt;psql&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;We will work with files:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;src/backend/optimizer/util/constrexcl.c&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;src/backend/optimizer/util/clauses.c&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;src/backend/optimizer/plan/planmain.c&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For full-fledged debugging you should install extra dependencies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;icu-i18n&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;zstd&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;zlib&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;pkg-config&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And PERL package &lt;code&gt;IPC::Run&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  High-level planner architecture
&lt;/h2&gt;

&lt;p&gt;Now the theory comes in. I will explain some parts not digging too dip. But if you want to touch hard parts you are welcome to READMEs - you can find many of them in PostgreSQL source code repository in &lt;code&gt;src/backend/optimizer&lt;/code&gt;. These READMEs explain multiple aspects of the planner workflow.&lt;/p&gt;

&lt;h3&gt;
  
  
  Query processing algorithm
&lt;/h3&gt;

&lt;p&gt;Let's kick off with query processing pipeline. It can be represented as 4 distinct stages:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Parsing&lt;/li&gt;
&lt;li&gt;Rewriting&lt;/li&gt;
&lt;li&gt;Planning&lt;/li&gt;
&lt;li&gt;Execution&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy0amo3bmjxospzm81msv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy0amo3bmjxospzm81msv.png" alt="Query processing pipeline stages" width="800" height="403"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can guess we will talk about 3 stage - planning. It also can be divided into 4 stages:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Preprocessing&lt;/li&gt;
&lt;li&gt;Optimization&lt;/li&gt;
&lt;li&gt;Finding possible paths&lt;/li&gt;
&lt;li&gt;Plan creation&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;1 and 2 stages perform some optimizations. Main difference that the &lt;code&gt;Preprocessing&lt;/code&gt; works with &lt;code&gt;Query tree&lt;/code&gt; and make "simple" optimizations, i.e. constant folding (calculate result of constant expressions).&lt;br&gt;
But &lt;code&gt;Optimization&lt;/code&gt; makes harder optimizations. Such optimizations works with "whole-query" knowledge: JOINs, constants, partitions, tables etc.&lt;/p&gt;

&lt;p&gt;During 3 stage we find all possible (excluding those that known to be surely inefficient) ways to execute this query. i.e. table has indexes and after we performed some optimization it may be turned out, that we may not use explicit sorting, because &lt;code&gt;Index only scan&lt;/code&gt; provides already sorted data.&lt;/p&gt;

&lt;p&gt;On the last stage we find the best path for which we create execution &lt;em&gt;plan&lt;/em&gt;. This plan will be used by Executor to actually run query.&lt;/p&gt;
&lt;h3&gt;
  
  
  Source code organization
&lt;/h3&gt;

&lt;p&gt;In source code these stages organized in such way.&lt;/p&gt;

&lt;p&gt;We have "main" functions - entry points for "main" activity:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;query_planner&lt;/code&gt; - creates access paths for tables (i.e. &lt;code&gt;SeqScan&lt;/code&gt;, &lt;code&gt;IndexScan&lt;/code&gt; etc..) and also creates and initializes main data structures for further use.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;grouping_planner&lt;/code&gt; - wrapper for &lt;code&gt;query_planner&lt;/code&gt; that adds postprocessing logic after we retrieve data from tables: sorting, grouping etc...&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;subquery_planner&lt;/code&gt; - entry point for planning single subquery. It is called each time we encounter new subquery (recursively).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;standard_planner&lt;/code&gt; - entry point for whole planner.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Schematically, this can be represented this way.&lt;/p&gt;

&lt;p&gt;On top we have &lt;code&gt;standard_planner&lt;/code&gt;. It prepares environment and calls &lt;code&gt;subquery_planner&lt;/code&gt; for top query (top query is also a subquery, but without parent).&lt;/p&gt;

&lt;p&gt;&lt;code&gt;subquery_planner&lt;/code&gt; preprocess &lt;em&gt;query tree&lt;/em&gt; and calls &lt;code&gt;grouping_planner&lt;/code&gt; for running main planning logic.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;grouping_planner&lt;/code&gt; runs &lt;code&gt;query_planner&lt;/code&gt; and after that adds some postprocessing nodes for query plan: sorting, grouping, window functions etc...&lt;/p&gt;

&lt;p&gt;&lt;code&gt;query_planner&lt;/code&gt; initializes state of planner and performs optimizations. After optimizations it calls &lt;code&gt;make_one_rel&lt;/code&gt; to actually find best access path.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;make_one_rel&lt;/code&gt; finds best access path for whole relation (including strategy for JOIN search).&lt;/p&gt;

&lt;p&gt;&lt;code&gt;standard_join_search&lt;/code&gt; determines best JOIN strategy. It uses dynamic programming algorithm in opposite to GEQO (Genetic Query Optimizer) which is called when &lt;code&gt;geqo_threshold&lt;/code&gt; number of tables in single JOIN reached.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="n"&gt;standard_planner&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="cm"&gt;/* Initialize global planner state */&lt;/span&gt;
    &lt;span class="n"&gt;subquery_planner&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="cm"&gt;/* Query tree preprocessing */&lt;/span&gt;
        &lt;span class="n"&gt;grouping_planner&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="cm"&gt;/* Process GROUP operations: SORT, GROUP BY, WINDOW, SET ops */&lt;/span&gt;
            &lt;span class="n"&gt;query_planner&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="cm"&gt;/* Planner state initialization */&lt;/span&gt;
                &lt;span class="cm"&gt;/* Optimizations */&lt;/span&gt;
                &lt;span class="cm"&gt;/* Find access paths */&lt;/span&gt;
                &lt;span class="n"&gt;make_one_rel&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="cm"&gt;/* JOIN strategy */&lt;/span&gt;
                    &lt;span class="n"&gt;standard_join_search&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="cm"&gt;/* Add postprocessing nodes (group, sort, etc...) */&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="cm"&gt;/* Best path choose */&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="cm"&gt;/* Generating plan for best path */&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpe5crjywdae7ut8uuds0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpe5crjywdae7ut8uuds0.png" alt="Correlation between query and source source functions" width="483" height="435"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Data structures
&lt;/h2&gt;

&lt;p&gt;Now let's talk about type system.&lt;/p&gt;

&lt;h3&gt;
  
  
  Nodes and trees
&lt;/h3&gt;

&lt;p&gt;PostgreSQL has it's own type system based on C-style inheritance.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Node&lt;/code&gt; - is a base structure for all. It has single member with type &lt;code&gt;NodeTag&lt;/code&gt; - type discriminator.&lt;/p&gt;

&lt;p&gt;This is simple &lt;code&gt;enum&lt;/code&gt; which is created as &lt;code&gt;T_&lt;/code&gt; prefix + name of structure.&lt;/p&gt;

&lt;p&gt;All possible nodes are already known, their values defined in &lt;code&gt;src/include/nodes/node.h&lt;/code&gt; header file, or, starting with 16 version, are generated using code-gen in &lt;code&gt;src/include/nodes/nodetags.h&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Implementations of these nodes stored in header files in &lt;code&gt;src/include/nodes&lt;/code&gt;. Such files ends with &lt;code&gt;*nodes.h&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;src/include/nodes/primnodes.h&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;src/include/nodes/pathnodes.h&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;src/include/nodes/plannodes.h&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;src/include/nodes/execnodes.h&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;src/include/nodes/memnodes.h&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;src/include/nodes/miscnodes.h&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;src/include/nodes/replnodes.h&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;src/include/nodes/parsenodes.h&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;src/include/nodes/supportnodes.h&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Node examples
&lt;/h3&gt;

&lt;p&gt;This diagram shows tree with several Node types grouped by purpose (in my opinion). Totally, there are about 500 types, so take a look at most used ones.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdw1vabqwzkxfnmx8w13c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdw1vabqwzkxfnmx8w13c.png" alt="Diagram of Node types" width="765" height="311"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;List&lt;/code&gt; - is a dynamic array. It can store &lt;code&gt;void *&lt;/code&gt;, &lt;code&gt;int&lt;/code&gt;, &lt;code&gt;Oid&lt;/code&gt; or &lt;code&gt;TransactionId&lt;/code&gt; (only one specific type). Using it's tag we can define this data is stored in it:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;&lt;code&gt;NodeTag&lt;/code&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;void *&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;T_List&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;int&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;T_IntList&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Oid&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;T_OidList&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;TransactionId&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;T_XidList&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;NOTE: &lt;code&gt;List&lt;/code&gt; is unique in way has 4 different tags for single structure.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Expr&lt;/code&gt; - is a base structure for nodes, which can be executed (expressions). You can encounter them in attribute list of &lt;code&gt;SELECT&lt;/code&gt; or &lt;code&gt;WHERE&lt;/code&gt; constraints. Examples:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Structure&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;Evaluation result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Var&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Table column&lt;/td&gt;
&lt;td&gt;Value of specified column in tuple of specified table&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Const&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Constant&lt;/td&gt;
&lt;td&gt;Constant value&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;OpExpr&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Operator&lt;/td&gt;
&lt;td&gt;Operator invocation with specified arguments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;FuncExpr&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Function&lt;/td&gt;
&lt;td&gt;Function invocation with specified arguments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;BoolExpr&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Boolean expression&lt;/td&gt;
&lt;td&gt;Evaluation of boolean expression (&lt;code&gt;NOT&lt;/code&gt;, &lt;code&gt;AND&lt;/code&gt;, &lt;code&gt;OR&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;SubPlan&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Sub-SELECT&lt;/td&gt;
&lt;td&gt;Result of SELECT query execution&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Node&lt;/code&gt; and &lt;code&gt;Expr&lt;/code&gt; are &lt;em&gt;abstract&lt;/em&gt;, they do not have own &lt;code&gt;NodeTag&lt;/code&gt; value.&lt;br&gt;
They serve as markers: &lt;code&gt;Node&lt;/code&gt; - common object of PostgreSQL type system, &lt;code&gt;Expr&lt;/code&gt; - expression.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;One more group - are planner nodes. They are used by the planner and contain data required for convenient access, i.e. tables in JOIN stores as set, not in tree format.&lt;/p&gt;

&lt;h3&gt;
  
  
  Query tree representation
&lt;/h3&gt;

&lt;p&gt;First, look at how query is represented in code. This schema shows parts of query with corresponding data structures (nodes).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbsjszxg1ibapnnupevmm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbsjszxg1ibapnnupevmm.png" alt="Query tree and corresponding data structures" width="641" height="442"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Query&lt;/code&gt; represents query tree for single query. If we find sub-query we create another &lt;code&gt;Query&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Next, table representation. There is an important abstraction - &lt;code&gt;Range Table&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Strictly speaking, &lt;code&gt;Range table&lt;/code&gt; - it is everything that can be in &lt;code&gt;FROM&lt;/code&gt; clause, data source. Each such element named &lt;code&gt;Range Table Entry&lt;/code&gt; (rte):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Table&lt;/li&gt;
&lt;li&gt;Function&lt;/li&gt;
&lt;li&gt;Subquery&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;JOIN&lt;/code&gt;'s&lt;/li&gt;
&lt;li&gt;CTE&lt;/li&gt;
&lt;li&gt;&lt;code&gt;VALUES (), ()&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;RangeTblEntry&lt;/code&gt; - is a structure, that represents Range Table Entry. To distinguish different types of RTE enum &lt;code&gt;RTEKind&lt;/code&gt; is used.&lt;/p&gt;

&lt;p&gt;For query in the example we have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;tbl1&lt;/code&gt; - &lt;code&gt;RTE_RELATION&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tbl2&lt;/code&gt; - &lt;code&gt;RTE_RELATION&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tbl1 JOIN tbl2&lt;/code&gt; - &lt;code&gt;RTE_JOIN&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;generate_series&lt;/code&gt; - &lt;code&gt;RTE_FUNCTION&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tbl1 JOIN tbl2 LEFT OUTER JOIN generate_series&lt;/code&gt; - &lt;code&gt;RTE_JOIN&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;(SELECT MAX(id) ... )&lt;/code&gt; - &lt;code&gt;RTE_SUBQUERY&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Note that each query has it's own Range Table, so &lt;code&gt;tbl3&lt;/code&gt; - RTE &lt;em&gt;for subquery&lt;/em&gt; and it is not in list above.&lt;/p&gt;

&lt;h3&gt;
  
  
  Query representation in planner
&lt;/h3&gt;

&lt;p&gt;Now, how the planner sees the same query.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcoven6imo3aphtelwidm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcoven6imo3aphtelwidm.png" alt="Parts of query and planner data structures" width="649" height="444"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Main parts:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;PlannerInfo&lt;/code&gt; - information about single query (just like &lt;code&gt;Query&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;&lt;code&gt;RelOptInfo&lt;/code&gt; - planner information about 1 &lt;em&gt;relation&lt;/em&gt;: table, function, subqeury, CTE, JOIN etc (just like &lt;code&gt;RangeTblEntry&lt;/code&gt;). To distinguish different types of relations &lt;code&gt;RelOptKind&lt;/code&gt; (another enum) is used.&lt;/p&gt;

&lt;p&gt;In previous section you can notice that some &lt;code&gt;RTEKind&lt;/code&gt; are coloured. Let's draw a parallel between &lt;code&gt;RelOptKind&lt;/code&gt; and &lt;code&gt;RTEKind&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;RELOPT_BASEREL&lt;/code&gt; corresponds to &lt;code&gt;RTE_RELATION&lt;/code&gt; (table), &lt;code&gt;RTE_FUNCTION&lt;/code&gt; (function) and &lt;code&gt;RTE_SUBQUERY&lt;/code&gt; (subquery). Basically, we do not need such precise information, all we need are 2 things: name (to be able to address them) and returned attributes list.&lt;/p&gt;

&lt;p&gt;Such relations are called &lt;em&gt;base relations&lt;/em&gt;. Planner likes to work with them and knowledge of relation origin is not that important.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;RELOPT_JOINREL&lt;/code&gt; is created for &lt;code&gt;RTE_JOIN&lt;/code&gt;. Here we can see one more difference with query tree. In query tree we have many &lt;code&gt;RTE_JOIN&lt;/code&gt; (1 for each JOIN), but one only 1. The reason of that hides in storage format - planner stores tables for join as set. This simplifies work required for optimal JOIN search.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Another values of &lt;code&gt;RelOptKind&lt;/code&gt; are not covered. But they are also useful, i.e. to support table inheritance or &lt;code&gt;UNION ALL&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Extra structures
&lt;/h3&gt;

&lt;p&gt;In it's work planner uses multiple auxiliary data structures. Now we will cover only 2 of them: &lt;code&gt;RestrictInfo&lt;/code&gt; and &lt;code&gt;EquivalenceClass&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F86dw3wmeuv913s2kdejk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F86dw3wmeuv913s2kdejk.png" alt="RestrictInfo and EquivalenceClass" width="452" height="230"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;RestrictInfo&lt;/code&gt; is a restriction (constraint) applied to query. i.e. it can be condition from &lt;code&gt;WHERE&lt;/code&gt; or &lt;code&gt;JOIN&lt;/code&gt;. In schema's query there are 3 restrictions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;t1.id = t2.id&lt;/code&gt; - &lt;code&gt;JOIN&lt;/code&gt; condition&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;t1.value = t2.value&lt;/code&gt; - first condition &lt;code&gt;AND&lt;/code&gt; in &lt;code&gt;WHERE&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;1.value = 0&lt;/code&gt; - second condition &lt;code&gt;AND&lt;/code&gt; in &lt;code&gt;WHERE&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Any conditions/restrictions called &lt;code&gt;qualifications&lt;/code&gt; (or &lt;code&gt;quals&lt;/code&gt;)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;code&gt;EquivalenceClass&lt;/code&gt; is a set of values known to be equal to each other. In same query we have 2 equivalence classes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;t1.id&lt;/code&gt; and &lt;code&gt;t2.id&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;t1.value&lt;/code&gt;, &lt;code&gt;t2.value&lt;/code&gt; and &lt;code&gt;0&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can note that &lt;code&gt;EquivalenceClass&lt;/code&gt; created using &lt;code&gt;RestrictInfo&lt;/code&gt;. Both of them (ds) are important, because the help planner to find more optimal scan paths (examples):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;RestrictInfo&lt;/code&gt; - which index to use. If we have index on &lt;code&gt;t1.value&lt;/code&gt;, then we can use it because of &lt;code&gt;t1.value = 0&lt;/code&gt; qualification&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;EquivalenceClass&lt;/code&gt; - find dependencies between used data. If in previous example we have index on &lt;code&gt;t2.value&lt;/code&gt; (not &lt;code&gt;t1.value&lt;/code&gt;), then we can not use &lt;code&gt;t1.value = 0&lt;/code&gt; qual, but we &lt;code&gt;t1.value&lt;/code&gt; and &lt;code&gt;t2.value&lt;/code&gt; are in same &lt;code&gt;EquivalenceClass&lt;/code&gt;, so we transitively conclude that &lt;code&gt;t2.value = 0&lt;/code&gt; and can use index.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Implementing Constraint Exclusion
&lt;/h2&gt;

&lt;p&gt;Now, we know enough to begin development of planner.&lt;/p&gt;

&lt;p&gt;On practice we implement "Constraint Exclusion" optimization - examine provided constraints to find which are mutually exclusive and remove such relations (tables).&lt;/p&gt;

&lt;p&gt;For example, this query can not remove any tuple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;tbl1&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; 
    &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; 
  &lt;span class="k"&gt;AND&lt;/span&gt;
    &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Such pattern will have next representation in Nodes:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpvec064lymh3thmst9og.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpvec064lymh3thmst9og.png" alt="Constraint Exclusion" width="800" height="366"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We are looking for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;BoolExpr&lt;/code&gt; - boolean expression with &lt;code&gt;AND&lt;/code&gt; condition, which has&lt;/li&gt;
&lt;li&gt;Both expressions are &lt;code&gt;OpExpr&lt;/code&gt; (operator invocation), having&lt;/li&gt;
&lt;li&gt;Opposite operators with&lt;/li&gt;
&lt;li&gt;Equal (by value) operands:&lt;/li&gt;
&lt;li&gt;Left operand - table column (&lt;code&gt;Var&lt;/code&gt;) and&lt;/li&gt;
&lt;li&gt;Right operand - constant (&lt;code&gt;Const&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When we find such pattern, so can conclude, that this relation can be removed - it will not return any data.&lt;/p&gt;

&lt;p&gt;For simplicity we will search ONLY for this pattern - no &lt;code&gt;OR&lt;/code&gt;/&lt;code&gt;NOT&lt;/code&gt;, operands do not switch, operators must be placed together, etc.&lt;/p&gt;

&lt;h3&gt;
  
  
  Implementation
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;Now we are working in &lt;code&gt;src/backend/optimizer/util/constrexcl.c&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Firstly, we create helper function &lt;code&gt;extract_operands&lt;/code&gt; - it will extract &lt;code&gt;Var&lt;/code&gt; and &lt;code&gt;Const&lt;/code&gt; from provided &lt;code&gt;OpExpr&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;An operator is also a function, and it also has arguments. They are stored in member &lt;code&gt;args&lt;/code&gt; - &lt;code&gt;List&lt;/code&gt;. We find binary operator, so it must have 2 arguments. To get length of list function &lt;code&gt;list_length&lt;/code&gt; is used:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;list_length&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expr&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;According to pattern, first argument must be a table column (&lt;code&gt;Var&lt;/code&gt;), and second - a constant (&lt;code&gt;Const&lt;/code&gt;). To check type of Node use macro &lt;code&gt;IsA(node, type)&lt;/code&gt; and to get first and last elements from &lt;code&gt;List&lt;/code&gt; use &lt;code&gt;linitial&lt;/code&gt; and &lt;code&gt;llast&lt;/code&gt; macros, accordingly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;IsA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;linitial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;Var&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;IsA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;Const&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;out_var&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Var&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;linitial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;out_const&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Const&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;llast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As a result, the whole function looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt;
&lt;span class="nf"&gt;extract_operands&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;OpExpr&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;expr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Var&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;out_var&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Const&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;out_const&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;list_length&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expr&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;IsA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;linitial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expr&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;Var&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;IsA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expr&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;Const&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;out_var&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Var&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;linitial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expr&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;out_const&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Const&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;llast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expr&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's move on to the implementation of the main logic. It will be located in function &lt;code&gt;is_mutually_exclusive&lt;/code&gt;. This function accepts 2 &lt;code&gt;OpExpr&lt;/code&gt;s and check that they matches the template.&lt;/p&gt;

&lt;p&gt;At the beginning use previously written function to extract stored &lt;code&gt;Var&lt;/code&gt; and &lt;code&gt;Const&lt;/code&gt; from both &lt;code&gt;OpExpr&lt;/code&gt;s.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Var&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;left_var&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;Const&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;left_const&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;Var&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;right_var&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;Const&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;right_const&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;extract_operands&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;left&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;left_var&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;left_const&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;extract_operands&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;right&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;right_var&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;right_const&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now check that corresponding operands are equal. To check 2 nodes for equality use function &lt;code&gt;equal&lt;/code&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This function accepts to &lt;code&gt;void *&lt;/code&gt;, but works only with &lt;code&gt;Node *&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="cm"&gt;/* Table columns */&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;equal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;left_var&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;right_var&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="cm"&gt;/* Constants */&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;equal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;left_const&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;right_const&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Last step - check operators, they must be opposite. This can be done using table in system catalog &lt;code&gt;pg_operator&lt;/code&gt;. We need to check column &lt;code&gt;oprnegate&lt;/code&gt; - it contains Oid of opposite operator for given.&lt;/p&gt;

&lt;p&gt;We will not use this table directly - in header file &lt;code&gt;utils/cache/lsyscache.h&lt;/code&gt; there are multiple helper functions for working with system catalog. Function &lt;code&gt;get_negator&lt;/code&gt; is one what we want - it returns Oid of opposite operator for given.&lt;/p&gt;

&lt;p&gt;For small performance improvement add only 1 check, supposing that if opposite operator for left operator is right, then opposite for right is left. That is, they are symmetrical.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;get_negator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;left&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;opno&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;right&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;opno&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As a result, we have next function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="n"&gt;bool&lt;/span&gt;
&lt;span class="nf"&gt;is_mutually_exclusive&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;OpExpr&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;left&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OpExpr&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;right&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Var&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;left_var&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;Const&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;left_const&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;Var&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;right_var&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;Const&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;right_const&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;extract_operands&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;left&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;left_var&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;left_const&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;extract_operands&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;right&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;right_var&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;right_const&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;equal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;left_var&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;right_var&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;equal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;left_const&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;right_const&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;get_negator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;left&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;opno&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;right&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;opno&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The last thing we have to do is to add logic to the right place. First, add this to 1 stage - query tree preprocessing. It is contained in &lt;code&gt;subquery_planner&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Main logic of preprocessing is contained in &lt;code&gt;proprocess_expression&lt;/code&gt; function. It is a generic function, which passes through all nodes and performs preprocessing (with possible query tree rewriting).&lt;/p&gt;

&lt;p&gt;In it we are interested in &lt;code&gt;simplify_and_arguments&lt;/code&gt; (&lt;code&gt;src/backend/optimizer/util/clauses.c&lt;/code&gt;) - it is called inside &lt;code&gt;preprocess_expression&lt;/code&gt; and performs optimization of multiple qualifications in &lt;code&gt;AND&lt;/code&gt; expression.&lt;/p&gt;

&lt;p&gt;Why do we need this function? Because it has out parameter &lt;code&gt;forceFalse&lt;/code&gt;. If it is set to &lt;code&gt;true&lt;/code&gt;, then the whole list of qualifications can be replaced with constant &lt;code&gt;FALSE&lt;/code&gt;. That is, we do not need to implement this yourself.&lt;/p&gt;

&lt;p&gt;This function works as follows. It get list of conditions in &lt;code&gt;AND&lt;/code&gt; and then it creates new list by iterating through all elements (performing some logic). We can add our logic to the end of each iteration: store previous qual in separate variable and check with current.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="nf"&gt;simplify_and_arguments&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;List&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                       &lt;span class="n"&gt;eval_const_expressions_context&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                       &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;haveNull&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;forceFalse&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="cm"&gt;/* ... */&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unprocessed_args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="cm"&gt;/* ... */&lt;/span&gt;
        &lt;span class="cm"&gt;/* Previous and current expressions must be operator invocation */&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;IsA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;arg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OpExpr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;newargs&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;NIL&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;IsA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;newargs&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;OpExpr&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="cm"&gt;/* Check for mutually exclusion */&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;is_mutually_exclusive&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;OpExpr&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;arg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;OpExpr&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;llast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;newargs&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="cm"&gt;/* Telling to replace all expressions with single FALSE */&lt;/span&gt;
                &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;forceFalse&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;NIL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="n"&gt;newargs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lappend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;newargs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;arg&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;newargs&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now test what we have got using this example. We have such schema:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;tbl1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt; &lt;span class="k"&gt;GENERATED&lt;/span&gt; &lt;span class="n"&gt;ALWAYS&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;IDENTITY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;tbl2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt; &lt;span class="k"&gt;GENERATED&lt;/span&gt; &lt;span class="n"&gt;ALWAYS&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;IDENTITY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;To run DB and/or psql you can use script &lt;code&gt;./dev/run.sh --init-db --run-db --psql&lt;/code&gt;&lt;br&gt;
Or using VS Code task &lt;code&gt;Run psql&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Test query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;EXPLAIN&lt;/span&gt; &lt;span class="k"&gt;ANALYZE&lt;/span&gt; 
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;tbl1&lt;/span&gt; 
&lt;span class="k"&gt;WHERE&lt;/span&gt;
    &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt;
    &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Baseline is the following (vanilla pg):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 Seq Scan on tbl  (cost=0.00..43.90 rows=11 width=4) (actual time=0.004..0.004 rows=0 loops=1)
   Filter: ((value &amp;gt; 0) AND (value &amp;lt;= 0))
 Planning Time: 0.186 ms
 Execution Time: 0.015 ms
(4 rows)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can see, that query was actually executed with filter applied. Now rebuild with our changes.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;To build sources you can use script &lt;code&gt;./dev/build.sh&lt;/code&gt;, or using VS Code task &lt;code&gt;Build&lt;/code&gt;/shortcut &lt;code&gt;Ctrl + Shift + B&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Before rerunning do not forget to stop DB - script &lt;code&gt;./dev/run.sh --stop-db&lt;/code&gt;, or using VS Code task &lt;code&gt;Stop DB&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And run that query again:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                                     QUERY PLAN                                     
------------------------------------------------------------------------------------
 Result  (cost=0.00..0.00 rows=0 width=0) (actual time=0.001..0.002 rows=0 loops=1)
   One-Time Filter: false
 Planning Time: 0.033 ms
 Execution Time: 0.013 ms
(4 rows)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This output tell us, that the whole output was substituted with empty set:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Result&lt;/code&gt; - node returning ready-made values&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;One-Time Filter: false&lt;/code&gt; - filter to decline all tuples&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Lets run debugger and check how our logic works step by step. Set breakpoint at the start of &lt;code&gt;is_mutually_exclusive&lt;/code&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;To attach using debugger (using provided configuration files in this repository):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Obtain PID of backend (if run using scripts in this repo you will have PID shown after start of psql)&lt;/li&gt;
&lt;li&gt;Start debugging session in VS Code (i.e. press F5)&lt;/li&gt;
&lt;li&gt;Enter PID of backend into popped up window&lt;/li&gt;
&lt;li&gt;Enter password if necessary&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;

&lt;p&gt;Run our query and wait for breakpoint to be reached.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzrhbiw26ikd9s3ma85fk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzrhbiw26ikd9s3ma85fk.png" alt="First breakpoint" width="800" height="328"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Make a step into function &lt;code&gt;extract_operands&lt;/code&gt; and look what is inside provided &lt;code&gt;OpExpr&lt;/code&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I will use extension &lt;a href="https://marketplace.visualstudio.com/items?itemName=ash-blade.postgresql-hacker-helper" rel="noopener noreferrer"&gt;&lt;code&gt;PostgreSQL Hacker Helper&lt;/code&gt;&lt;/a&gt; to look inside variables instead of built-in variables view.&lt;br&gt;
Why? Because this extension knows about Nodes and reveal them using their real types.&lt;br&gt;
I.e. given base &lt;code&gt;Node *&lt;/code&gt; it will check it's &lt;code&gt;type&lt;/code&gt;, cast to appropriate type and show that variable.&lt;br&gt;
On next screenshot you can see how it reveals real Nodes inside provided &lt;code&gt;List&lt;/code&gt;. Without this extension you have to manually enter numerous amounts of expression in &lt;code&gt;Watch&lt;/code&gt; tab.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Reveal it's contents and see multiple members, but we need to check what is inside &lt;code&gt;args&lt;/code&gt; - arguments of operator.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp871gasgn4g67zgrk1qv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp871gasgn4g67zgrk1qv.png" alt="Arguments of operator invocation" width="712" height="719"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can see 2 elements:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;Var&lt;/code&gt; - table column&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Const&lt;/code&gt; - constant&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This function must correctly proceed and return &lt;code&gt;true&lt;/code&gt; and set out parameters. Continue pressing F10 until we reach &lt;code&gt;return&lt;/code&gt; - yes function works as expected.&lt;/p&gt;

&lt;p&gt;Next, step out from function and see what is inside another &lt;code&gt;OpExpr&lt;/code&gt;, variable &lt;code&gt;right&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6n1rh42pubv2njlzgix4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6n1rh42pubv2njlzgix4.png" alt="right variable data" width="616" height="608"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We see the same picture - 2 arguments: &lt;code&gt;Var&lt;/code&gt; and &lt;code&gt;Const&lt;/code&gt;. Now we will step over &lt;code&gt;extract_operands&lt;/code&gt;, it should work correctly. But step at first &lt;code&gt;equal&lt;/code&gt; function invocation, when comparing &lt;code&gt;Var&lt;/code&gt;s.&lt;/p&gt;

&lt;p&gt;Let's manually compare contents of both &lt;code&gt;Var&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmwscjayfhmcafqg2raby.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmwscjayfhmcafqg2raby.png" alt="Var of left and right expressions" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The only thing that is different - &lt;code&gt;location&lt;/code&gt; member. But this must be different, because it belongs to another part of query. Make step over and we reach next &lt;code&gt;equal&lt;/code&gt; - variables points to the same table column.&lt;/p&gt;

&lt;p&gt;Let's compare constants manually again.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F28wbbeqv0f5e33d48fp4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F28wbbeqv0f5e33d48fp4.png" alt="Const of left and right expressions" width="799" height="388"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Make step over and we reach last step - checking operators Oids. That means constants are also equal.&lt;/p&gt;

&lt;p&gt;Take a look at Oids of operators - &lt;code&gt;opno&lt;/code&gt; member of &lt;code&gt;OpExpr&lt;/code&gt;. Right &lt;code&gt;opno&lt;/code&gt; - 521.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4t2ksvyjxtwigthpqplo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4t2ksvyjxtwigthpqplo.png" alt="Oid of right operator" width="800" height="165"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Set breakpoint at &lt;code&gt;return&lt;/code&gt; in function &lt;code&gt;get_negator&lt;/code&gt; and continue execution (F5). When this breakpoint hits we can see that negator of left operator is equal to 523 - operator Oid of right part.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvsgi5drhumt2m6h7tq8h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvsgi5drhumt2m6h7tq8h.png" alt="Oid of negator of left operator" width="800" height="357"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Our logic lives in preprocessing stage. That's significant drawback. Why? Because it handles only simple cases. Such query won't be optimized:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;EXPLAIN&lt;/span&gt; &lt;span class="k"&gt;ANALYZE&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;tbl1&lt;/span&gt; &lt;span class="n"&gt;t1&lt;/span&gt; 
    &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;tbl2&lt;/span&gt; &lt;span class="n"&gt;t2&lt;/span&gt; 
       &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;t1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; 
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;t1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We also have both qualifications to understand this query will not return anything, but our code - does not.&lt;/p&gt;

&lt;p&gt;In order to fix this we will move our code to next stage - planner optimizations.&lt;/p&gt;

&lt;p&gt;As you can remember, this stage starts in &lt;code&gt;query_planner&lt;/code&gt;, because members of &lt;code&gt;PlannerInfo&lt;/code&gt; are initialized here. It is very huge structure, but all we need - is &lt;code&gt;simple_rel_array&lt;/code&gt; - array of &lt;code&gt;RelOptInfo&lt;/code&gt; (structure, representing relation/table).&lt;/p&gt;

&lt;p&gt;New logic will work similarly to preprocessing stage: iterate through all qualifications and when find mutually exclusive ones - replace whole list with single &lt;code&gt;FALSE&lt;/code&gt; (now we do it ourselves).&lt;/p&gt;

&lt;p&gt;In &lt;code&gt;RelOptInfo&lt;/code&gt; we need to work with member &lt;code&gt;baserestrictinfo&lt;/code&gt; - list of qualifications imposed on the relation. If we have qualifications in JOIN (which relates only to 1 relation), then they will be moved into &lt;code&gt;WHERE&lt;/code&gt; list.&lt;/p&gt;

&lt;p&gt;We will iterate over all &lt;code&gt;RelOptInfo&lt;/code&gt; and over all &lt;code&gt;RestrictInfo&lt;/code&gt;s in &lt;code&gt;baserestrictinfo&lt;/code&gt;. And when find 2 adjacent mutually exclusive expressions, then replace the whole list of qualifications.&lt;/p&gt;

&lt;p&gt;This logic will be inside single function which we will add to &lt;code&gt;query_planner&lt;/code&gt;. Signature is following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt;
&lt;span class="nf"&gt;collapse_mutually_exclusive_quals_for_rel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PlannerInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;RelOptInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;root&lt;/code&gt; - current query context&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;rel&lt;/code&gt; - relation which we exploring&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Firstly, check we have at least 2 qualifications:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt;
&lt;span class="nf"&gt;collapse_mutually_exclusive_quals_for_rel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PlannerInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;RelOptInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;list_length&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;baserestrictinfo&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now main logic comes in - find 2 adjacent mutually exclusive &lt;code&gt;OpExpr&lt;/code&gt;. We will store previous &lt;code&gt;OpExpr&lt;/code&gt; in separate variable and initialize it with first element in &lt;code&gt;baserestrictinfo&lt;/code&gt; array and start iterating from 2 element - use &lt;code&gt;linitial&lt;/code&gt; macro and &lt;code&gt;for_each_from&lt;/code&gt; correspondingly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt;
&lt;span class="nf"&gt;collapse_mutually_exclusive_quals_for_rel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PlannerInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;RelOptInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;ListCell&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;lc&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;RestrictInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;prev_rinfo&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="cm"&gt;/* Read first element */&lt;/span&gt;
    &lt;span class="n"&gt;prev_rinfo&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;linitial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;baserestrictinfo&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="cm"&gt;/* Start iterating from 2 element */&lt;/span&gt;
    &lt;span class="n"&gt;for_each_from&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;baserestrictinfo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="cm"&gt;/* Read next element */&lt;/span&gt;
        &lt;span class="n"&gt;RestrictInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;cur_rinfo&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RestrictInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;lfirst&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lc&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="cm"&gt;/* ... */&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When we get next element check previous and current qualifications both are &lt;code&gt;OpExpr&lt;/code&gt; (using &lt;code&gt;IsA&lt;/code&gt;) and (if so) check they are mutually exclusive (using &lt;code&gt;is_mutually_exclusive&lt;/code&gt;).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt;
&lt;span class="nf"&gt;collapse_mutually_exclusive_quals_for_rel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PlannerInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;RelOptInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="cm"&gt;/* Both qualifications are operator call */&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;IsA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prev_rinfo&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;clause&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OpExpr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;IsA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cur_rinfo&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;clause&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OpExpr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;
        &lt;span class="cm"&gt;/* Qualifications are mutually exclusive */&lt;/span&gt;
        &lt;span class="n"&gt;is_mutually_exclusive&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;OpExpr&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;prev_rinfo&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;clause&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;OpExpr&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;cur_rinfo&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;clause&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="cm"&gt;/* ... */&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the check is successful, then replace the whole list with single element, constant &lt;code&gt;FALSE&lt;/code&gt; and we are done.&lt;/p&gt;

&lt;p&gt;This can be done using 3 separate functions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;makeBoolConst&lt;/code&gt; - create &lt;code&gt;BoolExpr&lt;/code&gt;, SQL boolean constant&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;make_restrictinfo&lt;/code&gt; - create &lt;code&gt;RestrictInfo&lt;/code&gt; with previously created boolean constant (it has multiple flags - we don't need them, so they are all &lt;code&gt;false&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;list_make1&lt;/code&gt; - create &lt;code&gt;List&lt;/code&gt; with 1 element
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt;
&lt;span class="nf"&gt;collapse_mutually_exclusive_quals_for_rel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PlannerInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;RelOptInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="cm"&gt;/* ... */&lt;/span&gt;

    &lt;span class="cm"&gt;/* Create qualification */&lt;/span&gt;
    &lt;span class="n"&gt;RestrictInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;false_rinfo&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;make_restrictinfo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                  &lt;span class="cm"&gt;/* Create constant `FALSE` */&lt;/span&gt;
                                                  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Expr&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;makeBoolConst&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                                                  &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                  &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="cm"&gt;/* Create single element List */&lt;/span&gt;
    &lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;baserestrictinfo&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;list_make1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;false_rinfo&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But if check is not successful, then just update previous &lt;code&gt;RestrictInfo&lt;/code&gt; variable.&lt;/p&gt;

&lt;p&gt;The whole function looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt;
&lt;span class="nf"&gt;collapse_mutually_exclusive_quals_for_rel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PlannerInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;RelOptInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;ListCell&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;lc&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;RestrictInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;prev_rinfo&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;list_length&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;baserestrictinfo&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;prev_rinfo&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;linitial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;baserestrictinfo&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="n"&gt;for_each_from&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;baserestrictinfo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;RestrictInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;cur_rinfo&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RestrictInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;lfirst&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lc&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;IsA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prev_rinfo&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;clause&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OpExpr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;IsA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cur_rinfo&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;clause&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OpExpr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;
            &lt;span class="n"&gt;is_mutually_exclusive&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;OpExpr&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;prev_rinfo&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;clause&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;OpExpr&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;cur_rinfo&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;clause&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;RestrictInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;false_rinfo&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;make_restrictinfo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                          &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Expr&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;makeBoolConst&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                                                          &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                          &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;baserestrictinfo&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;list_make1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;false_rinfo&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="n"&gt;prev_rinfo&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cur_rinfo&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We wrote function for single relation, but query can have multple of them - &lt;code&gt;simple_rel_array&lt;/code&gt; member of &lt;code&gt;PlannerInfo&lt;/code&gt;. We will write another function &lt;code&gt;collaptse_mutually_exclusive_quals&lt;/code&gt; for this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="kt"&gt;void&lt;/span&gt; 
&lt;span class="nf"&gt;collapse_mutually_exclusive_quals&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PlannerInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We need to traverse the entire array and apply logic to each. But even here it's not that simple. First remark - array indexing starts with 1, not 0. This is because indexes of this array plays as IDs for range table entries in range table (&lt;code&gt;Query-&amp;gt;rtable&lt;/code&gt;). 0 element is always &lt;code&gt;NULL&lt;/code&gt;. So loop looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="kt"&gt;void&lt;/span&gt; 
&lt;span class="nf"&gt;collapse_mutually_exclusive_quals&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PlannerInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;simple_rel_array_size&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;RelOptInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;rel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;simple_rel_array&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Second remark - type of relation. As you can remember, &lt;code&gt;RelOptInfo&lt;/code&gt; can be created for tables and JOINs, functions, etc. We need to check this.&lt;/p&gt;

&lt;p&gt;Initially, we were designed logic to work with tables, but we do not need table-specific features (like storage format or something else) - all we need if just to get return attributes (columns), we do not event need name of table.&lt;/p&gt;

&lt;p&gt;If you remeber, such relations called &lt;code&gt;base rel&lt;/code&gt;. So starting from now we work with base relations - check that &lt;code&gt;RelOptInfo&lt;/code&gt; is a base relation. This info stored in &lt;code&gt;reloptkind&lt;/code&gt; member.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="kt"&gt;void&lt;/span&gt; 
&lt;span class="nf"&gt;collapse_mutually_exclusive_quals&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PlannerInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;simple_rel_array_size&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;RelOptInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;rel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;simple_rel_array&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;reloptkind&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;RELOPT_BASEREL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;collapse_mutually_exclusive_quals_for_rel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Business logic is implemented, now it's time to integrate it into planner. We:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Use &lt;code&gt;baserestrictinfo&lt;/code&gt; member&lt;/li&gt;
&lt;li&gt;That stored in &lt;code&gt;RelOptInfo&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Which stored in &lt;code&gt;simple_rel_array&lt;/code&gt; member&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So add our code, when &lt;code&gt;simple_rel_array&lt;/code&gt; will be initialized. If we study the code a little, we find that this is done in &lt;code&gt;add_base_rels_to_query&lt;/code&gt; function, so add our code right after that function (in &lt;code&gt;src/backend/optimizer/plan/planmain.c&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="n"&gt;RelOptInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="nf"&gt;query_planner&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PlannerInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="n"&gt;query_pathkeys_callback&lt;/span&gt; &lt;span class="n"&gt;qp_callback&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;qp_extra&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="cm"&gt;/* ... */&lt;/span&gt;
    &lt;span class="n"&gt;add_base_rels_to_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Node&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;parse&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;jointree&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="cm"&gt;/* Constraint Exclusion */&lt;/span&gt;
    &lt;span class="n"&gt;collapse_mutually_exclusive_quals&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Do not forget to remove our code from 1 stage (query tree preprocessing) in &lt;code&gt;clause.c&lt;/code&gt; file&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Stop old DB, rebuild, run psql and execute our query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 Seq Scan on tbl  (cost=0.00..43.90 rows=11 width=4) (actual time=0.007..0.008 rows=0 loops=1)
   Filter: ((value &amp;gt; 0) AND (value &amp;lt;= 0))
 Planning Time: 0.042 ms
 Execution Time: 0.020 ms
(4 rows)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Our patch did not work, planner still uses table scanning. Let's run debugger and watch what happened. Set breakpoint at &lt;code&gt;collapse_mutually_exclusive_quals&lt;/code&gt; function start and run query:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs6h95yl1707hqodhu09h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs6h95yl1707hqodhu09h.png" alt="Breakpoint" width="800" height="268"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Make some steps (stepping inside other functions) and we will find, that check for list &lt;code&gt;baserestrictinfo&lt;/code&gt; length did not pass - planner thinks that it is empty (&lt;code&gt;NULL&lt;/code&gt; - empty list)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4n3dnnzggew52kzpx8yk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4n3dnnzggew52kzpx8yk.png" alt="Empty baserestrictinfo list" width="800" height="264"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The reason is because most of main data structures (such &lt;code&gt;PlannerInfo&lt;/code&gt; or &lt;code&gt;RelOptInfo&lt;/code&gt;) very huge, so initialized during the work (on the fly), there is no single function doing that - initialization is split between multple functions.&lt;br&gt;
We supposed that &lt;code&gt;add_base_rels_to_query&lt;/code&gt; created &lt;code&gt;baserestrictinfo&lt;/code&gt; then each element is initialized too. But that not true in this case - each &lt;code&gt;RelOptInfo&lt;/code&gt; is not fully initialized, just basic properties.&lt;/p&gt;

&lt;p&gt;So, this bug is fixed by proper finding place where &lt;code&gt;baserestrictinfo&lt;/code&gt; array is initialized and available for use. It is &lt;code&gt;deconstruct_join_tree&lt;/code&gt; - add our code right after this function invocation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="n"&gt;RelOptInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="nf"&gt;query_planner&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PlannerInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="n"&gt;query_pathkeys_callback&lt;/span&gt; &lt;span class="n"&gt;qp_callback&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;qp_extra&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="cm"&gt;/* ... */&lt;/span&gt;
    &lt;span class="n"&gt;joinlist&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;deconstruct_jointree&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="cm"&gt;/* Move our code invocation down */&lt;/span&gt;
    &lt;span class="n"&gt;collapse_mutually_exclusive_quals&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Stop DB, rebuild and run query again:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                                     QUERY PLAN                                     
------------------------------------------------------------------------------------
 Result  (cost=0.00..0.00 rows=0 width=0) (actual time=0.001..0.002 rows=0 loops=1)
   One-Time Filter: false
 Planning Time: 0.125 ms
 Execution Time: 0.012 ms
(4 rows)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Constraint Exclusion logic successfully applied to test query. Now let's test it on new case, with JOIN:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;EXPLAIN&lt;/span&gt; &lt;span class="k"&gt;ANALYZE&lt;/span&gt; 
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;tbl1&lt;/span&gt; &lt;span class="n"&gt;t1&lt;/span&gt; 
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;tbl2&lt;/span&gt; &lt;span class="n"&gt;t2&lt;/span&gt; 
   &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;t1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; 
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;t1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When running this query, we get next result:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres=# EXPLAIN ANALYZE 
SELECT * FROM tbl1 t1 
JOIN tbl2 t2 
   ON t1.value &amp;gt; 0 
WHERE t1.value &amp;lt;= 0;
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.
!?&amp;gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Something went wrong: backend crashed and we can not send queries anymore. Log file will not get enough information to detect cause of error, so we will debug it.&lt;/p&gt;

&lt;p&gt;Start psql, attach debugger, set breakpoint on &lt;code&gt;collapse_mutually_exclusive_quals&lt;/code&gt; function and run query. When that breakpoint hits make some steps in &lt;code&gt;for&lt;/code&gt; loop and stop at &lt;code&gt;if&lt;/code&gt; statement with &lt;code&gt;reloptkind&lt;/code&gt; check. Make another step and we stop at &lt;code&gt;collapse_mutually_exclusive_quals_for_rel&lt;/code&gt; function invocation. Which means given &lt;code&gt;rel&lt;/code&gt; is base rel - &lt;code&gt;t1&lt;/code&gt; or &lt;code&gt;t2&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fex0b50tn0xlru7crhyof.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fex0b50tn0xlru7crhyof.png" alt="Check RelOptInfo kind" width="800" height="278"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Make step into and we will stop at &lt;code&gt;baserestrictinfo&lt;/code&gt; list length checking. If we look at this list we will see 2 &lt;code&gt;RestrictInfo&lt;/code&gt;: both have &lt;code&gt;OpExpr&lt;/code&gt; with &lt;code&gt;Var&lt;/code&gt; and &lt;code&gt;Const&lt;/code&gt; at each side respectively. That means planner moved condition from &lt;code&gt;JOIN&lt;/code&gt; into &lt;code&gt;WHERE&lt;/code&gt; list.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2e7di52x58krx5etjb5g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2e7di52x58krx5etjb5g.png" alt="Checking baserestrictinfo list length" width="800" height="260"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Next, after making some steps we test &lt;code&gt;RestrictInfo&lt;/code&gt;s to be mutually exclusive.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fygcidrswbdbg11b624m4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fygcidrswbdbg11b624m4.png" alt="Check mutually exclusive" width="800" height="238"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Test passed and now we are creating list with single constant &lt;code&gt;FALSE&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7qiju41tkuhnx5hrzj2y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7qiju41tkuhnx5hrzj2y.png" alt="Creating list of single FALSE" width="800" height="244"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After that we return to parent function and move to next relation. It also passes "base rel" check and &lt;code&gt;collapse_mutually_exclusive_quals_for_rel&lt;/code&gt; in called. But this time &lt;code&gt;baserestrictinfo&lt;/code&gt; list is empty - this must be &lt;code&gt;t2&lt;/code&gt; table, because it does not have any qualifications.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmf9x6spdl0c81ocx9emv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmf9x6spdl0c81ocx9emv.png" alt="Empty baserestrictinfo list" width="800" height="217"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We return to parent again and move to next &lt;code&gt;RelOptInfo&lt;/code&gt; (last). And when we stepping in &lt;code&gt;if&lt;/code&gt; statement we get SEGFAULT. If we look at more close, we will see that &lt;code&gt;rel&lt;/code&gt; is &lt;code&gt;NULL&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fao7sprr749z8cuen94wd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fao7sprr749z8cuen94wd.png" alt="rel is NULL" width="800" height="289"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This not extraordinary - this is normal situation. Why? Because we did not read comments for &lt;code&gt;simple_rel_array&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;PlannerInfo&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="cm"&gt;/*
     * simple_rel_array holds pointers to "base rels" and "other rels" (see
     * comments for RelOptInfo for more info).  It is indexed by rangetable
     * index (so entry 0 is always wasted).  Entries can be NULL when an RTE
     * does not correspond to a base relation, such as a join RTE or an
     * unreferenced view RTE; or if the RelOptInfo hasn't been made yet.
     */&lt;/span&gt;
    &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;RelOptInfo&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;simple_rel_array&lt;/span&gt; &lt;span class="n"&gt;pg_node_attr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;array_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;simple_rel_array_size&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Take a closer look at 3 sentence: "Entries can be NULL when an RTE does not correspond to a base relation ...". So, if our &lt;code&gt;rel&lt;/code&gt; was &lt;code&gt;NULL&lt;/code&gt;, then it was not a base relation. What was that? Look at the corresponding rte:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frpemzi4wuufbek67r4te.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frpemzi4wuufbek67r4te.png" alt="Real rel type" width="736" height="570"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It was JOIN, so that is because no &lt;code&gt;RelOptInfo&lt;/code&gt; pointer was there.&lt;/p&gt;

&lt;p&gt;Fix this bug with checking that &lt;code&gt;rel&lt;/code&gt; is not &lt;code&gt;NULL&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="kt"&gt;void&lt;/span&gt; 
&lt;span class="nf"&gt;collapse_mutually_exclusive_quals&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PlannerInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;simple_rel_array_size&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;RelOptInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;rel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;simple_rel_array&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rel&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="nb"&gt;NULL&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;reloptkind&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;RELOPT_BASEREL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;collapse_mutually_exclusive_quals_for_rel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Also add a little &lt;code&gt;Assert&lt;/code&gt; just for sure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt;
&lt;span class="nf"&gt;collapse_mutually_exclusive_quals_for_rel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PlannerInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;RelOptInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;ListCell&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;lc&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;RestrictInfo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;prev_rinfo&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="n"&gt;Assert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rel&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="nb"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;list_length&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;baserestrictinfo&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="cm"&gt;/* ... */&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Stop DB, rebuild and run query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                                     QUERY PLAN                                      
-------------------------------------------------------------------------------------
 Result  (cost=0.00..0.00 rows=0 width=24) (actual time=0.002..0.002 rows=0 loops=1)
   One-Time Filter: false
 Planning Time: 0.046 ms
 Execution Time: 0.012 ms
(4 rows)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The bug is fixed, query executed successfully.&lt;/p&gt;

&lt;p&gt;But it's worth nothing that it is not good to delete such information (list of predicates): we delete known information and wasing resources trying to apply this optimization.&lt;/p&gt;

&lt;p&gt;Best way to do this is just to create new &lt;code&gt;Path&lt;/code&gt; (&lt;code&gt;Result&lt;/code&gt; with empty output) when we starting to generate all possible paths. And this is real PostgreSQL implementation. When GUC setting &lt;code&gt;constraint_exclusion&lt;/code&gt; is set to &lt;code&gt;on&lt;/code&gt; (by default &lt;code&gt;partition&lt;/code&gt; to discard unnecessary partitions) PostgreSQL starts to find such mutually exclusive qualifications.&lt;/p&gt;

&lt;p&gt;If we dig down we will see that constraint exclusion logic applied in function &lt;code&gt;relation_excluded_by_constraints&lt;/code&gt; (self-explanatory name) which is called inside &lt;code&gt;set_rel_size&lt;/code&gt; (when computing sizes for each relation to find best path). This function has a bit of complicated logic, but in short it does what we have done - traverse through all predicates and find the ones that are exclusive (O(n^2) complexity compared to O(n) in our implementation). Also, this tries to find mutually exclusive predicates by all possible means: trying to commute operators, if the function is immutable it will be evaluated (real function invocation) and so on.&lt;/p&gt;

&lt;p&gt;I found this piece of code that find mutually exclusive operators:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="n"&gt;bool&lt;/span&gt;
&lt;span class="nf"&gt;operator_predicate_proof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Expr&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;predicate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Node&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;clause&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                         &lt;span class="n"&gt;bool&lt;/span&gt; &lt;span class="n"&gt;refute_it&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bool&lt;/span&gt; &lt;span class="n"&gt;weak&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;OpExpr&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;pred_opexpr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;clause_opexpr&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;Oid&lt;/span&gt; &lt;span class="n"&gt;pred_collation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;clause_collation&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;Oid&lt;/span&gt; &lt;span class="n"&gt;pred_op&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;clause_op&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;test_op&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;Node&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;pred_leftop&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;pred_rightop&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;clause_leftop&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;clause_rightop&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="cm"&gt;/*
     * Both expressions must be binary opclauses, else we can't do anything.
     *
     * Note: in future we might extend this logic to other operator-based
     * constructs such as DistinctExpr.  But the planner isn't very smart
     * about DistinctExpr in general, and this probably isn't the first place
     * to fix if you want to improve that.
     */&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;is_opclause&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;predicate&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;pred_opexpr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;OpExpr&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;predicate&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;list_length&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pred_opexpr&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;is_opclause&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clause&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;clause_opexpr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;OpExpr&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;clause&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;list_length&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clause_opexpr&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="cm"&gt;/*
     * If they're marked with different collations then we can't do anything.
     * This is a cheap test so let's get it out of the way early.
     */&lt;/span&gt;
    &lt;span class="n"&gt;pred_collation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pred_opexpr&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;inputcollid&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;clause_collation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;clause_opexpr&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;inputcollid&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pred_collation&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;clause_collation&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="cm"&gt;/* Grab the operator OIDs now too.  We may commute these below. */&lt;/span&gt;
    &lt;span class="n"&gt;pred_op&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pred_opexpr&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;opno&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;clause_op&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;clause_opexpr&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;opno&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="cm"&gt;/*
     * We have to match up at least one pair of input expressions.
     */&lt;/span&gt;
    &lt;span class="n"&gt;pred_leftop&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Node&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;linitial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pred_opexpr&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;pred_rightop&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Node&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;lsecond&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pred_opexpr&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;clause_leftop&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Node&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;linitial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clause_opexpr&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;clause_rightop&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Node&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;lsecond&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clause_opexpr&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;equal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pred_leftop&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;clause_leftop&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;equal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pred_rightop&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;clause_rightop&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="cm"&gt;/* We have x op1 y and x op2 y */&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;get_negator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pred_op&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;clause_op&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="cm"&gt;/* Omitted */&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As you can see this code is very similar to ours, except collation checking. But the most interesting part in the omitted part (last). That's what i was talking about - all possible ways to find such operators: checking for different order of operands, test for commutative operators, checking for constant or constant function, operator evaluation etc.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lifehacks
&lt;/h2&gt;

&lt;p&gt;In the last part I will give you some lifehacks for the planner development.&lt;/p&gt;

&lt;h3&gt;
  
  
  Disable password prompt
&lt;/h3&gt;

&lt;p&gt;When we attach with debugger to backend you will be prompted a password. This is a security feature, but not very convinient during development.&lt;/p&gt;

&lt;p&gt;You can disable this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Using configuration file &lt;code&gt;/etc/sysctl.d/10-ptrace.conf&lt;/code&gt; - set &lt;code&gt;kernel.yama.ptrace_scope = 0&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;By writing &lt;code&gt;0&lt;/code&gt; to &lt;code&gt;/proc/sys/kernel/yama/ptrace_scope&lt;/code&gt; - &lt;code&gt;echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You should combine them: apply 2 method to disable password prompt immediately and 1 to save changes between reboots. Otherwise you must use 2 method after each reboot.&lt;/p&gt;

&lt;h3&gt;
  
  
  Display backend PID
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;psql&lt;/code&gt; has it's own configuration file - &lt;code&gt;.psqlrc&lt;/code&gt;. It contains commands to be executed when &lt;code&gt;psql&lt;/code&gt; starts.&lt;/p&gt;

&lt;p&gt;We need to now PID of backend to attach with the debugger. So this is good place to insert our &lt;code&gt;SELECT pg_backend_pid();&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Next set environment variable &lt;code&gt;PSQLRC&lt;/code&gt; equal to path to our &lt;code&gt;.psqlrc&lt;/code&gt; - &lt;code&gt;PSQLRC="/home/user/project/build/.psqlrc"&lt;/code&gt; (do not forget to &lt;code&gt;export&lt;/code&gt; it to &lt;code&gt;psql&lt;/code&gt;)&lt;/p&gt;

&lt;p&gt;After that PID of backend will be displayed when &lt;code&gt;psql&lt;/code&gt; starts.&lt;/p&gt;

&lt;p&gt;TIP: using &lt;code&gt;psql&lt;/code&gt; feature to write output to external file you can save PID in file for later use. For example, you can integrate this into VS Code to automatically attach to backend when pressing F5 (start debugging).&lt;/p&gt;

&lt;h3&gt;
  
  
  PostgreSQL debugging tools
&lt;/h3&gt;

&lt;p&gt;Of course PostgreSQL has it's own debugging facilities. Most of them are related to Node displaying.&lt;/p&gt;

&lt;p&gt;The first, in header file &lt;code&gt;print.h&lt;/code&gt; declared many functions to output Node structures to log in custom format (not json/yaml).&lt;/p&gt;

&lt;p&gt;The most basic function - &lt;code&gt;pprint&lt;/code&gt;. It displays any provided Node to &lt;code&gt;stdout&lt;/code&gt;. Other functions are specialized - &lt;code&gt;print_expr&lt;/code&gt;, &lt;code&gt;print_pathkeys&lt;/code&gt;, &lt;code&gt;print_tl&lt;/code&gt;, &lt;code&gt;print_slot&lt;/code&gt;, &lt;code&gt;print_rt&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The second, 2 special feature macros. They must be enabled during compilation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;OPTIMIZER_DEBUG&lt;/code&gt; - common planner&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GEQO_DEBUG&lt;/code&gt; - GEQO&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If they are defined, then after certain stage result of it's work will be dumped to log.&lt;/p&gt;

&lt;p&gt;The third, GUC settings:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;debug_print_parse&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;debug_print_rewritten&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;debug_print_plan&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the setting is set to &lt;code&gt;on&lt;/code&gt;, then after corresponding stage (query parsing, query tree rewriting and planning) result of it's work will be dumped to log.&lt;/p&gt;

&lt;p&gt;But they output result, not separte steps.&lt;/p&gt;

&lt;h3&gt;
  
  
  Automation
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;This is not longer lifehack, but help&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When working with PostgreSQL we can define 4 main stages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bootstrapping (&lt;code&gt;configure&lt;/code&gt; invocation)&lt;/li&gt;
&lt;li&gt;Compilation&lt;/li&gt;
&lt;li&gt;Tests running&lt;/li&gt;
&lt;li&gt;Run DB and &lt;code&gt;psql&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For each step we can write automation scripts. But when we are working with VS Code we can get a big benefit by integrating them into VS Code. You can do this using VS Code Tasks. For example, task for building can be the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"label"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Build"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"detail"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Build PostgreSQL and install into directory"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"shell"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"${workspaceFolder}/build.sh"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"problemMatcher"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"group"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"kind"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"build"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"isDefault"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With such setting you can run building by pressing Ctrl + Shift + B shortcut.&lt;/p&gt;

&lt;p&gt;I have developed some scripts for automation. They are lying in &lt;a href="https://github.com/TantorLabs/meetups/tree/main/2024-09-17_Kazan/Sergey%20Solovev%20-%20Debugging%20PostgreSQL%20planner/dev" rel="noopener noreferrer"&gt;&lt;code&gt;dev&lt;/code&gt;&lt;/a&gt; folder. You can run them from terminal or whatever you want. In this folder you also can find &lt;a href="https://github.com/TantorLabs/meetups/tree/main/2024-09-17_Kazan/Sergey%20Solovev%20-%20Debugging%20PostgreSQL%20planner/dev/README_en.md" rel="noopener noreferrer"&gt;&lt;code&gt;README&lt;/code&gt;&lt;/a&gt; with tutorial and usage examples.&lt;/p&gt;

&lt;p&gt;Also, there are VS Code configuration files for integration with these scripts. They are lying in &lt;a href="https://github.com/TantorLabs/meetups/tree/main/2024-09-17_Kazan/Sergey%20Solovev%20-%20Debugging%20PostgreSQL%20planner/.vscode" rel="noopener noreferrer"&gt;&lt;code&gt;.vscode&lt;/code&gt;&lt;/a&gt; folder.&lt;/p&gt;

&lt;h3&gt;
  
  
  PostgreSQL Hacker Helper
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;PostgreSQL Hacker Helper&lt;/code&gt; is a VS Code extension for PostgreSQL development.&lt;/p&gt;

&lt;p&gt;Main feature of it is Node variables revealing using according to value of NodeTag. The extension known about them and when finds one cast it type according to it's NodeTag and displays in that way.&lt;/p&gt;

&lt;p&gt;You have already seen this extensions, on screenshots, then we were debugging source code and revealing &lt;code&gt;OpExpr&lt;/code&gt;, &lt;code&gt;Var&lt;/code&gt;, &lt;code&gt;Const&lt;/code&gt; variables.&lt;/p&gt;

&lt;p&gt;As you can notice, it also knows about &lt;code&gt;List&lt;/code&gt; - shows each member of array according to their NodeTag, not plain &lt;code&gt;Node *&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;It has a lot more features, like formatting using &lt;code&gt;pgindent&lt;/code&gt; or extension files bootstrapping.&lt;/p&gt;

&lt;p&gt;The extension supports almost all PostgreSQL versions - starting from 8.&lt;/p&gt;

&lt;p&gt;Link to the extension &lt;a href="https://marketplace.visualstudio.com/items?itemName=ash-blade.postgresql-hacker-helper" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we've done
&lt;/h2&gt;

&lt;p&gt;In summary, we:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Got acquainted with planner work pipeline, it's main stages and corresponding functions&lt;/li&gt;
&lt;li&gt;Learnt PostgreSQL type system with most used functions for work with them&lt;/li&gt;
&lt;li&gt;Designed and implement our own optimization for the planner&lt;/li&gt;
&lt;li&gt;Add our logic to different pipeline stages&lt;/li&gt;
&lt;li&gt;Fixed code using the debugger# PostgreSQL planner development and debugging&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And main thing: do not be afraid to try &lt;a href="https://github.com/ashenBlade/postgres-dev-helper" rel="noopener noreferrer"&gt;PostgreSQL Hacker Helper&lt;/a&gt; extension!&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>c</category>
    </item>
  </channel>
</rss>
