<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Yuwei Xiao</title>
    <description>The latest articles on Forem by Yuwei Xiao (@ywxiao).</description>
    <link>https://forem.com/ywxiao</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3818169%2F49c4afbc-2455-4dec-b1ef-04bf24d5963e.jpeg</url>
      <title>Forem: Yuwei Xiao</title>
      <link>https://forem.com/ywxiao</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/ywxiao"/>
    <language>en</language>
    <item>
      <title>pg_duckpipe: Real-time CDC for streaming Postgres Table into Columnar Ducklake</title>
      <dc:creator>Yuwei Xiao</dc:creator>
      <pubDate>Fri, 20 Mar 2026 08:46:00 +0000</pubDate>
      <link>https://forem.com/ywxiao/pgduckpipe-real-time-cdc-for-streaming-postgres-table-into-columnar-ducklake-536d</link>
      <guid>https://forem.com/ywxiao/pgduckpipe-real-time-cdc-for-streaming-postgres-table-into-columnar-ducklake-536d</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;TL;DR: &lt;a href="https://github.com/relytcloud/pg_duckpipe" rel="noopener noreferrer"&gt;pg_duckpipe&lt;/a&gt; is a PostgreSQL extension that continuously streams your heap tables into DuckLake columnar tables via WAL-based CDC. One SQL call to start, no external infrastructure required.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Why pg_duckpipe?
&lt;/h2&gt;

&lt;p&gt;When we released &lt;a href="https://github.com/relytcloud/pg_ducklake" rel="noopener noreferrer"&gt;pg_ducklake&lt;/a&gt;, it brought a columnar lakehouse storage layer to PostgreSQL: DuckDB-powered analytical tables backed by Parquet, with metadata living in PostgreSQL's own catalog. One question kept coming up: how do I keep these analytical tables in sync with my transactional tables automatically?&lt;/p&gt;

&lt;p&gt;This is a real problem. If you manage DuckLake tables by hand, running periodic ETL jobs or batch inserts, you end up with stale data, extra scripts to maintain, and an operational surface area that grows with every table. For teams that want fresh analytical views of their OLTP data, this quickly becomes painful.&lt;/p&gt;

&lt;p&gt;pg_duckpipe addresses this. It is a PostgreSQL extension (and optionally a standalone daemon) that streams changes from regular heap tables into DuckLake columnar tables in real time. No Kafka, no Debezium, no external orchestrator. Just PostgreSQL.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwpr96bg8bwgrbehont65.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwpr96bg8bwgrbehont65.png" alt="architecture" width="800" height="382"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;Docker ships both pg_ducklake and pg_duckpipe pre-configured:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-d&lt;/span&gt; - name duckpipe &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 15432:5432 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;POSTGRES_PASSWORD&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;duckdb &lt;span class="se"&gt;\&lt;/span&gt;
  pgducklake/pgduckpipe:18-main
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Sync a Local Table&lt;/strong&gt;&lt;br&gt;
Sync a heap table into a columnar copy for analytical queries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Connect to the database&lt;/span&gt;
&lt;span class="n"&gt;psql&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="n"&gt;localhost&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="mi"&gt;15432&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;U&lt;/span&gt; &lt;span class="n"&gt;postgres&lt;/span&gt;

&lt;span class="c1"&gt;-- Create a table and insert some data&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="n"&gt;BIGSERIAL&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;bigint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;generate_series&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Start syncing to a columnar copy&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;duckpipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'public.orders'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Query the columnar copy&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders_ducklake&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Sync from a Remote PostgreSQL&lt;/strong&gt;&lt;br&gt;
pg_duckpipe can replicate from a remote PostgreSQL instance. The source database does not need pg_duckpipe or pg_ducklake installed. It only needs &lt;code&gt;wal_level = logical&lt;/code&gt; and a replication user. This makes it easy to add an analytical layer to an existing production database:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Create a sync group pointing to the remote database&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;duckpipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;create_group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'prod_replica'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;conninfo&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'host=prod-db.example.com port=5432 dbname=myapp user=replicator'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Add tables to sync&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;duckpipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'public.orders'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sync_group&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'prod_replica'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;duckpipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'public.customers'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sync_group&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'prod_replica'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Check sync progress&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;source_table&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rows_synced&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;duckpipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Under the Hood&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;pg_duckpipe is written in Rust. Here is how changes flow from source to lakehouse:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Tail the WAL stream. Connect to PostgreSQL's logical replication protocol via the pgoutput plugin.&lt;/li&gt;
&lt;li&gt;Decode and route. Parse each change and dispatch it to a per-table in-memory queue.&lt;/li&gt;
&lt;li&gt;Flush to DuckLake. Batch-write queued changes into DuckLake columnar tables through embedded DuckDB connections.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhc8uscnmb86s08xg1mv7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhc8uscnmb86s08xg1mv7.png" alt="pipeline" width="800" height="217"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A few design choices worth noting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Per-table isolation. Each synced table progresses through its own state machine (SNAPSHOT, CATCHUP, STREAMING) independently. A failure in one table never blocks another.&lt;/li&gt;
&lt;li&gt;Backpressure. If flush workers fall behind, the slot consumer pauses WAL consumption rather than accumulating unbounded memory.&lt;/li&gt;
&lt;li&gt;Crash safety. Per-table LSN tracking and an idempotent DELETE+INSERT flush path ensure at-least-once delivery with correct replay after restarts.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;For a deeper look at the architecture, checkout the codebase and docs.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Roadmap
&lt;/h2&gt;

&lt;p&gt;pg_duckpipe is under active development. Here is what we are working on next:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Functionality: schema DDL propagation, broader PostgreSQL version support.&lt;/li&gt;
&lt;li&gt;Performance: flush worker thread pool, bounded queues, adaptive batching.&lt;/li&gt;
&lt;li&gt;Maintenance &amp;amp; Observability: auto-compaction, scheduled flush policies, per-table lag metrics.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Give it a try, open an issue if something breaks, and send a PR if you want to help shape it. Let's start duck piping!&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/relytcloud/pg_duckpipe" rel="noopener noreferrer"&gt;https://github.com/relytcloud/pg_duckpipe&lt;/a&gt;&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>opensource</category>
      <category>analytics</category>
      <category>database</category>
    </item>
    <item>
      <title>pg_ducklake: Columnar Storage for PostgreSQL</title>
      <dc:creator>Yuwei Xiao</dc:creator>
      <pubDate>Wed, 11 Mar 2026 09:29:01 +0000</pubDate>
      <link>https://forem.com/ywxiao/pgducklake-columnar-storage-for-postgresql-53fh</link>
      <guid>https://forem.com/ywxiao/pgducklake-columnar-storage-for-postgresql-53fh</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Access pg_ducklake at &lt;a href="https://github.com/relytcloud/pg_ducklake/" rel="noopener noreferrer"&gt;github&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;PostgreSQL is The World's Most Advanced Open Source Relational Database[1].&lt;br&gt;
It has the broadest and most mature ecosystem in the modern data stack: from AI integrations, JDBC drivers, and ORMs, to robust tooling for monitoring, backups and replication.&lt;br&gt;
But PostgreSQL is fundamentally a row-store designed for transactions, and that makes it a less natural fit for large analytical scans and aggregations.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;[1] quoted from &lt;a href="https://www.postgresql.org/" rel="noopener noreferrer"&gt;https://www.postgresql.org/&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;At the same time, the lakehouse approach is steadily becoming the default for analytics: Separation of storage and compute architecture, based on open columnar file format (often Parquet).&lt;br&gt;
Table Formats like Delta Lake and Apache Iceberg brought this model mainstream, but they suffer from complexity in metadata management.&lt;/p&gt;

&lt;p&gt;DuckLake is the new participant that keeps SQL catalog metadata while storing open Parquet data files.&lt;/p&gt;

&lt;p&gt;pg_ducklake sits right at the intersection of these worlds: it brings a native lakehouse experience into PostgreSQL, while keeping accessbility from DuckDB ecosystem&lt;/p&gt;

&lt;h2&gt;
  
  
  PostgreSQL × DuckDB × DuckLake
&lt;/h2&gt;

&lt;p&gt;pg_ducklake creates a unified experience by bridging these three components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PostgreSQL provides the Interface and Catalog: You manage and query tables using familiar Postgres SQL, while all table metadata is stored natively in PostgreSQL heap tables.&lt;/li&gt;
&lt;li&gt;DuckDB powers the Execution Engine: A vectorized DuckDB engine is embedded directly within the PostgreSQL backend to handle analytical scans and aggregations with high efficiency.&lt;/li&gt;
&lt;li&gt;DuckLake serves as the Storage Format: It defines the "open" nature of the data: combining Parquet files on S3 with the metadata in Postgres, which ensures external DuckDB clients (CLI, Python, etc.) can also access the same tables.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbk6a4qdhy2rm2qm98sln.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbk6a4qdhy2rm2qm98sln.jpeg" alt="architecture" width="800" height="542"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What pg_ducklake brings
&lt;/h2&gt;

&lt;p&gt;The goal is simple: use PostgreSQL normally, but get lakehouse-style tables when you need them.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Postgres-native ergonomics: DuckLake tables are managed from PostgreSQL, using familiar SQL and tooling. So they fit naturally into Postgres apps, BI and analyst workflows. In replica-friendly deployments (e.g., serverless Postgres setups like Neon), you can often scale read-heavy analytics by adding read replicas.&lt;/li&gt;
&lt;li&gt;Open tables by default: Parquet data + Postgres catalog; DuckDB clients (e.g., CLI, Python) can read the "raw" ducklake table by using Postgres as the metadata provider.&lt;/li&gt;
&lt;li&gt;DuckDB speed for analytics: vectorized execution + columnar storage for scans and aggregations.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What’s next for pg_ducklake
&lt;/h2&gt;

&lt;p&gt;pg_ducklake is under active development, and we’re aiming toward a production-grade lakehouse experience inside PostgreSQL. On the roadmap are practical features like schema evolution, time travel, partitioning / layout controls, and table maintenance (compaction / garbage collection), along with clearer operational guidance as more real-world users kick the tires.&lt;/p&gt;

&lt;p&gt;Feedback and contributions are very welcome, especially real-world workloads, feature requests, and sharp edges you run into.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;pg_ducklake: Native lakehouse tables in PostgreSQL—Open data, DuckDB speed.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>database</category>
      <category>postgres</category>
      <category>duckdb</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
