pg_duckpipe: Real-time CDC for streaming Postgres Table into Columnar Ducklake

Yuwei Xiao — Fri, 20 Mar 2026 08:46:00 +0000

TL;DR: pg_duckpipe is a PostgreSQL extension that continuously streams your heap tables into DuckLake columnar tables via WAL-based CDC. One SQL call to start, no external infrastructure required.

Why pg_duckpipe?

When we released pg_ducklake, it brought a columnar lakehouse storage layer to PostgreSQL: DuckDB-powered analytical tables backed by Parquet, with metadata living in PostgreSQL's own catalog. One question kept coming up: how do I keep these analytical tables in sync with my transactional tables automatically?

This is a real problem. If you manage DuckLake tables by hand, running periodic ETL jobs or batch inserts, you end up with stale data, extra scripts to maintain, and an operational surface area that grows with every table. For teams that want fresh analytical views of their OLTP data, this quickly becomes painful.

pg_duckpipe addresses this. It is a PostgreSQL extension (and optionally a standalone daemon) that streams changes from regular heap tables into DuckLake columnar tables in real time. No Kafka, no Debezium, no external orchestrator. Just PostgreSQL.

Getting Started

Docker ships both pg_ducklake and pg_duckpipe pre-configured:

docker run -d - name duckpipe \
  -p 15432:5432 \
  -e POSTGRES_PASSWORD=duckdb \
  pgducklake/pgduckpipe:18-main

Sync a Local Table
Sync a heap table into a columnar copy for analytical queries:

-- Connect to the database
psql -h localhost -p 15432 -U postgres

-- Create a table and insert some data
CREATE TABLE orders (
  id BIGSERIAL PRIMARY KEY,
  customer_id BIGINT,
  total INT,
  created_at TIMESTAMP DEFAULT now()
);
INSERT INTO orders(customer_id, total)
SELECT (random() * 1000)::bigint, (random() * 10000)::int
FROM generate_series(1, 100000);

-- Start syncing to a columnar copy
SELECT duckpipe.add_table('public.orders');

-- Query the columnar copy
SELECT customer_id, sum(total), count(*)
FROM orders_ducklake
GROUP BY customer_id
ORDER BY sum(total) DESC
LIMIT 10;

Sync from a Remote PostgreSQL
pg_duckpipe can replicate from a remote PostgreSQL instance. The source database does not need pg_duckpipe or pg_ducklake installed. It only needs wal_level = logical and a replication user. This makes it easy to add an analytical layer to an existing production database:

-- Create a sync group pointing to the remote database
SELECT duckpipe.create_group('prod_replica',
conninfo => 'host=prod-db.example.com port=5432 dbname=myapp user=replicator');

-- Add tables to sync
SELECT duckpipe.add_table('public.orders', sync_group => 'prod_replica');
SELECT duckpipe.add_table('public.customers', sync_group => 'prod_replica');

-- Check sync progress
SELECT source_table, state, rows_synced FROM duckpipe.status();

Under the Hood

pg_duckpipe is written in Rust. Here is how changes flow from source to lakehouse:

Tail the WAL stream. Connect to PostgreSQL's logical replication protocol via the pgoutput plugin.
Decode and route. Parse each change and dispatch it to a per-table in-memory queue.
Flush to DuckLake. Batch-write queued changes into DuckLake columnar tables through embedded DuckDB connections.

A few design choices worth noting:

Per-table isolation. Each synced table progresses through its own state machine (SNAPSHOT, CATCHUP, STREAMING) independently. A failure in one table never blocks another.
Backpressure. If flush workers fall behind, the slot consumer pauses WAL consumption rather than accumulating unbounded memory.
Crash safety. Per-table LSN tracking and an idempotent DELETE+INSERT flush path ensure at-least-once delivery with correct replay after restarts.

For a deeper look at the architecture, checkout the codebase and docs.

Roadmap

pg_duckpipe is under active development. Here is what we are working on next:

Functionality: schema DDL propagation, broader PostgreSQL version support.
Performance: flush worker thread pool, bounded queues, adaptive batching.
Maintenance & Observability: auto-compaction, scheduled flush policies, per-table lag metrics.

Give it a try, open an issue if something breaks, and send a PR if you want to help shape it. Let's start duck piping!

GitHub: https://github.com/relytcloud/pg_duckpipe

pg_ducklake: Columnar Storage for PostgreSQL

Yuwei Xiao — Wed, 11 Mar 2026 09:29:01 +0000

Access pg_ducklake at github

PostgreSQL is The World's Most Advanced Open Source Relational Database[1].
It has the broadest and most mature ecosystem in the modern data stack: from AI integrations, JDBC drivers, and ORMs, to robust tooling for monitoring, backups and replication.
But PostgreSQL is fundamentally a row-store designed for transactions, and that makes it a less natural fit for large analytical scans and aggregations.

[1] quoted from https://www.postgresql.org/

At the same time, the lakehouse approach is steadily becoming the default for analytics: Separation of storage and compute architecture, based on open columnar file format (often Parquet).
Table Formats like Delta Lake and Apache Iceberg brought this model mainstream, but they suffer from complexity in metadata management.

DuckLake is the new participant that keeps SQL catalog metadata while storing open Parquet data files.

pg_ducklake sits right at the intersection of these worlds: it brings a native lakehouse experience into PostgreSQL, while keeping accessbility from DuckDB ecosystem

PostgreSQL × DuckDB × DuckLake

pg_ducklake creates a unified experience by bridging these three components:

PostgreSQL provides the Interface and Catalog: You manage and query tables using familiar Postgres SQL, while all table metadata is stored natively in PostgreSQL heap tables.
DuckDB powers the Execution Engine: A vectorized DuckDB engine is embedded directly within the PostgreSQL backend to handle analytical scans and aggregations with high efficiency.
DuckLake serves as the Storage Format: It defines the "open" nature of the data: combining Parquet files on S3 with the metadata in Postgres, which ensures external DuckDB clients (CLI, Python, etc.) can also access the same tables.

What pg_ducklake brings

The goal is simple: use PostgreSQL normally, but get lakehouse-style tables when you need them.

Postgres-native ergonomics: DuckLake tables are managed from PostgreSQL, using familiar SQL and tooling. So they fit naturally into Postgres apps, BI and analyst workflows. In replica-friendly deployments (e.g., serverless Postgres setups like Neon), you can often scale read-heavy analytics by adding read replicas.
Open tables by default: Parquet data + Postgres catalog; DuckDB clients (e.g., CLI, Python) can read the "raw" ducklake table by using Postgres as the metadata provider.
DuckDB speed for analytics: vectorized execution + columnar storage for scans and aggregations.

What’s next for pg_ducklake

pg_ducklake is under active development, and we’re aiming toward a production-grade lakehouse experience inside PostgreSQL. On the roadmap are practical features like schema evolution, time travel, partitioning / layout controls, and table maintenance (compaction / garbage collection), along with clearer operational guidance as more real-world users kick the tires.

Feedback and contributions are very welcome, especially real-world workloads, feature requests, and sharp edges you run into.

pg_ducklake: Native lakehouse tables in PostgreSQL—Open data, DuckDB speed.

Forem: Yuwei Xiao