What Is CDC?
Change Data Capture (CDC) is a mechanism that tracks row-level changes (inserts, updates, deletes) in a database and notifies downstream systems in the order they occur. In disaster recovery scenarios, CDC is often used for real-time synchronization from a primary database to a standby one.
source ----------> CDC ----------> sink
Apache SeaTunnel CDC
SeaTunnel CDC supports two types of synchronization:
- Snapshot Reading: Reads historical data from tables
- Incremental Tracking: Captures and reads incremental change logs from tables
Lock-Free Snapshot Sync
Why emphasize “lock-free”? Because some CDC platforms (e.g., Debezium) may lock tables during historical data sync. SeaTunnel avoids this by reading snapshots without locking. Here’s the basic snapshot read process:
storage------------->splitEnumerator----------split---------->reader
^ |
| |
\-----------------report------------/
Split Assignment:
The splitEnumerator
divides table data into multiple chunks (splits) based on a specified field (e.g., table ID or unique key) and step size.
Parallel Processing:
Each split is routed to a different reader for parallel reading. Each reader occupies one connection.
Event Feedback:
After completing a split, the reader reports progress back to the splitEnumerator
.
Each split sent to a reader contains metadata:
String splitId // Routing ID
TableId tableId // Table ID
SeatunnelRowType splitKeyType // Field type used for splitting
Object splitStart // Split start point
Object splitEnd // Split end point
The reader generates SQL statements based on this info. Before reading, it records the log position (low watermark
) for the split. Once complete, it reports:
String splitId // Split ID
Offset highWatermark // Log position after split is processed
Incremental Synchronization
After snapshot reading, any changes in the source DB are continuously captured and synced in real time. Unlike the snapshot phase, this stage reads from the DB’s log (e.g., MySQL binlog) and typically uses a single-threaded reader to reduce pressure on the DB.
data log------------->splitEnumerator----------split---------->reader
^ |
| |
\-----------------report------------/
During incremental sync, all snapshot splits and tables are merged into a single split. Metadata for the incremental split:
String splitId
Offset startingOffset // Earliest log start across all splits
Offset endingOffset // Log end position; null if continuous
List<TableId> tableIds
Map<TableId, Offset> tableWatermarks // Watermarks per table from snapshot phase
List<CompletedSnapshotSplitInfo> completedSnapshotSplitInfos // Snapshot split details
CompletedSnapshotSplitInfo includes:
String splitId
TableId tableId
SeatunnelRowType splitKeyType
Object splitStart
Object splitEnd
Offset watermark // High watermark from report
The incremental phase finds the smallest watermark from snapshot splits and starts reading from that log position.
Exactly-Once Guarantee
Whether during snapshot or incremental sync, the database may still be changing. How does SeaTunnel ensure exactly-once processing?
Snapshot Phase
During snapshot reading, imagine a split is being read and data changes (e.g., insert k3, update k2, delete k1). Without special handling, updates could be missed. SeaTunnel solves this by:
- Reading the low watermark from the database log before the split
- Reading data for split
{start, end}
- Recording the high watermark after the split
- If
high = low
, no change occurred. Ifhigh > low
, a change occurred during the read. SeaTunnel:
- Caches snapshot data in-memory
- Replays log events between low and high watermark onto the in-memory table using primary key order
- Reports
high watermark
- Reports
insert k3 update k2 delete k1
| | |
v v v
bin log --|---------------------------------------------------|-- log offset
low watermark high watermark
CDC read data: k1 k3 k4
| replay
v
Final result: k2 k3' k4
Incremental Phase
Before starting incremental sync, SeaTunnel validates the snapshot phase. It checks for inter-split changes—data changes that may have occurred between splits. SeaTunnel handles this by:
- Finding the minimum watermark from all snapshot splits
- Starting to read from that log position
- For each log event, checking if the data was already processed in a snapshot split
- If not, it's inter-split data and will be corrected
- After all tables are validated, true incremental sync begins
|------------filter split2-----------------|
|----filter split1------|
data log -|-----------------------|------------------|----------------------------------|- log offset
min watermark split1 watermark split2 watermark max watermark
Fault Tolerance and Checkpointing
How to support pause and resume? SeaTunnel uses the Chandy-Lamport distributed snapshot algorithm.
Assume two processes, p1 and p2:
p1 has variables X1 Y1 Z1, p2 has X2 Y2 Z2:
p1 p2
X1:0 X2:4
Y1:0 Y2:2
Z1:0 Z2:3
p1 initiates a global snapshot by recording its local state and sending a marker to p2. Before p2 receives it, it sends message M to p1.
p1 p2
X1:0 -------marker-------> X2:4
Y1:0 <---------M---------- Y2:2
Z1:0 Z2:3
p2 receives the marker, records its own state. p1 then receives message M and logs it separately as it already took a snapshot. Final snapshot:
p1 M p2
X1:0 X2:4
Y1:0 Y2:2
Z1:0 Z2:3
In SeaTunnel CDC, markers are sent to all nodes — readers, splitEnumerators, writers — and each stores its current state for recovery.
Top comments (0)