Forem: Apache SeaTunnel

Selective CDC with Apache SeaTunnel: How to Capture Only the Database Changes You Need

Apache SeaTunnel — Thu, 21 May 2026 09:52:53 +0000

1. Overview

In modern data architectures, real-time capture and processing of data changes is a key technology for building data lakes, real-time data warehouses, and business analytics systems. By reading database transaction logs (such as MySQL Binlog), Apache SeaTunnel can efficiently and accurately capture table change events, including INSERT, UPDATE, and DELETE operations.

Apache SeaTunnel natively supports extracting the row_kind metadata column, which records the change type (signal) of each captured record, such as +I (INSERT), -U (UPDATE_BEFORE), +U (UPDATE_AFTER), and -D (DELETE). This enables users to perform more fine-grained control over change streams, such as filtering specific change events through the row_kind field (for example, synchronizing only newly inserted data), thereby building efficient and customized real-time data pipelines.

This technology is widely used in scenarios such as append-only data lake ingestion, preserving complete change histories for downstream analytical systems, and implementing fine-grained filtering logic in streaming ETL processes.

2. Environment Setup

Before starting the demo, prepare the following environment and components:

JDK 11
Apache SeaTunnel 2.3.12
MySQL 5.7

3. SeaTunnel Configuration

1. Preparing SeaTunnel Connector Plugins

First, ensure that your SeaTunnel environment can connect to MySQL.

Edit the config/plugin_config file and add the following two core connectors:

id="l4j1ph"
connector-cdc-mysql
connector-jdbc

After saving the file, execute the installation script:

id="0ksx1w"
sh bin/install-plugin.sh

If online installation is slow or unavailable, you can manually download the corresponding JAR packages from the Maven repository and place them into the connectors directory.

2. Adding the MySQL Driver

Since the MySQL JDBC driver is usually not bundled by default, it must be downloaded manually. Place mysql-connector-java-8.0.28.jar (or your preferred version) into the lib directory of SeaTunnel.

4. Creating MySQL Tables

id="2h22u9"
CREATE TABLE `w`  (
  `id` int(11) NOT NULL,
  `name` varchar(50) CHARACTER SET utf8mb4 NULL DEFAULT NULL,
  PRIMARY KEY (`id`) USING BTREE
) ENGINE = InnoDB CHARACTER SET = utf8mb4 ;

CREATE TABLE `w2`  (
  `id` int(11) NOT NULL,
  `name` varchar(50) CHARACTER SET utf8mb4 NULL DEFAULT NULL,
  `row_kind` varchar(10) CHARACTER SET utf8mb4 NULL DEFAULT NULL
) ENGINE = InnoDB CHARACTER SET = utf8mb4 ;

Note: Do not set id as the primary key in table w2; otherwise, records will be updated based on the primary key instead of being inserted as new rows.

5. SeaTunnel Job Definition

id="xv3hza"
env {
  parallelism = 1
  job.mode = "STREAMING"
}

source {
  MySQL-CDC {
    server-id = 5000
    username = "root"
    password = "root"
    table-names = ["cdc.w"]
    url = "jdbc:mysql://localhost:3306/cdc"
  }
}

transform {
  RowKindExtractor {

  }
}

sink {
  jdbc {
        url = "jdbc:mysql://localhost:3306/cdc?useUnicode=true&characterEncoding=UTF-8&rewriteBatchedStatements=true"
        driver = "com.mysql.cj.jdbc.Driver"
        username = "root"
        password = "root"
        database = cdc
        table = w2
        generate_sink_sql = true
  }
}

Key Notes

RowKindExtractor adds a row_kind flag to each data row, enabling Append-Only mode.
The row_kind field name can be customized:

id="tr2caj"
custom_field_name = "op_type"

Data types support both abbreviated and full formats. The abbreviated format is used by default:

id="p68n79"
transform_type = SHORT     # FULL

Execute the job:

id="mq0owv"
bin/seatunnel.sh -c job/filename -m local

After execution, all data changes in table w will be synchronized to table w2.

6. Testing

1. Insert Data

insert into w values(1,'Alice');
insert into w values(2,'Bob');

mysql> select * from w2;

+----+-------+----------+
| id | name  | row_kind |
+----+-------+----------+
|  1 | Alice | +I       |
|  2 | Bob   | +I       |
+----+-------+----------+

2. Update and Delete Data

id="7rrgg3"
update w set name='Charlie' where id=2;
delete from w where id=2;

mysql> select * from w2;

+----+-------- +----------+
| id | name    | row_kind |
+----+-------- +----------+
|  1 | Alice   | +I       |
|  2 | Bob     | +I       |
|  2 | Charlie | +U       |
|  2 | Charlie | -D       |
+----+--------+----------+

Conclusion: All changes are synchronized downstream in the form of inserted records.

7. Implementing Change Filtering Through Metadata

Using the row_kind metadata field, selective synchronization can easily be implemented within the data pipeline. For example, if only newly inserted records from source table w need to be synchronized to downstream table w2, a WHERE condition can be added in the SQL query to filter the row_kind field.

The core principle lies in row-level change event markers:

For UPDATE operations, two consecutive events are generated:

-U (UPDATE_BEFORE), representing the old value
+U (UPDATE_AFTER), representing the new value

DELETE operations generate the -D event.

By filtering row_kind = '+I', only INSERT events are captured and forwarded downstream, while UPDATE and DELETE events are ignored. This enables business scenarios such as source-stream snapshots and append-only data ingestion.

Technical Implementation Example

id="h7c5lc"
transform {
  RowKindExtractor {
    plugin_input = "mysql_source"
    plugin_output = "trans_row"
  }

  Sql {
    plugin_input = "trans_row"
    plugin_output = "trans_sql"
    query = "select id,name from trans_row where row_kind = '+I'";
  }
}

After adding the change marker field, SQL filtering can be used to retain only newly inserted data and write it to the downstream table w2 in real time.

UPDATE and DELETE events are filtered out and will not be transmitted downstream.

8. Test Verification and Result Analysis

To verify the effectiveness of the row_kind filtering logic, we performed a series of operations on the source table w and observed the changes in the target table w2.

In this scenario, table w2 no longer requires the row_kind field.

Test Steps and Observations

1. Insert Data

id="8e4l5x"
INSERT INTO w VALUES(1,'Alice');
INSERT INTO w VALUES(2,'Bob');

mysql> select * from w2;

+----+--------+----------+
| id | name   | row_kind |
+----+--------+----------+
|  1 | Alice  | +I       |
|  2 | Bob    | +I       |
+----+--------+----------+

2. Update Data

id="7cllq7"
UPDATE w SET name='Charlie' WHERE id=2;
DELETE FROM w WHERE id=2;

No changes will appear in table w2.

Apache SeaTunnel Isn’t a Simple ETL Tool , Understanding Its DataFlow-Driven DAG Engine

Apache SeaTunnel — Thu, 21 May 2026 08:07:32 +0000

In the field of data integration and synchronization, Apache SeaTunnel is undoubtedly one of the hottest tools today. This series will dive deep into its advanced usage.

The first article starts with SeaTunnel’s core concept — “Data Flow”, analyzing the underlying principles such as data movement and transformation mechanisms, combined with practical examples in complex scenarios, helping you truly master this tool.

One-Sentence Summary (Conclusion First)

SeaTunnel is not a linear “source → sink” tool
👉 It is a DAG execution engine driven by “DataStream / DataFlow”

The fact that two sources can flow into one sink is a direct reflection of this model.

1. SeaTunnel’s Core Concept: Data Flow

Inside SeaTunnel, everything revolves around “data flow.”

What is a Data Flow?

A data flow = a stream of Records with the same structure (with Schema)

It is not a table, not a file, and not a SQL result.

Instead, it is:

Record1 → Record2 → Record3 → ...

Every Plugin is “Operating on Data Streams”

Plugin Type	Behavior
Source	Generate data streams
Transform	Consume + generate data streams
Sink	Consume data streams

2. The Real Meaning of `plugin_output` / `plugin_input` (Very Important)

You’ve been “using” them before, but now it’s time to truly “understand” them.

1️⃣ `plugin_output`

plugin_output = "source_data_output_1"

Its meaning is not simply a “name,” but:

Assigning a unique ID to the data stream generated by the current plugin

It can be understood as:

DataStream<ID = source_data_output_1>

2️⃣ `plugin_input`

plugin_input = "source_data_output_1"

Its meaning is:

Which data stream this plugin should consume

One Sentence to Fully Explain It

plugin_output / plugin_input = “connection ports” for data streams

3. SeaTunnel’s DAG Model (You Are Already Using It)

Your successful experiment is essentially:

SourceA ─┐
         ├──► Sink
SourceB ─┘

Internally, SeaTunnel Builds a DAG Like This:

DataStream A ─┐
              ├──► Sink Operator
DataStream B ─┘

Key Point: Why Can They Be Merged?

Because:

A Sink is not “bound to one source,” but instead “subscribes to one or more data streams”

When you write:

sink {
  jdbc {
    plugin_input = "a,b"
  }
}

or when multiple sources are eventually connected to the same sink, SeaTunnel internally will:

Merge multiple input streams
Into one logical input
And write records sequentially

⚠️ Note:

This is not a join
Not a SQL union
It is stream-level merging (append)

4. What’s the Fundamental Difference from “SQL / ETL” Thinking?

This is where many people get confused.

In the SQL World

SELECT * FROM A
UNION ALL
SELECT * FROM B

👉 This is “result-set semantics”

In the SeaTunnel World

Record stream from A
Record stream from B
↓
Sink continuously consumes them

👉 This is “stream semantics”

As long as the Schemas are compatible, they can flow into the same sink.

5. The Role of Schema in Data Streams (You Must Remember This)

Data flow = Record + Schema

Preconditions for Stream Merging in SeaTunnel:

Same number of fields
Compatible field types
Aligned field names (or mappable)

Otherwise:

Runtime exceptions occur directly
Or sink writing fails

👉 Earlier, you mentioned that “the target fields are definitely aligned,” and that’s exactly why your experiment succeeded.

6. The Official Definition of SeaTunnel’s “Data Flow Model”

In future architecture designs, technical discussions, or documentation writing, you can directly use the following description:

SeaTunnel uses DataStream as its core abstraction.
Source plugins generate data streams, Transform plugins process data streams and output new streams, and Sink plugins consume one or more data streams and write data into external systems.
Multiple data streams can converge at the Sink as long as their Schemas are compatible. SeaTunnel performs stream merging (append) rather than relational joins.

7. Direct Impact on Your Builder / Strategy Design (Important)

Now you can confidently conclude three things:

1️⃣ Builder Must Support N Source → M Sink

This is not a 1→1 model, but a graph model.

2️⃣ `plugin_output` is a First-Class Citizen

If someone in your Builder does not configure plugin_output:

👉 Your platform should automatically generate one for them.

This is a platform-level capability.

3️⃣ Sink Logically Supports Multiple Input Streams

Even if the DSL looks like:

plugin_input = "s1"

The semantic meaning in your Builder should actually be:

Set<DataStream>

instead of a simple String.

8. Several Key Facts You Have Already Verified Through Practice

Let me summarize the conclusions you’ve already proven:

✅ SeaTunnel is a DAG, not a linear ETL tool
✅ Multiple Sources can flow into one Sink
✅ Merging is stream merging, not SQL join
✅ Schema alignment is the prerequisite
✅ The DSL describes data flow, not SQL

9. Summary

SeaTunnel Has Only 3 Core Roles

Source     →   Transform   →   Sink
(generate)     (modify)        (consume)

How Are Data Streams Connected?

Just remember this “universal rule table.”

Scenario	Supported?	Reason
1 Source → 2 Sink	✅	A data stream can be consumed by multiple sinks
2 Source → 1 Sink	✅	Data streams can be merged
2 Source → 2 Sink (Grouped)	✅	Different stream IDs provide isolation
Multiple Source/Sink groups in the same config	✅	DAG natively supports it

It all relies on these two concepts:

plugin_output: What is the name of the data stream I generate?
plugin_input: Which data stream(s) should I consume?

For example, two sources → one sink:

┌──────────┐
│ Source A │──┐
└──────────┘  │
               ├──▶ Sink
┌──────────┐  │
│ Source B │──┘
└──────────┘

One source → two sinks:

        ┌──────▶ Sink A
Source ─┤
        └──────▶ Sink B

Two completely independent flows inside one configuration:

Source A ───▶ Sink A

Source B ───▶ Sink B

Modernizing Infrastructure: Seamless Data Migration to HighGo DB with Apache SeaTunnel

Apache SeaTunnel — Thu, 23 Apr 2026 10:23:07 +0000

Wondering how to interface Apache SeaTunnel with HighGo Database? This article shares hands-on experience. HighGo Database is built on the PostgreSQL kernel, allowing it to be connected directly using standard JDBC drivers. Below are configuration examples for HighGo MySQL-mode to PG-mode migration and Doris-to-HighGo data transfers.

1. Introduction to HighGo Database

HighGo is a leading Chinese database vendor specializing in enterprise-grade applications. Built on the PostgreSQL kernel, it is a prominent player in China's domestic IT modernization ecosystem (Xinchuang), similar to KingBase.

Key Features:

Fully compatible with the PostgreSQL protocol.
Certified for government and critical infrastructure IT standards.
Utilizes standard PostgreSQL drivers (no proprietary drivers required).
Supports multiple deployment modes (Standalone, Primary-Standby, Distributed).

HighGo offers both PG and MySQL compatibility modes. You can treat it as native PG or MySQL; standard JDBC and tools like Navicat connect seamlessly. One minor tip: when using older versions of Navicat with HighGo's MySQL mode, you may need to select the "Legacy" client driver in settings to avoid metadata errors when opening tables.

2. Practical Read/Write Scenarios

2.1 Reading HighGo MySQL Mode to HighGo PG Mode

You can paste this configuration directly into a SeaTunnel node within DolphinScheduler. Unlike some competitors that require PG drivers to access MySQL-compatible schemas, HighGo acts as a native MySQL instance (using the MySQL JDBC driver).

env {
  parallelism = 2
  job.mode = "BATCH"
}
source {
  Jdbc {
    driver = "com.mysql.cj.jdbc.Driver"
    url = "jdbc:mysql://192.168.0.110:3306/public"
    user = "root"
    password = "root"
    query = "SELECT * FROM public.tb_dict;"
  }
}
sink {
  jdbc {
    url = "jdbc:postgresql://192.168.0.119:5866/datadb"
    driver = "org.postgresql.Driver"
    user = "highgo"
    password = "highgo"
    generate_sink_sql = true
    database = datacenter
    table = data_schema.dim_public_dict_info
    schema_save_mode = "CREATE_SCHEMA_WHEN_NOT_EXIST"
    field_ide="LOWERCASE"
    data_save_mode="DROP_DATA"
  }
}

The execution is as smooth as silk.

2.2 Meeting Compliance and Migration Requirements

If your existing system uses non-domestic databases (e.g., Apache Doris) but your production environment mandates a transition to certified domestic platforms, SeaTunnel serves as the perfect migration bridge. You can treat Doris as a high-performance engine to process data before writing it back to the compliant HighGo DB.

env {
  parallelism = 2
  job.mode = "BATCH"
}
source {
  Jdbc {
    url = "jdbc:mysql://192.168.0.120:9030/data_statistics"
    driver = "com.mysql.cj.jdbc.Driver"
    connection_check_timeout_sec = 100
    user = "root"
    password = "root"
    "table_list" = [
      { "table_path" = "data_statistics.data_develop_data_source_yw" },
      { "table_path" = "data_statistics.data_develop_data_source_type" },
      { "table_path" = "data_statistics.data_develop_data_source_ip" }
    ]
  }
}
sink {
  jdbc {
    url = "jdbc:postgresql://192.168.0.119:5866/datadb"
    driver = "org.postgresql.Driver"
    user = "highgo"
    password = "highgo"
    generate_sink_sql = true
    database = datadb
    table = "data_schema.${table_name}"
    data_save_mode = "DROP_DATA"
  }
}

3. Summary

From my experience, the combination of Doris + DolphinScheduler + SeaTunnel has become the "New Trinity" of data engineering. While DolphinScheduler and Doris handle most ETL tasks via catalogs, SeaTunnel acts as the ultimate fail-safe for complex migrations or specialized domestic database integrations.

Can You Turn “What I Want to Do” into a Runnable SeaTunnel Config with AI?

Apache SeaTunnel — Thu, 23 Apr 2026 09:54:18 +0000

Some thoughts around Apache SeaTunnel Discussion #10651: When AI writes configurations, the hard part has never been “writing them,” but whether what’s written can actually be used.

Over the past two years, almost every data tool has been asked one question:

Can configurations stop being handwritten?

When applied to SeaTunnel, this question becomes more specific:

Can a single sentence like “what I want to do” directly become a configuration?

Taking it one step further, can this configuration be not just “roughly correct,” but actually runnable, reviewable, and modifiable?

Writing SeaTunnel configurations manually is something many people are already familiar with. What is truly troublesome is often not “writing the configuration,” but the following:

After writing it, can it actually run;
When errors occur, is it easy to troubleshoot;
If someone else takes over, can they understand it;
When requirements change, can it be modified at low cost.

AI can certainly help. But if the goal is only to “generate a piece of HOCON,” the value is actually not that great. Because the real difficulty has never been typing things out, but making sure that after writing it, you don’t trap yourself, nor the next person who takes over.

So what is more worth doing is not simply “AI helps me write configurations,” but to stably translate the natural language “what I want to do” into a SeaTunnel configuration that is runnable, reviewable, and iterative.

This article mainly discusses three things:

Why this is worth doing;
What a relatively stable implementation path looks like;
How far the recent community discussions and prototypes have progressed.

1. Where the Real Demand Lies for AI Writing Configurations

1.1 Why Manual Configuration Becomes a Bottleneck

SeaTunnel task configuration is essentially a DSL (commonly HOCON, also supporting JSON/SQL), composed of env / source / transform / sink to form an executable data pipeline. Its expressive power is strong, but precisely because of that, configuration writing naturally comes with an “engineering threshold.” When team size, types of data sources, and the number of tasks all grow together, manual configuration will almost inevitably produce four types of cost:

Dense syntax details: nested levels, array/object structures, field types, quotation marks and escaping—any small mistake will explode at runtime.
Error-prone and difficult to troubleshoot: errors often manifest as “task startup failure” or “runtime failure.” When locating issues, you need to simultaneously understand engine-side constraints, connector parameter semantics, variable substitution rules, and default conventions.
High learning cost: newcomers need to learn HOCON syntax, SeaTunnel conventions (such as plugin_output/plugin_input), connector capability boundaries, and engine differences.
Slow adaptation to heterogeneous multi-source scenarios: once evolving from “single-table sync” to “multi-source join / lake ingestion / CDC / multi-table sync,” configuration complexity grows non-linearly, and templates quickly become invalid.

SeaTunnel official documentation on configuration file structure and variable substitution:

https://seatunnel.apache.org/docs/2.3.8/concept/config/

1.2 What Discussion #10651 Is Really Asking

The problem mentioned in Discussion #10651, in my view, is essentially this type of engineering requirement:

I don’t want to start writing DSL from scratch; I want to input “what I want to do + what data sources I have + what constraints I have,” and the system can generate a SeaTunnel configuration that is runnable, reviewable, and iterative, and provide actionable fix suggestions when failures occur.

Discussion entry:

https://github.com/apache/seatunnel/discussions/10651

1.3 Let Me State the Conclusion First

I don’t particularly care whether “AI can directly write a piece of HOCON.” This problem is not difficult to demonstrate; the difficulty lies in whether the generated result can enter daily usage. My judgment is that this needs to take a more engineering-oriented path: first transform natural language into structured IR, then render it into SeaTunnel HOCON, and finally supplement it with a machine-checkable validation report. Doing so brings at least three direct benefits:

Runnable: the generated result satisfies SeaTunnel configuration structure, connector required parameters, and engine constraints.
Reviewable: sensitive information is parameterized, key decisions enter IR, and default values and items to be confirmed are clearly visible.
Iterative: when validation fails, you can go back to the IR or patch layer for minimal fixes, rather than regenerating the entire configuration.

With this judgment, the next question becomes clear: how should this pipeline be built.

2. If We Really Want to Do This, What Should the Pipeline Look Like

2.1 Don’t Rush to Let the Model Directly Output HOCON

Directly letting the model output a piece of HOCON often produces good demo results, but it is not sufficient for engineering. A more stable approach is to break configuration generation into several clear stages, each of which can be checked. A minimal closed loop roughly looks like this:

Intent Parsing: extract task type, source/target, mode (batch/stream), SLA, and fault tolerance requirements from natural language.
Metadata Awareness: obtain source schema, primary keys/incremental positions, and target constraints (field types, partitions, write modes).
Connector Resolution: select connector combinations based on “intent + engine + environment constraints,” and confirm version compatibility.
Parameter Auto Fill: fill required parameters and reasonable default values; uncertain items are output as a “to-confirm list,” rather than guessing.
Syntax and Semantic Validation: HOCON syntax, connector parameter schema, variable substitution, and sensitive information compliance; when failures occur, generate executable fix patches.

The model is responsible for proposing solutions; the system is responsible for fallback and validation.

2.2 Structurally, This Solution Is Actually Two Pipelines

From a structural perspective, this solution can be divided into two pipelines: a control chain (intent → plan) and an artifact chain (plan → configuration → execution). Splitting it this way makes both understanding and implementation clearer.

2.2.1 Module Breakdown

Intent Parser: natural language → IntentSpec (structured JSON)
Metadata Provider: fetch schema and constraints from JDBC/Catalog/information schema
Connector Resolver: connector capability matrix matching (engine compatibility, CDC support, Exactly-Once support, etc.)
Plan Builder: generate JobPlanIR (strongly typed IR, similar to AST)
Config Renderer: JobPlanIR → HOCON/JSON (HOCON by default)
Config Linter: syntax + parameter validation + security policy checks
Submitter (optional): submit jobs, query status, stop jobs, rollback

2.2.2 Execution Flow (Text Sequence)

User inputs natural language + environment constraints
Intent Parser outputs IntentSpec
Metadata Provider fetches schema/primary keys/incremental positions/target constraints
Connector Resolver selects Source/Sink/Transform combinations
Plan Builder outputs JobPlanIR
Config Renderer generates seatunnel.conf
Config Linter outputs validation_report (pass/fail + fix suggestions)
If passed, Submitter submits; if failed, enter a “fix → revalidate” loop based on report

Execution side does not need to start from scratch. SeaTunnel MCP server has already demonstrated how LLMs can submit and manage SeaTunnel tasks via tools, which can be directly referenced when building an MVP:

https://github.com/apache/seatunnel-tools

3. If Building an MVP, What Should the First Version Look Like

3.1 Input and Output Format: Define the Protocol First

The biggest risk for an MVP is inconsistent outputs. The simplest way is to define the I/O protocol first.

3.1.1 Input: IntentSpec (JSON)

{
  "intent": "Sync mysql.shop.orders fully to Doris ods.orders, run daily",
  "engine": "zeta",
  "mode": "BATCH",
  "source": {
    "type": "mysql",
    "jdbc_url": "${MYSQL_URL}",
    "username": "${MYSQL_USERNAME}",
    "password": "${MYSQL_PASSWORD}",
    "database": "shop",
    "table": "orders"
  },
  "sink": {
    "type": "doris",
    "fenodes": "${DORIS_FENODES}",
    "username": "${DORIS_USERNAME}",
    "password": "${DORIS_PASSWORD}",
    "database": "ods",
    "table": "orders"
  },
  "constraints": {
    "parallelism": 4,
    "no_plaintext_secret": true,
    "target_ddl_policy": "validate_only"
  }
}

3.1.2 Output: Configuration + Validation Report

seatunnel.conf: HOCON (default). Sensitive information must be parameterized using ${...}
validation_report.json: errors / warnings / to-be-confirmed parameter list / fix suggestions (can generate patch)

3.2 Prompts Are Not the Main Character, Boundaries Are

There is no need to overcomplicate prompt design. The key point is only one: confine uncertainty within a verifiable range. For MVP, a “three-stage Prompt” is sufficient:

3.2.1 Prompt A: Intent → Plan (Only Output IR, Not Configuration)

Goal: Output JobPlanIR (JSON), with fixed fields and fixed enums, and prohibit natural language explanations.

Key constraints:

Explicitly define job.mode, engine, and plugin_name for source/sink
Determine plugin_output/plugin_input reference relationships; legacy result_table_name/source_table_name only used for compatibility input
Plaintext secrets are not allowed
Uncertain items must be placed in todo_items[]

3.2.2 Prompt B: Plan → HOCON Rendering

Goal: Output only HOCON, and strictly limit sections to env/source/transform/sink.

Key constraints:

All sensitive fields must be written as ${VAR} or ${VAR:default}
Do not output nonexistent parameter names (parameter names must come from the rule set)

3.2.3 Prompt C: Self-check (Lint + Semantic)

Goal: Output structured validation_report.json:

{
  "errors": [],
  "warnings": [],
  "todo_items": [],
  "patch_suggestion": ""
}

3.3 How to Choose Models: Local Open Source or Cloud LLM

Dimension	Local Open-source Models	Cloud LLMs
Generation Quality	Requires fine-tuning / retrieval fallback	Usually stronger, more stable for complex reasoning
Data Compliance	Data stays within domain, strong advantage	Requires desensitization, auditing, contracts, compliance evaluation
Cost	Fixed cost, controllable	Grows with usage
Latency	Can be low or high (depends on inference stack)	More affected by network fluctuations
Operations	Requires GPU / inference services	Depends on vendor stability

In the MVP stage, it is generally better to first use cloud models to run through the full chain of “generation → validation → submission → rollback,” and then move toward local or hybrid deployment based on enterprise compliance and cost considerations.

3.4 Which Compatibility Rules Should Be Fixed from the Beginning

If compatibility rules are not clearly defined upfront, things will become chaotic later. The following are better treated as hard constraints:

Default output is HOCON; JSON/SQL must be explicitly declared and follow extension constraints (e.g., .json)

Reference: https://seatunnel.apache.org/docs/2.3.8/concept/config/

Fixed section order: env → source → transform → sink
plugin_output/plugin_input is only explicitly written when referencing across sections, multiple source/sink, or transform chains; for single-chain scenarios, reduce noise as much as possible
Variable substitution uses ${var} and ${var:default}, uniformly injected at runtime (do not hardcode environment differences)
Plaintext passwords / AK / SK are prohibited; must use variables or external secret management systems

Once these boundaries are defined, the next practical question is: where do connector rules come from?

3.5 The Rule System Does Not Have to Be Fully Handwritten

There is one point in PR #10789 that I find very practical: it does not rely entirely on manually maintained connector rules. Instead, it scans SeaTunnel Java source files such as *Factory.java and *Options.java to automatically generate a connector catalog, and then processes the option inheritance chain. This is very helpful for rule system design.

A more practical approach is not to rely entirely on handwritten rules, but to divide into two layers:

Auto-generated layer: extract connector names, OptionRule, default values, required parameters, and parameter aliases from source code
Human-enhanced layer: supplement knowledge that is difficult to express in static code, such as CDC capabilities, recommended engines, typical combinations, common misconfigurations, and enterprise security policies

If the running SeaTunnel cluster can expose interfaces such as /option-rules, then the knowledge acquisition chain can be further upgraded to:

Runtime interface first: obtain the most accurate connector rules for the current version
Auto-generated catalog fallback: avoid complete failure in offline or no-cluster scenarios
Keyword/example routing supplement: improve the hit rate from natural language to connectors

Therefore, rules/connectors.yaml here is more like a manually corrected layer on top of automatically generated rules, rather than a fully handwritten “parameter encyclopedia.”

At this point, the abstract parts are almost covered. Next, let’s look directly at a complete example.

4. A Complete Example: From “What I Want to Do” to a Runnable Configuration

Let’s look at a full example that connects “natural language → IR → HOCON → validation report.”

Fully sync mysql.shop.orders to Doris ods.orders, run daily, use zeta engine, parallelism 4.

The generator should not only output a piece of HOCON, but also output JobPlanIR, seatunnel.conf, and validation_report. IR is used to review intent, HOCON is used for execution, and the validation report is used to expose risks and items requiring confirmation.

Here is a point that is easy to confuse: in the example, the business type of the source is written as mysql, but the rendered plugin_name is Jdbc. This is not an error. It is because this example describes a “full table read from MySQL,” which is closer to the JDBC Source usage scenario in SeaTunnel. If the goal were MySQL CDC, the resulting source plugin would often become MySQL-CDC.

4.1 First Look at JobPlanIR: It Fixes the Intent

You can think of JobPlanIR as an intermediate representation similar to an AST. It is not directly executed, but is mainly used for connector matching, parameter checking, and subsequent rendering.

{
  "job_mode": "BATCH",
  "engine": "zeta",
  "source": {
    "type": "mysql",
    "plugin_name": "Jdbc",
    "sync_mode": "full",
    "jdbc_url": "${MYSQL_JDBC_URL}",
    "driver": "com.mysql.cj.jdbc.Driver",
    "username": "${MYSQL_USERNAME}",
    "password": "${MYSQL_PASSWORD}",
    "database": "shop",
    "table": "orders",
    "table_path": "shop.orders"
  },
  "sink": {
    "type": "doris",
    "plugin_name": "Doris",
    "fenodes": "${DORIS_FENODES}",
    "username": "${DORIS_USERNAME}",
    "password": "${DORIS_PASSWORD}",
    "database": "ods",
    "table": "orders",
    "data_save_mode": "${DORIS_DATA_SAVE_MODE:APPEND_DATA}",
    "schema_save_mode": "${DORIS_SCHEMA_SAVE_MODE:CREATE_SCHEMA_WHEN_NOT_EXIST}",
    "sink_label_prefix": "${DORIS_LABEL_PREFIX:orders_full_sync}",
    "doris_config": {
      "format": "json",
      "read_json_by_line": "true"
    }
  },
  "transform": [],
  "constraints": {
    "parallelism": 4,
    "schedule": "daily_external",
    "no_plaintext_secret": true,
    "engine_compatibility": "Jdbc source + Doris sink are supported on SeaTunnel Zeta",
    "secret_placeholders": [
      "MYSQL_JDBC_URL",
      "MYSQL_USERNAME",
      "MYSQL_PASSWORD",
      "DORIS_FENODES",
      "DORIS_USERNAME",
      "DORIS_PASSWORD"
    ]
  },
  "todo_items": [
    "Confirm daily scheduling method; SeaTunnel HOCON does not natively support cron, requires external scheduler to trigger daily",
    "Confirm Doris write semantics; current default APPEND_DATA ensures runnability, change to DROP_DATA if overwrite full sync is required",
    "Confirm mysql.shop.orders has primary key or splittable column; otherwise Jdbc Source may degrade to single-thread reading"
  ]
}

4.2 Then Look at seatunnel.conf: It Executes the Job

This layer should be kept concise, containing only necessary runtime parameters. Connection info and passwords are parameterized. Since this is a single-chain job, no need for plugin_output/plugin_input. The empty transform {} is only kept to maintain the typical structure.

env {
  parallelism = 4
  job.mode = "BATCH"
}

source {
  Jdbc {
    url = ${MYSQL_JDBC_URL}
    driver = "com.mysql.cj.jdbc.Driver"
    username = ${MYSQL_USERNAME}
    password = ${MYSQL_PASSWORD}
    table_path = "shop.orders"
  }
}

transform {
}

sink {
  Doris {
    fenodes = ${DORIS_FENODES}
    username = ${DORIS_USERNAME}
    password = ${DORIS_PASSWORD}
    database = "ods"
    table = "orders"
    sink.label-prefix = "${DORIS_LABEL_PREFIX:orders_full_sync}"
    schema_save_mode = "${DORIS_SCHEMA_SAVE_MODE:CREATE_SCHEMA_WHEN_NOT_EXIST}"
    data_save_mode = "${DORIS_DATA_SAVE_MODE:APPEND_DATA}"
    doris.config {
      format = "json"
      read_json_by_line = "true"
    }
  }
}

4.3 Finally Look at validation_report: It Explains the Issues Clearly

The validation report is not decoration. It answers two questions: what is runnable, and what still needs confirmation.

{
  "errors": [],
  "warnings": [
    "Generated based on intent: full sync mysql.shop.orders to Doris ods.orders, run daily, zeta engine, parallelism 4",
    "Default Doris data_save_mode set to APPEND_DATA for runnability; change to DROP_DATA if overwrite full sync is required",
    "Scheduling is not encoded in SeaTunnel config; requires external scheduler for daily trigger",
    "Jdbc partitioning not explicitly set; if no primary key or unique index exists, parallelism may be lower than env.parallelism=4"
  ],
  "todo_items": [
    "Add external scheduler configuration (e.g., cron, Airflow, DolphinScheduler)",
    "Confirm DORIS_DATA_SAVE_MODE should be DROP_DATA",
    "Confirm primary key / unique key or partition_column for orders table"
  ],
  "patch_suggestion": ""
}

In this example, the three points I most want to emphasize are: sensitive information is not stored in plaintext, connector parameters have clear sources, and uncertain items are not guessed blindly.

At this point, the solution, protocol, and example have all been covered. The final question returns to something more practical: is this approach actually worth it?

5. What Do We Ultimately Save by Doing This

5.1 Three Typical Scenarios

5.1.1 Database Synchronization (MySQL → Doris)

Manual: a large number of connector parameters and table mapping details
AI-generated: input intent + connection information → output runnable HOCON + to-confirm items

5.1.2 Lakehouse Ingestion (Hive → Iceberg)

Manual: complex combinations of catalog / warehouse / partition / commit parameters
AI-generated: automatically fills required parameters based on rule system and lists uncertain items as to-confirm items

5.1.3 Log Collection (S3/Local → Elasticsearch)

Manual: format parsing, field mapping, index naming, retry strategies are easy to miss
AI-generated: first produces a “minimum runnable version,” then iteratively enhances based on validation and runtime feedback

5.2 Comparison Dimensions (Intuitive, Non-Academic)

The following numbers are more like experience-based estimates, mainly to give a sense of scale rather than strict experimental data. Actual benefits depend on the team’s familiarity with SeaTunnel, metadata integration, and connector complexity.

Dimension	Manual Configuration	AI-generated Configuration (with validation)
Time to first completion	30–120 minutes	3–15 minutes
Lines of configuration	80–200 lines	40–120 lines (more parameterized)
Syntax error rate	High (common)	Low (lint + rule system fallback)
Learning difficulty	High	Medium (mainly learning input protocol and confirmation list)

6. How This Can Be Further Advanced

6.1 If We Want to Push This Forward in the Community, How Can We Collaborate

Add to Discussion #10651: input/output protocol, MVP milestones, reproducible examples
Continue discussions around PR #10789: whether to evolve seatunnel-cli/ as a standalone tool, or settle into a two-layer architecture of “generation core + CLI/API frontend”
Contribution directions:
- Enhance connector catalog auto-generation (source extraction, inheritance chain parsing, version diffing)
- Improve connector rule system (required parameters, default values, engine compatibility)
- Improve validator (more readable error messages and fix suggestions)
- Strengthen secret handling (session memory desensitization, placeholder injection, external secret manager integration)
- Add more examples (cover JDBC / CDC / file / lakehouse scenarios)

6.2 If We Really Want to Implement This, What Pitfalls Must Be Considered First

The most common issue is still the model “seems to understand but actually doesn’t.” So a more stable approach is not to let it freely generate, but to constrain outputs within verifiable boundaries using IR, rule systems, and lint. When uncertain, it should explicitly list items in the to-confirm list.
Metadata should not be taken for granted. Schema, table structure, and field information can indeed help reduce trial and error, but only if desensitization is the default, data access is controlled, and sensitive values are not included in prompts.
If session memory is supported later, the risk is not only “remembering context,” but also “accidentally remembering connection information.” A better approach is to store only aliases, references, or secret locations—not plaintext credentials.
Another layer is enterprise compliance. Audit logs, permission isolation, whether local models can be used, whether configuration release requires approval and rollback—these are often overlooked, but unavoidable in production environments.

7. Final Questions to Continue the Discussion

At this point, the core concern remains unchanged: whether AI can write configurations is not the hardest part. The harder part is how to stabilize the entire chain of “generation → validation → repair → execution.”

If this is only for occasional demos, being able to generate is enough; but if we truly want it to enter daily team workflows, the fallback, review, and repair mechanisms must also be completed.

If you are also interested in this direction, feel free to continue discussing the following questions.

7.1 Q&A (Leave Your Thoughts)

What is the biggest pain point for your team when writing SeaTunnel configurations: syntax, parameters, or troubleshooting?
Would you prefer AI to first solve “configuration generation” or “automatic repair after failure”?
What interaction style do you prefer: Chat (conversational) or Form (structured form)?

7.2 Quick Poll (Reply with the Option Number)

A: I need one-click “intent → configuration” generation
B: I need “configuration → validation → fix suggestions”
C: I need a full loop of “generation + submission + self-healing on failure”
D: I only want “connector parameter auto-fill + template library”

References

Discussion #10651: AI-generated SeaTunnel job configuration

https://github.com/apache/seatunnel/discussions/10651

PR #10789: Introduces seatunnel-cli prototype for natural language configuration generation

https://github.com/apache/seatunnel/pull/10789

SeaTunnel configuration structure and variable substitution (HOCON/JSON/SQL)

https://seatunnel.apache.org/docs/2.3.8/concept/config/

SeaTunnel Tools repository (including MCP-related content)

https://github.com/apache/seatunnel-tools

How to Integrate SeaTunnel with Apache DolphinScheduler: A Step-by-Step Production Guide

Apache SeaTunnel — Thu, 23 Apr 2026 07:54:40 +0000

"I’ll write about the DolphinScheduler integration when I have time; I owe too much content already." Well, the project is about to be deployed, so it’s time to settle the "debt".

1. Why Integrate with DolphinScheduler?

We’ve already verified that SeaTunnel’s Local mode works fine for ETL tasks. However, in a production environment, we need:

Scheduled Dispatching: Automatic execution of data sync tasks daily or hourly.
Task Dependencies: Triggering downstream tasks only after upstream data is ready.
Alarm Notifications: Sending alerts when tasks fail (not a common role in smaller cities yet—usually we just wait for things to explode).
O&M Management: Visualizing task status and historical execution records.

Honestly, I’m mostly just too lazy to use the command line. Executing tasks via a Web UI is much easier, and checking logs is convenient. If it’s a bit slower, that’s just more time for a water break.

DolphinScheduler and SeaTunnel are natively integrated, supporting SeaTunnel job configuration directly via the Web UI to meet all the above needs.

2. Deployment Environment

Component	Version	Description
DolphinScheduler	3.1.7+	Scheduling Platform
SeaTunnel	2.3.8+ / 2.3.12	Data Sync Engine
Zeta Engine	Built-in	SeaTunnel Execution Engine

Architecture Logic: DS handles scheduling and workflow orchestration; SeaTunnel handles the actual data reading and writing.

3. Integration Methods

3.1 Method 1: Calling SeaTunnel CLI via Shell Node

This is the most direct way—the "Shell approach" fits most scenarios.

Steps:

Install the SeaTunnel client on the DolphinScheduler runtime node (API service not required).
Call the seatunnel.sh script within a Shell node.

#!/bin/bash
cd /opt/apache-seatunnel-2.3.12/bin
./seatunnel.sh --config /data/jobs/mysql_to_doris.conf -m local

Pros: Simple configuration, good compatibility, and avoids exposing sensitive database info.
Cons: Config files must be debugged in advance; modifications require using vim on the server (a headache just thinking about it).

3.2 Method 2: Submitting via SeaTunnel API or SeaTunnel Web

If you need granular control (task cancellation, status queries), use the API method.

I haven't tried this because it seemed too troublesome...

3.3 Method 3: Official SeaTunnel Node

Using the SeaTunnel node in DolphinScheduler with the Zeta engine. I found it doesn't support IP settings, meaning DolphinScheduler and SeaTunnel must be bound to the same machine.

Consequently, SeaTunnel must be installed on every machine where DolphinScheduler is installed. Since DS is a cluster, tasks could be assigned to any node. For quick validation, I copied the local SeaTunnel version to all DS nodes instead of reinstalling the cluster version.

3.3.1 Validation with Default Config

Using default parameters (a script that generates test data and outputs to the console) resulted in an error:
Line 5: /bin/seatunnel.sh: No such file or directory.

Integration failed because the environment variables weren't configured, so the directory couldn't be found.

3.3.2 Modifying DolphinScheduler Environment Config

On the main DS node, modify the dolphinscheduler_env.sh file located in /opt/dolphinscheduler/bin/env:

Update: export SEATUNNEL_HOME=${SEATUNNEL_HOME:-/opt/seatunnel} (where /opt/seatunnel is your installation path).

Restart the cluster. Official docs say this automatically updates the environment for all Worker and Master servers. If it doesn't work, manually update the conf directories on each node. Ensure all Workers, Masters, and API servers have the SEATUNNEL_HOME configured.

3.3.3 Re-verifying Integration

Rerun the task instance. Once you see the green checkmark, you’re good! Checking the logs shows the SeaTunnel logo and sync info. Integration successful.

3.3.4 Viewing Detailed Logs in a Cluster

Query the DS database using the task instance ID (e.g., 203971):

SELECT * FROM t_ds_task_instance where id=203971

The node IP and directory are recorded, but the actual log content must be retrieved by scanning the corresponding log file on that node.

4. DolphinScheduler Timezone Issues

Incorrect scheduling time is a major pain, often resulting in an 8-hour offset. DS has timezone settings (likely dependent on Java's xx_jackson_time_zone). If DS is started via systemctl, global Java variables might not work; modifying the DS configuration files directly is the most effective fix.

5. Summary

SeaTunnel’s strength lies in its multiple integration options and its ability to automatically create tables with templates. Integrating with DolphinScheduler adds management power, allowing you to manage .conf files via UI and making debugging much more convenient.

Why Apache SeaTunnel Zeta Can Be Both “Fast and Stable”

Apache SeaTunnel — Fri, 17 Apr 2026 10:29:31 +0000

If SeaTunnel Zeta is simply understood as “a faster execution engine,” its true value will be underestimated.

For data integration systems, the real challenge has never been “whether the pipeline can run,” but whether the following can be achieved at the same time: sufficiently high throughput, recoverability after failure, no data duplication or loss, and controlled resource consumption.

What makes Zeta worth serious attention lies exactly here: it does not win through a single performance optimization, but instead turns consistency, recovery, convergence under concurrency, and resource control into a closed-loop system capability.

Note: This article is based on SeaTunnel commit c5ceb6490; all source code interpretations refer to this version. Runtime observations are based on the official apache/seatunnel:2.3.13 image and are intended to help understand the mechanisms, not as a strict benchmark for this commit.

Conclusion First

From an architect’s perspective, SeaTunnel Zeta does not achieve both high throughput and stability through a single “performance optimization point,” but instead forms a closed loop of four capabilities:

Control plane: when checkpoints are triggered, timed out, and completed
State plane: how task state is snapshotted, persisted, restored, and remapped
Data plane: how Barrier, Record, and Close signals converge in order under high concurrency
Resource plane: how resources are modeled, allocated, and throttled to prevent the system from overwhelming itself

None of these four layers can be missing. If the contract of any layer is broken, it will eventually manifest as duplicate writes, stalled recovery, checkpoint timeouts, or resource instability.

1. Looking at the Big Picture: Zeta Solves Not Just “Fast,” but “Fast and Stable”

The most typical contradiction in data integration systems has never been “whether they can run,” but whether the following three conditions can be satisfied simultaneously:

Throughput is high enough to avoid becoming a bottleneck
Recoverable after failure, without data loss or duplication upon restart
Resource consumption is controllable, without exhausting the cluster in pursuit of stability

This is why I prefer to understand Zeta as a stability engine for data integration scenarios, rather than a generalized computing engine.

From the source code design, it decomposes the problem into four clearly defined planes:

Control plane: CheckpointCoordinator is responsible for triggering, progressing, completing, timing out, and terminating checkpoints
State plane: CheckpointStorage, CompletedCheckpoint, and ActionSubtaskState handle snapshotting and recovery
Data plane: SourceSplitEnumeratorTask, Writers, Aggregated Committer, and intermediate queues embed control signals into the data processing flow
Resource plane: ResourceProfile, DefaultSlotService, and read_limit handle resource profiling, dynamic allocation, and throttling

1.1 Architecture Overview

Architectural judgment: The highlight of Zeta is not the complexity of individual modules, but that it places “consistency, recovery, concurrency, and resources” into a unified protocol.

2. Exactly-Once Is Not a Single Capability, but a Cross-Layer Contract

Many articles describe Exactly-Once as “the engine supports checkpoints, therefore Exactly-Once is guaranteed.” This is not rigorous from an architectural perspective.

In Zeta, Exactly-Once is at least divided into two layers:

Engine-level guarantees: Barrier alignment, state snapshotting, completion ordering, and failure rollback
Connector-level guarantees: prepareCommit must produce transferable and replayable CommitInfo, and commit must be idempotent and retryable

In other words, Zeta provides an execution framework for Exactly-Once, rather than automatically guaranteeing it for all connectors.

In addition, the Sink side does not have only one commit path:

If the connector implements SinkAggregatedCommitter, it follows the path: Writer prepareCommit → Aggregated Committer aggregation → unified commit after notifyCheckpointComplete
If the connector only implements SinkCommitter, the commit happens directly inside notifyCheckpointComplete(...) of the Writer task

The following analysis focuses on the first path, as it better reflects Zeta’s coordination of consistency and commit timing at the engine level.

2.1 What It Actually Guarantees

Taking the SinkAggregatedCommitter path as an example, the Exactly-Once main flow in Zeta is:

CheckpointCoordinator triggers a checkpoint and injects barriers into tasks
Each participant snapshots state at the barrier boundary and sends ACK
Sink Writer calls prepareCommit(checkpointId) without committing externally
SinkAggregatedCommitterTask aggregates CommitInfo and includes the result in checkpoint state
Only when the Coordinator determines the checkpoint is complete does it trigger the actual commit(...)

The architectural meaning of this chain is very clear: first solidify the consistency boundary, then perform external side effects.

2.2 Why This Design Matters

If the Writer commits to the external system immediately after local processing, once the checkpoint fails to complete, the system will face two classic problems after recovery:

State not saved but external commit already happened → irreversible duplication
Upstream replay writes again → logically at-least-once, but claimed as Exactly-Once

Zeta delays the commit action until after notifyCheckpointComplete, essentially doing one thing: binding external visible side effects to the completion of consistency.

2.3 Architectural Boundaries Must Be Clear

If this is not clearly stated, it is easy to misinterpret:

SinkWriter.prepareCommit(checkpointId) is not a normal flush, but a phase-one protocol action
SinkCommitter.commit(...) must be idempotent, otherwise duplicates may still occur after recovery
If the external system does not support idempotency or transactional semantics, engine-level Exactly-Once will degrade

Architectural judgment: Exactly-Once is not a “switch,” but a responsibility chain across engine, connectors, and external systems.

2.4 What Is the Cost

Every architectural benefit comes with a cost, and Exactly-Once is no exception:

The more frequent the checkpoints, the higher the cost of Barrier handling and state serialization
External commits are delayed, introducing additional commit paths and state buffering
If Sink idempotency is not well designed, complexity shifts to connector implementers

3. The Key to Resume Is Not Just Restoring State, but Restoring Protocol Progress

Many systems stop at “restoring state objects.” But in distributed data integration, this is not enough, because the protocol itself has progress.

Three points in Zeta’s recovery path are particularly worth attention.

3.1 Recovery Is Not a Direct Restore, but a Remapping Based on Current Parallelism

CheckpointCoordinator.restoreTaskState(...) does not simply assign old state back to the original subtask. Instead, it determines the correct execution unit based on current parallelism and mapping.

This means it considers not “who ran last time,” but “who should take over this time.”

This is crucial because real-world recovery often involves:

Worker relocation
Parallelism changes
Slot reallocation

3.2 The Core of Source Recovery Lies in the Enumerator

On the Source side, what truly determines whether reading can continue correctly is not just the reader itself, but the allocation state of splits.

Therefore, Zeta places the recovery focus on SourceSplitEnumerator:

During checkpoint: execute snapshotState(checkpointId)
During recovery: SourceSplitEnumeratorTask.restoreState(...) decides whether to call restoreEnumerator(...) or createEnumerator(...)
Then open() is invoked and subsequent coordination resumes

This shows that its recovery approach is not about “restoring threads,” but about “restoring the scheduler.”

3.3 What Truly Reflects Stability Engineering Is “Protocol Signal Compensation”

One of the most valuable details in this article is the re-signaling logic of NoMoreSplits after reader re-registration.

In SourceSplitEnumeratorTask.receivedReader(...), if a reader has previously been marked as having no more splits, then when it re-registers after recovery, the system will again call signalNoMoreSplits.

This detail is highly significant:

What is restored is not just data state
Nor just split allocation results
But also the fact that “this reader has already reached the end of the protocol”

Without this step, the system may appear to have “successfully restored state,” but the reader could remain stuck waiting for more splits indefinitely.

Architectural judgment: A truly mature recovery mechanism restores “state + protocol position + control signals,” not just a serialized object.

4. In High-Concurrency Systems, the Real Risk Is Not Slowness, but Lack of Convergence

When people think of high concurrency, they often think of parallelism, threads, and queue length. But for data integration engines, the more dangerous issue is actually: whether control messages are drowned out, and whether the shutdown process loses control.

Zeta’s design here reflects a clear engineering mindset.

4.1 The Parallel Model Is Not the Highlight, the Convergence Model Is

From the task model perspective, Zeta’s high concurrency is not mysterious:

Source/Sink improve throughput via multiple Readers and Writers
Pipelines scale throughput via task parallelism
Aggregated Committer waits until all necessary writers are registered and aligned before advancing lifecycle

These are standard practices in distributed execution engines.

What stands out is that it does not treat “parallelism” as simply increasing processing threads, but treats how to terminate in an orderly way under concurrency as a first-class concern.

4.2 Barrier Priority Is Essentially Protecting the Control Plane

In the implementations of RecordEventProducer and IntermediateBlockingQueue, when a Barrier arrives, it is acknowledged with priority. If that Barrier triggers prepareClose for the current task, the system enters the prepareClose state, and ordinary records are no longer accepted into the queue.

This design addresses two common pitfalls in high-concurrency systems:

Control signals being drowned by data traffic: Barriers cannot reach boundaries, and consistency cannot converge
Data still flowing during shutdown: Records continue after checkpoint boundaries, breaking semantics

In other words, this is not “queue optimization,” but an architectural decision where control takes priority over throughput.

4.3 Why This Is Especially Important for Data Integration Systems

In data integration pipelines, downstream systems are often slower than upstream, and network/storage jitter is common.

If the system simply increases concurrency mechanically, three consequences arise:

Queue buildup worsens
Checkpoint cost increases
Shutdown and recovery become harder to converge

So what Zeta demonstrates here is not just “high concurrency capability,” but:

It knows when to continue throughput, and when to first enforce consistency and lifecycle convergence.

5. Low Resource Usage Is Not About Using Fewer Machines, but About Restraining Resource Decisions

“Low resource usage” is often misunderstood as “this engine consumes fewer machines.” Architecturally, a more accurate statement is:

The system avoids wasting resources on ineffective competition through a simpler resource model and explicit throttling mechanisms.

5.1 The Value of a Minimal Resource Model Lies in Low Scheduling Cost

ResourceProfile uses CPU and Memory as core resource descriptors, and provides merge, subtract, and enoughThan.

This is not a highly detailed model, but it has two practical advantages:

Simplicity → low scheduling computation cost
Generality → suitable for volatile and heterogeneous data integration workloads

The trade-off is also clear: it has limited expressiveness for network, disk, and downstream service bottlenecks.

Architectural judgment: This is a “good enough” resource model, not a “precise simulation” model.

5.2 Dynamic Slots Are Essentially Elastic Partitioning Based on Remaining Capacity

In DefaultSlotService.requestSlot(...), if dynamic slots are enabled and remaining resources can satisfy the requested profile, a new SlotProfile is created on demand.

This means slots are not statically partitioned, but dynamically sliced based on available capacity.

Benefits:

Higher resource utilization
More flexible scheduling
Suitable for mixed workloads with fluctuating load

But this does not mean the system is immune to overload. If upstream jobs expand parallelism uncontrollably, dynamic slots will only expose the problem faster.

5.3 What Actually Suppresses Resource Instability Is Checkpoint Throttling

checkpointInterval, checkpointMinPause, and checkpointTimeout are not just configurations, but stability valves:

interval: how frequently snapshots occur
minPause: enforced gap between checkpoints
timeout: maximum duration before abort

Improper configuration leads to a vicious cycle:

Frequent checkpoints → higher state cost → slower barriers → more timeouts → more recovery → increased resource instability

5.4 Throttling Is Often More Effective Than Scaling

Configurations like read_limit.rows_per_second and read_limit.bytes_per_second have high architectural value.

Because often the system is not “computationally insufficient,” but:

Downstream cannot keep up
Excessive concurrency only creates retries and backlog
Resources are wasted on ineffective contention

Therefore, for slow or rate-limited downstream systems, the recommended approach is:

Throttle first, observe, then scale.

5.5 Closed Loop of Resource Scheduling and Throttling

6. From an Architectural Perspective, What Scenarios Is Zeta Suitable For

From the current design, Zeta’s strengths are clear:

Clear data integration pipelines from Source to Sink
Need for recoverable and traceable consistency guarantees
Production environments where manual intervention after recovery is unacceptable
Desire to maintain stable operation under limited resources via dynamic allocation and throttling

Correspondingly, its focus is not on maximizing every operator capability, but on:

Clearly defining consistency boundaries
Completing recovery loops
Ensuring convergence under concurrency
Turning resource control into a system-level capability

7. If You Want to Apply It in Practice, Focus on These Four Things

7.1 For Connector Developers

Do not treat prepareCommit(checkpointId) as a normal flush
commit(...) must be idempotent and retryable
External side effects must align with checkpoint completion

7.2 For Source Developers

snapshotState(...) and run(...) may run concurrently; ensure thread safety
Fully implement addSplitsBack(...) and reader failover
Do not only restore split state while ignoring protocol termination signals

7.3 For Operators

Do not assume higher parallelism is always better
Tune checkpoint.interval, checkpoint.timeout, and min-pause first
Use read_limit for fragile downstream systems
Prefer cluster mode for savepoint / restore demonstrations

7.4 For Architecture Reviewers

Evaluate Exactly-Once together with external system idempotency
Evaluate recovery beyond state snapshots, including protocol compensation
Evaluate performance not just by throughput, but by convergence during shutdown and recovery

8. How to Interpret "Performance Data": Do Not Prove Architecture with Out-of-Context Numbers

It is not valid in architecture articles to directly conclude that an "architecture is advanced" based only on a set of Total Read/Write and Total Time.

The sample statistics in the quick-start documentation can only demonstrate three things at most:

The pipeline is runnable.
Read/write forms a closed loop.
No failures occur in the minimal environment.

It alone cannot prove upper limits of high concurrency, recovery efficiency, or cost-performance ratio under different resource specifications.

8.1 Supplement: Minimal Testing Better Illustrates "The Importance of Context"

I performed three additional minimal run validations: environment is a single Ubuntu host with 8 vCPU / 15Gi RAM, running the official apache/seatunnel:2.3.13 image in local mode.

Official batch template: 32 / 32 / 0, total time 3s
Custom batch job, parallelism=1, row.num=1000: 1000 / 1000 / 0, total time 3s
Custom batch job, parallelism=4, row.num=1000: 4000 / 4000 / 0, total time 3s

These three sets of data clearly show: the same total time may correspond to completely different data volumes and parallelism settings.

Therefore, drawing conclusions about "performance" without parallelism, data scale, resource specifications, and job type easily leads to distortion.

8.2 What Else Can These Tests Demonstrate

In a batch job lasting approximately 12s, I added two sets of local-mode control-plane validations:

When checkpoint.interval = 2000, 5 regular checkpoints completed plus 1 final checkpoint were observed.
After adding min-pause = 5000, only 2 regular checkpoints plus 1 final checkpoint were observed within similar job duration.
After adding read_limit.rows_per_second = 5, for the same 100 rows, job duration increased from ~12s to ~21s.

This shows that min-pause and read_limit are not "decorative configurations" — they actually change control rhythm and runtime.

I also performed a validation in single-machine cluster mode specifically for savepoint / restore:

After running for 8s in a ~50s batch job, job status remained RUNNING, and checkpoint overview recorded 6 completed checkpoints.
After executing -s, job status became SAVEPOINT_DONE, and SAVEPOINT_TYPE appeared in checkpoint history.
Using the same jobId to execute -r for restoration, foreground restoration completed in ~37s, final statistics 500 / 500 / 0.

From only the final line 500 / 500 / 0, you cannot tell whether it "resumed from a breakpoint." But combined with the prior ~16s runtime and savepoint records, a more reasonable engineering judgment is:
the restoration processed remaining splits, not a full re-run.

I also tested adding read_limit.bytes_per_second = 10000 to a large-field example; total duration remained ~12s.
This more likely indicates that under this load pattern, FakeSource split reading became the bottleneck first — not simply that "byte rate limiting does not work."
It again proves: discussing performance numbers without load context easily leads to misjudgment.

Of course, these are only runtime observations, not strict benchmarks based on the c5ceb6490 build.
They better support "mechanisms are effective, metrics must be interpreted carefully" rather than "absolute performance leadership."

9. Recommended Observation Metrics for Real Pressure Testing

Instead of only looking at throughput, I suggest observing four types of metrics simultaneously:

Consistency metrics: duplication, loss, unfinished commits
Recovery metrics: time to recover after failure, need for manual intervention
Resource metrics: CPU, Heap, thread count, checkpoint duration
Convergence metrics: data inflow during shutdown, barrier delays

Two recommended comparison scenarios:

Scenario A: High Parallelism Observation

env {
  job.mode = "STREAMING"
  parallelism = 128
  checkpoint.interval = 1000
}
source {
  FakeSource {
    row.num = 100000000
    split.num = 128
    split.read-interval = 1
  }
}
sink {
  Console {
  }
}

Scenario B: Conservative Recovery Observation

env {
  job.mode = "STREAMING"
  parallelism = 32
  checkpoint.interval = 5000
}
source {
  FakeSource {
    row.num = 100000000
    split.num = 32
    split.read-interval = 100
  }
}
sink {
  Console {
  }
}

The above two configurations are more suitable for observing control links and recovery behavior, not for serious throughput benchmarking.
FakeSource in c5ceb6490 supports split.read-interval, not rate.

In addition, row.num in FakeSource means total generated rows per parallelism.
This must be accounted for when explaining test scale.

What these two scenarios truly compare is not just "who is faster," but:

Whether higher parallelism actually delivers effective throughput
Whether shorter checkpoint intervals stabilize recovery boundaries or cause timeouts
Whether the system throttles gracefully when sinks slow down, or amplifies congestion

A practical observation: in my minimal tests, min-pause did reduce checkpoint count within the same time window, and read_limit did increase total runtime. Both configurations are observable and verifiable.

10. Architecture Vision: From "Recoverable" to "Adaptive"

If we regard Zeta as a stability engine, its most promising future direction may not be stacking more "performance parameters,"
but further turning existing control signals into adaptive capabilities.

For example:

When Checkpoint slows down, can the system automatically identify whether the bottleneck is Source, Queue, Sink, or insufficient Slot resources?
When downstream writing slows, can the system automatically adjust read_limit based on real-time metrics, instead of requiring manual throttling after backlog occurs?
When a job recovers, can the system inform the user in advance: which checkpoint recovery starts from, how many splits remain, expected impact scope?

Furthermore, Exactly-Once capabilities on the connector side can become more explicit.
Today we mostly express capability boundaries via interface implementations and code conventions.
In the future, if idempotency, commit semantics, and retry boundaries become declarable, inspectable, observable contracts,
the operability of the entire data integration pipeline will improve significantly.

This does not mean the current version fully supports these capabilities,
but is a natural extension of the existing architecture:

Once the control plane, state plane, data plane, and resource plane form a closed loop,
the next step can evolve from "recover after failure" to "predict before failure, adapt during runtime."

11. Final Thoughts: What Makes Zeta Valuable Is Turning Stability into a System Capability

Looking at individual code points, many implementations in Zeta are not particularly flashy.

But architecturally, it gets several critical things right:

CheckpointCoordinator as a unified consistency control entry
Aggregated Committer binding external commits to checkpoint completion
restoreTaskState(...) and Enumerator-based recovery forming a complete resume loop
Barrier priority and prepareClose ensuring convergence under concurrency
ResourceProfile, dynamic slots, and read_limit making resource control a system-level strategy

What deserves recognition is not a single powerful module, but that it places the most failure-prone aspects of data integration systems into a unified, explainable engineering mechanism.

If you are an architect, what matters is not just whether it is fast, but whether it remains explainable, convergent, and operable under failure, recovery, commit, and resource fluctuation.

From this perspective, Zeta’s real value is not extreme optimization in one area, but placing these concerns into a system that can be traced, verified, and reasoned about.

SeaTunnel Zeta’s competitiveness lies not in pushing a single capability to the extreme, but in closing the loop across consistency, recovery, concurrency, and resource management.

Appendix: Source Code Reference Anchors

If you want to further explore the source code, it is recommended to start with the following entry points. You can also follow the official SeaTunnel channel and reply with the keyword “anchors” to get more materials.

CheckpointCoordinator.tryTriggerPendingCheckpoint
https://github.com/apache/seatunnel/blob/c5ceb6490/seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/checkpoint/CheckpointCoordinator.java#L500-L582
CheckpointCoordinator.restoreTaskState
https://github.com/apache/seatunnel/blob/c5ceb6490/seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/checkpoint/CheckpointCoordinator.java#L306-L344
SeaTunnelSink
https://github.com/apache/seatunnel/blob/c5ceb6490/seatunnel-api/src/main/java/org/apache/seatunnel/api/sink/SeaTunnelSink.java#L40-L127
SinkFlowLifeCycle.received / notifyCheckpointComplete
https://github.com/apache/seatunnel/blob/c5ceb6490/seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/task/flow/SinkFlowLifeCycle.java#L191-L244
SinkAggregatedCommitterTask.notifyCheckpointComplete
https://github.com/apache/seatunnel/blob/c5ceb6490/seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/task/SinkAggregatedCommitterTask.java#L303-L332
SourceSplitEnumeratorTask.restoreState
https://github.com/apache/seatunnel/blob/c5ceb6490/seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/task/SourceSplitEnumeratorTask.java#L187-L207
SourceSplitEnumeratorTask.receivedReader
https://github.com/apache/seatunnel/blob/c5ceb6490/seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/task/SourceSplitEnumeratorTask.java#L221-L246
DefaultSlotService.requestSlot
https://github.com/apache/seatunnel/blob/c5ceb6490/seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/service/slot/DefaultSlotService.java#L168-L189
speed-limit.md
https://github.com/apache/seatunnel/blob/c5ceb6490/docs/zh/introduction/configuration/speed-limit.md

Three Core Engine Innovations in Apache SeaTunnel: High-Reliability Asynchronous Persistence and CDC Architecture Optimization

Apache SeaTunnel — Fri, 17 Apr 2026 09:47:03 +0000

Abstract: In large-scale distributed data integration scenarios, high availability and extreme data processing performance have always been core challenges. This article provides an in-depth analysis of three recent core engine innovations in Apache SeaTunnel: a high-performance asynchronous WAL (Write-Ahead Log) persistence architecture based on LMAX Disruptor, an efficient timezone conversion optimization for Debezium deserialization in the CDC module, and enhanced complex type mapping in the JDBC module for databases such as SQL Server. By interpreting these core code changes, this article reveals how Apache SeaTunnel achieves a leap in processing throughput while ensuring strong data consistency, and provides best-practice references for distributed system architecture design.

1. Background Introduction

With the deepening of enterprise digital transformation, data integration is no longer just simple “data movement,” but has evolved into complex orchestration of massive, heterogeneous, and real-time data streams. As a next-generation high-performance data integration platform, Apache SeaTunnel’s self-developed Zeta engine demonstrates strong capabilities in distributed coordination, fault tolerance, and resource scheduling.

However, in the pursuit of extreme performance, bottlenecks such as blocking caused by synchronous I/O, performance overhead in cross-timezone data processing, and fragmentation in heterogeneous database type mapping have constrained further scalability. A series of recent core code contributions directly address these deep-rooted challenges through systematic architectural upgrades.

2. Core Contributors and PR Traceability

The technical breakthroughs analyzed in this article are inseparable from continuous contributions by the community. Below are the core contributors and corresponding Pull Requests for these features, enabling developers to further explore implementation details.

Technical Highlight	Main Contributor (GitHub ID)	Key PR	Description
Asynchronous WAL Persistence (WALDisruptor)	Kirs (@CalvinKirs) & Xiaojian Sun (@Sun-XiaoJian)	#3418 / #4683	Introduced LMAX Disruptor framework to refactor asynchronous persistence logic in the Zeta engine IMAP storage layer, significantly reducing I/O blocking.
CDC Performance Optimization (Timezone / Bitwise Ops)	Zongwen Li (@zongwenli)	#3499	Implemented highly optimized time conversion logic in CDC deserialization, avoiding frequent date object creation and improving multi-timezone support.
SQL Server Type Mapping Enhancement	hailin0 (@hailin0)	#5872	Unified and enhanced the JDBC type system, especially improving high-precision support for SQL Server DATETIME2 and DATETIMEOFFSET.

3. Core Technical Highlights

3.1 Asynchronous WAL Persistence Architecture Based on LMAX Disruptor

In distributed storage systems, WAL (Write-Ahead Log) is the cornerstone of ensuring data consistency. Traditional synchronous WAL writes block the main thread, leading to increased latency under high-concurrency I/O scenarios. SeaTunnel introduces the lock-free queue framework LMAX Disruptor in WALDisruptor.

Innovation: Adopts a single-producer, multi-worker thread pool model (Worker Pool), decoupling WAL publishing from actual I/O persistence logic.
Architectural Advantages: The ring buffer mechanism of Disruptor significantly reduces thread contention and context switching overhead, while preallocated memory avoids frequent garbage collection.

3.2 CDC Timezone Conversion and Deserialization Performance Optimization

CDC (Change Data Capture) is one of SeaTunnel’s core strengths. When processing raw data from Debezium, high-frequency time conversion operations often consume significant CPU resources.

Innovation: In SeaTunnelRowDebeziumDeserializationConverters, fine-grained bitwise conversion logic is introduced for TIMESTAMP, MICRO_TIMESTAMP, and NANO_TIMESTAMP, avoiding costly Java date object creation.
Architectural Advantages: By directly operating on millisecond and nanosecond-level long values and combining them with cached timezone (ZoneId) conversions, processing throughput is effectively doubled.

3.3 Standardized Enhancement of Heterogeneous Database Type Mapping

Type differences across heterogeneous databases (such as SQL Server, Oracle, and MySQL) are a major cause of precision loss during data synchronization.

Innovation: In converters such as SqlServerTypeConverter, precision adaptation logic for complex types like DATETIME2 and DATETIMEOFFSET is refactored.
Architectural Advantages: A streaming builder pattern based on BasicTypeDefine is introduced, making mappings between source types (SourceType) and underlying storage types (DataType) more transparent and extensible.

4. Implementation Details and Code Examples

4.1 Core of Asynchronous Persistence: Evolution of WALDisruptor

In WALDisruptor.java, we can observe a typical Disruptor usage pattern:

// Initialize Disruptor with BlockingWaitStrategy to reduce CPU usage under low load
this.disruptor = new Disruptor<>(
        FileWALEvent.FACTORY,
        DEFAULT_RING_BUFFER_SIZE,
        threadFactory,
        ProducerType.SINGLE,
        new BlockingWaitStrategy());

// Bind worker pool to handle HDFS/local file I/O
disruptor.handleEventsWithWorkerPool(
        new WALWorkHandler(fs, fileConfiguration, parentPath, serializer));

disruptor.start();

With this architecture, the main thread only needs to call tryAppendPublish to submit tasks to the RingBuffer and return immediately, while persistence is handled asynchronously by background threads.

4.2 CDC Performance Acceleration: Efficient Time Conversion

In SeaTunnelRowDebeziumDeserializationConverters.java, developers implemented an extremely optimized conversion function for high-precision timestamps:

public static LocalDateTime toLocalDateTime(long millisecond, int nanoOfMillisecond) {
    int date = (int) (millisecond / 86400000);
    int time = (int) (millisecond % 86400000);
    if (time < 0) {
        --date;
        time += 86400000;
    }
    long nanoOfDay = time * 1_000_000L + nanoOfMillisecond;
    LocalDate localDate = LocalDate.ofEpochDay(date);
    LocalTime localTime = LocalTime.ofNanoOfDay(nanoOfDay);
    return LocalDateTime.of(localDate, localTime);
}

This implementation replaces heavy Calendar or SimpleDateFormat operations with efficient mathematical calculations, representing a typical example of high-performance system design.

5. Performance Benchmark Comparison

Based on benchmark results from the SeaTunnel community, significant performance improvements were observed after these optimizations:

Metric	Before Optimization (Legacy Mode)	After Optimization (2.3.13 Preview)	Improvement
WAL Write Latency (P99)	15 ms	2 ms	86% ↓
CDC Throughput per Core (Rows/s)	55k	120k	118% ↑
SQL Server Time Precision	Second-level	Nanosecond-level (Datetime2)	—

Test Environment: 8 vCPU (Intel Xeon), 16GB RAM, SSD storage.
Scenario: MySQL CDC → SeaTunnel (Zeta) → Console/HDFS.
Data Characteristics: Average row size ~500 bytes, with 3+ time-related fields.
Throughput Note: 120k Rows/s represents single-core peak; real-world performance may vary due to network I/O and sink throughput.

Note: Data derived from CDC synchronization scenarios involving 10 billion records.

6. Challenges and Solutions

6.1 Graceful Shutdown in Asynchronous Architecture

Challenge: Asynchronous persistence may leave unflushed data in memory queues during JVM shutdown.
Solution: Introduced timeout-based waiting in the close() method to ensure queue draining.

disruptor.shutdown(DEFAULT_CLOSE_WAIT_TIME_SECONDS, TimeUnit.SECONDS);

6.2 Timezone Drift in Heterogeneous Databases

Challenge: Inconsistent timezones between database servers and runtime environments may cause incorrect CDC timestamp parsing.
Solution: Introduced dynamic ZoneId injection to ensure end-to-end timezone consistency.

7. Best Practices and Considerations

7.1 Backpressure Management

Although Disruptor improves throughput, downstream storage issues (e.g., HDFS or S3 latency) may cause RingBuffer accumulation. Monitoring queue depth is essential.

7.2 Importance of Graceful Shutdown

Force-killing processes (kill -9) may lead to data loss in asynchronous pipelines. Always use controlled shutdown procedures.

7.3 Timezone Configuration Consistency

Ensure serverTimeZone matches the database timezone to avoid inconsistencies in CDC pipelines.

7.4 Type Conversion Precision

When synchronizing SQL Server DATETIMEOFFSET to systems without offset support, precision loss may occur. Validate schema compatibility beforehand.

8. Conclusion and Outlook

Through architectural innovations in asynchronous WAL persistence, CDC performance optimization, and standardized type mapping, Apache SeaTunnel has significantly strengthened its foundation as an enterprise-grade data integration platform. Looking ahead, the project will continue exploring more efficient in-memory data exchange formats and deeper integration with AI ecosystems, making data integration more intelligent, efficient, and accessible.

A Practical DataOps Development Framework Based on WhaleStudio’s Three Layer Model

Apache SeaTunnel — Fri, 10 Apr 2026 09:37:01 +0000

As data platforms evolve from simply “getting jobs to run” to achieving stable and reliable operations, the challenges teams face also begin to shift. Early on, the focus is mainly on whether tasks execute successfully. As scale increases, the concerns move toward access control, clarity of data pipelines, manageability of changes, and the ability to recover from failures.

This is where DataOps starts to show its real value. It is not just a set of tool usage guidelines, but an engineering methodology that spans development, scheduling, and governance. Using WhaleStudio’s development management framework as an example, this article distills a set of practical standards drawn directly from real production experience.

The Three Layer Development Framework

In complex data platforms, managing everything through a single dimension quickly becomes insufficient as the system grows. WhaleStudio introduces a three-layer structure of Project, Workflow, and Task, which decouples governance, orchestration, and execution, creating clear boundaries for system management.

Project as the Governance Boundary

The project layer is the most fundamental part of the system, yet it is also the most commonly misused. In many teams, projects are treated merely as a way to organize directories. This approach often leads to problems later, such as unclear permissions, resource misuse, and ambiguous ownership.

In a well-designed system, projects should serve as governance boundaries. Everything related to access control should be scoped within a project, including user permissions, data source access, script resources, alerting strategies, and Worker group configurations.

A practical rule is simple. Whenever there is a scenario where certain users should not be able to view or modify specific resources, isolation must be enforced at the project level rather than relying on conventions or manual processes.

Workflow as the Business Pipeline

If projects define who can do what, workflows define how work is organized.

A workflow is essentially a DAG that represents dependencies between tasks. In a typical data pipeline, workflows connect data ingestion, SQL processing, script execution, and sub-process calls into a complete business flow.

Beyond orchestration, workflows also handle scheduling concerns such as dependency management, parallel and sequential execution strategies, retry mechanisms, and backfill logic. This means a workflow is not just a representation of execution logic, but also a key part of system stability design.

In practice, workflows should be treated as traceable and replayable pipelines rather than just collections of tasks.

Task as the Smallest Execution Unit

Under workflows, tasks represent the smallest unit of execution and have the most direct impact on system stability.

Common task types include SQL, Shell, Python, and data integration jobs. Despite their differences, they should follow consistent design principles such as traceability, retry capability, and recoverability.

In many production scenarios, issues do not originate from the scheduler itself, but from the tasks. For example, non-idempotent SQL logic, scripts without proper error handling, or strong dependencies on external systems can amplify risks during retries or backfills. Establishing standards at the task level is therefore critical to overall system reliability.

Once the responsibilities of the three layers are clearly defined, the next step is to manage permissions and design workflows effectively to prevent the system from becoming unmanageable as it scales.

Principles for Data Access and Workflow Design

As teams grow and business logic becomes more complex, access control and workflow design become key factors affecting both efficiency and stability. Without consistent standards, systems can quickly become chaotic.

Organize Projects by Business Domain

Projects should primarily be structured around business domains such as sales, risk control, or finance. This aligns naturally with organizational structure and helps clarify ownership.

When cross-team collaboration is required, resource sharing should be implemented through authorization mechanisms rather than placing everything into a single project. While the latter may seem convenient initially, it often leads to uncontrolled permissions over time.

Separate Responsibilities in Permission Design

Permissions should never default to giving everyone full access. Roles such as development, testing, operations, and auditing should be clearly separated, each with its own scope of authority.

This approach reduces the risk of accidental changes and helps standardize release processes, making system changes more controlled.

Balance Isolation and Reuse

Resource management must balance isolation with reuse. Data sources, scripts, resource pools, and Worker groups should be isolated by default to avoid unintended interference.

When reuse is necessary, it should be achieved through controlled authorization rather than duplicating configurations. This reduces maintenance overhead and avoids inconsistencies.

Resolve Permission Differences Through Projects

Whenever permission differences exist, they must be handled through project-level isolation. For example, if certain datasets should only be accessible to specific users, this must be enforced through system mechanisms rather than informal agreements.

Although this principle seems straightforward, it is often overlooked, leading to loss of control over the permission system.

Once the permission model is stable, workflow design becomes the key factor in maintainability.

Control Workflow Size

As the number of tasks grows, placing everything into a single workflow leads to rapidly increasing maintenance costs and higher risk during changes.

In practice, workflows should be split based on data layers or business domains, such as ODS, DWD, DWS, and ADS. The number of nodes within a workflow should remain within a manageable range to avoid excessive complexity.

Upgrade Governance When Complexity Increases

When the number of workflows grows too large or directory structures become unmanageable, relying on labels or folders is no longer sufficient. At this point, governance should be elevated to a higher level, such as introducing additional project segmentation.

This is not merely structural optimization, but an evolution of governance strategy.

Once design principles are clear, implementation should align with team size. There is no single solution that fits all teams.

Implementation Strategies for Different Team Sizes

DataOps does not have a universal solution. The right approach depends on team size and system complexity.

Large Teams with Layered Isolation

In large or complex data warehouse environments, multiple business domains, permission boundaries, and data pipelines coexist. In such cases, data warehouse layers such as ODS, DWD, DWS, and ADS should be mapped to different projects and workflows.

Dependencies across projects and workflows must be clearly defined. Impact analysis tools should be used for global governance to ensure changes do not introduce cascading failures.

Medium Sized Teams with Balanced Design

For medium-sized teams, the goal is to maintain stability while avoiding unnecessary complexity.

Projects should not be overly fragmented, and workflows should not be split excessively. Instead, different scheduling cycles such as daily and monthly jobs can be connected through well-defined dependencies.

The focus at this stage should be on unified scheduling strategies and resource pool management rather than introducing overly complex governance frameworks.

Small Teams with Fast Execution

For small teams or early-stage projects, the priority is to establish a working delivery pipeline.

A single workflow can be used to handle core business processes, supported by naming conventions, alerting mechanisms, and backfill strategies to ensure baseline quality. As complexity increases, the system can gradually evolve toward more fine-grained structures.

This approach keeps costs under control while avoiding overly heavy design in the early stages.

Conclusion

From Project to Workflow to Task, WhaleStudio’s three-layer model provides a clear division of responsibilities. Projects define governance boundaries, workflows manage business orchestration, and tasks handle execution.

With well-designed permission models and properly structured workflows, systems can remain stable and controllable even as complexity grows.

The essence of DataOps lies not in the tools themselves, but in building an engineering system that can evolve sustainably. Only when permissions, resources, and execution logic are governed under a unified framework can a data platform truly support long-term business growth.

(5)When Your Data Warehouse Breaks Down, It’s Probably a Naming Problem
(4)Why Your ADS Layer Always Goes Wild and How a Strong DWS Layer Fixes It
- (3) Key Design Principles for ODS/Detail Layer Implementation: Building the Data Ingestion Layer as a “Stable and Operable” Infrastructure
(I) A Complete Guide to Building and Standardizing a Modern Lakehouse Architecture: An Overview of Data Warehouses and Data Lakes

Coming Next

Part 7 Scheduling design best practices

You Don’t Apply to Become an ASF Member, You Grow Into It

Apache SeaTunnel — Fri, 10 Apr 2026 09:11:30 +0000

Very few people set “becoming an ASF Member” as a clear goal.

Not because it lacks appeal, but because there is no application process and no defined path. It is more of an outcome, something that happens after sustained contributions are naturally recognized within a community.

Fan Jia followed exactly that kind of path.

Recently, he was invited to join the Apache Software Foundation as a Member. Taking this opportunity, we had an in-depth conversation with him. More than a recognition of achievement, the discussion felt like a reflection on his journey—from data integration, to open source participation, to system design and community understanding—tracing how an engineer gradually arrives at this point.

Starting from Data Integration

Fan Jia’s current work focuses on data integration, particularly in areas such as data synchronization, Change Data Capture, and data infrastructure. As he describes it, his day-to-day work can be distilled into one core objective: enabling data to flow reliably across different systems.

In practice, this is far more complex than it sounds. It involves synchronizing data between heterogeneous systems, handling schema evolution, and ensuring stability in complex production environments. Alongside this, he has been actively contributing to the Apache SeaTunnel community over the long term.

What stands out is that his starting point was not open source itself, but a set of concrete and persistent engineering problems. Those problems became the foundation for his later involvement in open source.

How He Got Into Open Source

When asked how he first got involved in open source, his answer was straightforward—it started with his job. After joining WhaleOps, he became involved in the development, maintenance, and partial architectural design of Apache SeaTunnel.

In the early stage, his contributions were similar to those of most engineers, focusing on solving specific issues such as fixing bugs and improving features. Over time, however, his attention shifted toward system design and how the project could run reliably across broader and more diverse scenarios.

This transition did not happen overnight. It emerged gradually through continuous involvement. As his focus moved from isolated problems to the system as a whole, his role evolved along with it.

From User to Maintainer

He describes this phase as a shift in perspective and responsibility.

As a user, the focus is on whether a feature exists and whether it meets immediate needs. As a maintainer, the concerns expand to system stability, backward compatibility, adaptability across different use cases, and the real experience of community users.

At the same time, the sense of responsibility becomes more concrete. Writing code is no longer just about completing a task. It becomes part of maintaining a system that runs in real production environments, making every technical decision more deliberate.

Once this shift in perspective happens, the truly complex problems begin to surface.

A Memorable Technical Challenge

During his time contributing to SeaTunnel, one of the most memorable challenges was building the Zeta engine from scratch.

This was not about solving a single isolated issue, but about tackling a combination of complex system-level problems. At the execution model level, the engine needed to support both batch and stream processing, balancing throughput and latency while avoiding bottlenecks under high concurrency.

From a concurrency perspective, multi-threaded execution introduced challenges such as race conditions, deadlocks, and unpredictable execution order. These issues are often difficult to reproduce and tend to surface only after prolonged runtime.

In terms of resource management, real production workloads involve long-running tasks and large data volumes. Memory control, thread pool isolation, and backpressure handling become critical. Out-of-memory errors are especially dangerous, as they can impact not only individual tasks but the stability of the entire service process.

For stability and recoverability, the system must guarantee no data loss, avoid uncontrolled duplication, and correctly restore state after failures or restarts. This typically requires integrating checkpointing and state management mechanisms.

Overall, this was not a single technical problem, but a full-scale systems engineering challenge.

These experiences also shaped how he understands collaboration in open source.

The Most Important Skill in Open Source

When asked what matters most in an open source community, his answer was patience.

A pull request in open source rarely gets merged immediately. It usually goes through multiple stages, including initial implementation, community review, several rounds of revision, CI validation, and documentation updates. Along the way, various issues can arise. Without patience, it is easy to give up midway.

However, consistently pushing through these details is exactly what defines high-quality contributions.

This understanding of the process is also reflected in his advice to newcomers.

Advice for New Contributors

For developers just getting started in open source, he believes the most important things are curiosity and the willingness to act.

Often, the biggest barrier is not technical difficulty, but simply not getting started. Once you take the first step—submitting a small PR or joining a discussion—everything else tends to follow naturally.

He also emphasizes the importance of expressing your own ideas and even questioning existing designs. Open source communities are inherently open environments, and everyone starts as a beginner.

As participation deepens, feedback from the community becomes more visible.

The Moment He Became an ASF Member

When he learned that he had become an ASF Member, his first reaction was excitement and happiness.

Unlike many achievements, this is not something you apply for. It is a recognition from the community based on long-term contributions, which makes it especially meaningful.

At the same time, he sees it not just as an honor, but as an increase in responsibility.

What This Role Means

In his view, being an ASF Member is fundamentally about responsibility.

It is not only about continuing technical contributions, but also about fostering a healthy community, helping new contributors grow, and participating in higher-level governance. It also means being accountable to users, ensuring that projects run reliably in real-world environments.

As his role evolves, so does his understanding of the community.

Understanding The Apache Way

He summarizes his understanding of The Apache Way in one phrase: Community Over Code.

The long-term success of an open source project depends not only on its technology but also on whether it maintains open and transparent decision-making, encourages contributors from diverse backgrounds, and builds governance based on consensus.

These factors ultimately determine the vitality of a project.

With this perspective, he approaches projects from a broader viewpoint.

How He Sees SeaTunnel

In his view, SeaTunnel’s strengths lie in several areas.

From an architectural standpoint, it supports a multi-engine model, allowing users to choose the most suitable execution engine for different scenarios. From an ecosystem perspective, it provides a rich set of connectors, enabling integration with various databases, data lakes, and messaging systems.

In terms of capabilities, CDC is a key strength, supporting both data change capture and schema evolution, making the system more adaptable to complex production environments.

At the same time, despite these capabilities, SeaTunnel maintains a relatively lightweight design, allowing users to adopt and use it at a lower cost.

These insights come from long-term hands-on experience.

How Open Source Changed Him

Open source has had a significant impact on his career, especially in how he approaches problems.

Within a company, systems are usually designed around specific business needs. In open source, however, solutions must consider much broader and more general use cases, which pushes engineers to make longer-term architectural decisions.

Collaborating with developers from different companies and backgrounds also expands one’s technical perspective.

One Sentence About Open Source

When asked to summarize open source in one sentence, he said

Open source is not just about sharing code, it is a process where developers and communities grow together

It may sound simple, but when viewed in the context of his journey, it is less a conclusion and more a natural outcome.

From solving concrete data problems, to participating in system design, to thinking about how projects run reliably across different scenarios, and eventually to engaging in community collaboration and consensus building, there is no clear boundary between these stages.

It is a continuous process where perspective gradually expands through doing the work.

Becoming an ASF Member is not the end of this journey, but a milestone along the way. It reflects recognition of past contributions and signals greater responsibility ahead.

If there is one deeper takeaway from this experience, it may not be a specific technology or a single project, but a more enduring capability

The ability to keep investing in uncertainty and to continue doing the right thing even when there is no immediate reward

About Apache SeaTunnel
Apache SeaTunnel is an easy-to-use, ultra-high-performance distributed data integration platform that supports real-time synchronization of massive amounts of data and can synchronize hundreds of billions of data per day stably and efficiently.

Welcome to fill out this form to be a speaker of Apache SeaTunnel: https://forms.gle/vtpQS6ZuxqXMt6DT6 :)

Why do we need Apache SeaTunnel?
Apache SeaTunnel does everything it can to solve the problems you may encounter in synchronizing massive amounts of data.
Data loss and duplication
Task buildup and latency
Low throughput
Long application-to-production cycle time
Lack of application status monitoring

Apache SeaTunnel Usage Scenarios
Massive data synchronization
Massive data integration
ETL of large volumes of data
Massive data aggregation
Multi-source data processing

Features of Apache SeaTunnel
Rich components
High scalability
Easy to use
Mature and stable

How to get started with Apache SeaTunnel quickly?
Want to experience Apache SeaTunnel quickly? SeaTunnel 2.1.0 takes 10 seconds to get you up and running.
https://seatunnel.apache.org/docs/2.1.0/developement/setup

How can I contribute?
We invite all partners who are interested in making local open-source global to join the Apache SeaTunnel contributors family and foster open-source together!

Submit an issue:
https://github.com/apache/seatunnel/issues

Contribute code to:
https://github.com/apache/seatunnel/pulls

Subscribe to the community development mailing list :
dev-subscribe@seatunnel.apache.org
Development Mailing List :
dev@seatunnel.apache.org

Join Slack:
https://join.slack.com/t/apacheseatunnel/shared_invite/zt-3uouszk3m-PtLLNyZsJVqE5Gb6gn24mA

Join us now!❤️❤️

What Happened in Apache SeaTunnel? This March You Shouldn’t Miss

Apache SeaTunnel — Fri, 10 Apr 2026 07:06:02 +0000

Hey there! The March 2026 report is here. The Apache SeaTunnel community has been incredibly active. A total of 26 contributors participated, version 2.3.13 was released, five new connectors were added, and major improvements were made across the core engine, file connectors, CDC, and Transform modules. More than 20 bugs were also fixed.

On top of that, infrastructure upgrades were rolled out. Whether you’re an enterprise or individual user, it’s a great time to upgrade, explore new features, and stay in sync with the community momentum.

Reporting period March 1, 2026 to March 30, 2026

1. Release Overview

Version	Release Date	Notes
2.3.13	March 14, 2026	Released this month with 50+ new features and 20+ bug fixes

Download
https://seatunnel.apache.org/download

2. Key Updates in Version 2.3.13

2.1 New Connectors

Connector	Description	PR
HugeGraph Sink	Adds support for Apache HugeGraph	#10002
DuckDB	Introduces DuckDB as both Source and Sink	#10285
Lance	Adds support for writing to Lance datasets	#9894
AWS DSQL	Adds AWS DSQL Sink connector	#9739
IoTDB	Adds Source and Sink support for IoTDB 2.x	#9872

2.2 Core Engine Enhancements

Module	Feature	PR
Zeta Engine	Supports arbitrarily nested arrays and map types	#9881
Zeta Engine	Adds min-pause checkpoint configuration	#9804
Zeta Engine	Introduces REST API to inspect pending queue details	#10078
Flink	Adds support for Flink 1.20.1	#9576
Flink	Enables schema evolution for CDC sources	#9867
Metrics	Adds sink committed metrics and commit rate calculation	#10233

2.3 File Connector Improvements

Connector	Enhancement	PR
HdfsFile	Enables parallel reading for large files	#10332
LocalFile	Supports chunked parallel reading for CSV, TEXT, JSON files	#10142
Parquet	Adds logical partitioning support	#10239
HdfsFile and LocalFile	Adds sync_mode=update support	#10437, #10268
HBase	Supports time-range scanning	#10318
Hive	Supports automatic failover across multiple Metastore URIs	#10253

2.4 CDC Improvements

Feature	Description	PR
Maxwell Canal Debezium	Optimizes JSON format and supports merging update_before and update_after	#9805
Kafka	Adds Protobuf deserialization support via Schema Registry wire format	#10183
Kafka	Injects record timestamp as EventTime metadata	#9994
MySQL CDC	Optimizes wait time for schema evolution	#10040

2.5 Transform Enhancements

Transformation	Feature	PR
Multimodal Embeddings	Adds support for multimodal embeddings	#9673
RegexExtract	Introduces regex-based extraction transform	#9829
SQL to Paimon	Adds support for MERGE INTO syntax	#10206

3. Bug Fixes in Version 2.3.13

Module	Issue	PR
CSV Reader	Fixes parsing failure caused by empty first column	#10383
ClickHouse	Improves batch parallel reads by replacing limit offset with last batch sort value	#9801
PostgreSQL	Adds support for TIMESTAMP_TZ type	#10048
Redis	Fixes cluster mode bug and adds end-to-end tests	#9869
MongoDB	Improves writer close logic	#10051
Elasticsearch	Optimizes resource cleanup for Scroll API	#10124
MySQL CDC	Optimizes schema evolution wait time	#10040

4. Community Highlights

Contributors in March 2026

Rank	Contributor	PR Count	Role
🏅	@zhangshenghang	6	Contributor
🥈	@yzeng1618	4	Contributor
🥈	@davidzollo	4	Contributor
🥈	@chl-wxp	4	Contributor
🥉	@liunaijie	3	Contributor
🥉	@dybyte	3	Contributor
🥉	@ricky2129	3	Contributor
🥉	@corgy-w	3	Contributor
	@zooo-code	2	Contributor
	@kuleat	2	Contributor
	@LeonYoah	2	Contributor
	@OmkarK-7	1	Contributor
	@icekimchi	1	Contributor
	@assokhi	1	Contributor
	@Sephiroth1024	1	Contributor
	@Best2Two	1	Contributor
	@ic4y	1	Contributor
	@misi1987107	1	Contributor
	@CosmosNi	1	Contributor
	@chocoboxxf	1	Contributor
	@xiaochen-zhou	1	Contributor
	@qingzheguo-flash	1	Contributor
	@rameshreddy-adutla	1	Contributor
	@CNF96	1	Contributor
	@MuraliMon	1	Contributor
	@ocean-zhc	1	Contributor

A total of 51 PRs were merged in March. Huge thanks to all 26 contributors.

Full contributor list
https://github.com/apache/seatunnel/graphs/contributors

Infrastructure Updates

End-to-end test Docker images migrated to the seatunnelhub repository
JDK Docker images upgraded
CI timeout optimization with Kafka set to 140 minutes and Kudu to 60 minutes
Added Metalake support for managing data source metadata

5. Recommendations for Enterprises

Upgrade Guidance

Production environments are strongly recommended to upgrade to version 2.3.13
This release includes more than 50 new features and over 20 bug fixes

Features to Watch

New connectors including HugeGraph, DuckDB, IoTDB, AWS DSQL, and Lance
Improved large file processing with parallel chunked reads in HdfsFile and LocalFile
Enhanced CDC capabilities including schema evolution and multi-format Kafka support
Improved observability with new sink committed metrics
Support for Flink 1.20.1

Notes

Some connector APIs have changed, so reviewing the upgrade documentation is recommended
Using the seatunnelhub image repository is strongly encouraged

6. Key Metrics

Metric	March Data
Releases	1 release (2.3.13)
New Connectors	5+
Feature Enhancements	50+
Bug Fixes	20+
Contributors	50+

7. What’s Coming Next

Further optimization of CDC performance
More cloud-native data source integrations
Improved metrics and monitoring capabilities

Compiled and edited by the SeaTunnel Community

(5)When Your Data Warehouse Breaks Down, It’s Probably a Naming Problem

Apache SeaTunnel — Fri, 03 Apr 2026 06:59:33 +0000

As a data warehouse grows, the first thing that tends to get out of control is not the data itself—but naming. Naming conventions may seem like a minor detail, but they directly determine whether data is easy to find, understand, and maintain. As the fifth article in the Data Lakehouse Design and Practice series, this article starts from real-world usage and summarizes core methods for table and field naming. By combining layered prefixes, unified terminology (word roots), and cycle encoding, table names become self-explanatory. Together with metric naming and governance processes, this helps build a clear and collaborative data system.

Goals and Methods of Naming Conventions: Make Table Names Self-Explanatory and Teams Work Automatically

In a data warehouse system, naming conventions are not just about form—they are foundational infrastructure that directly impacts collaboration efficiency and data quality. A good naming system has one core goal: make the table name itself carry enough information so that people can understand what the table is, where it comes from, and how to use it—without needing extra documentation. Ideally, a table name should be “readable at a glance” and include key information such as data layer, owning team, business domain, subject domain, core object meaning, and update cycle or data scope. When these elements are systematically encoded into table names, data discovery, metric interpretation, troubleshooting, and team handovers all become significantly more efficient, reducing communication costs.

A naming system is essentially a “word root system” that standardizes business language. For example, the same business object must use the same term consistently across tables (e.g., avoid mixing “rack” and “shelf”). Similarly, metric naming should follow unified rules—for instance, all ratio-type metrics should use the _rate suffix, avoiding ambiguity from mixing terms like ratio, percent, or rt.

Layer prefixes must be strictly standardized. They allow users to immediately identify the data layer and purpose of a table: ods_ for source-aligned data, dwd_ for detailed standardized data, dws_ for aggregated data, ads_ for application-facing outputs, and dim_ for shared dimensions. These prefixes are not just naming conventions—they directly reflect the data architecture.

Another often overlooked but critical aspect is encoding update cycles or data scope into table names. For example, _1d represents the last day, _td means up to today, and _7d means the last seven days. This prevents confusion between tables with the same name but different time semantics, reducing the risk of metric misuse.

At the asset management level, table types must be clearly distinguished. Production tables are long-term assets, intermediate tables serve only processing workflows and should have retention policies, and temporary tables are for one-time validation and must not enter production pipelines. Prefixes like mid_ and tmp_ help prevent data asset pollution at the source.

Finally, naming conventions must be integrated with governance processes. Any new table or field must include complete metadata such as owner, field definitions, metric definitions, update frequency, dependencies, and lifecycle. Tables without such metadata may be usable in the short term but will almost certainly become technical debt in the long run. In practice, it is best to standardize templates first—ensuring key fields like layer, domain, and cycle are strictly consistent—while allowing limited flexibility in non-critical parts.

Table Naming Conventions: Templates, Cycle Encoding, and Examples

In practice, table naming should follow a structured template to ensure completeness and consistency. A general template can be defined as {layer}_{dept}_{biz_domain}_{subject}_{object}_{cycle_or_range}, where each component has a clear role: layer indicates data level, dept indicates ownership, biz_domain defines the business domain, subject represents analytical abstraction, object defines the entity or behavior, and cycle_or_range specifies the time scope.

Cycle and range encoding is especially important. Common patterns include _1d (last day), _td (to date), _7d or _30d (last N days). Additional markers can distinguish data types or update modes, such as d for daily snapshots, w for weekly data, i for incremental tables, f for full tables, and l for slowly changing tables. These conventions allow users to quickly understand temporal semantics.

For example, in the aggregation layer, dws_asale_trd_byr_subpay_1d represents buyer-level, staged payment transactions aggregated over the last day, while dws_asale_trd_itm_slr_hh represents hourly aggregation at the seller-item level. Although long, such names are highly informative and readable.

Dimension tables follow a separate convention, using the dim_ prefix and a {scope}_{object} structure, such as dim_pub_area (public area dimension) or dim_asale_item (item dimension), emphasizing cross-domain reuse.

Intermediate tables should be tightly bound to their target tables, typically named as mid_{target_table}_{suffix}, such as mid_dws_xxx_01. Temporary tables must use the tmp_ prefix and are strictly limited to development or validation, never entering production dependencies. For manually maintained data, tables in the DWD layer can explicitly include manual, such as dwd_trade_manual_client_info_l.

Field and Metric Naming Conventions: Rules, Structure, and Examples

At the field level, naming must be strictly standardized. All field names should use lowercase with underscores—camelCase is not allowed. Readability should take priority over brevity, and consistent naming must be maintained for the same semantic meaning.

Partition fields should be unified globally—for example, dt for date, hh for hour, and mi for minute—with fixed formats. This improves development efficiency and avoids confusion across tables.

Field suffixes should clearly indicate meaning: _cnt for counts, _amt or _price for monetary values (choose one consistently), and boolean fields should use the is_ prefix and never be nullable. These conventions allow users to infer data types and meanings at a glance.

NULL handling must also follow consistent rules. Typically, dimension fields use -1 for unknown values, while metric fields use 0 to indicate no occurrence. This prevents NULL propagation in aggregations and improves data stability.

Metric naming should be structured as a combination of business qualifier, time qualifier, aggregation method, and base metric. For example, trade_amt represents transaction amount, install_poi_cnt represents installation point count, and pay_succ_rate represents payment success rate. Aggregation methods should use fixed terms like sum, avg, max, and min, avoiding inconsistent alternatives like “total.”

A full example from fields to metrics: in the detail layer, an incremental order table might be named dwd_trade_order_i, containing fields such as order ID, user ID, payment amount, order status, and partition keys. In the aggregation layer, dws_trade_user_pay_1d summarizes user-level payments over the last day, including metrics like payment success count, total payment amount, and success rate. Finally, in the application layer, a table like ads_fin_kpi_board_d provides business-facing dashboards with KPIs such as GMV, refund amount, net revenue, and number of paying users.

By standardizing naming across tables, fields, and metrics, a data warehouse can achieve clear semantics, consistent structure, and efficient collaboration. While such conventions may introduce some overhead initially, they are essential for scalability and team coordination in the long term.

Earlier Posts in This Series：
- (4)Why Your ADS Layer Always Goes Wild and How a Strong DWS Layer Fixes It
- (3) Key Design Principles for ODS/Detail Layer Implementation: Building the Data Ingestion Layer as a “Stable and Operable” Infrastructure
- (I) A Complete Guide to Building and Standardizing a Modern Lakehouse Architecture: An Overview of Data Warehouses and Data Lakes
Next Post:

- (6) DataOps Development Standards and Best Practices

Growing with the Community: Zhang Shenghang’s Path to Apache SeaTunnel PMC Member

Apache SeaTunnel — Fri, 03 Apr 2026 02:55:16 +0000

🎉 Hi Community—more exciting news! Zhang Shenghang has been invited to join the Apache SeaTunnel PMC in recognition of his outstanding contributions—well deserved!

Over the years, Zhang has been highly active in the Apache SeaTunnel community. From improving code quality, refining documentation, to engaging with the community and mentoring newcomers, his presence has been everywhere. He consistently embraces the Apache Way, contributing with dedication and passion to the growth of the project.

We took this opportunity to conduct an in-depth interview with him. Covering his background, open source journey, PMC role, and thoughts on community development and culture, this conversation offers a closer look at his story and his enthusiasm for open source.

Personal Background & Open Source Journey

Could you briefly introduce yourself and how you entered the big data and open source space? Name: Zhang Shenghang GitHub: zhangshenghang

When did you start contributing to Apache SeaTunnel, and what was the motivation?
I started contributing to Apache SeaTunnel in June 2024. Initially, I was using DataX, a classic standalone data integration tool. However, it lacks service-oriented and distributed capabilities, which creates limitations in large-scale data synchronization scenarios. That’s when I came across Apache SeaTunnel as a more comprehensive solution.
What key contributions or features have you worked on in SeaTunnel?
He has contributed to multiple core features and improvements, including adding a pending queue feature for SeaTunnel Engine task scheduling, enabling Kafka Protobuf format support, introducing Kerberos testing in e2e workflows, implementing a new resource scheduling algorithm in SeaTunnel Engine, adding TTL support for HBase Sink, introducing API-based log retrieval, fixing Flink source 100% busy issues, supporting the Typesense connector, enabling default value substitution for configuration variables, fixing Doris custom SQL execution issues, correcting Kafka consumer offset auto-commit logic, and resolving RabbitMQ checkpoint issues in Flink mode.

Open Source Contributions & Growth

Which contribution or experience impressed you the most?
What impressed me most was not just submitting a PR, but the full process—from discovering a problem, analyzing it, discussing solutions with the community, to finally implementing and validating the fix. Issues involving engine scheduling, resource allocation, and Flink stability often look simple on the surface but are deeply tied to framework mechanisms and runtime behavior. Solving them requires both deep code understanding and close collaboration.
What is the most important skill in open source collaboration?
All are important, but if I had to choose one, it would be the ability to collaborate continuously. Technical skills are foundational, but communication is equally critical—open source is not just about writing code, but explaining context, design decisions, and trade-offs clearly so others can understand.
What advice would you give to beginners in open source?
Don’t overestimate the difficulty. You don’t need to start with massive features or deep architectural changes. Fixing a bug, improving documentation, adding tests, or optimizing small features are all valuable contributions.

Becoming a PMC Member

Congratulations on becoming a PMC Member! What was your first reaction?
Thank you. My first reaction was both excitement and a strong sense of responsibility. It’s recognition of past contributions, but also a reminder that a PMC Member is not just a contributor, but a community builder.
What does becoming a PMC Member mean to you and the community?
To me, it represents recognition of long-term contributions, collaboration ability, and responsibility. Personally, it means thinking beyond individual modules and considering the project’s overall development, governance, and ecosystem. For the community, more PMC Members mean more people willing to take responsibility and drive sustainable growth.
How important is the Apache Way to open source success?
It emphasizes “Community Over Code.” A project succeeds not just because of good code, but because of an open, transparent, and sustainable collaboration culture.

SeaTunnel Community Development

What key milestones has SeaTunnel gone through?
SeaTunnel has evolved from a data synchronization tool into a more comprehensive data integration platform, expanding across connectors, orchestration, engines, and observability. The maturation of SeaTunnel Engine is a major turning point, enabling stronger unified execution capabilities. Additionally, increased community activity and internationalization have significantly boosted its impact.
How do you see SeaTunnel’s position and future?
SeaTunnel is building a unique position by balancing rich connectors, strong engine capabilities, scalability, and enterprise readiness. Compared to traditional tools, it fits modern data infrastructure better; compared to heavyweight platforms, it remains flexible and extensible. It has strong potential to become a leading global open source data integration project.
What are your future plans as a PMC Member?
I plan to focus on improving SeaTunnel Engine, scheduling, resource management, and system stability; strengthening connectors and production readiness; and helping new contributors onboard faster through issue guidance, PR reviews, and knowledge sharing.

Personal Growth & Open Source Culture

How has open source impacted your career and growth?
Professionally, it has exposed me to real-world complex problems and high-standard collaboration environments. Personally, it has deepened my understanding of collaboration, responsibility, and long-term thinking. Open source has shaped not only my technical skills but also my mindset and working style.
How would you summarize the spirit of open source in one sentence?
Open source is about collaboratively creating, improving, and sharing technology in an open and inclusive way for the benefit of everyone.

Forem: Apache SeaTunnel

Selective CDC with Apache SeaTunnel: How to Capture Only the Database Changes You Need

1. Overview

2. Environment Setup

3. SeaTunnel Configuration

1. Preparing SeaTunnel Connector Plugins

2. Adding the MySQL Driver

4. Creating MySQL Tables

5. SeaTunnel Job Definition

Key Notes

6. Testing

1. Insert Data

2. Update and Delete Data

7. Implementing Change Filtering Through Metadata

Technical Implementation Example

8. Test Verification and Result Analysis

Test Steps and Observations

1. Insert Data

2. Update Data

Apache SeaTunnel Isn’t a Simple ETL Tool , Understanding Its DataFlow-Driven DAG Engine

One-Sentence Summary (Conclusion First)

1. SeaTunnel’s Core Concept: Data Flow

What is a Data Flow?

Every Plugin is “Operating on Data Streams”

2. The Real Meaning of plugin_output / plugin_input (Very Important)

1️⃣ plugin_output

2️⃣ plugin_input

One Sentence to Fully Explain It

3. SeaTunnel’s DAG Model (You Are Already Using It)

Internally, SeaTunnel Builds a DAG Like This:

Key Point: Why Can They Be Merged?

4. What’s the Fundamental Difference from “SQL / ETL” Thinking?

In the SQL World

In the SeaTunnel World

5. The Role of Schema in Data Streams (You Must Remember This)

Preconditions for Stream Merging in SeaTunnel:

6. The Official Definition of SeaTunnel’s “Data Flow Model”

7. Direct Impact on Your Builder / Strategy Design (Important)

1️⃣ Builder Must Support N Source → M Sink

2️⃣ plugin_output is a First-Class Citizen

3️⃣ Sink Logically Supports Multiple Input Streams

8. Several Key Facts You Have Already Verified Through Practice

9. Summary

SeaTunnel Has Only 3 Core Roles

How Are Data Streams Connected?

Modernizing Infrastructure: Seamless Data Migration to HighGo DB with Apache SeaTunnel

1. Introduction to HighGo Database

2. Practical Read/Write Scenarios

2.1 Reading HighGo MySQL Mode to HighGo PG Mode

2.2 Meeting Compliance and Migration Requirements

3. Summary

Can You Turn “What I Want to Do” into a Runnable SeaTunnel Config with AI?

1. Where the Real Demand Lies for AI Writing Configurations

1.1 Why Manual Configuration Becomes a Bottleneck

1.2 What Discussion #10651 Is Really Asking

1.3 Let Me State the Conclusion First

2. If We Really Want to Do This, What Should the Pipeline Look Like

2.1 Don’t Rush to Let the Model Directly Output HOCON

2.2 Structurally, This Solution Is Actually Two Pipelines

2.2.1 Module Breakdown

2.2.2 Execution Flow (Text Sequence)

3. If Building an MVP, What Should the First Version Look Like

3.1 Input and Output Format: Define the Protocol First

3.1.1 Input: IntentSpec (JSON)

3.1.2 Output: Configuration + Validation Report

3.2 Prompts Are Not the Main Character, Boundaries Are

3.2.1 Prompt A: Intent → Plan (Only Output IR, Not Configuration)

3.2.2 Prompt B: Plan → HOCON Rendering

3.2.3 Prompt C: Self-check (Lint + Semantic)

3.3 How to Choose Models: Local Open Source or Cloud LLM

3.4 Which Compatibility Rules Should Be Fixed from the Beginning

3.5 The Rule System Does Not Have to Be Fully Handwritten

4. A Complete Example: From “What I Want to Do” to a Runnable Configuration

4.1 First Look at JobPlanIR: It Fixes the Intent

4.2 Then Look at seatunnel.conf: It Executes the Job

4.3 Finally Look at validation_report: It Explains the Issues Clearly

5. What Do We Ultimately Save by Doing This

5.1 Three Typical Scenarios

5.1.1 Database Synchronization (MySQL → Doris)

5.1.2 Lakehouse Ingestion (Hive → Iceberg)

2. The Real Meaning of `plugin_output` / `plugin_input` (Very Important)

1️⃣ `plugin_output`

2️⃣ `plugin_input`

2️⃣ `plugin_output` is a First-Class Citizen