Forem: WhaleOps

Tribute to the passing of Teradata Automation!

WhaleOps — Mon, 27 Feb 2023 02:47:24 +0000

On February 15, 2023, Teradata officially withdrew from China after 26 years. As a professional data company like Teradata, WhaleOps feel so regretful about this. As an editor of WhaleOps, I am also a fan of Teradata, and keep an eye on the development of Teradata’s various product lines. When everyone is thinking about the future of the Teradata data warehouse, they ignore that Teradata actually has a magic weapon, that is, the data warehouse scheduling suite Teradata Automation that comes with the Teradata data warehouse.

The rapid development of Teradata in the world, especially in Greater China, is inseparable from the assistance of Teradata Automation. Today, we are here to remember the history of Teradata Automation and the prospect of the future. We also hope that DolphinScheduler and WhaleScheduler, which have paid tribute to Automation since their birth, can take over the mantle and continue to benefit the next generation of schedulers.

Architecture Evolution of Teradata Automation

Teradata was different from other data warehouses (Oracle, DB2) at the beginning of its birth. It abandoned the commonly-used ETL architecture, but the ELT model that people feel delighted in talking about in the field of big data, that is, its overall solution does not need to use Informatica/DataStage/Talend extracts and transforms the data source, but convert the source system into an interface file, and then enters the data warehouse preparation layer through FastLoad/Multiload/TPump/Parallel Transporter (TPT) (Those who are interested in the alternative tools can refer to Apache SeaTunnel open source version or WhaleOps commercial version WhaleTunnel), and then execute TeradataSQL scripts through BTEQ, and execute all scripts effectively according to the triggers and dependencies (DAG) between tasks.

Does this architecture of Teradata Automation sound familiar? Yes, Oozie, Azkaban, and Airflow, which are popular later in the big data field all have this logical architecture. Oozie, Azkaban, and Airflow pale into insignificance by comparison with Teradata Automation which was developed in 199x!

A*lthough Automation has been there for ages, as the originator of ELT scheduling tools and the most used ELT scheduling tool in the enterprise industry, it is very advantageous in terms of comprehensive consideration of business cases.* Therefore, considering the business case needs, the design of Apache DolphinScheduler, including the design of many functions of the commercial version of WhaleScheduler, can still see the shadow of tribute to Teradata Automation, but Teradata Automation is task-level scheduling control, while DolphinScheduler and WhaleScheduler are workflow + task Level scheduling, this design is to pay tribute to Informatica, another global ETL scheduling predecessor (I will tell its story later).

The first version of Automation was written by Teradata Taiwan employee Jet Wu using Perl. It is famous for its lightweight, simple structure, and stable system. It also uses flag files to avoid the low performance of Teradata’s OLTP and quickly became popular in Teradata’s global projects. Later, it was modified by great engineers such as Wang Yongfu(Teradata) in China to further improve usability. The catalog of Automation in Greater China looks like this:

/ETL (Automation home directory)
 | - -/APP stores ETL task scripts. In this directory, first, create the subsystem directory, and then create the ETL task directory
 | - -/DATA
 | - - - /complete Store the data that has been successfully executed. Create a subdirectory with the system name and date
 | - - - /fail
 | - - - - -/bypass Store files that do not need to be executed. Create subdirectories with system names and dates
 | - - - - -/corrupt Store files that do not match the size. Create a subdirectory with the system name and date
 | - - - - -/duplicate Store duplicate received files. Create subdirectories with system name and date
 | - - - - -/error Store files that generate errors during operation. Create subdirectories with system names and dates
 | - - - - -/unknown Store files not defined in the ETL Automation mechanism. Create subdirectories with dates
 | - - - /message Store the control file to send message notification
 | - - - /process stores the data files and control files used by the jobs being executed
 | - - - /queue stores data files and control files used by jobs ready to be executed
 | - - - /receive is used to receive data files and control files from various source systems
 | - -/LOG Store the ETL Automation system program and the record files generated during the execution of each operation
 | - -/bin stores the execution files of the ETL Automation system program
 | - -/etc store some configuration files of the ETL Automation mechanism
 | - -/lock store ETL Automation system programs and lock files generated during the execution of each job
 | - -/tmp temporary buffer directory, store temporary files

In the beginning, the interface is like this:

This version of Automation has been enduring for more than 10 years. Later, due to the increasing number of tasks, the old version of Automation that relies on file + Teradata metadata storage is insufficient in performance, and the operation status management is also quite complicated, so it was updated, and a new version was generated in China, which added task parameters into memory and preloaded them to quickly improve task parallelism, reduce data latency, and added complex running state management. Until now, Teradata Automation is still used by many financial systems.

Hats off to Teradata Automation!

In the beginning, the open-source community of Apache DolphinScheduler integrated all the scheduling system concepts at that time, with many functions paying tribute to Teradata Automation. For example, the cross-project workflow/task dependency task (Dependent), is exactly the same as Teradata Automation’s dependency setting. At that time, Airflow, Azkaban, and Oozie did not have such a function. Apache DolphinScheduler relies on its excellent performance, excellent UI, and functional design, so Wang Yongfu, one of the core developers of Teradata Automation in China, migrated his company’s scheduling leadership from Teradata Automation tasks to Apache DolphinScheduler and showed great appreciation for it. Looking back, I still remember how excited the community was to be recognized by Yongfu, proving that DolphinScheduler has been a leading project in the world, and it can only be worthy of the name if it becomes the top Apache project.

Now WhaleOps gathers talents from Internet companies + Informatica + IBM + Teradata, and there are many die-hard fans of Teradata, so we boldly put some concepts of Automation into WhaleScheduler. Teradata Automation Users are familiar with these functions, and external users pat their thighs and say that the design is so creative! But to be honest, WhaleScheduler is standing on the shoulders of giants:

Dependency/Trigger Distributed Memory Engine Design
Trigger mechanism (in addition to file trigger, also add Kafka, SQL detection trigger)
Running state-weighted (TD slang-dictatorship mode, mainly for scenarios where data arrives late, but supervisory reports are guaranteed first)
Running state-isolation (TD slang-anti-epidemic mode, a tribute to a certain big guy, mainly in the face of dirty data in the data, to avoid continuing to pollute the function of downstream tasks)
blacklist
whitelist
A preheating mechanism, etc.

WhaleScheduler has absorbed the import and export function of Teradata Automation Excel so that business departments can easily maintain complex DAGs through Excel tables without configuring complex tasks through the interface. These are all tributes to Teradata Automation. Without the continuous innovation and exploration of Automation predecessors in business cases, and without generations of Teradata predecessors who have continued to create data warehouse specifications, how can we create the world’s outstanding open-source communities and commercial products out of thin air?

Although Teradata has withdrawn from China, its innovative technical architecture and the spirit of Teradata professionals have always inspired our younger generations to keep forging ahead. Although Teradata Automation can no longer serve you, the TD fans of WhaleOps also sincerely hope that our WhaleScheduler, which integrates Internet cloud native scheduling + traditional scheduling, can take over the mantle of Teradata Automation and continue to contribute to the world.

Also, we hope that together with Apache DolphinScheduler, WhaleScheduler of WhaleOps will open up a more innovative way for future scheduling systems builders!

Finally, I would like to pay tribute to Teradata Automation and the users and practitioners who have worked tirelessly on the scheduling system with this article!

📌📌Welcome to fill out this survey to give your feedback on your user experience or just your ideas about Apache DolphinScheduler:)

https://www.surveymonkey.com/r/7CHHWGW

WhaleOps is selected as 2022 China's cutting-edge technology pioneer enterprise

WhaleOps — Fri, 06 Jan 2023 02:19:31 +0000

Congratulations!👏🏻🎉🎉WhaleOps is selected as 2022 China's cutting-edge technology pioneer enterprise by leading technical media SegmentFault.

Thanks for the honor, and we will work harder to strengthen the ecology construction to serve more developers and offer more DevOps services for helping with enterprises' digital transformation. 💪🏻💪🏻

SeaTunnel Zeta engine, the first choice for massive data synchronization, is officially released!

WhaleOps — Fri, 06 Jan 2023 02:18:00 +0000

Apache SeaTunnel (incubating) launched the official version 2.3.0, and officially released its core synchronization engine Zeta! In addition, SeaTunnel 2.3.0 also brings many long-awaited new features, including support for CDC and nearly a hundred kinds of Connectors.

Document
https://seatunnel.apache.org/docs/2.3.0/about

Download link

https://seatunnel.apache.org/download/

01 Major update

SeaTunnel synchronization engine—Zeta Officially Released

Zeta Engine is a data synchronization engine specially designed and developed for data synchronization scenarios. It is faster, more stable, more resource-saving, and easier to use. In the comparison of various open-source synchronization engines around the world, Zeta's performance is far ahead. The SeaTunnel Zeta engine has undergone several R&D versions, and the beta version was released in October 2022. After community discussion, it was named Zeta (the fastest star in the universe, and the community believes it fully reflects the characteristics of the engine). Thanks to the efforts of community contributors, we officially released the production-available version of Zeta Engine, its features include:

Simple and easy to use, the new engine minimizes the dependence on third-party services, and can realize cluster management, snapshot storage, and cluster HA functions without relying on big data components such as Zookeeper and HDFS. This is very useful for users who do not have a big data platform or are unwilling to rely on a big data platform for data synchronization.
More resource-saving, at the CPU level, Zeta Engine internally uses Dynamic Thread Sharing (dynamic thread sharing) technology. In the real-time synchronization scenario, if the number of tables is large but the amount of data in each table is small, Zeta Engine will Synchronous tasks run in shared threads, which can reduce unnecessary thread creation and save system resources. On the read and data write side, the Zeta Engine is designed to minimize the number of JDBC connections. In the CDC scenario, Zeta Engine will try to reuse log reading and parsing resources as much as possible.
More stable. In this version, Zeta Engine uses Pipeline as the minimum granularity of Checkpoint and fault tolerance for data synchronization tasks. The failure of a task will only affect the tasks that have upstream and downstream relationships with it. Try to avoid task failures that cause the entire Job to fail. or rollback. At the same time, for scenarios where the source data has a storage time limit, Zeta Engine supports enabling data cache to automatically cache the data read from the source, and then the downstream tasks read the cached data and write it to the target. In this scenario, even if the target end fails and data cannot be written, it will not affect the normal reading of the source end, preventing the source end data from being deleted due to expiration.
Faster, Zeta Engine's execution plan optimizer will optimize the execution plan to reduce the possible network transmission of data, thereby reducing the loss of overall synchronization performance caused by data serialization and deserialization, and completing faster Data synchronization operations. Of course, it also supports speed limiting, so that sync jobs can be performed at a reasonable speed.
Data synchronization support for all scenarios. SeaTunnel aims to support full synchronization and incremental synchronization under offline batch synchronization, and support real-time synchronization and CDC. ### Nearly 100 kinds of Connector support

SeaTunnel 2.3.0 version supports ClickHouse, S3, Redshift, HDFS, Kafka, MySQL, Oracle, SQLserver, Teradata, PostgreSQL, AmazonDynamoDB, Greenplum, Hudi, Maxcompute, OSSfile, etc. 97 kinds of Connector (see: https://seatunnel.apache.org/docs/ 2.3.0/Connector-v2-release-state).

In this version, under abundant feedback from users and testing, many Connectors have been perfected to production-available standards. For Connectors that are still in the Alpha and Beta stages, you're welcome to join the test.

Support for CDC Connectors

Change data capture (CDC) is the process of identifying and capturing changes made to data in a database, and then communicating those changes to downstream processes or systems in real-time. This is a very important and long-awaited function in data integration. In version 2.3.0, CDC Connector is also supported for the first time, mainly JDBC-Connector (including MySQL, SQLServer, etc.).

SeaTunnel CDC is a concentrated solution of absorbing the advantages and abandoning the disadvantages of existing CDC components on the market, as well as targeting to solve the pain points of many users. It has the following characteristics:

Supports basic CDC
Support lock-free parallel snapshot history data The following functions are still in the development stage, and I believe they will be avalable soon:
Support log heartbeat detection and dynamic table addition
Support sub-database sub-table and multi-structure table reading
Support for Schema evolution ### Zeta Engine Metrics Support

SeaTunnel version 2.3.0 also supports Zeta Metrics. Users can obtain various indicators after job execution is completed, including job execution time, job execution status, and the amount of data executed by the job. In the future, we will provide more and more comprehensive indicators to facilitate users to better monitor the running status of jobs.

Zeta engine supports persistent storage

SeaTunnel 2.3.0 version provides the function of persistent storage. Users can store the metadata of the job in the persistent storage, which ensures that the metadata of the job will not be lost after restarting SeaTunnel.

Zeta Engine CheckPoint supports the S3 storage plugin

Amazon S3 provides cloud object storage for a variety of use cases, and it is also one of the Checkpoint storage plugins that the community has recently requested. Therefore, we specifically support the S3 Checkpoint storage plugin and are compatible with the S3N and S3A protocols.

02 Change Log

New Features

Core

[Core] [Log] Integrate slf4j and log4j2 unified management log #3025
[Core] [Connector-V2] [Exception] Unified Connector exception format #3045
[Core] [Shade] [Hadoop] Add Hadoop-shade package #3755

Connector-V2

[Connector-V2] [Elasticsearch] Added Source
Connector #2821
[Connector-V2] [AmazondynamoDB] Added AmazondynamoDB Source & Sink Connector #3166
[Connector-V2] [StarRocks] Add StarRocks Sink Connector #3164
[Connector-V2] [DB2] Added DB2 source & sink connector #2410
[Connector-V2] [Transform] Added transform-v2 API #3145
[Connector-V2] [InfluxDB] Added influxDB Sink Connector #3174
[Connector-V2] [Cassandra] Added Cassandra Source & Sink Connector #3229
[Connector-V2] [MyHours] Added MyHours Source Connector #3228
[Connector-V2] [Lemlist] Added Lemlist Source Connector #3346
[Connector-V2] [CDC] [MySql] Add MySql CDC Source Connector #3455
[Connector-V2] [CDC] [SqlServer] Added SqlServer CDC Source Connector #3686
[Connector-V2] [Klaviyo] Added Klaviyo Source Connector #3443
[Connector-V2] [OneSingal] Added OneSingal Source Connector #3454
[Connector-V2] [Slack] Added Slack Sink Connector #3226
[Connector-V2] [Jira] Added Jira Source Connector #3473
[Connector-V2] [Sqlite] Added Sqlite Source & Sink Connector #3089
[Connector-V2] [OpenMldb] Add OpenMldb Source Connector #3313
[Connector-V2] [Teradata] Added Teradata Source & Sink Connector #3362
[Connector-V2] [Doris] Add Doris Source & Sink Connector #3586
[Connector-V2] [MaxCompute] Added MaxCompute Source & Sink Connector #3640
[Connector-V2] [Doris] [Streamload] Add Doris streamload Sink Connector #3631
[Connector-V2] [Redshift] Added Redshift Source & Sink Connector #3615
[Connector-V2] [Notion] Add Notion Source Connector #3470
[Connector-V2] [File] [Oss-Jindo] Add OSS Jindo Source & Sink Connector #3456

Zeta engine

Support print job metrics when the job completes #3691
Add Metris information statistics #3621
Support for IMAP file storage (including local files, HDFS, S3) #3418 #3675
Support saving job restart status information #3637

E2E

[E2E] [Http] Add HTTP type Connector e2e test case #3340
[E2E] [File] [Local] Add local file Connector e2e test case #3221 ### Bug Fixes

Connector-V2

[Connector-V2] [Jdbc] Fix Jdbc Source cannot be stopped in batch mode #3220,
[Connector-V2] [Jdbc] Fix Jdbc connection reset error #3670
[Connector-V2] [Jdbc] Fix NPE in Jdbc connector exactly-once #3730
[Connector-V2] [Hive] Fix NPE during Hive data writing #3258
[Connector-V2] [File] Fix NPE when FileConnector gets FileSystem #3506
[Connector-V2][File] Fix NPE thrown when fileNameExpression is not configured by File Connector user #3706
[Connector-V2] [Hudi] Fix the bug that the split owner of Hudi Connector may be negative #3184
[Connector-V2] [Jdbc] Fix the error that the resource is not closed after the execution of Jdbc Connector #3358

Zeta engine

[ST-Engine] Fix the problem of duplicate data file names when using the Zeta engine #3717
[ST-Engine] Fix the problem that the node fails to read data from Imap persistence properly #3722
[ST-Engine] Fix Zeta Engine Checkpoint #3213
[ST-Engine] Fix Zeta engine Checkpoint failed bug #3769 ### Optimization

Core

[Core] [Starter] [Flink] Modify Starter API to be compatible with Flink version #2982
[Core] [Pom] [Package] Optimize the packaging process #3751
[Core] [Starter] Optimize Logo printing logic to adapt to higher version JDK #3160
[Core] [Shell] Optimize binary plugin download script #3462

Connector-V1

[Connector-V1] Remove Connector V1 module #3450

Connector-V2

[Connector-V2] Add Connector Split basic module to reuse logic #3335
[Connector-V2] [Redis] support cluster mode & user authentication #3188
[Connector-V2] [Clickhouse] support nest and array data types #3047
[Connector-V2] [Clickhouse] support geo type data #3141
[Connector-V2] [Clickhouse] Improve double data type conversion #3441
[Connector-V2] [Clickhouse] Improve Float, Long type data conversion #3471
[Connector-V2] [Kafka] supports setting the starting offset or message time for reading #3157
[Connector-V2] [Kafka] support specifying multiple partition keys #3230
[Connector-V2] [Kafka] supports dynamic discovery of partitions and topics #3125
[Connector-V2] [Kafka] Support Text format #3711
[Connector-V2] [IotDB] Add parameter validation #3412
[Connector-V2] [Jdbc] Support setting data fetch size #3478
[Connector-V2] [Jdbc] support Upsert configuration #3708
[Connector-V2] [Jdbc] Optimize the submission process of Jdbc Connector #3451
[Connector-V2][Oracle] Improve datatype mapping for Oracle connector #3486
[Connector-V2] [Http] Support extracting complex Json strings in Http connector #3510
[Connector-V2] [File] [S3] Support S3A protocol #3632
[Connector-V2] [File] [HDFS] support using hdfs-site.xml #3778
[Connector-V2] [File] Support file splitting #3625
[Connector-V2][CDC] Support writing CDC changelog events in Jdbc ElsticSearch #3673
[Connector-V2][CDC] Support writing CDC changelog events in Jdbc ClickHouse #3653
[Conncetor-V2][CDC] Support writing CDC changelog events in Jdbc Connector #3444

Zeta engine

Zeta engine optimizations to improve performance #3216
Support custom JVM parameters #3307

CI
[CI] Optimize CI execution process for faster execution #3179 #3194

E2E
[E2E] [Flink] Support command line execution on task manager #3224
[E2E] [Jdbc] Optimize JDBC e2e to improve test code stability #3234
[E2E][Spark] Corrected Spark version in e2e container to 2.4.6 #3225

See the specific Change log: https://github.com/apache/incubator-seatunnel/releases/tag/2.3.0

03 Acknowledgement

Every version release takes the efforts of many community contributors. In the dead of night, during holidays, after work, and in fragmented times, they have spent their time on the development of the project. Special thanks to @jun Gao, @ChaoTian for their multiple rounds of performance testing and stability testing work for the candidate version. We sincerely appreciate everyone for their contributions. The following is the list of contributors (GitHub ID) for this version, in no particular order:

Eric Joy2048
TaoZex
Hisoka-X
Tyrant Lucifer
ic4y
liugddx
Calvin Kirs
ashulin
hailin0
Carl-Zhou-CN
FW Lamb
wuchunfu
john8628
lightzhao
15531651225
zhaoliang01
harveyyue
Monster Chenzhuo
hx23840
Solomon-aka-beatsAll
matesoul
lianghuan-xatu
skyoct
25Mr-LiuXu
iture123
FlechazoW
mans2singh

Special thanks to our Release Manager @TyrantLucifer. Although it is the first time to assume the role of Release Manager, he has actively communicated with the community on version planning, spared no effort to track issues before release, deal with blocking issues, and manage version quality. He is perfectly qualified for this release. Thanks for his contribution to the community, and welcome more Committers and PPMCs to claim the task of Release Manager to help the community complete releases more quickly and with high quality.

Forem: WhaleOps

Tribute to the passing of Teradata Automation!

Architecture Evolution of Teradata Automation

Hats off to Teradata Automation!

WhaleOps is selected as 2022 China's cutting-edge technology pioneer enterprise

SeaTunnel Zeta engine, the first choice for massive data synchronization, is officially released!

01 Major update

SeaTunnel synchronization engine—Zeta Officially Released

Support for CDC Connectors

Zeta engine supports persistent storage

Zeta Engine CheckPoint supports the S3 storage plugin

02 Change Log

New Features

CI

E2E

03 Acknowledgement