Forem: DMetaSoul

Open Source first Anniversary Star 1.2K! Review on the anniversary of LakeSoul, the unique open-source Lakehouse

DMetaSoul — Wed, 28 Dec 2022 11:26:25 +0000

LakeSoul, the only Chinese open-source lakehouse framework, has been open source for one year since the end of December 2021. During this year, two versions, 1.0 and 2.0, have been released successively, and many surprising functions have attracted the attention of technology lovers worldwide, which obtained 1.2K Stars.

LakeSoul’s design concept is to create a simple, high-performance cloud-native data lake supporting BI and AI application scenarios. Version 1.0 is built around the Spark engine to realize the relatively useful function of the Lakehouse. Version 2.0 focuses on ecosystem construction and underlying framework reconstruction to further enrich functions, support more usage scenarios, and get closer to actual usage requirements. Our previous article briefly introduced LakeSoul’s design concept: “Lakehouse is the future of data Intelligence? Then you must know about the only open-source LakeSoul in China”. This article summarizes the open-source project construction process over the past year and gives some recent Benchmark comparison results for your reference.

Key features of LakeSoul version 1.0

Merge On Read (MOR), and copy-on-write (COW) write modes are supported. MOR is implemented using Upsert semantics for writing more than read scenarios. In contrast, COW is implemented using Update semantics for scenarios such as read more than write.
Compaction allows high write throughput in MOR mode and allows efficient read-time merges with multipath ordered merges, allowing Compaction to optimize read performance.
In MOR mode, a user-defined MergeOperator is supported. Users can customize the UDFs that combine multiple values of the same primary key when reading, which can flexibly implement some common merge logic that is not simply overwritten, such as summation, fetching the latest non-null value, and so on. For specific usage scenarios and methods, see: Case Sharing: LakeSoul’s unique MergeOperator function
It supports multi-stream concurrent updates, and multiple streams can be written into different columns, which only need to have the same primary key column. The write operation does not need to do any additional configuration. This function can quickly realize multi-stream wide table splicing (eliminating Join), machine learning sample splicing, etc. For specific application scenarios, please refer to our previous article: Using LakeSoul to build real-time machine learning sample library
Besides reading and writing data, it supports Delete of tables and partitions, DropTable, DropPartiton, multilevel partitioning, and primary key hash bucket splitting. It supports API interfaces such as SparkSql, DataFrame, and Streaming.
Extend Spark logical and physical plans to realize semantics such as CDC lake entry and Merge Into; A real-time lake entry system based on Debezium+Kafka+Structed Streaming+LakeSoul is introduced in CDC.

Interpretation of LakeSoul 2.0 version construction route

In version 1.0, LakeSoul implemented basic lakehouse read and write capabilities, with high write and throughput performance in MOR scenarios (see Benchmark below) and good read-time merge performance. In the 1.0 architecture, there was some unreasonable design of the metadata layer and IO layer and high coupling degree, which restricted the further expansion of new functions by LakeSoul. In the 2.0 version, we made a large reconstruction and added many practical function points.

The metadata layer of the Catalog is reconstructed and decoupled to become a single module that supports access to multiple engines. We replaced Cassandra with PostgresSQL. Postgres’s powerful transaction capability enables complex concurrency control, conflict detection, and two-phase commit conformance protocols. At the same time, through the well-designed metadata table structure, the primary key index can be used for the read-and-write operations of metadata to achieve high performance. The minimum Postgres instance of 2 core 4G on the cloud can reach thousands of QPS. The efficient performance metadata layer implementation means that LakeSoul can support large-scale LakeSoul file management and efficient concurrency control.
Support Flink stream and batch integrated engine, expand upstream lake access capacity, and focus on constructing multi-source real-time synchronization capability based on Flink CDC to meet enterprise-class thousand-meter real-time lake access requirements. LakeSoul Flink CDC supports synchronizing thousands of tables in the entire library. It only needs to configure the library name of the online library to synchronize all tables in the library automatically. LakeSoul realizes automatic new table awareness and automatic DDL change synchronization and is automatically compatible with old version data when reading, making real-time lake access more simple and fast. The two-phase commit protocol is implemented through metadata layer transactions and idempotent, thus ensuring the semantics of Exactly Once in the lake.
Automatically mount Hive partitions on downstream lakes. By configuring the LakeSoul partition field and the Hive external name, every Compaction allows you to automatically mount a partition to the Hive partition with the same name. In addition, users can define Hive partition fields to synchronize partitions with different names. Apache Kyuubi also supports direct access to LakeSoul lakehouse via Hive JDBC. It will be connected and exported to more data warehouses to create a richer upstream and downstream ecology.
More complete ecological functions, including snapshot-read, incremental read, rollback, and data clearing. The incremental read can use the Streaming mode, specify the start time stamp, and Trigger periodically through the trigger to continuously read the incremental data after the last read, including CDC data. In snapshot read mode, data entering the lake within the specified start timestamp is read. The rollback will roll back to the latest version before the specified timestamp. The data cleansing will clear all metadata and data content before the specified timestamp. These functions apply to scenarios such as operation and maintenance, troubleshooting, version rollback, and streaming pipeline.

From version 1.0 to 2.0, LakeSoul further opens up the upstream and downstream links, thus providing richer scenarios, including real-time lake entry of the multi-source online database, stream-and-batch integrated ETL, multi-stream integration large-width table construction (eliminating Join), real-time machine learning feature stream construction, etc., greatly improving the performance and usability.

Benchmark result

At the beginning of its design, the goal of LakeSoul was to build a cloud-native, high throughput, and efficient concurrency stream-batch integrated lakehouse storage framework and elaborate concurrent write transaction mechanism and efficient MOR mechanism. In some competitions and public data sets, performance has been impressive.

CCF BDCI 2022 — Data Lake Batch Integrated Performance Challenge Benchmark

Recently, in the Big Data and Intelligent Computing Competition (BDCI 2022) jointly held by DMetaSoul and CCF (China Computer Society), “Data Lake, Stream and Batch Integrated Performance Challenge,” 11 batches of data need to be Upsert into the lakehouse table. The first was 20 million, and the last ten were 5 million. When upserting, it needs to sum some fields or filter null values based on the primary key. Finally, the total Write and Read time is counted. Contestants can choose any data lake storage framework and use Spark as the computing engine. The final evaluation results of several open-source data lakehouse frameworks are as follows:

In contrast, LakeSoul has obvious performance advantages in the case of multiple updates of large batches of data. In MOR mode, high write throughput capacity can be obtained, and MOR read performance is near.

Review code reference: https://github.com/meta-soul/ccf-bdci2022-datalake-contest-examples

CHBenchmark: CDC enters the lake and queries Benchmark in real-time

CHBenchmark is a public benchmark that combines TPC-C and TPC-H. It uses 21 query SQL statements to measure query time. Use Flink CDC to synchronize the data of 12 tables, whose initial full data is 1600w, and the incremental data is over 700,000 after 30 minutes. The Checkpoint interval is 10 minutes. Record the query time after full-time and 30 minutes, respectively.

In contrast, LakeSoul shows advantages in reading performance under the continuous update of small batch incremental data, benefiting from an efficient metadata management mechanism and MOR high-performance design. But in terms of individual statements, there is room for optimization, mainly the time spent on IO.

Review code reference: https://github.com/meta-soul/LakeSoul/pull/115

In response, the LakeSoul community has started a new project, NativeIO, decoupling the IO layer to allow more computing frameworks to benefit from the Lakehouse. The NativeIO layer, implemented by Rust, adopted the Parquet IO layer in the Arrow-rs project and optimized the object storage in asynchronous parallel. In our preliminary test, the performance of accessing the object storage was more than doubled. At the same time, the NativeIO layer encapsulates IO, Merge, and other logic and provides C API up, which can be called through Java, Python, and other languages and can easily provide unified LakeSoul lakehouse access capability for a variety of computing engines.

Generally speaking, this year, through community feedback and three-party evaluation, LakeSoul continuously pursues more efficient and reasonable design, continuously expands upstream and downstream links gradually improves ecological system construction, and builds a BI+AI integrated lakehouse platform. Finally, here is our trial link: https://github.com/meta-soul/LakeSoul

Welcome to use LakeSoul and feedback to build a domestic lakehouse open-source community ecology!

The best Open-source lakehouse project, LakeSoul 2.0, supports snapshot, rollback, Flink, and Hive interconnection

DMetaSoul — Fri, 08 Jul 2022 08:11:42 +0000

I published an article about LakeSoul, the data lakehouse, a few months ago. https://dev.to/qazmkop/design-concept-of-a-best-opensource-project-about-big-data-and-data-lakehouse-24o2

Recently, LakeSoul has been updated and iterated with more perfect functions, which can adapt to various scenarios and coordinate with business landings.

The DMetaSoul team released LakeSoul 2.0 in early July, which has been upgraded and optimized in many ways to improve the flexibility of its architectural design and better adapt to customers' needs for future rapid business development.
Main objectives of LakeSoul 2.0 development upgrade:

Support multiple computing engines (Flink, Presto, etc.), refactoring Catalog, and decouple Spark;
Use Postgres SQL protocol to support more demanding transactional mechanism and replace Cassandra SQL, reducing the cost of Cassandra cluster management for enterprises;
Supports more functions in service production, such as version snapshot, rollback, and Hive interconnection.
Strengthen ecosystem construction and improve upstream and downstream link design; DMetaSoul achieved the design goal through Catalog reconstruction, Spark and Catalog docking transformation, development of new user features, and support for the Flink computing engine. Following are the features of LakeSoul 2.0.

1.Catalog refactoring

1.1 Supports Postgres SQL protocol

In LakeSoul 2.0, metadata and database interaction are fully implemented using the Postgres SQL (PG) protocol for reasons mentioned at https://github.com/meta-soul/LakeSoul/issues/23. On the one hand, Cassandra does not support single-table multi-partition transactions. On the other hand, Cassandra cluster management has higher maintenance costs, while the Postgres SQL protocol is widely used in enterprises and has lower maintenance costs. You need to configure PG parameters. For details, click https://github.com/meta-soul/LakeSoul/wiki/02.-QuickStart

1.2 Independent Catalog framework

In LakeSoul 2.0, the Catalog is decoupled from Spark to realize an independent metadata storage structure and interface. Spark, Flink, Presto, and other computing engines can interconnect with LakeSoul to provide multi-engine streaming and batch integration capability.

1.3 The mechanism for detecting data submission conflicts

Multiple tasks simultaneously writing data to the same table may cause data consistency problems in the same partition. To solve this problem, LakeSoul categorizes data commit types by defining four types: Append, Compaction, Update, and Merge, so that when committing two commit types together, the collision detection determines whether the commit succeeds or fails (X) and what happens when the commit succeeds. For example, if users must commit append data simultaneously to perform two tasks, repeat the submission of the conflicting task when a conflict is detected.

2.Connect Spark to Catalog and transform it

After refactoring, decouple from Spark. Since most of the original Spark functions are affected by decoupling, it is necessary to transform the three parts based on the Catalog design framework.

2.1 Spark DDL

DDL (such as CREATE and drop table) and DataSet-related functions (such as save) in Spark SQL interact closely with the Catalog. The information about the created table is stored in the Catalog. LakeSoul retuned the Spark Scala interface to align it with Catalog.

2.2 Spark DataSink

The DataSink (insert into, upsert, etc.) involves not only metadata information (such as tables) but also data file record and partition data conflict detection when the stream batch task writes data. DMetaSoul also adjusted the data operation interface after the Spark Job was completed. Partitioning data commit conflict detection is also moved down to the Catalog.

2.3 Spark DataSource

Merge On Read (MOR) iS optimized On the DataSource. In version 1.0, data reads are sorted according to the write version number of data files. When merging, the latest version overwrites data of the old version by default. In LakeSoul 2.0, the Write Version attribute for data files was removed instead of using an ordered list of files, with the latest build file at the end of the list.
In LakeSoul 1.0, users using other MergeOperators provided by LakeSoul, such as MergeOpLong, needed to register and specify the associated column names when reading.

LakeSoulTable.registerMergeOperator(spark, "org.apache.spark.sql.lakesoul.MergeOpLong", "longMOp")
LakeSoulTable.forPath("tablepath").toDF.withColumn("value", expr("longMOp(value)"))  .select("hash", "value")

If the table needs to use a large number of column fields or even all the fields, it is inconvenient to specify one by one, and if the user does not specify, LakeSoul uses DefaultMergeOp for all the fields by default.
For example, https://github.com/meta-soul/LakeSoul/issues/30, the user expected MergeNonNullOp to be used for all fields by default, which was not met in version 1.0. In LakeSoul 2.0, LakeSoul implemented modifications to the default MergeOprator operation.

spark-shell --conf defaultMergeOpInfo="org.apache.spark.sql.execution.datasources.v2.merge.parquet.batch.merge_operator.DefaultMergeOp"

3.New service features

In LakeSoul 2.0, three new features address the business requirements in the real world.

3.1 the snapshot

Snapshots provide users with the ability to view historical data. In LakeSoul, when the user performs upsert, Update, insert, and other operations, the corresponding partition will generate a data version. As the data continues to be updated, the historical version will increase, and the user hopes to check the historical data to facilitate data comparison, etc. https://github.com/meta-soul/LakeSoul/issues/41. LakeSoul 2.0 allows users to view a version number by partition snapshot.

LakeSoulTable.forPath("tablepath","range=rangeval",2).toDF.show()

3.2 the rollback

Rollback allows the user to take a historical version of the data as the current data, enabling the data to go back in time. LakeSoul 2.0 also provides the ability to roll back data to a historical version by partition,
https://github.com/meta-soul/LakeSoul/issues/42.

LakeSoulTable.forPath("tablepath").rollbackPartition("range=rangeval",2)

3.3 Support Hive

Many enterprises choose Hive as an offline data warehouse in the financial and Internet sectors. In version 2.0, LakeSoul considers the uniformity of the upstream computing engine and the diversity of the downstream data output. Hive is the data warehouse tool that supports LakeSoul downstream first.
It would help if users created a hive appearance by putting a new Compaction file in the Spark conf directory. When this Compaction happens, it compacts this operation with a new Compaction file that does not support Merge On Read. Only Copy On Write is supported.

LakeSoulTable.forPath("tablepath").compaction("range=rangeval","hiveExternalTableName")

3.4 Support Flink

LakeSoul 2.0 supports Flink unified streaming and batch writing and implements Exactly-once semantics. It can ensure data accuracy for Flink CDC and other scenarios and combine Merge On Read and Schema AutoMerge capabilities provided by LakeSoul 2.0, which has high commercial value for practical business scenarios such as multi-table mergers and sub-database mergers.

Conclusion
LakeSoul2.0 (https://github.com/meta-soul/LakeSoul/releases/tag/v2.0.0-spark-3.1.2) version surrounding ecological system upgrade the further iterations, pay attention to the upstream and downstream tools seamless ecological docking, It also focuses on relevant features in enterprise business applications.

Data engineers must-see: The future trend of big data cloud services

DMetaSoul — Sun, 26 Jun 2022 09:02:55 +0000

Evolution trends of big data systems

From the demand side, big data system from 1995 transaction scenario (TP), such as bank transactions and other daily online processing TP. Analysis scenarios (AP) by 2005, such as reverse indexing of search keywords, do not require complex SQL features and are more focused on concurrent performance. In 2010, hybrid scenarios (HTAP), which use a single system for transaction processing and real-time analysis, reduced operational complexity. In 2015, complex analytics scenarios, that is, converged analytics from multiple sources such as public cloud, private cloud, and edge cloud. Finally, the Real-time Hybrid Scenario (HSAP), the convergence of real-time business insights, services, and analytics.

From the perspective of the supply side, the big data system from the 1995 relational data (MySQL), that is, point storage and query oriented, through sub-database sub-table and middleware to do the horizontal expansion. By 2005, non-relational databases (NoSQL), which store large amounts of unstructured data, scale well horizontally. In 2010, hybrid databases (NewSQL) were compatible with MySQL's expressiveness, consistency, and NoSQL's extensibility. Finally, by 2015, data lakes and data warehouses can realize business integration across business lines and systems. Now, we have reached the era of the next generation of big data systems.

Based on the properties of large data generated from operations, development trends and the analysis of the large data can solve any problem, must not ignore the customer's data level, the number of Schema and change frequency and business logic, use of data of the main way and the frequency of these three questions, need to go back to the most basic logic of data processing are analyzed. Data processing operations are nothing more than reading and writing, so ultimately, there are four forms: write less, read less; Write more, read less; Write less, read more; Write more, read more correspond to different technical systems.

Write less, read less: OLTP-type applications, which focus on point storage and queries, are well addressed by MySQL.
Write more, read less: A common but underappreciated problem is the debug log of application code, which is very large in storage, and developers tend not to optimize, only searching through the massive log when something goes wrong. For a growing Internet enterprise, using ES accounts for 50% of the cost of big data. First, the search engine must maintain a full index, so it cannot save money. Another reason is that companies tend not to use big data to serve their businesses, so other big data applications are unavailable. But this cost is hidden in the overall technical cost and not visible, so there is no special optimization.
Write less, read more: BI data analysis falls into this category, or OLAP, which generally writes sequentially, then computes and outputs the results. Almost all big data cloud service startups are a little red sea in this field.
Write more, read more: real-time computing in the form of Search, advertising, and recommendations. Business scenarios are dynamic marketing based on user profiles, especially recommendations, which are becoming more widespread. Any information flow based on user characteristics is a recommendation. A large amount of data is due to the detailed record of user behavior. Real-time computing is to carry out dynamic prediction and judgment through algorithms.

In terms of application scenarios, the latter two kinds of reading and writing are respectively evolved into Hybrid Transactional & Analytical Processing (HTAP) and Hybrid Serving & Analytical Processing (HSAP). In terms of volume, the HTAP direction has been more entrepreneurial recently, but it solves already well-defined technical problems. Along the timeline, HSAP will overwrite HTAP in the future because HSAP solves business problems through technology.

Data engineers and developers need to focus on future industry trends and business pain points to improve their technology. Those in industries such as HTAP, likely to shrink in number in the future, need to do more career thinking and choices. More importantly, why are there so few practitioners and companies in an industry that is promising and able to solve the problems of current technology? These reasons must be the industry's breakthrough point and are vital to practitioners.

Challenges of HSAP

First, HSAP and HTAP are not antagonistic and even borrow many of HTAP's design ideas. For example, HTAP is replacing MySQL with storage changes:

HTAP is an upgrade to a database typically used in "transaction" scenarios to process structured data. Traditional databases logically take row storage, each row being one data item. The whole row of data needs to be read into the memory for calculation. Generally, only certain fields in the data line are processed. Therefore, the computing efficiency and CPU usage are not high.

When it came to search engines and big data, it was often necessary to scan data on a large scale and process certain fields in each row. So, based on these usage characteristics, column storage emerged. Column storage is algorithmically friendly because it is very convenient to add a column (the "feature" used in the algorithm). Another benefit of column storage is that CPU optimization, known as vectorization, can be used to execute a single instruction on multiple data simultaneously, greatly improving computing efficiency. Therefore, HTAP tends to emphasize inventory, vectorization, MPP, etc., and improve the efficiency of big data processing through these technologies.

However, this does not mean that row storage is overshadowed by row storage. Both row and column storage are related to usage scenarios and have costs, a balance problem between cost and efficiency. Therefore, in terms of storage form and computing efficiency, HSAP does not need to innovate for innovation's sake.
The biggest difference between HSAP and HTAP is that HSAP is both a technology and a business, so the first question it answers is data modeling from a business scenario.

A traditional database is also known as a relational database. Data modeling is very mature, in the form of Schema. HSAP can be considered to have evolved from search engines. The earliest search engines were to retrieve text so that it could be classified in NoSQL, that is, non-relational databases. After that, Internet businesses became increasingly diversified, a mixture of transaction and information flow. For example, e-commerce has both large-scale data business and complex transaction links.

Moreover, in Search, advertising, and recommendation business, e-commerce also needs structured data, such as commodity price, discount, and logistics information. Therefore, the data service base of e-commerce needs very good modeling, which is not the work of the engineer who makes the transaction link, but the work of the search engine architect. Modeling data services is critical and greatly impacts search engine storage and computing efficiency.

So the prerequisite for using HSAP is good business data modeling, storage optimization, query acceleration, and so on. Data modeling does not have a very good standardized solution because understanding the complex big data infrastructure and the business is essential. One possible evolution path is that the big data architect discovers more scenarios during the process of HSAP, abstracting the scenarios through data modeling, gradually accumulating experience, and eventually forming good products.

Application analysis of HSAP
What are the core customer issues in the HSAP space? Instead of taking the Internet platform with a huge amount of big data analysis and service requirements as an example, take the universal XX Bank. The basic scenario is as follows:
Marketing financial products according to user group dynamics;
Next door YY bank users with reasonable concessions to attract over.

The core pain point of the big data architecture team of the bank comes from the above scenario, which can be basically classified as "user growth." It requires big data analysis and service integration (i.e., this is a typical HSAP problem). However, the BI demand of the bank has been well-covered by-products, so the pain point is not strong. *The current warehouse architecture has the following problems: *

The data delay, and the production and batch running tasks in the number warehouse are usually T+1 output, which does not support the integration of flow and batch. It is difficult to support some business scenarios with high timeliness.
The capacity of metadata expansion and shrinkage is weak, and there is a performance bottleneck when the number of partitions increases rapidly.
The resource scheduling capability is insufficient and cannot be containerized for elastic expansion.

*Requirements for technologies: *

Stream batch integration: the basis is unified real-time storage. At the same time, the upstream and downstream computing using event trigger mode, the downstream data output delay is greatly shortened;
Horizontal metadata expansion: Supports table management of many partitions and files.
Flexible resource scheduling: Flexible container-based expansion, on-demand resource utilization, and public and private cloud deployment are supported.
Open systems and interfaces: Services are the mainstream, but another complex offline and BI analysis processing is best also in a unified storage system, one is easy to connect with the existing system, and the other allows other engines to pull data out for processing. Therefore, compatibility with SQL language is also a must. Not to say too far, in the next 2-3 years, to solve these problems well will be a very successful company.

Case

This article gives examples of Snowflake, an American public company, and LakeSoul, an open-source product for a Chinese startup.
Snowflake is a typical PLG (Product led Growth) driven company. In terms of products, Snowflake has realized the real customer value: the expansion and shrinkage of cloud storage. Specifically:

Truly taking advantage of the infinitely expanding storage and computing power of the cloud;
Truly let customers zero operation and maintenance, high availability, to save worry;
save customers money.

These principles coincide with introducing new products in the consumer goods field to meet the unmet needs of users, and the product details are well done. Snowflake, for example, has designed a virtual Warehouse, which comes in T-shirts ranging from X-small to 4x-large, to separate users from each other. Such product designs must be designed with a deep understanding of the requirements and provide great customer value.

In addition, Snowflake has achieved a better L-shaped strategy from a business perspective. In the health care sector, public information has shown that it amplifies the value of data by enabling "data exchange" and even achieves network effects. But there's more to it than that. Snowflake is suspected of blowing bubbles. But given the second-hand information (not available online), Snowflake's bet on a company that makes digital SaaS services in health care makes logical sense.

LakeSoul meets the technology needs to solve the core problems of our customers in the HSAP space.

Integration of stream and batch: based on unified real-time storage, the upstream and downstream computing adopts event trigger mode, and the downstream data output delay is greatly shortened;
Horizontal metadata expansion: Supports table management of many partitions and files.
Elastic resource scheduling: containerized elastic expansion, on-demand resource utilization, and support public and private cloud deployment;
Open system and interface: service is the mainstream, but other complicated offline and BI analysis and processing should also be based on a unified storage system. On the one hand, connecting with the existing system; on the other hand, it allows other engines to pull data out for processing.Therefore, compatibility with SQL language is also a must.

Quickly develop risk control algorithms in business scenarios based on MetaSpore

DMetaSoul — Wed, 15 Jun 2022 09:26:30 +0000

MetaSpre:https://github.com/meta-soul/MetaSpore
AlphaIDE:https://registry-alphaide.dmetasoul.com/login
After decades of development, the traditional credit evaluation model for risk management is relatively mature and stable. Represented by the FICO score in the United States, it builds a rule engine and promotes the rapid development of the financial loan business in the United States. In recent years, with the leapfrog development of big data and artificial intelligence technologies, financial institutions can draw diversified user portraits and build more accurate risk control models under the support of new technologies.

This paper will take the Tianchi loan default data set as an example, train and evaluate the default prediction model in MetaSpore on the AlphaIDE development environment launched by DMetaSoul , and give intelligent credit score according to the estimated probability. In the following chapters, I will focus on environmental use, problem modeling, feature derivation, modeling, score cards, and so on.

2.MetaSpore On AlphaIDE

2.1 IDE environment configuration and startup

Register or log in AlphaIDE account: https://registry-alphaide.dmetasoul.com/. Enter the AlphaIDE service, click application service on the left, click Kubeflow drop-down menu, and enter the Jupyter page.

Click on the upper right to create the Notebook and select the desired resource under CPU/RAM. It is recommended to use 2Core CPU X 8GB or more of RAM:

Check Kubeflow, AWS, and Spark, and then Launch to create the Notebook.

After creating the Juyper Notebook, click Connect to open a Terminal and run:

git clone git@github.com:meta-soul/MetaSpore.git

Run the algorithm Demo and develop it after the completion of the clone code base

2.2 Training machine learning models

See the algorithms project in MetaSpore's Demo directory for recommendations, search, NLP, and risk control-related applications. Here take the risk control project (Demo/RiskModels/Loan_default /) as an example:
1. Star the Spark Session: If using Spark in the Alpha IDE distributed cluster training, you need to increase the Spark. Kubernetes. Namespace configuration parameters, such as:

def init_spark(app_name, cluster_namespace, ..., **kwargs):
    spark = pyspark.sql.SparkSession.builder\
        .appName(app_name) \
        .config("spark.kubernetes.namespace", cluster_namespace)
        ...
        .getOrCreate()
    sc = spark.sparkContext
    print(sc.version)
    print(sc.applicationId)
    print(sc.uiWebUrl)
    return spark

Of course, you can also run in local mode by changing the fourth line above to master("local"). When the number of samples is small, the local mode is better.

2. Start the model training script: follow the steps in the readme.md document to prepare the training data, initialize the model configuration file we need and then execute the training script.

3. Save the model training results: Save the trained model and estimated results to S3 cloud storage as we did in default_ESTIMation_spark_lgbm.py.

It may take some time to download the dependent libraries during the first execution.

3.Intelligent risk control algorithm

3.1 Background

Before diving into the actual code implementation, let's take a quick look at the data set. The data set given by the Tianchi community is used in this article. The data set is to predict whether the user defaults on loans. The data comes from the loan records of A credit platform, with A total of over 1.2 million data, including 47 columns of variable information, 15 unknown variables, 200,000 pieces as the test set A, and 200,000 pieces as test set B. "isDefault" in the dataset can be used as the training label. Other columns can be used as the model's features, including ID, category, and numerical features. The complete introduction of feature columns can refer to the description of the dataset provided by the government, and the following figure shows the sample data:

The so-called default rate prediction is to use the machine learning model to establish the learner of binary classification, that is, through the model to estimate:

When the risk control algorithm is implemented, the model's accuracy and interpretability should be considered simultaneously. The linear model and tree model are generally used. The following examples are also based on the LightGBM model.

Based on the default rate prediction model, we can establish a user's credit score, similar to the sesame credit score of Ant Group. Generally speaking, the lower the default probability is, the higher our score should be, which is convenient for loan personnel to evaluate customers.

3.2 Feature Engineering

The evaluation problems related to financial loans are mainly based on tabular data, so the importance of feature engineering is self-evident. The common features in the dataset include ID type, Categorical type, and continuous number type, which require common data handling such as EDA, missing value completion, outlier processing, normalization, feature binning, and importance assessment.
The process can reference the GitHub codebase: https://github.com/meta-soul/MetaSpore/blob/main/demo/dataset, which part about tianchi_loan instructions.

WoE Weight of Evidence coding is required for feature derivation for risk control models, especially for standard risk control scorecards. WoE represents the difference in the proportion of good and bad customers in the current feature binning, smoothed by the log function and calculated by the following formula:

Among them,
py, pn respectively represent the percentage of good or bad customers in the total good or bad customers in this binning;
Badi, Badt respectively represent the number of bad customers in the current binning and the number of bad customers in all customers;
Goodi, Goodt respectively represent the number of good customers in the current binning and the number of good users in all customers;

If the larger the absolute value of WoE is, the more significant the difference is, the feature binning is a perfect predictor. On the other hand, If WoE is 0, then the ratio of good and bad customers is equal to the percentage of random bad and good customers. In this case, WoE has no predictive power.

The calculation process: https://github.com/meta-soul/MetaSpore/tree/main/demo/dataset/tianchi_loan/woe.ipynb

In actual business scenarios, a lot of work will focus on feature binning and filtering. The commonly used methods in the process of feature binning include isofrequency, isometric, Best-KS, ChiMerge, and other methods. In the process of feature screening, the chi-square test, wrapper based on tree-model, and IV, Information Value, based on WoE are the common methods. However, these tasks require a lot of data analysis and iteration for specific businesses.

3.3 Model training

Once the samples and features are ready, we can train the model. Here we use the Spark version of the LightGBM model, which is both model performance and interpretable, and compatible with AlphaIDE and MetaSpore Serving. Model training code:

Another advantage of using the Spark version of the machine learning model is that it can fully use cluster computing resources during hyperparameter optimization. To demonstrate, use Hyperopt to optimize the combination of learningRate and numIterations as two super parameters. Define the optimization space of the hyperparameters as follows:

After the train_with_HYPERopt function is defined, optimize the parameter combination. The optimization process is slow. When finish the execution, best_params can be output to view the best parameter combination:

In practical business, hyperparameter optimization will have more space and consume more computing time. After the hyperparameter optimization is completed, the model can be trained again on the full sample set. After the training, we can export the model to ONNX format. For details, please refer to the introduction in [2,3] or the export code given in our GitHub code repository: https://github.com/meta-soul/MetaSpore/blob/main/demo/riskmodels/loan_default/default_estimation_spark_lgbm.py

3.4 Model evaluation

In addition to the commonly used AUC index to evaluate the performance of the risk control algorithm model, the Kolmogorov-Smirnov index is another measure. The higher the KS index is, the stronger the risk differentiation ability of the model is. The business significance of the KS curve is that we allow the model to make a small number of errors in exchange for maximizing the identification of bad samples. In general:

KS<0.2, the differentiation ability of the model is not high, and the application value is not high;
0.2<=KS<0.4, general models are concentrated in this interval, so it is necessary to continue to observe the tuning model;
0.4<=KS<0.7. The model has good differentiation ability and strong application value;
KS<=0.7, there may be a fitting phenomenon that needs to be checked.

We provide a KS value calculation and KS drawing toolbox in MetaSpore. After we write the test results into S3 cloud storage, we can call the functions in the toolbox to evaluate the model:

3.5 Credit Score

Assuming that we have reasonably estimated the loan default rate through the machine learning model, we can give the following score:

If the following two assumptions are satisfied, then we can derive the calculation formula of constants A and B:

**A. Initial value hypothesis: **The score is assumed at S0, i.e.

B. Point of Double assumption: Assume that odds0=2×oddso, a fixed PDO score reduces the credit score:

The formula for calculating A and B:

Here is the result of the scorecard operation. The results showed that the lower the probability of default, the higher the credit score.

Refer to the GitHub repository for code implementation: https://github.com/meta-soul/MetaSpore/tree/main/demo/riskmodels/loan_default/notebooks/credit_scorecard.ipynb

The results of the score can also be evaluated using the KS indicator. In addition, it should be noted that to implement a standard scorecard is to be implemented, a linear model, usually Logistic Regression, is required. Besides a credit score for each user, the linear model's intercept and the binning of each one-dimensional feature need to be scored.

Conclusion

DMetaSoul uses MetaSpore on AlphaIDE to quickly implement a loan default rate prediction model on an open-source dataset and build a scorecard based on this model. Based on the Demo system of this version, the methods of feature derivation, binning, and screening can be more delicate, which often determines the upper limit of the performance of the risk control system. Finally, give the address of the code base and the AlphaIDE trial link (AlphaIDE tutorial):
Default rate forecast: https://github.com/meta-soul/MetaSpore/tree/main/demo/riskmodels/loan_default
MetaSpore's one-stop machine learning development platform: https://github.com/meta-soul/MetaSpore
AlphaIDE trial link: https://registry-alphaide.dmetasoul.com

A New One-stop AI development and production platform, AlphaIDE

DMetaSoul — Wed, 15 Jun 2022 07:00:34 +0000

A New One-stop AI development and production platform, AlphaIDE, quickly deploy an intelligent data platform, providing one-click deployment and operation of Web IDE development interface, data analysis, machine learning training, online prediction, and algorithm experiment application services

I’ve posted about LakeSoul, an open-source framework for unified streaming and batch table storage, and MetaSpore, an open-source platform for machine learning.

The design concept of the almighty Opensource project about machine learning platform, MetaSpore

The design concept of the best open-source project about big data and data lakehouse

Using LakeSoul and MetaSpore, it is possible to quickly develop real-time data analysis and complete machine learning on offline links, detailed at Build a real-time machine learning sample library using the best open-source project about big data and data lakehouse, LakeSoul

However, there are still many environments to navigate to deploy analytical computing tasks and online services in a production environment. It is necessary to have a simple and user-friendly integrated development environment in the development stage. In the deployment stage, it is necessary to consider the container cluster construction, the high availability scheduling of workflow, the load balancing and elastic scaling of online services, and the adaptation of different public and private cloud platforms. There is a lot of bridging work from business development to deployment. In the past, it required an operation and maintenance team familiar with cloud-native and container technologies to cooperate with development and deployment, which also brought certain communication and time costs.

Recently, I found a product, AlphaIDE, that addresses these issues, but it’s still in beta. DMetaSoul officially launched AlphaIDE products to solve the above problems, providing complete development and production environment.Through containerization and seamless connection with mainstream public clouds at home and abroad, an intelligent data platform can be easily and quickly deployed to provide one-click deployment and operation of Web IDE development interface, data analysis, machine learning training, and online prediction and algorithm experiment application services. Since AlphaIDE is in beta, the interface is still Chinese, but it is still worth trying out AlphaIDE with the help of the browser’s translation function. Only a tiny portion of the text on the interface is in Chinese, most of it is in English.

Key features of AlphaIDE

1.Data-centric one-stop development and production platform
AlphaIDE provides complete code development, job scheduling, online production service deployment, and rich Data Ops and ML Ops development tools. Developers can develop and deploy a complete process from data ingesting to model computation to online experimentation by focusing on their data and model computation processes without worrying about the underlying infrastructure. AlphaIDE helps developers mask details such as the underlying operations of the cluster and the docking of cloud vendors, allowing data scientists and algorithm engineers to develop and go online independently.

2.Integrate standard development components
AlphaIDE integrates the self-developed frameworks of LakeSoul and MetaSpore and provides open-source computing frameworks such as Spark and Flink. AlphaIDE integrates custom Kubeflow, supports Notebook’s workflow production scheduling, and includes custom Jupyter/CodeServer container images without requiring developers to package their runtime images.

3.Cloud-native private deployment
AlphaIDE can support private deployment under a user’s cloud service account. The user only needs to provide the authorization of one cloud account, which can be canceled after creating the cluster. Users can deploy AlphaIDE automatically under their cloud account without manual intervention. The resources and data of the entire cluster are stored in the user’s cloud VPCS (virtual subnets), which can connect to the existing databases and storage systems on the Intranet without repeated remote data import, ensuring data privacy and security.

4.Rich enterprise-level functionality
AlphaIDE provides SSO and Role-Based Access Control (RBAC) permission management, connecting with the enterprise’s account system, such as LDAP, or the third-party account system, such as Github Team. AlphaIDE also supports resource quota control and integrates log collection queries, service status, performance monitoring, etc.

The trial AlphaIDE

AlphaIDE Demo service is available online. Users can enter the Demo cluster after login and registration to experience the core functions of AlphaIDE for free. There is no English version of the product interface yet, but it is still worth trying out AlphaIDE with the help of the browser’s translation function.

Users can use the https://registry-alphaide.dmetasoul.com/ to register the demo cluster using an email account. After passing the email verification, click to enter the free test cluster and log in to the Demo cluster using the email address and password specified during registration. For details, see AlphaIDE Service Usage Guide
In addition to the basic IDE instructions, AlphaIDE also provides code demos to run tests.

Run LakeSoul flow batch of one of the samples: https://github.com/meta-soul/LakeSoul/wiki/Lakesoul-IDE-Demo

Run MetaSpore model training example: https://github.com/meta-soul/MetaSpore/blob/main/tutorials/metaspore-getting-started.ipynb

Run a Demo of DMetaSoul previously released MovieLens end-to-end recommendation system: https://github.com/meta-soul/MetaSpore/tree/main/demo/movielens/offline#readme

Welcome you to sign up and try AlphaIDE.

Usage Guide：Quickly deploy an intelligent data platform with the One-stop AI development and production platform, AlphaIDE

DMetaSoul — Wed, 15 Jun 2022 06:29:47 +0000

1.Log in

Click on the link Alpha IDE: https://registry-alphaide.dmetasoul.com/#/login. You can register by email.

After registration, the email will receive a verification link. After clicking the verification link, you can log in through the email address and password you just registered.

After login, click to enter the trial IDE environment

If the login page is displayed, use the previous email password to log in.

2. The Usage of IDE

2.1 Creating a Namespace

First, in the left navigation bar, go to kubeflow-Home:

On the Kubeflow initialization page, click Start Setup.

Then on the Namespace creation page, click Finish. The default Namespace is the user name:

2.2 Creating a Jupyter Notebook

After entering the Demo IDE service, click the application service on the left and click the Kubeflow drop-down menu to enter the Jupyter page.

Click Create Notebook in the upper right corner to go to the Notebook creation page.

After entering the Notebook name, select all configurations in the Configuration area and use the default settings for other configurations.

Drag to the bottom and click Launch. After creating the Notebook, click Connect to enter the Jupyter development environment. Many resource files need to be read during the initial loading. Wait one minute.

2.3 Testing Spark Tasks

In Jupyter Notebook, create a Python3 Kernel Notebook:

After entering the Notebook code development screen, enter the following test code:

from pyspark.sql import SparkSession
spark = SparkSession.builder\
.config('spark.master', 'local')\
.getOrCreate()
from datetime import datetime, date
from pyspark.sql import Row
df = spark.createDataFrame([
Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])
df.show()

Then press Shift + Enter to see the result:

AlphaIDE Jupyter integrates plug-ins such as Python Language Server and Spark Monitor to provide Functions such as Python code completion and Spark task progress display, facilitating development and debugging. You can also install additional plug-ins or themes you need in the Jupyter Extension interface.

2.4 Test MetaSpore task

AlphaIDE is already integrated with MetaSpore. You can test MetaSpore’s introductory tutorial Notebook: https://github.com/meta-soul/MetaSpore/blob/main/tutorials/metaspore-getting-started.ipynb.

The S3 bucket name of the AlphaIDE Demo service is alphaide-demo. YOUR_S3_BUCKET in the tutorial can be replaced with this bucket name and prefixes the path to save data with S3://alphaide-demo/. The feature description schema file required for the test is in the tutorial directory.

2.5 Test the LakeSoul task

LakeSoul Demo link: https://github.com/meta-soul/LakeSoul/wiki/03.-Usage-Doc#1-create-and-write-lakesoultable
The introduction of LakeSoul: https://dev.to/qazmkop/design-concept-of-a-best-opensource-project-about-big-data-and-data-lakehouse-24o2

2.6 Running Movielens Demo

DMetaSoul has provided a MovieLens Demo:https://github.com/meta-soul/MetaSpore/blob/main/demo/movielens/offline/README-CN.md.

Solved a practical business problem when using Hudi: LakeSoul supports null field non-override semanticssemantics

DMetaSoul — Sun, 29 May 2022 10:45:22 +0000

Recently, the LakeSoul r&d team helped users solve a practical business problem using Hudi. Here is a summary and record. The business process is that the upstream system extracts the original data from the online DB table into JSON format and writes it into Kafka. The downstream system uses Spark to read the messages in Kafka. The data is updated and aggregated using Hudi and sent to the downstream database for analysis.

Some of the data in Kafka is only some of the fields in the original table. The data sample Kafka: {A: A1, C: C4, D: D6, E: E7} {A: A2, B: B4, E: E6} {A: A3, B: B5, C: C5, D: D5, E: E5}. In subsequent data updates, use the latest historical data instead of the missing field value with no updating.
The following figure simplifies the data update process. In the original table, five fields are A, B, C, D, and E. Field A is the primary key, and its type is a string. Spark reads batch data from Kafka and converts it to the format required by Upsert (a DataFrame of a fixed Schema). MOR (Merge on Read) reads new table contents.

Hudi's Merge on Read is currently used to implement this business process, and there is no fixed Schema for the above-misaligned JSON data, which is not supported. Hudi's Merge on Read implementation of the above business process would not have been possible without a fixed schema for the above-misaligned JSON data. An invalid NULL value overwrites the original content if a missing field is filled with a null value. Copy on Write is degraded if Merge Into is used, and the Write performance fails to meet requirements. A workaround approach is to obtain unchanged data from the original table for each data completion. However, this increases resource costs and development workload, inconsistent with users' expectations.

LakeSoul supports custom MergeOperator. Each field can be passed a user-defined MergeOperator when performing Upsert. The parameters are the original value of the field and the new value of Upsert. This is where the Merge results can be determined based on business requirements. The UDF is the same as Spark's native UDF. When Upsert is used, you need to specify a primary key value. Therefore, multiple delta files may have various values for the same primary key and the same field. MergeOperator controls the merging behavior of these values. The default MergeOperator implementation is as follows:

class DefaultMergeOp[T] extends MergeOperator[T] {
  override def mergeData(input: Seq[T]): T = {
    input.last
  }
}

In this scenario, you can define a MergeOperator. For undefined fields, the MergeOperator still fills null values as unique markers (the service guarantees that unique markers do not conflict with normal data). The MergeOperator is ignored during Merge and returns the original values. In this way, when Spark processes JSON data and executes Upsert, null is ignored. The original content is not overwritten, reducing the missing field data through the initial data filling process, significantly improving the execution efficiency, and simplifying the code logic. The code for this custom MergeOperator is as follows:


class MergeNonNullOp[T] extends MergeOperator[T] {
  override def mergeData(input: Seq[T]): T = {
    val output=input.filter(_!=null)
    output.filter(!_.equals("null")).last
  }
}

As you can see, a simple custom implementation of MergeOperator solves an otherwise complex business problem.

The LakeSoul team plans to integrate MergeOperator, which ignores empty fields, into LakeSoul's built-in system, with global options to control whether or not this type of MergeOperator is enabled by default, further improving development efficiency. See Github Issue: https://github.com/meta-soul/LakeSoul/issues/30.

In the future, LakeSoul will also support Merge Into SQL syntax to define Upsert behavior and Merge on Read to improve further the expression of stream batch write updates.
For more information on the LakeSoul Cloud-Native Stream Batch All-in-one surface storage framework, refer to the previous article:
Build a real-time machine learning sample library using the best open-source project about big data and data lakehouse, LakeSoul

Design concept of a best opensource project about big data and data lakehouse

4 best opensource projects about big data you should try out

What is the Lakehouse, the latest Direction of Big Data Architecture?

DMetaSoul — Sat, 14 May 2022 17:06:38 +0000

1. Explanation of nouns
Because there are many nouns in the article, the leading nouns are briefly introduced to facilitate everyone to read.

Database:
In the sense of the word, Databases have been used in computers since the 1960s. However, the database structure at this stage is mainly hierarchical or mesh, and there is an extreme dependence between data and programs, so the application is relatively limited.
Databases are now commonly referred to as relational databases. A relational database is a database that uses a relational model to organize data. It stores data in the form of rows and columns and has the advantages of high structuration, strong independence, and low redundancy. In 1970, the birth of the relational database, which truly completely separated software data and programs, became an integral part of the mainstream computer system. The relational database has become one of the most important database products. Almost all the new database products of database manufacturers support relational databases, even if some non-relational database products also almost have the interface to support relational databases.
Relational databases are mainly used for Online Transaction Processing (OLTP). OLTP mainly processes basic and routine transactions, such as bank transactions.

Data warehouse:
With the large-scale application of databases, the data of the information industry grows explosively. To study the relationship between data and excavate the hidden value of data, more and more people need to use ONLINE Analytical Processing (OLAP) to analyze data and explore deep-seated relationships and information. However, it isn't easy to share data between different databases, and data integration and analysis are also very challenging.
To solve the problem of enterprise data integration and analysis, bill Enman, the father of Data Warehouse, proposed Data Warehouse in 1990. The primary function of a data warehouse is to OLAP the large amount of data accumulated by OLTP over the years through the unique data storage architecture of the data warehouse and help decision-makers quickly and effectively analyze valuable information from a large amount of data and provide decision support. Since the emergence of data warehouse, the information industry began to develop from relational database-based operational systems to decision support systems.

Compared with a database, the data warehouse has the following two characteristics:
Data warehouse is subject-oriented integration. The Data warehouse is built to support various businesses and data from scattered operational data. Therefore, the required data must be extracted from multiple heterogeneous sources, processed and integrated, reorganized according to the topic, and finally entered into the data warehouse.
Data warehouse is mainly used to support enterprise decision analysis, and the data operation involved is mostly data query. Therefore, the data warehouse can improve query speed and reduce overhead by optimizing table structure and storage mode. Although warehouses are well suited for structured data, many modern enterprises must deal with unstructured, semi-structured, and data with high diversity, speed, and volume. Data warehousing is not suitable for many of these scenarios and is not the most cost-effective.

Data lake:
The essence of a data lake is a solution composed of "data storage architecture + data processing tools." The data storage architecture must be scalable and reliable enough to store massive data of any type, including structured, semi-structured, and unstructured data. Data processing tools fall into two broad categories. The first type of tool focuses on how to "move" data into the lake. It includes defining data sources, formulating data synchronization policies, moving data, and compiling data catalogs. The second type of tool focuses on how to analyze, mine, and utilize data from the lake. Data lake needs to have perfect data management ability, diversified data analysis ability, comprehensive data life cycle management ability, safe data acquisition, and data release ability. Without these data management tools, metadata will be missing, the data quality of the lake will not be guaranteed, and eventually, the data lake will deteriorate into a data swamp.

It has become a common understanding within the enterprise that data is an important asset. With the continuous development of enterprises, data keeps piling up. Enterprises hope to keep all relevant data in production and operation completely, carry out effective management and centralized governance, and dig and explore data value. Data lakes are created in this context. The data lake is a large data warehouse that centrally stores structured and unstructured data. It can store original data from multiple data sources and various data types. Data can be accessed, processed, analyzed, and transmitted without structural processing. The data lake can help enterprises quickly complete federated analysis, mining, and exploring data value of heterogeneous data sources.

With the development of big data and AI, the value of data in the data lake is gradually rising and being redefined. The data lake can bring a variety of capabilities to enterprises, such as centralized data management, help enterprises build more optimized operation models, and provide other capabilities for enterprises, such as predictive analysis, recommendation models, etc., which can stimulate the subsequent growth of enterprise capabilities.
The data warehouse and a data lake can be likened to the difference between a warehouse and a lake: a warehouse stores goods from a specific source; Lake water comes from rivers, streams, and other sources and is raw data. Data lakes, while good for storing data, lack some key features: they do not support transaction processing, do not guarantee data quality, and lack consistency/isolation, making it almost impossible to mix append and read data and to do batch and streaming jobs. For these reasons, many of the data lake capabilities are not yet implemented, and the benefits of a data lake are lost.

Data lakehouse:
Wikipedia does not give a specific definition of the lakehouse. It considers the advantages of both data lake and data warehouse. On the low-cost cloud storage in an open format, it realizes functions similar to data structure and data management functions in the data warehouse. It includes the following features: concurrent data reads and writes, architecture support with data governance mechanism, direct access to source data, separation of storage and computing resources, open storage formats, support for structured and semi-structured data (audio and video), and end-to-end streaming.

2. Evolution direction of big data system:
In recent years, many new computing and storage frameworks have emerged in the field of big data. For example, a standard computing engine represented by Spark, Flink, and an OLAP system described by Clickhouse emerged as computing frameworks. In storage, object storage has become a new storage standard, representing an important base for integrating data lake and lake warehouse. At the same time, Alluxio, JuiceFS, and other local cache acceleration layers have emerged. Several key evolution directions in the field of big data:

Cloud-native. Public and private clouds provide computing and storage hardware abstraction, abstracting the traditional IaaS management operation and maintenance. An important feature of cloud-native is that both computing and storage provide elastic capabilities. Making good use of elastic capabilities and reducing costs while improving resource utilization is an issue that both computing and storage frameworks need to consider.
Real-time. Traditional Hive is an offline data warehouse that provides T+1 data processing. It cannot meet new service requirements. The traditional LAMBDA architecture introduces complexity and data inconsistencies that fail to meet business requirements. So how to build an efficient real-time data warehouse system and realize real-time or quasi-real-time write updates and analysis on a low-cost cloud storage are new challenges for computing and storage frameworks.
Computing engine diversification. Big data computing engines are blooming, and while MapReduce is dying out, Spark, Flink, and various OLAP frameworks are still thriving. Each framework has its design focus, some deep in vertical scenarios, others with converging features, and the selection of big data frameworks are becoming more and more diverse.

In this context, the lakehouse and flow batch emerged.

3. What problems can be solved by integrating the lakehouse?
3.1 Connect data storage and computing
Many companies have not diminished the need for flexible, high-performance systems for a wide range of data applications, including SQL analysis, real-time monitoring, data science, and machine learning. Most of the latest advances in AI are based on models that better handle unstructured data (text, images, video, audio). The two-dimensional relational tables of a completely pure data warehouse can no longer handle semi-/ unstructured data, and AI engines cannot run solely on pure data warehouse models. A common solution is to combine the advantages of the data lake and warehouse to establish the lakehouse and then solve the limitations of the data lake: directly realize the data structure and data management functions similar to those in the data warehouse on the low-cost storage for the data lake.

The data warehouse platform is developed based on big data demand, and the data lake platform is developed based on the demand for AI. These two big data platforms are completely separated at the cluster level, and data and computation cannot flow freely between the two platforms. By the Lakehouse, the seamless flow between data lake and data warehouse can be realized, opening up different data storage and computation levels.

3.2 Flexibility and ecological richness
Lakehouse can give full play to the flexibility and ecological richness of the data lake and the growth and enterprise capability of the data warehouse. Its main advantages are as follows:
Data duplication: If an organization maintains a data lake and multiple data warehouses simultaneously, there is no doubt that there is data redundancy. At best, this can lead to inefficient data processing, but it can lead to inconsistent data at worst. The Lakehouse can remove the repeatability of data and truly achieve uniqueness. Data lakehouse has the following advantages:
High storage costs: Data warehouses and data lakes are designed to reduce the cost of data storage. Data warehouses often reduce costs by reducing redundancy and integrating heterogeneous data sources. On the other hand, data lakes tend to use big data file systems and Spark to store computational data on inexpensive hardware. The goal of the lakehouse integrated architecture is to combine these technologies to maximize cost reduction.

Differences between reporting and analysis applications: Data science tends to work with data lakes, using various analytical techniques to deal with raw data. On the other hand, reporting analysts tend to use consolidated data, such as data warehouses or data marts. There is often not much overlap between the two teams in an organization, but there are certain repetitions and contradictions between them. Both teams can work on the same data architecture with the all-in-one architecture, avoiding unnecessary duplication.

Data stagnation: Data stagnation is one of the most severe problems in the data lake, which can quickly become a data swamp if it remains ungoverned. We tend to throw data into the lake easily but lack effective governance, and in the long run, the timeliness of data becomes increasingly difficult to trace. The lakehouse for massive data management can help improve the timeliness of analysis data more effectively.
Risk of potential incompatibilities: Data analytics is still an emerging technology, and new tools and techniques emerge every year. Some technologies may only be compatible with data lakes, while others may only be compatible with data warehouses. The lakehouse means preparing for both.

Conclusion:
In general, the lakehouse has the following key characteristics:
Transaction support:

Data is often read and written concurrently to business systems in an enterprise.
ACID support for transactions ensures consistency and correctness of concurrent data access, especially in SQL access mode.
Data modeling and data governance: The lakehouse can support the realization and transformation of various data models and support DW mode architecture, such as the star and snowflake models. The system should ensure data integrity and have robust governance and audit mechanisms.
BI support: The integration of lakehouse supports the use of BI tools directly on the source data, speeding up the analysis efficiency and reducing the data delay. In addition, it is more cost-effective to operate two copies separately in lakehouse.
Memory separation: The architecture of memory separation also enables the system to scale up to more significant concurrency and data capacity. (Some newer data warehouses have adopted this architecture.)
Openness: With open, standardized storage formats (such as Parquet, etc.) and rich API support, various tools and engines (including machine learning and Python/R libraries) can provide efficient direct access to data.
Support for multiple data types (structured and unstructured): Lakehouse provides data warehousing, transformation, analysis, and access for many applications. Data types include images, video, audio, semi-structured, and text.
Support for various workloads: Support for various workloads, including data science, machine learning, SQL queries, and analysis. These workloads may require multiple tools, but they are all supported by the same database.
End-to-end flow: Real-time reporting has become a normal requirement in the enterprise. Building a dedicated system for real-time data services is no longer the same as before with the support of flow.

4.Four best open-source data lake warehouse projects
Hudi
Hudi is an opensoure procject providing tables, transactions, efficent upserts/deletes, advanced indexes, streaming ingestion services, data clustering/compaction optimizations, and concurrency all while keeping your data in open source file formats.
Apache Hudi brings core warehouse and database functionality directly to a data lake, which is great for streming wokloads, making users create efficient incremental batch pipelines. Besides, Hudi is very compatible, for example, it can be used on any cloud, and it supports Apache Spark, Flink, Presto, Trino, Hive and many other query engines.

Iceberg
Iceberg is an open table format for huge analytic dataset with Schema evolution, Hidden partitioning, Partition layout evolution, Time travel, Version rollback, etc.
Iceberg was built for huge tables, even those that can’t be read with a distributed SQL engine, used in production where a single table can contain tens of petabytes of data. Iceberg is famous for its fast scan planning, advanced filtering, works with any cloud store, serializable isolation,, multiple concurrent writers, etc.

Lakesoul
LakeSoul is a unified streaming and batch table storage solution built on the Apache Spark engine. It supports scalable metadata management, ACID transactions, efficient and flexible upsert operation, schema evolution, and streaming & batch unification.
LakeSoul specializes in row and column level incremental upserts, high concurrent write, and bulk scan for data on cloud storage. The cloud-native computing and storage separation architecture makes deployment very simple while supporting huge amounts of data at a lower cost.

delta lake
Delta Lake is an open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs for Scala, Java, Rust, Ruby, and Python, providing ACID transactions, scalable metadata handling, and unifies streaming and batch data processing on top of existing data lakes, such as S3, ADLS, GCS, and HDFS.

Hudi focuses more on the fast landing of streaming data and the correction of delayed data. Iceberg focuses on providing a unified operation API by shielding the differences of the underlying data storage formats, forming a standard, open and universal data organization lattice, so that different engines can access through API. Lakesoul, now based on spark, focuses more on building a standardized pipeline of data lakehouse. Delta Lake, an open-source project from Databricks, tends to address storage formats such as Parquet and ORC on the Spark level.

MMML | Deploy HuggingFace training model rapidly based on MetaSpore

DMetaSoul — Wed, 11 May 2022 15:15:36 +0000

A few days ago, HuggingFace announced a $100 million Series C funding round, which was big news in open source machine learning and could be a sign of where the industry is headed. Two days before the HuggingFace funding announcement, open-source machine learning platform MetaSpore released a demo based on the HuggingFace Rapid deployment pre-training model.

As deep learning technology makes innovative breakthroughs in computer vision, natural language processing, speech understanding, and other fields, more and more unstructured data are perceived, understood, and processed by machines. These advances are mainly due to the powerful learning ability of deep learning. Through pre-training of deep models on massive data, the models can capture the internal data patterns, thus helping many downstream tasks. With the industry and academia investing more and more energy in the research of pre-training technology, the distribution warehouses of pre-training models such as HuggingFace and Timm have emerged one after another. The open-source community release pre-training significant model dividends at an unprecedented speed.

In recent years, the data form of machine modeling and understanding has gradually evolved from single-mode to multi-mode, and the semantic gap between different modes is being eliminated, making it possible to retrieve data across modes. Take CLIP, OpenAI's open-source work, as an example, to pre-train the twin towers of images and texts on a dataset of 400 million pictures and texts and connect the semantics between pictures and texts. Many researchers in the academic world have been solving multimodal problems such as image generation and retrieval based on this technology. Although the frontier technology through the semantic gap between modal data, there is still a heavy and complicated model tuning, offline data processing, high performance online reasoning architecture design, heterogeneous computing, and online algorithm be born multiple processes and challenges, hindering the frontier multimodal retrieval technologies fall to the ground and pratt &whitney.

DMetaSoul aims at the above technical pain points, abstracting and uniting many links such as model training optimization, online reasoning, and algorithm experiment, forming a set of solutions that can quickly apply offline pre-training model to online. This paper will introduce how to use the HuggingFace community pre-training model to conduct online reasoning and algorithm experiments based on MetaSpore technology ecology so that the benefits of the pre-training model can be fully released to the specific business or industry and small and medium-sized enterprises. And we will give the text search text and text search graph two multimodal retrieval demonstration examples for your reference.

1. Multimodal semantic retrieval

The sample architecture of multimodal retrieval is as follows:
Our multimodal retrieval system supports both text search and text search application scenarios, including offline processing, model reasoning, online services, and other core modules:

Offline processing, including offline data processing processes for different application scenarios of text search and text search, including model tuning, model export, data index database construction, data push, etc.
Model inference. After the offline model training, we deployed our NLP and CV large models based on the MetaSpore Serving framework. MetaSpore Serving helps us conveniently perform online inference, elastic scheduling, load balancing, and resource scheduling in heterogeneous environments.
Online services. Based on MetaSpore's online algorithm application framework, MetaSpore has a complete set of reusable online search services, including Front-end retrieval UI, multimodal data preprocessing, vector recall and sorting algorithm, AB experimental framework, etc. MetaSpore also supports text search by text and image scene search by text and can be migrated to other application scenarios at a low cost.

The HuggingFace open source community has provided several excellent baseline models for similar multimodal retrieval problems, which are often the starting point for actual optimization in the industry. MetaSpore also uses the pre-training model of the HuggingFace community in its online services of searching words by words and images by words. Searching words by words is based on the semantic similarity model of the question and answer field optimized by MetaSpore, and searching images by words is based on the community pre-training model.

These community open source pre-training models are exported to the general ONNX format and loaded into MetaSpore Serving for online reasoning. The following sections will provide a detailed description of the model export and online retrieval algorithm services. The reasoning part of the model is standardized SAAS services with low coupling with the business. Interested readers can refer to my previous post: The design concept of MetaSpore, a new generation of the one-stop machine learning platform.

1.1 Offline Processing
Offline processing mainly involves the export and loading of online models and index building and pushing of the document library. You can follow the step-by-step instructions below to complete the offline processing of text search and image search and see how the offline pre-training model achieves reasoning at MetaSpore.

1.1.1 Search text by text
Traditional text retrieval systems are based on literal matching algorithms such as BM25. Due to users' diverse query words, a semantic gap between query words and documents is often encountered. For example, users misspell "iPhone" as "Phone," and search terms are incredibly long, such as "1 ~ 3 months old baby autumn small size bag pants". Traditional text retrieval systems will use spelling correction, synonym expansion, search terms rewriting, and other means to alleviate the semantic gap but fundamentally fail to solve this problem. Only when the retrieval system fully understands users' query terms and documents can it meet users' retrieval demands at the semantic level. With the continuous progress of pre-training and representational learning technology, some commercial search engines continue to integrate semantic vector retrieval methods based on symbolic learning into the retrieval ecology.

Semantic retrieval model
This paper introduces a set of semantic vector retrieval applications. MetaSpore built a set of semantic retrieval systems based on encyclopedia question and answer data. MetaSpore adopted the Sentence-Bert model as the semantic vector representation model, which fine-tunes the twin tower BERT in supervised or unsupervised ways to make the model more suitable for retrieval tasks. The model structure is as follows:

The query-Doc symmetric two-tower model is used in text search and question and answer retrieval. The vector representation of online Query and offline DOC share the same vector representation model, so it is necessary to ensure the consistency of the offline DOC library building model and online Query inference model. The case uses MetaSpore's text representation model Sbert-Chinese-QMC-domain-V1, optimized in the open-source semantically similar data set. This model will express the question and answer data as a vector in offline database construction. The user query will be expressed as a vector by this model in online retrieval, ensuring that query-doc in the same semantic space, users' semantic retrieval demands can be guaranteed by vector similarity metric calculation.

Since the text presentation model does vector encoding for Query online, we need to export the model for use by the online service. Go to the q&A data library code directory and export the model concerning the documentation. In the script, Pytorch Tracing is used to export the model. The models are exported to the "./export "directory. The exported models are mainly ONNX models used for wired reasoning, Tokenizer, and related configuration files. The exported models are loaded into MetaSpore Serving by the online Serving system described below for model reasoning. Since the exported model will be copied to the cloud storage, you need to configure related variables in env.sh.

_Build library based on text search _
The retrieval database is built on the million-level encyclopedia question and answer data set. According to the description document, you need to download the data and complete the database construction. The question and answer data will be coded as a vector by the offline model, and then the database construction data will be pushed to the service component. The whole process of database construction is described as follows:

Preprocessing, converting the original data into a more general JSonline format for database construction;
Build index, use the same model as online "sbert-Chinese-qmc-domain-v1" to index documents (one document object per line);
Push inverted (vector) and forward (document field) data to each component server.

The following is an example of the database data format. After offline database construction is completed, various data are pushed to corresponding service components, such as Milvus storing vector representation of documents and MongoDB storing summary information of documents. Online retrieval algorithm services will use these service components to obtain relevant data.

1.1.2 Search by text
Text and images are easy for humans to relate semantically but difficult for machines. First of all, from the perspective of data form, the text is the discrete ID type of one-dimensional data based on words and words. At the same time, images are continuous two-dimensional or three-dimensional data. Secondly, the text is a subjective creation of human beings, and its expressive ability is vibrant, including various turning points, metaphors, and other expressions, while images are machine representations of the objective world. In short, bridging the semantic gap between text and image data is much more complex than searching text by text. The traditional text search image retrieval technology generally relies on the external text description data of the image or the nearest neighbor retrieval technology and carries out the retrieval through the image associated text, which in essence degrades the problem to text search. However, it will also face many issues, such as obtaining the associated text of pictures and whether the accuracy of text search by text is high enough. The depth model has gradually evolved from single-mode to multi-mode in recent years. Taking the open-source project of OpenAI, CLIP, as an example, train the model through the massive image and text data of the Internet and map the text and image data into the same semantic space, making it possible to implement the text and image search technology based on semantic vector.

CLIP graphic model
The text search pictures introduced in this paper are implemented based on semantic vector retrieval, and the CLIP pre-training model is used as the two-tower retrieval architecture. Because the CLIP model has trained the semantic alignment of the twin towers' text and image side models on the massive graphic and text data, it is particularly suitable for the text search graph scene. The model structure is as follows:

Due to the different image and text data forms, the Query-Doc asymmetric twin towers model is used for text search image retrieval. The image-side model of the twin towers is used for offline database construction, and the text-side model is used for the online return. In the final online retrieval, the database data of the image side model will be searched after the text side model encodes Query, and the CLIP pre-training model guarantees the semantic correlation between images and texts. The model can draw the graphic pairs closer in vector space by pre-training on a large amount of visual data.
Here we need to export the text-side model for online MetaSpore Serving inference. Since the retrieval scene is based on Chinese, the CLIP model supporting Chinese understanding is selected. The exported content includes the ONNX model used for online reasoning and Tokenizer, similar to the text search. MetaSpore Serving can load model reasoning through the exported content.

Build library on Image search
You need to download the Unsplash Lite library data and complete the construction according to the instructions. The whole process of database construction is described as follows:

Preprocessing, specify the image directory, and then generate a more general JSOnline file for library construction;
Build index, use OpenAI/Clip-Vit-BASE-Patch32 pre-training model to index the gallery, and output one document object for each line of index data;
Push inverted (vector) and forward (document field) data to each component server. Like text search, after offline database construction, relevant data will be pushed to service components, called by online retrieval algorithm services to obtain relevant data.

1.2 Online Services
The overall online service architecture diagram is as follows:

Multi-mode search online service system supports application scenarios such as text search and text search. The whole online service consists of the following parts:

Query preprocessing service: encapsulate preprocessing logic (including text/image, etc.) of pre-training model, and provide services through gRPC interface;
Retrieval algorithm service: the whole algorithm processing link includes AB experiment tangent flow configuration, MetaSpore Serving call, vector recall, sorting, document summary, etc.;
User entry service: provides a Web UI interface for users to debug and track down problems in the retrieval service.

From a user request perspective, these services form invocation dependencies from back to front, so to build up a multimodal sample, you need to run each service from front to back first. Before doing this, remember to export the offline model, put it online and build the library first. This article will introduce the various parts of the online service system and make the whole service system step by step according to the following guidance. See the ReadME at the end of this article for more details.

1.2.1 Query preprocessing service
Deep learning models tend to be based on tensors, but NLP/CV models often have a preprocessing part that translates raw text and images into tensors that deep learning models can accept. For example, NLP class models often have a pre-tokenizer to transform text data of string type into discrete tensor data. CV class models also have similar processing logic to complete the cropping, scaling, transformation, and other processing of input images through preprocessing. On the one hand, considering that this part of preprocessing logic is decoupled from tensor reasoning of the depth model, on the other hand, the reason of the depth model has an independent technical system based on ONNX, so MetaSpore disassembled this part of preprocessing logic.

NLP pretreatment Tokenizer has been integrated into the Query pretreatment service. MetaSpore dismantlement with a relatively general convention. Users only need to provide preprocessing logic files to realize the loading and prediction interface and export the necessary data and configuration files loaded into the preprocessing service. Subsequent CV preprocessing logic will also be integrated in this manner.

The preprocessing service currently provides the gRPC interface invocation externally and is dependent on the Query preprocessing (QP) module in the retrieval algorithm service. After the user request reaches the retrieval algorithm service, it will be forwarded to the service to complete the data preprocessing and continue the subsequent processing. The ReadMe provides details on how the preprocessing service is started, how the preprocessing model exported offline to cloud storage enters the service, and how to debug the service.

To further improve the efficiency and stability of model reasoning, MetaSpore Serving implements a Python preprocessing submodule. So MetaSpore can provide gRPC services through user-specified preprocessor.py, complete Tokenizer or CV-related preprocessing in NLP, and translate requests into a Tensor that deep models can handle. Finally, the model inference is carried out by MetaSpore, Serving subsequent sub-modules.

Presented here on the lot code: https://github.com/meta-soul/MetaSpore/compare/add_python_preprocessor

1.2.2 Retrieval algorithm services
Retrieval algorithm service is the core of the whole online service system, which is responsible for the triage of experiments, the assembly of algorithm chains such as preprocessing, recall, sorting, and the invocation of dependent component services. The whole retrieval algorithm service is developed based on the Java Spring framework and supports multi-mode retrieval scenarios of text search and text search graph. Due to good internal abstraction and modular design, it has high flexibility and can be migrated to similar application scenarios at a low cost.
Here's a quick guide to configuring the environment to set up the retrieval algorithm service. See ReadME for more details:

Install dependent components. Use Maven to install the online-Serving component
Search for service configurations. Copy the template configuration file and replace the MongoDB, Milvus, and other configurations based on the development/production environment.
Install and configure Consul. Consul allows you to synchronize the search service configuration in real-time, including cutting the flow of experiments, recall parameters, and sorting parameters. The project's configuration file shows the current configuration parameters of text search and text search. The parameter modelName in the stage of pretreatment and recall is the corresponding model exported in offline processing.
Start the service. Once the above configuration is complete, the retrieval service can be started from the entry script. Once the service is started, you can test it! For example, for a user with userId=10 who wants to query "How to renew ID card," access the text search service.

1.2.3 User Entry Service
Considering that the retrieval algorithm service is in the form of the API interface, it is difficult to locate and trace the problem, especially for the text search image scene can intuitively display the retrieval results to facilitate the iterative optimization of the retrieval algorithm. This paper provides a lightweight Web UI interface for text search and image search, a search input box, and results in a display page for users. Developed by Flask, the service can be easily integrated with other retrieval applications. The service calls the retrieval algorithm service and displays the returned results on the page.

It's also easy to install and start the service. Once you're done, go to http://127.0.0.1:8090 to see if the search UI service is working correctly. See the ReadME at the end of this article for details.

2. Multimodal system demonstration
The multimodal retrieval service can be started when offline processing and online service environment configuration have been completed following the above instructions. Examples of textual searches are shown below.
Enter the entry of the text search map application, enter "cat" first, and you can see that the first three digits of the returned result are cats:

If you add a color constraint to "cat" to retrieve "black cat," you can see that it does return a black cat:

Further, strengthen the constraint on the search term, change it to "black cat on the bed," and return results containing pictures of a black cat climbing on the bed:

The cat can still be found through the text search system after the color and scene modification in the above example.

Conclusion
The cutting-edge pre-training technology can bridge the semantic gap between different modes, and the HuggingFace community can greatly reduce the cost for developers to use the pre-training model. Combined with the technological ecology of MetaSpore online reasoning and online microservices provided by DMetaSpore, the pre-training model is no longer mere offline dabbling. Instead, it can truly achieve end-to-end implementation from cutting-edge technology to industrial scenarios, fully releasing the dividends of the pre-training large model. In the future, DMetaSoul will continue to improve and optimize the MetaSpore technology ecosystem:

More automated and wider access to HuggingFace community ecology. MetaSpore will soon release a common model rollout mechanism to make HuggingFace ecologically accessible and will later integrate preprocessing services into online services.
Multi-mode retrieval offline algorithm optimization. For multimodal retrieval scenarios, MetaSpore will continuously iteratively optimize offline algorithm components, including text recall/sort model, graphic recall/sort model, etc., to improve the accuracy and efficiency of the retrieval algorithm. For related code and reference documentation in this article, please visit: https://github.com/meta-soul/MetaSpore/tree/main/demo/multimodal/online Some images source: https://github.com/openai/CLIP/raw/main/CLIP.png https://www.sbert.net/examples/training/sts/README.html

Build a real-time machine learning sample library using the best open-source project about big data and data lakehouse, LakeSoul

DMetaSoul — Fri, 06 May 2022 14:17:08 +0000

The previous article, "The design concept of the best open-source project about big data and data lakehouse" introduced the design concept and partial realization principle of LakeSoul's open-source and stream batch integrated surface storage framework. The original intention of the design of LakeSoul is to solve various problems that are difficult to solve in traditional Hive data warehouse scenarios, including Upsert update, Merge on Read, and concurrent write. This article will demonstrate the core capabilities of LakeSoul using a typical application scenario: building a real-time machine learning sample library.

1. Background of business requirements

1.1 Online recommendation system
In the Internet, finance, and other industries, many business scenarios, such as search, advertising, recommendation, risk control, etc., can be summarized as an online personalized recommendation system. For example, in the e-commerce business, customized "guess you like" recommendations based on the personalized recommendation system can improve users' click rate and purchase rate. In the advertising business, the personalized recommendation is the core system for achieving proper orientation and improving ROI. In financial risk control, it is necessary to realize the real-time prediction of users' repayment ability and overdue possibility and provide a personalized credit line and loan repayment cycle for each user.

Recommendation system has been widely used in various industries. Building an industrial-grade online recommendation system requires the connection of many links and systems, spending a large amount of time on the development work. The MetaSpore framework developed by DmetaSoul provides a one-stop recommendation system development solution. Please refer to my previous post, "The design concept of an almighty Opensource project about machine learning platform."

This paper focuses on building a real-time sample database to realize the complete closed loop of "user feedback-model iteration." The recommendation system can learn and iterate independently and quickly capture user interest changes.

1.2 What is the sample library of recommendation System machine learning?
In the recommendation system, the core part is an algorithm model of personalized sorting. Model training starts with constructing samples and learning each user's preference through various characteristics and user behavior feedback labels. A sample library usually consists of several parts:

User Feature: includes the basic attributes, historical behaviors, and recent real-time behaviors of users. The basic attributes of users may come from real-time online requests or behavior labels mined by offline DMP. Users' historical and real-time behavior generally includes the events with feedback behaviors in the history of users and some relevant aggregate statistical indicators.
Item-Feature: Item is the object to be recommended to users, which can be commodities, news, advertisements, etc. Features generally include all kinds of attributes of articles, including discrete attributes and continuous attributes of statistical values.
User feedback: is the label in the algorithm model. Labels are all kinds of user feedback behaviors, such as show, click, transform, etc. Algorithmic models need to learn to model user preferences through the relationship between features and labels.

*1.3 Challenges of building the machine learning sample library *
There are several kinds of problems and challenges when building the machine learning sample library:

Real-time requirements. The model learning of the mainstream recommendation system in the industry has developed in the direction of online and real-time. The more timely the model is updated, the faster it can capture the changes in users' interests, thus providing more accurate recommendation results and improving business effects, which requires the sample library to support a high write throughput capacity.
Multi-stream updates. Many online features must be influenced in real-time for further model training after the online sorting calculation by the model. User feedback also needs to be fed back into the sample library, often with multiple live streams of user feedback. In this case, multiple live streams simultaneously write to different columns in the update sample library. Traditional Hive data storehouses generally cannot support real-time updates and must be implemented through full Join. However, the operation efficiency is low when the Join window is large and a large amount of data is redundant. The window Join of Flink also has the problem of enormous state data and high operation and maintenance costs.
Parallel-experiments. In practical business development, algorithm engineers often need to conduct parallel experiments of multiple models to compare the results simultaneously. Different models may require different feature and label columns, updating differently. For example, Offline batch job computing generates some features, and these batch data also need to be inserted into the feature database.
Feature backtracking. In algorithmic business development, Adding features is needed sometimes, while the same is true of backtracking models, requiring batch updates of new features to historical data. It is also challenging to implement efficiently in Hive.

There are many challenges in constructing a real-time sample database in the recommendation system algorithm scenario. The main problems of these challenges are that Hive data warehouse functions and performance are weak, and scenarios such as stream batch integration, incremental update, and concurrent write cannot be well supported. Bytedance and other companies have previously shared solutions based on Hudi to build recommendation system samples in streaming and batch integration. However, Hudi still has problems, such as concurrent updates in actual use.

Developed by DMetaSoul and an open-source streaming batch one body table storage framework, LakeSoul can solve these problems well. The following article details how to use LakeSoul to build a sample library of industrial-grade recommendation systems.

2 Building a real-time machine learning sample library

LakeSoul is a table storage framework designed for streaming batch scenarios with the following key features:

Column level Update (Upsert)
Support Merge on Read, which merges data while reading to improve write throughput
Support object storage without file semantics
Concurrent write, which can support multiple streams and batch jobs to update the same partition
Distributed metadata management to improve the scalability of metadata
Schema evolution allows you to add and delete columns of a table The overall design of building the machine learning sample library by LakeSoul is to use Upsert instead of Join to write multiple groups of features and labels into the same table by streaming and batch, respectively, achieving high concurrent write and read/write throughput. The following part explains the specific implementation process and fundamental principles in detail.

2.1 Primary key design
To enable efficient Merge, LakeSoul provides the ability to set primary keys. Divide the primary key columns in a table into a specified number of hash buckets evenly according to the number of hash buckets. In each bucket, the primary key columns are sorted and written. Merge several incremental files with the ordered primary key at reading time, obtaining the Merge result.

When recommending system sample libraries, the backflow of all features and tags is tagged with the request ID generated during an online request, used as the Join Key in offline Join scenarios. Therefore, you can use the request ID as the primary key of the LakeSoul sample table and the hour as the Range partition. Create the LakeSoul table in the Spark job as follows:

LakeSoulTable.createTable(data, path).shortTableName("sample").hashPartitions("request_id").hashBucketNum(100).rangePartitions("hour").create()

This creates a table with 'request_id' as the primary key, hash buckets 100, and hours as a Range partition.

2.2 Data writing and Concurrent update
Since the characteristics and labels come from different streams and batches, we need jobs from multiple streams or batches to update the SAMPLE table concurrently. Each data must have a request_id column and an hour column. When executing LakeSoulTable.Upsert, LakeSoul Spark Writer automatically repartitions buckets based on the request_id and writes data into the corresponding partition bucket according to the hour column. A batch of written data can have values of multiple Range partitions.

LakeSoul supports multi-stream concurrent Upsert, which can meet the needs of multi-stream real-time updates of the sample database. For example, there are two streams, namely feature reflux and label reflux data, which can be updated into the sample database in real-time by performing Upsert:

// Read feature reflow, update sample table
val featureStreamDF = spark.readStream...
val lakeSoulTable = LakeSoulTable.forPath(tablePath)
lakeSoulTable.upsert(featureStreamDF)
// Read label reflow, update sample sheet
val labelStreamDF = spark.readStream...
val lakeSoulTable = LakeSoulTable.forPath(tablePath)
lakeSoulTable.upsert(labelStreamDF)

Because the Merge operation is not required, only the current incremental data is written so that the writing can have a high throughput. In actual tests, the write rate of each core on cloud merchant object storage is over 30MB/s, that is, 30 Spark Executors, meaning that write speed can reach 1GB with one CPU Core allocated to each.

2.3 the Merge On the Read
LakeSoul automatically merges _Upsert _data when reading it. So the interface to read is no different from that to read a table:

val lakeSoulTable = LakeSoulTable.forPath(path)
lakeSoulTable.toDF.select("*").show()

It can also be queried using SQL Select statements. In the underlying implementation, for each hash bucket, as the primary key is already ordered, only an external merge of multiple ordered lists is required, as shown below:

The figure shows that both the sample and label streams perform multiple _Upsert_s. LakeSoul will automatically find the incremental updated file based on the metadata service's update record and complete an ordered external merge when a read job works. LakeSoul implements ordered merging of Parquet files and improves the performance of multi-way ordered merging by optimizing the small top heap design.

2.4 Data Backfill
Since LakeSoul supports Upsert of any Range partitioned data, there is no difference between backtracking and streaming write. When the data to be inserted is ready, Spark performs Upsert to update historical data. LakeSoul automatically recognizes Schema changes. Update meta information of tables to implement Schema evolution. LakeSoul provides a complete storage function of data warehouse tables, and each historical partition can be queried and updated. Compared with Flink's window Join scheme, it solves the problem of invisible intermediate states and can quickly realize mass updates and traceability of historical data.

Conclusion

This article introduces the application of LakeSoul in a typical stream batch integration scenario, building the sample library of recommender system machine learning. LakeSoul stream batch integration, Merge on Read capability, can support large-scale, large window multi-stream real-time update, solve some problems existing in Hive warehouse mass Join and Flink window Join.

The design concept of an almighty Opensource project about machine learning platform

DMetaSoul — Sat, 30 Apr 2022 16:23:43 +0000

In my previous article, "Almighty Opensource project about machine learning you should try out", I introduced MetaSpore. In this article, I will introduce MetaSpore's design philosophy in detail. Next week, I will update the detailed steps of using MetaSpore to build industrial recommendation systems rapidly.
How did MetaSpore move its complex recommendation algorithms beyond the Internet giants to reach the vast majority of SMEs and developers? This article unveils MetaSpore's core design philosophy.

1.What is a one-stop machine learning platform
When it comes to machine learning, people tend to think of various machine learning frameworks, such as TensorFlow and PyTorch, etc. Many models implement Python code on GitHub. However, there are still a lot of difficulties in implementing the algorithm model in specific business scenarios._ Taking the recommendation system as an example, the following problems will be encountered in the implementation of the algorithm model in specific business scenarios:_

**How to generate training data (samples)? **In recommendation scenarios, it is often necessary to splice user feedback signals (exposure, click, etc.) with various feature sources, perform necessary feature cleaning and extraction, and divide data into verification sets, negative sampling, and other complex data processing. These processes are usually impossible in machine learning frameworks, so a big data system is needed to process bulk or streaming data.
**How to transfer training data generated by big data platforms to a deep learning framework? **Frameworks such as TensorFlow and PyTorch have their data input formats and corresponding DataLoader interfaces, requiring developers to parse the data. So how to deal with data fragmentation, variable-length characteristics, and other problems often puzzle algorithm engineers.
**How to run a distributed training that can freely schedule cluster resources, including GPUs? **This may require a dedicated operations team to manage machine learning-related hardware scheduling.
How to train the sparse feature model and use the NLP pre-training model? In the recommendation scenario, we need to be able to handle sparse features on a large scale to model the interesting relationship between users and goods. At the same time, multimodal model fusion has gradually become the frontier direction.
How to make online predictions efficiently after model training? In addition to cluster resources, elastic scheduling and load balancing are also involved. These problems require dynamic resource allocation in a heterogeneous environment with CPUs, GPUs, and NPUS. For Complex models, distillation, quantification, and other means are necessary to ensure their predictive performance.
After the model goes online, how does the online system extract splicing features, ensure the consistency of offline features, and evaluate the algorithm effect quantitatively? An online algorithm application framework that can integrate with an online system read all kinds of online data sources and provide a multi-layer ABTest traffic experiment function is necessary.
Finally, how to conduct efficient iteration in an algorithm experiment? Programmers want to run multiple, parallel experiments quickly to improve business performance rather than get bogged down in complex system environment configurations. Multiple teams and different systems often solve these problems in large Internet factories. For example, the following image is from the overall architecture of the recommendation system shared by Netflix's algorithm engineering team: As can be seen from the above figure that building a complete, industrial-grade recommendation system is quite complex and tedious, requiring considerable knowledge in different fields and investment in engineering development. SMEs lack the staffing and a one-stop platform to solve these problems standardized.

_2.MetaSpore's one-stop machine learning platform_
The original intention of DMetaSoul to develop and open-source MetaSpore is to help enterprises and developers solve all kinds of problems encountered in the process of algorithm business development based on MetaSpore features and provide a one-stop development experience by using standardized components and development interfaces. Meet the needs of enterprises and developers to obtain algorithmic business development best practices. Specifically, MetaSpore has the following core functional design concepts:

Model training is seamlessly integrated with the big data system, which can directly read the structured and unstructured data of all kinds of data lakes and warehouses for training. Data feature preprocessing, and model training are seamlessly linked together, saving the tedious data import and export and format conversion.
Support for sparse features. Large-scale sparse Embedding layer training is necessary for search and generalization scenarios. Some processing of sparse features is involved, such as cross combination, variable-length feature pooling, etc., which requires special support from the training framework.
Provide high-performance online forecasting services. Online prediction services support neural networks (including sparse Embedding), decision trees, and a variety of traditional machine learning models. Supports heterogeneous hardware computing acceleration, reducing the engineering threshold for online deployment.
Unified offline feature calculation. Through the unified feature format and calculation logic, the unified offline feature calculation saves the repeated development of multiple systems and ensures the consistency of offline features.
Online algorithm application framework. The online algorithm application framework covers the common function points of online systems, such as automatic feature extraction from multi-data sources, feature calculation, predictive service interface, experimental dynamic configuration, ABTest dynamic cutting flow, etc.
Embrace open source. The MetaSpore platform provides several homegrown components to implement these core function points. At the same time, MetaSpore's development philosophy is to embrace the mature open source ecosystem as much as possible without fragmentation, lowering the barriers to learning and enabling developers to develop based on their existing experience quickly.

MetaSpore integrates data, algorithms, and online systems to provide a one-stop, full-process development experience for algorithms through these core functional points and design concepts.

2.1 **MetaSpore's integration with big data ecology**
A large class of machine learning algorithms deals with structured tabular data and conduct prediction, classification, regression, and modeling. For example, CTR estimation of recommended advertising and financial risk control are common. In the practice of this kind of algorithm, it is very important to preprocess data. The quality of the data is directly related to the final model effect. Common data processing includes feature splicing, null value filling, discrete bucket division, outlier filtering, feature engineering, and sample generation such as verification set division and random shuffle. These steps often rely on big data systems for processing in engineering practice, while deep learning frameworks lack comprehensive big data processing functions.
There will be many problems with data and training connection in this case. Big data systems have multiple storage formats and systems, making it difficult for deep learning frameworks to adopt one by one. Traditionally, algorithmic engineers are required to process data format conversion by themselves, converting the format of big data systems into formats that deep learning frameworks can recognize. Not only is this cumbersome, it creates data redundancy problems, but it is also challenging to implement incremental streaming learning.
MetaSpore provides an offline training framework using the Parameter Server architecture for such scenarios. The MetaSpore offline training framework integrates seamlessly with PySpark, allowing you to input the Spark DataFrame directly into model training with no format conversion required. Spark supports multiple data sources, including CSV, Parquet, and ORC. It also supports various data stores, data lakes, such as Hive, and streaming data such as Kafka. By docking with PySpark, MetaSpore can directly train all the data sources Spark can support, greatly simplifying the Pipeline for data processing.
In terms of implementation, MetaSpore takes advantage of the PandasUDF functionality provided by PySpark. PandasUDF is a Python vectorization UDF interface provided by PySpark that uses Arrow as an efficient data transfer protocol from JVM to Python. The Worker end of the MetaSpore training framework is a PandasUDF, which receives a batch of data and sends it to the model training module. The model training results, such as predicted values, are also sent back to Spark in Arrow format to form a new DataFrame. Take a look at a code example:

# Defining Neural Networks
module = metaspore.algos.MLP()
# Create a PyTorch Distributed Training Estimator
estimator = metaspore.PyTorchEstimator(module)
# Read training and validation sets via Spark
train_df = spark_session.read.csv('hdfs://movielens/train/')
test_df = spark_session.read.csv('hdfs://movielens/test/')
# Model training on the training set
model = estimator.fit(train_df)
# Use the trained model to make predictions on the test set
test_prediction_result = model.transform(test_df)
# The predicted result is Spark DataFrame，Operations that can be invoked on any DataFrame
evaluator = pyspark.ml.evaluation.BinaryClassificationEvaluator()
auc = evaluator.evaluate(test_prediction_result)
print(auc)

You can see that MetaSpore integrates seamlessly with Spark. The read CSV in the middle can be replaced by any data source supported by Spark, and the model is trained after complex data processing in Spark.

2.2 **MetaSpore's support for sparse discrete feature training**
A large class of features is sparse and discrete in search, advertising, recommendation, and other personalized recommendation scenarios. This feature is usually converted to a fixed ID using one-hot Encoding and mapped to a Vector as a learning parameter. In this scenario, there will be two kinds of problems. One is that the space of feature ID is very large, usually over 100 million level; the other is that the ID space is not fixed, and the value types of some discrete features will change constantly.
MetaSpore provides a Parameter Server architecture to split the Embedding Table for the first problem. To solve the second problem, MetaSpore provides a complete set of dynamic sparse feature learning schemes, storing Embedding Table through hash Table and realizing the addition and removal of Embedding Vector. The original discrete eigenvalue hash, multi-value combination, and multi-value pooling are integrated and defined as a PyTorch Module for the recommendation scenario. Define a sparse MLP model using MetaSpore as follows:

class MLPLayer(torch.nn.Module):
    def __init__(self, ...):
        super().__init__()
        self.sparse = metaspore.EmbeddingSumConcat(deep_embedding_dim,
                                           deep_column_name_path,
                                           deep_combine_schema_path)
        dense_layers=[]
        dnn_linear_num=len(hidden_lay_unit)
        dense_layers.append(metaspore.nn.Normalization(feature_dim*embedding_size))
        dense_layers.append(torch.nn.Linear(feature_dim*embedding_size, hidden_lay_unit[0]))
        dense_layers.append(torch.nn.ReLU())
        for i in range(dnn_linear_num - 2):
            dense_layers.append(torch.nn.Linear(hidden_lay_unit[i], hidden_lay_unit[i + 1]))
            dense_layers.append(torch.nn.ReLU())
        dense_layers.append(torch.nn.Linear(hidden_lay_unit[-2], hidden_lay_unit[-1]))
        self.dnn = torch.nn.Sequential(*dense_layers)

    def forward(self, input_raw_strings):
        sparse = self.sparse(input_raw_strings)
        return self.dnn(sparse)

In the above code, by creating an EmbeddingSumConcat layer, you can accept sample input of raw string type, automatically hash, combine, pooling (default Sum), and input_raw_strings for the forward method. It is a batch of table data read by PySpark (a Pandas DataFrame object). Compared with TorchRec and other recommendation system deep learning frameworks, the API level is simplified, and the call logic is more direct and easy to understand.

2.3 **MetaSpore model Serving service**
The MetaSpore platform integrates online model Serving (prediction) services. Unlike TFServing, MetaSpore Serving has the following characteristics:

Support a wider range of models. In addition to sparse NN models generated by the offline training framework, MetaSpore Serving also supports NLP, CV, and other NN models, XGBoost, LightGBM, and other decision tree models. And Spark ML, SKLearn, and other machine learning library models.
Integrate feature processing logic. Feature calculation is an integral part of model Serving, such as sparse discrete feature processing logic, Tokenizer in NLP, etc. Traditionally, these logics are implemented in two offline sets, which are inefficient and prone to logical inconsistencies and other errors. MetaSpore Serving integrates the computational logic for sparse features and shares a code offline.
Supports heterogeneous hardware acceleration, such as CPU, GPU, and NPU. Model prediction usually requires selecting different hardware to perform calculations in different scenarios. MetaSpore Serving is capable of supporting common hardware for heterogeneous computing.

In MetaSpore Serving, OnnxRuntime is used as the computational library. And feature processing is used as prediction processing logic. The prediction calculation of each model, including several feature table calculations and the final Onnx model calculation, forms a DAG calculation graph. For example, in a typical Wide&Deep model, the computational logic at Serving can be roughly expressed as follows:

In this way, MetaSpore Serving can easily integrate various feature extraction and calculation modules, maximizing alignment with offline logic and simplifying the development experience.

2.4 **MetaSpore is uniformly calculated in offline feature**
For the Table class model (sparse or dense), MetaSpore takes Arrow Table as input offline and expresses the feature calculation logic as an execution plan of Arrow Compute. An expression map evaluates each feature. For discrete features, hashing and multi-value combinations are supported. For dense features, normalization, binarization, and bucket splitting are supported. Much of this calculation can be done directly using Arrow's built-in expression. For Arrow, which does not have built-in computations, you can register your own expressions using a custom Compute Kernel.

In terms of implementation, the expression of feature operation used offline is serialized and saved in the directory exported by the model. The same Arrow Compute expression is constructed after the online prediction service loads the expression so that the completely consistent feature calculation can be shared offline. In addition, because Arrow Compute supports filtering, joins, and other operations, LakeSoul also supports input to input batches. Thanks to the design of Arrow column expression, these calculations are vectorized and can have relatively high performance.

The offline feature calculation and offline input calculation are unified Arrow Table format. Arrow itself provides a multi-language interface to create Arrow tables, so MetaSpore Serving also easily supports multiple language calls. For example, with an XGBoost model with ten float type feature inputs, LakeSoul implements a program in Python that calls the MetaSpore prediction service:

import grpc
import metaspore_pb2
import metaspore_pb2_grpc
import pyarrow as pa
with grpc.insecure_channel('0.0.0.0:50051') as channel:
    stub = metaspore_pb2_grpc.PredictStub(channel)
    # Construct a one-line, 10-feature sample
    row = []
    values = [0.6558618,0.13005558,0.03510657,0.23048967,0.63329154,0.43201634,0.5795548,0.5384891,0.9612295,0.39274803]
    for i in range(10):
        row.append(pa.array([values[i]], type=pa.float32()))
    # Create Arrow RecordBatch
    rb = pa.RecordBatch.from_arrays(row, [f'field_{i}' for i in range(10)])
    # Serialization RecordBatch
    sink = pa.BufferOutputStream()
    with pa.ipc.new_file(sink, rb.schema) as writer:
        writer.write_batch(rb)

    # Construct GRPC request for MetaSpore Serving
    payload_map = {"input": sink.getvalue().to_pybytes()}
    request = metaspore_pb2.PredictRequest(model_name="xgboost_model", payload=payload_map)
    # Call Serving service prediction
    reply = stub.Predict(request)
    for name in reply.payload:
        with pa.BufferReader(reply.payload[name]) as reader:
            tensor = pa.ipc.read_tensor(reader)
            print(f'Predict Output Tensor: {tensor.to_numpy()}')

It can be seen that the online prediction is also the Arrow Table of the input characteristics, and MetaSpore Serving can be directly invoked. The invocation method is also uniform offline.

2.5 [MetaSpore](https://github.com/meta-soul/MetaSpore**) online algorithm application framework**
Offline training frameworks and online Serving services are now available. Then, an algorithm in the business scene landing is still a final step: an online algorithm experiment. In a service scenario, to verify the validity of an algorithm model, a baseline needs to be established and compared with the new algorithm model. Therefore, an online experimental framework is needed which can easily define algorithm experiments, read online features, and call model prediction services. In addition, multiple experiments can be traffic segmented to achieve ABTest effect comparison. A configuration center is also needed to quickly carry out multiple experimental iterations, which can dynamically load refresh experiments and cut flow configurations, support hot loading of experimental parameters, and various debugging and trace functions. This link also directly determines whether the AI model can be finally implemented into practical business applications.

MetaSpore provides an online algorithm application framework that covers the entire flow of online experiments. This framework is based on SpringBoot + Spring Cloud and provides the following core functions:

1. Experiment Pipeline. Developers can add experimental classes with simple annotations:

@ExperimentAnnotation(name = "rank.widedeep")
@Component
public class WideDeepRankExperiment implements BaseExperiment<RecommendResult, RecommendResult>
{
    @Override
    public void initialize(Map<String, Object> map) {}

    @SneakyThrows
    @Override
    public RecommendResult run(Context context, RecommendResult recommendResult)
{}
}

The Spring Cloud configuration center is then used to dynamically assemble multiple experimental flows, including flow segmentation for horizontal experiments and sequence of vertical experiments. Take the recommendation system as an example, there are two experiments in the recall layer and two experiments in the sorting layer, and the two layers are orthogonal so that the online experiment process can be assembled through the following configuration:

scene-config:
  scenes:
    - name: guess-you-like
      layers:
        - name: match
          normalLayerArgs:
            - experimentName: match.base
              ratio: 0.3
            - experimentName: match.multiple
              ratio: 0.7
        - name: rank
          normalLayerArgs:
            - experimentName: rank.wideDeep
              ratio: 0.5
            - experimentName: rank.lightGBM
              ratio: 0.5
  experiments:
    - layerName: match
      experimentName: match.base
      extraExperimentArgs:
        modelName: match.base
        matcherNames: [ItemCfMatcher]
    - layerName: match
      experimentName: match.multiple
      extraExperimentArgs:
        modelName: match.multiple
        matcherNames: [ItemCfMatcher, SwingMatcher, TwoTowersMatcher]
    - layerName: rank
      experimentName: rank.wideDeep
      extraExperimentArgs:
        modelName: movie_lens_wdl
        ranker: WideAndDeepRanker
        maxReservation: 100
    - layerName: rank
      experimentName: rank.lightGBM
      extraExperimentArgs:
        modelName: lightgbm_test_model
        ranker: LightGBMRanker
        maxReservation: 100

After the configuration and experimental class implementation are added to the SpringBoot project, the framework will automatically create and initialize the objects of the experimental class and execute them by order of the experimental layer, and automatically select an experiment to execute in each layer according to the tangential flow configuration. The developer can freely define the input and output between experiments, providing a high degree of flexibility. At the same time, this configuration file is placed in the configuration center, supporting dynamic hot update, modification of experimental configuration, and cutting flow configuration, which can take effect in real-time.
**2.Online feature extraction framework. **In online prediction, we need to construct an Arrow Table of features. However, online features can come from multiple sources, often with various databases, caches, and upstream services. MetaSpore, based on SpringBoot JPA, encapsulates a feature extraction code generation framework that automatically generates database access code only after the developer defines the schema and data source of the feature.
**3.MetaSpore provides a Client implementation of online Serving, **supporting SpringBoot and providing annotation injection to access Serving.

2.6 MetaSpore embraces the open-source ecosystem
MetaSpore itself is an open-source project, and MetaSpore uses a mix of mature open source ecosystems. For example, in offline training, MetaSpore's offline training framework Bridges PySpark and PyTorch, two popular frameworks in big data and deep learning, seamlessly combine their ecology better to address the pain points of real business scenarios. The online service is also fully integrated with gRPC, SpringBoot, and Spring Cloud, allowing developers to obtain the best practices of landing big data intelligent applications in the existing standard technology stack.

MetaSpore's philosophy is to embrace the open-source ecosystem fully. For example, for offline NLP large model training, developers who have the need can still use the existing open-source framework, such as OneFlow, to train the model; MetaSpore can continue to provide the ability to incorporate pre-training of the large model into multimodal model learning, as well as online Serving of the NLP model, making the NLP large model more inclusive.

3.The conclusion
This article details the design philosophy and core features of MetaSpore's one-stop machine learning platform to help readers better understand MetaSpore. Want to learn more about MetaSpore business scenario implementation.

Next week, I will update the detailed steps of using MetaSpore to build industrial recommendation systems rapidly.

Design concept of a best opensource project about big data and data lakehouse

DMetaSoul — Sat, 16 Apr 2022 15:54:49 +0000

Since the birth of Hadoop, the open-source ecosystem of big data systems has gone through nearly 15 years. In the past 15 years, various computing and storage frameworks have emerged in big data, but they still haven't reached a convergence state. Under the general trend of cloud-native, stream and batch integration, lake and lakehouse integration, there are still many problems to be solved in big data.

LakeSoul is a streaming batch integrated table storage framework developed by DMetaSoul, which has made a lot of design optimization around the new trend of big data architecture systems. This paper explains the core concept and design principle of LakeSoul, the Open-source Project, in detail.

1.Evolution trend of big data system architecture

In recent years, several new computing and storage frameworks have emerged in the field of big data. For example, a common computing engine represented by Spark, Flink, and an OLAP system represented by Clickhouse emerged as computing frameworks. Object storage has become a new storage standard storage, representing an essential base for integrating data lake and lakehouse. At the same time, Alluxio, JuiceFS, and other local cache acceleration layers have emerged. t is not hard to see several key evolution directions in the field of big data:

1. Cloud-native. Public and private clouds provide computing and storage hardware abstraction, abstracting the traditional IaaS management operation and maintenance. An important feature of cloud-native is that both computing and storage provide elastic capabilities. Making good use of elastic capabilities and reducing costs while improving resource utilization is an issue that both computing and storage frameworks need to consider.

2. Real-time. Traditional Hive is an offline data warehouse that provides T+1 data processing. It cannot meet new service requirements. The traditional LAMBDA architecture introduces complexity and data inconsistencies that fail to meet business requirements. So how to build an efficient real-time data warehouse system and realize real-time or quasi-real-time write updates and analysis on a low-cost cloud storage are new challenges for computing and storage frameworks.

3. Diversified computing engines. Big data computing engines are blooming, and while MapReduce is dying out, Spark, Flink, and various OLAP frameworks are still thriving. Each framework has its design focus, some deep in vertical scenarios, others with converging features, and the selection of big data frameworks are becoming more and more diverse.

_4. Data Lakehouse. _Wikipedia does not give a specific definition of Data Lakehouse. In this regard, LakeSoul believes that Data Lakehouse considers the advantages of both data lake and data warehouse. Based on low-cost cloud storage in an open format, functions similar to data structure and data management functions in the data warehouse are realized. It includes the following features: concurrent data reads and writes, architecture support with data governance mechanism, direct access to source data, separation of storage and computing resources, open storage formats, support for structured and semi-structured data (audio and video), and end-to-end streaming.

From the perspective of technology maturity development, the Data Lake is in a period of steady rise and recovery, while the Data Lakehouse is still in the period of expectation expansion, and the technology has not been completely converged. There are still many problems in specific business scenarios.

Considering the latest evolution direction of big data, how to connect various engines and storage in the cloud-native architecture system and adapt to the requirements of the rapidly changing data intelligence services of the upper layer is a problem that needs to be solved in the current big data platform architecture. In order to decode the above issues, first of all, it is necessary to have a set of perfect storage frameworks which can provide high data concurrency, high throughput reading, and writing ability and complete data warehouse management ability on the cloud, and expose such storage ability to multiple computing engines in a common way. That's why LakeSoul was developed and made open-source.

2.Detailed explanation of LakeSoul's design concept

LakeSoul is a unified streaming and batch table storage framework. LakeSoul has the following core features in design:

1.Efficient and scalable metadata management. With the rapid growth of data volumes, data warehouses need to be able to handle the rapid increase in partitions and files. LakeSoul dramatically improves metadata scalability by using a distributed, decentralized database to store Catalog information.

2. Supports concurrent writing. LakeSoul implements concurrency control through metadata services supports concurrent updates for multiple jobs in the same partition, and controls merge or rollback mechanisms by intelligently differentiating write types.

3. Supports incremental writing and Upsert. LakeSoul provides incremental Append and line-level Upsert capabilities and supports Merge on Reading mode to improve data intake flexibility and performance. LakeSoul implements Merge on Read for many update types, which provides high write performance without the need to Read and Merge data while writing. The highly optimized Merge Reader ensures read performance.

4. Real-time Data Warehouses function. LakeSoul supports streaming and batching writes, concurrent updates at the row or column level, framework-independent CDC, SCD, multi-version backtracking, and other common data warehouse features. Combined with the ability of streaming and batch integration, it can support the common real-time data warehouse construction requirements.
The overall architecture of LakeSoul is as follows:

Next, Let's learn more details about the above design points and implementation mechanism.

2.1 Highly extensible Catalog metadata service
LakeSoul supports multi-level partition management, multiple range partitions, and one hash partition at the table partition level. In a real business scenario, a large data warehouse will submit a large amount of partition information to the metadata layer after a long update. For real-time or quasi-real-time update scenarios, submiting is more frequent. In this case, metadata inflation is often encountered, resulting in the low efficiency of metadata access. The metadata performance greatly impacts query performance because partition information and other basic information of data distribution in metadata need to be accessed during data query. Therefore, a high-performance, highly extensible metadata service is important for a data warehouse.

LakeSoul uses the Cassandra distributed database to improve metadata performance and scalability to manage metadata. Cassandra is a decentralized distributed NoSQL database that provides rich data modeling methods, and high read/write throughput. Cassandra can also be easily scaled out. Using Cassandra as storage for metadata services also offers adjustable levels of Availability and consistency, enabling easy concurrency control and ACID for metadata operations.

LakeSoul organizes the primary key and index of the metadata layer table. Only one primary key operation is required for a leaf level partition to obtain all the information for that partition, read and write the snapshot of the current version, and so on. The snapshot of a partition contains the full file path and commit type for full write and incremental updates. This partition read plan can be built by sequential traversing of file commits in the snapshot. On the one hand, partition information access is efficient. On the other hand, it avoids the traversal of files and directories, a required optimization method for object storage systems such as S3 and OSS. Partition management mechanism of LakeSoul:

In contrast, Hive uses MySQL as the metadata storage layer. In addition to scalability problems, Hive also has a bottleneck in the efficiency of querying partition information. Managing more than thousands of partitions in a single table isn't easy. LakeSoul also supports many partitions than Iceberg and Delta Lake, which use file systems to store metadata.

2.2 ACID transactions and concurrent writes
To ensure consistency of concurrent writing and reading of data, LakeSoul supports ACID transactions and concurrent updates. Unlike the OLTP scenario, lakehouse updates are file-level granular.LakeSoul uses Cassandra's Light Weight Transaction to implement partition level updates. Cassandra is not a complete relational transaction database, but it can provide updated atomicity and isolation through LWT. On the other hand, Availability and consistency can be controlled by Cassandra's consistency level. Consistency levels can be selected based on the needs of the business scenario, providing greater flexibility.
Specifically, when the computing engine produces files for each partition to commit, it first commits the partition file update information, such as full Update or incremental Update, and then updates the reader-visible version via LWT. In scenarios where concurrent updates are detected, LakeSoul will automatically distinguish between write types to determine whether there is a conflict and decide whether to resubmit the data directly or roll back the calculation.

2.3 Incremental Update and Upsert
With the support of an efficient metadata layer, LakeSoul can quickly implement an incremental update mechanism. LakeSoul supports multiple update mechanisms, including Append, Overwrite, and Upsert. Common log streaming writes are usually in Append. In this case, the metadata layer only needs to record the file paths appended by each partition. At the same time, the Read data job reads all appended files in the partition uniformly to complete the Merge on read. For Overwrite situations, when an Update/Delete occurs under arbitrary conditions, or when a Compaction occurs, LakeSoul uses a Copy on Write mechanism to Update.

LakeSoul supports a more efficient Upsert mechanism in the case of hash partitions and Upsert operations on hash keys. Within each hash bucket, LakeSoul sorts the files by the hash key. After executing Upsert multiple times, you have multiple ordered files. Merge on Reading for a Read operation by simply merging these ordered files. Upsert is shown as follows:

In this way, LakeSoul achieves high write throughput while maintaining read speed through highly optimized ordered file merging.

2.4 Real-time counting function
LakeSoul is committed to simplifying the landing application of big data lakehouse, hoping to provide concise and efficient real-time data warehouse capability for data development. For this purpose, LakeSoul has specially developed and optimized several practical functions, including CDC, SCD, TimeTravel, etc., for the common business scenarios of the real-time data warehouse.
LakeSoul CDC provides a separate set of CDC semantic expressions. LakeSoul can interconnect with various CDC acquisition sources such as Debezium, Flink, and Canal by specifying an operation column in the CDC undertake table as the CDC semantic column. Convert the action column of the collection source to the LakeSoul CDC action column. In the case of online OLTP database synchronization to LakeSoul, LakeSoul's accepting table can also adopt the hashing bucket method and write CDC into Upsert for hashed keys, achieving a very high CDC synchronization speed.

3. LakeSoul application scenario Example
3.1 CDC Real-time Large-screen Report Synchronization
To analyze the data in the online database in real-time, it is usually necessary to synchronize the online data to the data warehouses and then perform data analysis, create BI reports and display large screens in the data warehouse. For example, during the e-commerce promotion festival, it is better to reveal the consumption amount, order number, and inventory number of provinces, regions, and groups on a large screen. However, these transaction data come from the online transaction database, so it is necessary to import the data into the data warehouse for multidimensional report analysis. If a periodic dump database is used in such scenarios, the delay and storage overhead is too high to meet the timeliness requirements. Besides, real-time calculation of Flink has the problem of cumbersome development operation and maintenance.
Using LakeSoul CDC real-time synchronization and converting to Upsert in case of the primary key operation, extremely high write throughput can be achieved. Data changes in the online database can be synchronized to the data warehouse in a near real-time manner, and then BI analysis of online data can be carried out quickly through SQL query. Through Debezium can support a variety of online databases, including MySQL, Oracle, and so on. The application of CDC into the data lake can greatly simplify the data lakehouse real-time intake update architecture.

LakeSoul provides a LakeSoul CDC MVP validation document: https://github.com/meta-soul/LakeSoul/blob/main/examples/cdc_ingestion_debezium/README-CN.md, which contains a complete real-time with the CDC into the example of the lake, Interested can refer to.

3.2 Real-time sample database construction of recommendation algorithm system
A common requirement in the recommendation algorithm scenario is to combine multiple tables, including user features, product features, exposure labels, and click labels, to build a sample library for model training. Offline Join also has problems of low timeliness and high consumption of computing resources. However, many previous schemes use Flink to carry out multi-stream Join through the real-time stream. However, in the case of a large time window, Flink also occupies high resources.

In this case, LakeSoul can be used to build a real-time sample library to convert a multi-stream Join into a multi-stream concurrent Upsert. Since different streams have the same primary key, you can set that primary key to a hash partition key for efficient Upsert. This approach can support Upsert with large-time Windows while ensuring high write throughput.

Above is the design concept and application case of LakeSoul. It can be seen from these that LakeSoul is very innovative and has advantages in performance. This is also the reason why I choose to use LakeSoul, although it is only an open-source project with the star of only 500.