Forem: Apache Doris

Can I use Apache Doris with my existing RAG system?

Apache Doris — Wed, 28 Jan 2026 21:32:47 +0000

This question came up in our recent webinar Q&A [video below👇].
The short answer: Yes. Apache Doris can replace your existing vector store (ChromaDB, Pinecone, Milvus...), but your chunking, embedding pipeline, and application logic stay exactly as they are.

A lot of RAG systems infra today look like this:

Postgres for structured data
Pinecone/ChromaDB/Milvus/Weaviate for vectors
Some even adding Elasticsearch for keyword search
Your app stitches results together

But what if "clients want to query their database with an LLM, not just text, but structured and unstructured data together?"

When your vectors, keywords, and metadata live in different systems, it's difficult for you to do searches like this efficiently: "find Python engineers in San Francisco hired in 2024 with similar backgrounds to this resume."

But with Apache Doris, a real-time database that now support hybrid search and vector search, you can do those searches in one SQL query, in one database, using one unified system.

If you're running RAG in production, juggling multiple databases, and facing cost and performance issues, it might be worth asking: what if you didn't have to?
🔗 See how ByteDance uses Apache Doris' hybrid search to cut down vector search cost: https://www.velodb.io/blog/bytedance-solved-billion-scale-vector-search-problem-with-apache-doris-4-0?utm_source=linkedin
🔗 Watch the webinar in full: https://www.youtube.com/watch?v=kKiXWNWZYVc

Overview of Real-Time Data Synchronization from PostgreSQL to VeloDB

Apache Doris — Tue, 20 Jan 2026 22:14:48 +0000

Overview

In the process of migrating data from PostgreSQL (including PostgreSQL-compatible Amazon Aurora) to VeloDB, Flink can be introduced as a real-time data synchronization engine to ensure data consistency and real-timeliness. Flink possesses high-throughput and low-latency stream processing capabilities, enabling efficient full-data loading and incremental change handling for databases.

For real-time synchronization scenarios, PostgreSQL's Logical Replication can be enabled to capture CDC (Change Data Capture) events. Whether it is a self-hosted PostgreSQL or cloud-based Amazon Aurora-PostgreSQL, you can subscribe with Flink CDC by enabling the logical decoding plugin and creating a Replication Slot, thereby achieving:

Full data initial load: First import business data from PostgreSQL/Aurora into VeloDB
Real-time synchronization of incremental changes: Capture Insert/Update/Delete operations based on Logical Replication and continuously write them to VeloDB

The following takes Amazon Aurora-PostgreSQL as an example to demonstrate how to use Flink CDC to subscribe to Aurora changes and synchronize them to VeloDB in real time.

Example

1. Create an AWS RDS Aurora PostgreSQL instance

2. Create a VeloDB warehouse

3. Create a PostgreSQL database and corresponding tables


CREATE DATABASE test_db;

-- Create table
CREATE TABLE test_db.public.student (
    id INT PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    age INT,
    email VARCHAR(255),
    phone VARCHAR(20),
    score NUMERIC(5,2),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Load data
INSERT INTO test_db.public.student (id, name, age, email, phone, score, created_at) 
VALUES 
(1, 'Alice Zhang', 22, 'alice@example.com', '13800138000', 89.50, NOW()),
(2, 'Bob Li', 21, 'bob@example.com', '13900139000', 76.80, NOW()),
(3, 'Charlie Wang', 23, 'charlie@example.com', '13600136000', 92.00, NOW()),
(4, 'David Chen', 20, 'david@example.com', '13500135000', 85.60, NOW()),
(5, 'Emma Liu', 22, 'emma@example.com', '13700137000', 78.90, NOW());

4. Enable PostgreSQL Logical Replication

Create a parameter group and modify the rds.logical_replication configuration

Modify the PostgreSQL configuration: replace the DB Cluster Parameter with the parameter group created just now, apply the changes, and restart the service

5. Install Flink With Doris Connector

5.1 Download the pre-defined installation package

Based on Flink 1.17, we provide a pre-defined installation package that can be directly downloaded and decompressed.

5.2 Manual installation

If you already have a Flink environment or need another version of Flink, you can use the manual installation method.Taking Flink 1.17 as an example here, download the Flink installation package and its dependencies.

Flink 1.17
Flink Postgres CDC Connector
Flink Doris Connector

After the download is complete, extract the Flink installation package.


tar -zxvf flink-1.17.2-bin-scala_2.12.tgz

Meanwhile, place the Flink PostgreSQL CDC Connector and Doris Connector into the flink-1.17.2-bin/lib directory.

As follows:

6. Submit the Flink synchronization job

When submitting the job, the Doris Connector will automatically create corresponding tables in VeloDB in advance based on the table structure of the upstream PostgreSQL database.

Flink supports job submission and operation in modes such as Local, Standalone, and Yarn. If you already have a Flink environment, you can directly submit the job to your own Flink environment.

6.1 Local Environment


cd flink-1.17.2-bin
bin/flink run -t local \
    -Dexecution.checkpointing.interval=10s \                                                                                                               
    -Dparallelism.default=1 \                                                                                                                              
    -c org.apache.doris.flink.tools.cdc.CdcTools \                                                                                                         
    lib/flink-doris-connector-1.17-25.1.0.jar \                                                                                                            
    postgres-sync-database \                                                                                                                                  
    --database test_db \                                                                                                                                   
    --postgres-conf hostname=database-test.xxx.us-east-1.rds.amazonaws.com \                                                                                                                      
    --postgres-conf port=3306 \                                                                                                                               
    --postgres-conf username=admin \                                                                                                                           
    --postgres-conf password=123456 \                                                                                                                         
    --postgres-conf database-name=test \                                                                                                                      
    --including-tables "student" \                                                                                                                        
    --sink-conf fenodes=lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:8080 \                                                                                                                 
    --sink-conf username=admin \                                                                                                                            
    --sink-conf password=123456 \                                                                                                                                
    --sink-conf jdbc-url=jdbc:mysql://lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:9030 \                                                                                                   
    --sink-conf sink.label-prefix=label

6.2 Standalone Environment


cd flink-1.17.2-bin
bin/flink run -t remote \
    -Dexecution.checkpointing.interval=10s \                                                                                                               
    -Dparallelism.default=1 \                                                                                                                              
    -c org.apache.doris.flink.tools.cdc.CdcTools \                                                                                                         
    lib/flink-doris-connector-1.17-25.1.0.jar \                                                                                                            
    postgres-sync-database \                                                                                                                                  
    --database test_db \                                                                                                                                   
    --postgres-conf hostname=database-test.xxx.us-east-1.rds.amazonaws.com \                                                                                                                      
    --postgres-conf port=3306 \                                                                                                                               
    --postgres-conf username=admin \                                                                                                                           
    --postgres-conf password=123456 \                                                                                                                         
    --postgres-conf database-name=test \                                                                                                                      
    --including-tables "student" \                                                                                                                        
    --sink-conf fenodes=lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:8080 \                                                                                                                 
    --sink-conf username=admin \                                                                                                                            
    --sink-conf password=123456 \                                                                                                                                
    --sink-conf jdbc-url=jdbc:mysql://lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:9030 \                                                                                                   
    --sink-conf sink.label-prefix=label

6.3 Yarn Environment


cd flink-1.17.2-bin
bin/flink run -t yarn-per-job \
    -Dexecution.checkpointing.interval=10s \                                                                                                               
    -Dparallelism.default=1 \                                                                                                                              
    -c org.apache.doris.flink.tools.cdc.CdcTools \                                                                                                         
    lib/flink-doris-connector-1.17-25.1.0.jar \                                                                                                            
    postgres-sync-database \                                                                                                                                  
    --database test_db \                                                                                                                                   
    --postgres-conf hostname=database-test.xxx.us-east-1.rds.amazonaws.com \                                                                                                                      
    --postgres-conf port=3306 \                                                                                                                               
    --postgres-conf username=admin \                                                                                                                           
    --postgres-conf password=123456 \                                                                                                                         
    --postgres-conf database-name=test \                                                                                                                      
    --including-tables "student" \                                                                                                                        
    --sink-conf fenodes=lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:8080 \                                                                                                                 
    --sink-conf username=admin \                                                                                                                            
    --sink-conf password=123456 \                                                                                                                                
    --sink-conf jdbc-url=jdbc:mysql://lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:9030 \                                                                                                   
    --sink-conf sink.label-prefix=label

6.4 K8S Environment


cd flink-1.17.2-bin
bin/flink run -t kubernetes-session \
    -Dexecution.checkpointing.interval=10s \                                                                                                               
    -Dparallelism.default=1 \                                                                                                                              
    -c org.apache.doris.flink.tools.cdc.CdcTools \                                                                                                         
    lib/flink-doris-connector-1.17-25.1.0.jar \                                                                                                            
    postgres-sync-database \                                                                                                                                  
    --postgres test_db \                                                                                                                                   
    --postgres-conf hostname=database-test.xxx.us-east-1.rds.amazonaws.com \                                                                                                                      
    --postgres-conf port=3306 \                                                                                                                               
    --postgres-conf username=admin \                                                                                                                           
    --postgres-conf password=123456 \                                                                                                                         
    --postgres-conf database-name=test \                                                                                                                      
    --including-tables "student" \                                                                                                                        
    --sink-conf fenodes=lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:8080 \                                                                                                                 
    --sink-conf username=admin \                                                                                                                            
    --sink-conf password=123456 \                                                                                                                                
    --sink-conf jdbc-url=jdbc:mysql://lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:9030 \                                                                                                   
    --sink-conf sink.label-prefix=label

Note: For more parameters of the Connector, refer to this link.

7. Verify Historical Data Synchronization

The Flink job will synchronize full historical data for the first time. Check the data synchronization status in VeloDB.

8. Verify Real-Time Data Synchronization

For scenarios requiring capture of deleted data, enable the following configuration in PostgreSQL


ALTER TABLE public.student REPLICA IDENTITY FULL

For details, refer to this link.

Perform data modifications in PostgreSQL:


INSERT INTO student (id, name, age, email, phone, score, created_at) 
VALUES 
(6, 'Frank Zhao', 24, 'frank@example.com', '13400134000', 88.75, NOW());

DELETE FROM student WHERE id = 3;

UPDATE student 
SET score = 95.00, age = 23 
WHERE id = 2;

Verify data changes in VeloDB:

Apache Doris IP change problem handling method

Apache Doris — Thu, 18 Dec 2025 19:32:05 +0000

Background note

Due to the existence of multiple network interface cards, or the existence of virtual network interface cards caused by the installation of Docker and other environments, there may be multiple different IPs on the same host. The current Apache Doris does not automatically recognize available IPs. Therefore, when encountering multiple IPs on the deployment host, you must force the correct IP through the priority_networks configuration item.

priority_networks is a configuration that both FE and BE have, and the configuration item needs to be written in fe.conf and be.conf. This configuration item is used to tell the process which IP to bind when FE or BE starts. An example is as follows:

$priority_networks =10.1.3.0/24$

This is a CIDR representation. FE or BE will use this configuration to find a matching IP as their localIP.

CIDR uses slash notation and is expressed as the number of bits of IP Address/Network ID. The specific conversion method can be seen in the following two examples.

① 192.168.0.0/16, converted to a 32-bit binary address: 11000000.10101000.0000000.00000000. Where/16 represents the 16-bit network ID, that is, the first 16 bits of the 32-bit binary address are fixed, corresponding to the network segment: 11000000.10101000.00000000.00000000~ 11000000.10101000.11111111.11111111.

② 192.168.1.2/24, converted to a 32-bit binary address: 11000000.10101000.00000001.00000000. Where/24 means that the first 24 bits of the 32-bit binary address are fixed, corresponding to the network segment: 11000000.10101000.00000001.00000000~ 11000000.10101000.00000001.1111111111

When the following scenario occurs, the ip will change, causing fe/be to malfunction and unable to start and operate normally

① cluster migration leads to ip network segment change

② IP change caused by dynamic address in virtual environment

③ If fe/be is not properly configured before restarting priority_networks the ip obtained after restarting is inconsistent with the metadata

1. Hardware information

CPU model: ARM64
Memory: 2GB
Hard drive: 36GB SSD

2. Software information

VM mirror version: CentOS-7
Apache Doris version: 1.2.4 (other versions are also acceptable)
Cluster size: 1FE * 3BE

FE recovery

3. Exception log

When checking fe.out, the following exception will be reported, and the fe process cannot be started at this time;

Before operation, pay attention to backup all fe metadata and stop upstream read and write actions!

4. Get the current IP

ip addr

5. 5. Reset IP information

After resetting the ip information, the above exception will still be reported, and the metadata needs to be reset .

# modify fe.conf priority_networks
priority_networks = 192.168.0.0/16
# or use this
priority_networks = 192.168.31.78/16

6. Reset metadata record

After resetting the metadata record, although the FE process can start, it is not available and requires metadata mode recovery.

# Annotate out the old ips previously recorded in the fe metadata
vim doris-meta/image/ROLE

7. Metadata mode recovery

# Add metadata_failure_recovery=true to fe.conf to restart fe in recovery mode
vim fe.conf
metadata_failure_recovery=true
# Then go to http://192.168.31.78:8030/login, if you can open the fe web UI, it can be normal boot fe

8. Reset fe cluster node

Although fe can currently be started using metadata Recovery Mode, it has not been fully restored because the cluster nodes recorded in the current fe metadata do not have the IP node that was just modified .

# Execute the following sql in the mysql client or web ui Playground to update the fe nodes recorded in the fe metadata
# remove old ip node
ALTER SYSTEM DROP FOLLOWER "192.168.31.81:9010";
# add new ip node 
ALTER SYSTEM ADD FOLLOWER "192.168.31.78:9010";

The old IP nodes are as follows.

The new IP node after reset is as follows.

9. Turn off metadata mode and restart FE

# Annotate metadata_failure_recovery=true in fe.conf Turn off recovery mode and restart fe
vim fe.conf
#metadata_failure_recovery=true

# and then go to http://192.168.31.78:8030/login, if you can open the fe web UI, fe completely restored

BE Recovery

10. Get the current IP

ip addr

11. Reset IP information

# modify be.conf priority_networks
priority_networks = 192.168.0.0/16
# or use this
priority_networks = 192.168.31.136/16
# After setting, restart be

12. Reset BE cluster node

Although the current be can be started, it has not been fully restored because the be cluster node recorded in the current fe metadata does not have the just modified be node .

# Execute the following sql in the mysql client or web ui Playground to update the be nodes recorded in the fe metadata
# remove old ip nodes
ALTER SYSTEM DROPP FOLLOWER "192.168.31.81:9010";
ALTER SYSTEM DROPP FOLLOWER "192.168.31.72:9010";
ALTER SYSTEM DROPP FOLLOWER "192.168.31.133:9010";
# add new ip nodes
ALTER SYSTEM ADD FOLLOWER "192.168.31.78:9010";
ALTER SYSTEM ADD FOLLOWER "192.168.31.71:9010";
ALTER SYSTEM ADD FOLLOWER "192.168.31.136:9010";

After all three BEs were reset, they were fully restored as follows.

At this point, Apache Doris cluster exception problem caused by IP change has been processed and restored

Overview of Real-Time Data Synchronization from PostgreSQL to VeloDB

Apache Doris — Wed, 17 Dec 2025 22:04:51 +0000

Migrating data from PostgreSQL (or Amazon Aurora-PostgreSQL) to VeloDB while ensuring real-time consistency can be a challenge—luckily, Flink CDC (Change Data Capture) solves this problem with high throughput and low latency. This step-by-step guide will walk you through using Flink CDC to sync data from Aurora-PostgreSQL to VeloDB, covering full data loading and incremental change capture.

Overview

When syncing PostgreSQL/Aurora to VeloDB, Flink acts as the real-time stream processing engine, and PostgreSQL’s Logical Replication captures CDC events. This combination enables:

Full data initial load: Import existing business data from PostgreSQL/Aurora to VeloDB in one go.
Real-time incremental sync: Capture INSERT/UPDATE/DELETE operations from PostgreSQL and write them to VeloDB continuously.

We’ll use Amazon Aurora-PostgreSQL as the source and VeloDB as the sink to demonstrate the entire process.

Prerequisites

Before starting, ensure you have:

An AWS RDS Aurora PostgreSQL instance (or self-hosted PostgreSQL).
A VeloDB warehouse (with FE nodes accessible via network).
Flink 1.17+ environment (we’ll cover both pre-built and manual installation).
Network connectivity between Flink, PostgreSQL/Aurora, and VeloDB (e.g., security groups, VPC peering).

Step 1: Set Up Aurora-PostgreSQL & Test Data

1.1 Create an Aurora-PostgreSQL Instance

First, create an AWS RDS Aurora PostgreSQL instance (skip this if you already have one).

1.2 Create a Database and Table

Connect to your Aurora-PostgreSQL instance and run the following SQL to create a test database and table:

CREATE DATABASE test_db;

-- 创建表
CREATE TABLE test_db.public.student (
    id INT PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    age INT,
    email VARCHAR(255),
    phone VARCHAR(20),
    score NUMERIC(5,2),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- 插入数据
INSERT INTO test_db.public.student (id, name, age, email, phone, score, created_at) 
VALUES 
(1, 'Alice Zhang', 22, 'alice@example.com', '13800138000', 89.50, NOW()),
(2, 'Bob Li', 21, 'bob@example.com', '13900139000', 76.80, NOW()),
(3, 'Charlie Wang', 23, 'charlie@example.com', '13600136000', 92.00, NOW()),
(4, 'David Chen', 20, 'david@example.com', '13500135000', 85.60, NOW()),
(5, 'Emma Liu', 22, 'emma@example.com', '13700137000', 78.90, NOW());

1.3 Enable PostgreSQL Logical Replication

Modify the PostgreSQL configuration: replace the DB Cluster Parameter with the parameter group created just now, apply the changes, and restart the service

Step 2: Install Flink with Doris/VeloDB Connector

VeloDB is compatible with the Flink Doris Connector, so we’ll use that to connect Flink to VeloDB. You can choose either the pre-built package or manual installation.

2.1 Pre-Built Installation (Simplest)

We provide a pre-built Flink 1.17 package with all required connectors (PostgreSQL CDC + Doris/VeloDB). Simply download and extract it:

tar -zxvf flink-1.17.2-bin-scala_2.12.tgz

2.2 Manual Installation (For Existing Flink Environments)

If you already have Flink 1.17 installed, download the required dependencies and add them to the lib directory:

Download Flink 1.17.2: Flink 1.17.2 Download
Download Flink PostgreSQL CDC Connector: Flink Postgres CDC
Download Flink Doris Connector: Flink Doris Connector

Step 3: Submit the Flink CDC Sync Job

The Flink Doris Connector will automatically create corresponding tables in VeloDB based on the PostgreSQL table structure. We’ll cover job submission in 4 common environments: Local, Standalone, Yarn, and Kubernetes.

Important Notes Before Submission

Port Correction: The example uses port=3306 (MySQL port) by mistake—PostgreSQL/Aurora default port is 5432 (update this in the commands below).
Database Name Fix: The example has --postgres-conf database-name=test (mismatch with test_db)—replace with test_db for consistency.
Customize Params: Replace placeholder values (e.g., hostname, fenodes, password) with your actual credentials.

3.1 Local Environment

> cd flink-1.17.2-bin
> bin/flink run -t local \
    -Dexecution.checkpointing.interval=10s \                                                                                                               
    -Dparallelism.default=1 \                                                                                                                              
    -c org.apache.doris.flink.tools.cdc.CdcTools \                                                                                                         
    lib/flink-doris-connector-1.17-25.1.0.jar \                                                                                                            
    postgres-sync-database \                                                                                                                                  
    --database test_db \                                                                                                                                   
    --postgres-conf hostname=database-test.xxx.us-east-1.rds.amazonaws.com \                                                                                                                      
    --postgres-conf port=3306 \                                                                                                                               
    --postgres-conf username=admin \                                                                                                                           
    --postgres-conf password=123456 \                                                                                                                         
    --postgres-conf database-name=test \                                                                                                                      
    --including-tables "student" \                                                                                                                        
    --sink-conf fenodes=lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:8080 \                                                                                                                 
    --sink-conf username=admin \                                                                                                                            
    --sink-conf password=123456 \                                                                                                                                
    --sink-conf jdbc-url=jdbc:mysql://lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:9030 \                                                                                                   
    --sink-conf sink.label-prefix=label

3.2 Standalone Environment

> cd flink-1.17.2-bin
> bin/flink run -t remote \
    -Dexecution.checkpointing.interval=10s \                                                                                                               
    -Dparallelism.default=1 \                                                                                                                              
    -c org.apache.doris.flink.tools.cdc.CdcTools \                                                                                                         
    lib/flink-doris-connector-1.17-25.1.0.jar \                                                                                                            
    postgres-sync-database \                                                                                                                                  
    --database test_db \                                                                                                                                   
    --postgres-conf hostname=database-test.xxx.us-east-1.rds.amazonaws.com \                                                                                                                      
    --postgres-conf port=3306 \                                                                                                                               
    --postgres-conf username=admin \                                                                                                                           
    --postgres-conf password=123456 \                                                                                                                         
    --postgres-conf database-name=test \                                                                                                                      
    --including-tables "student" \                                                                                                                        
    --sink-conf fenodes=lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:8080 \                                                                                                                 
    --sink-conf username=admin \                                                                                                                            
    --sink-conf password=123456 \                                                                                                                                
    --sink-conf jdbc-url=jdbc:mysql://lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:9030 \                                                                                                   
    --sink-conf sink.label-prefix=label

3.3 Yarn Environment (Per-Job Mode)

> cd flink-1.17.2-bin
> bin/flink run -t yarn-per-job \
    -Dexecution.checkpointing.interval=10s \                                                                                                               
    -Dparallelism.default=1 \                                                                                                                              
    -c org.apache.doris.flink.tools.cdc.CdcTools \                                                                                                         
    lib/flink-doris-connector-1.17-25.1.0.jar \                                                                                                            
    postgres-sync-database \                                                                                                                                  
    --database test_db \                                                                                                                                   
    --postgres-conf hostname=database-test.xxx.us-east-1.rds.amazonaws.com \                                                                                                                      
    --postgres-conf port=3306 \                                                                                                                               
    --postgres-conf username=admin \                                                                                                                           
    --postgres-conf password=123456 \                                                                                                                         
    --postgres-conf database-name=test \                                                                                                                      
    --including-tables "student" \                                                                                                                        
    --sink-conf fenodes=lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:8080 \                                                                                                                 
    --sink-conf username=admin \                                                                                                                            
    --sink-conf password=123456 \                                                                                                                                
    --sink-conf jdbc-url=jdbc:mysql://lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:9030 \                                                                                                   
    --sink-conf sink.label-prefix=label

3.4 Kubernetes Environment (Session Mode)

> cd flink-1.17.2-bin
> bin/flink run -t kubernetes-session \
    -Dexecution.checkpointing.interval=10s \                                                                                                               
    -Dparallelism.default=1 \                                                                                                                              
    -c org.apache.doris.flink.tools.cdc.CdcTools \                                                                                                         
    lib/flink-doris-connector-1.17-25.1.0.jar \                                                                                                            
    postgres-sync-database \                                                                                                                                  
    --postgres test_db \                                                                                                                                   
    --postgres-conf hostname=database-test.xxx.us-east-1.rds.amazonaws.com \                                                                                                                      
    --postgres-conf port=3306 \                                                                                                                               
    --postgres-conf username=admin \                                                                                                                           
    --postgres-conf password=123456 \                                                                                                                         
    --postgres-conf database-name=test \                                                                                                                      
    --including-tables "student" \                                                                                                                        
    --sink-conf fenodes=lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:8080 \                                                                                                                 
    --sink-conf username=admin \                                                                                                                            
    --sink-conf password=123456 \                                                                                                                                
    --sink-conf jdbc-url=jdbc:mysql://lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:9030 \                                                                                                   
    --sink-conf sink.label-prefix=label

💡 Tip: For more Flink Doris Connector parameters, check the official documentation.

Step 4: Verify Data Sync

4.1 Verify Full Historical Data Sync

The Flink job will first sync all existing data from PostgreSQL to VeloDB. Connect to your VeloDB warehouse and query the student table to confirm the data is present.

4.2 Verify Real-Time Incremental Sync

To capture DELETE operations (required for full incremental sync), first enable full replica identity on the PostgreSQL table:

ALTER TABLE public.student REPLICA IDENTITY FULL

📚 Reference: PostgreSQL Replica Identity Documentation

Now, modify data in PostgreSQL:

INSERT INTO student (id, name, age, email, phone, score, created_at) 
VALUES 
(6, 'Frank Zhao', 24, 'frank@example.com', '13400134000', 88.75, NOW());

DELETE FROM student WHERE id = 3;

UPDATE student 
SET score = 95.00, age = 23 
WHERE id = 2;

Check VeloDB to confirm the changes are synced in real-time:

Common Pitfalls to Avoid

PostgreSQL Port Mismatch: Don’t use 3306 (MySQL) for PostgreSQL—use 5432 instead.
Logical Replication Not Enabled: Without rds.logical_replication = 1, CDC events won’t be captured.
Replica Identity Missing: For DELETE operations, REPLICA IDENTITY FULL is required (otherwise, deletes won’t sync).
Network Connectivity: Ensure Flink can reach Aurora-PostgreSQL (5432) and VeloDB (8080, 9030) via security groups/VPC.

Wrapping Up

Flink CDC provides a robust, real-time way to sync data from PostgreSQL/Aurora to VeloDB, covering both full data loads and incremental changes. By following this guide, you can set up a reliable sync pipeline with minimal effort. If you run into issues, check the Flink and VeloDB logs for details, or refer to the official documentation for additional parameters.

Happy syncing! 🚀

Agent Facing Analytics with High Concurrency: Doris vs Clickhouse vs Snowflake

Apache Doris — Wed, 10 Dec 2025 21:09:50 +0000

Data warehouses have evolved drastically over the past 30 years—from BI-driven legacy systems to big data-powered modern platforms. Now, with the explosion of GenAI and LLM applications, we're entering a new era where data warehouses must seamlessly integrate with AI workflows, support real-time agent interactions, and deliver extreme performance at scale. Apache Doris 4.0 emerges as the game-changer, combining enterprise-grade analytics with AI-native capabilities to meet the demands of today's intelligent applications.

The Evolution of Data Warehouses: From Legacy to AI-Native

Let's trace the journey of data warehouses and understand how AI is reshaping their core requirements.

Legacy Data Warehouses (BI-Driven)

The first generation of data warehouses separated analytical data from transactional systems to handle large volumes of historical data (e.g., daily trading reports for stockbrokers). However, they quickly hit walls in the big data era:

Scalability: Expensive hardware upgrades with limited horizontal scaling
Cost Efficiency: On-premise deployments required specialized hardware and high maintenance costs
Advanced Analytics: Poor support for real-time insights, AI, and ML
Flexibility: Rigid architectures unable to adapt to new use cases or diverse data sources

Modern Data Warehouses (Big Data-Driven)

Post-2000, the mobile internet and e-commerce boom drove the need for more agile analytics. Modern data warehouses addressed legacy limitations with:

Stateless Compute/Storage: Lower overhead for scaling resources
Low-Latency: Sub-second response times for user queries
High-Concurrency: Effortlessly handles thousands of concurrent workloads
Hybrid Workloads: Supports ad-hoc queries, ETL, and batch processing
Federated Queries: Breaks data silos by unifying access to data lakes, transactional DBs, and more

Data Warehouses in the AI Era

ISG Research predicts: "Through 2027, almost all enterprises developing GenAI applications will invest in data platforms with vector search and retrieval-augmented generation (RAG) to complement foundation models with proprietary data."

LLMs thrive on high-quality data—for both training and inference. AI-driven applications require data warehouses to:

Balance Volume & Quality: High-quality data directly impacts model performance
Dual-Purpose Data: Support both model training and real-time inference
Dynamic Freshness: Handle continuous data read/write with near-zero latency
Agent-Friendly: Enable autonomous AI agents to interact without human intervention
AI-First Design: Natively support LLM functions, vector storage, and high-performance vector I/O

The Paradigm Shift: Agentic-Facing Analytics

Traditional BI and OLAP systems are built for passive, historical reporting—few users running heavy queries with slow tolerance. AI changes this with agentic-facing analytics:

Proactive, autonomous AI agents that reason, analyze in real-time, and trigger actions
Workloads shift to: "Massive users (agents), light/iterative queries, zero latency tolerance"
Requires millisecond response times for thousands of concurrent queries

Legacy OLAP systems can't keep up—their pre-aggregated data cubes, batch processing, and data silos create bottlenecks for agentic workflows. The solution? A semantics-and-response-centric architecture that prioritizes flexibility, real-time access, and unified data context.

Apache Doris Outperforms Competitors: Benchmark Results

Apache Doris (and its commercial distribution VeloDB) sets a new standard for performance across key analytics benchmarks. We compared it against Snowflake and ClickHouse Cloud with equivalent compute resources (128 cores for VeloDB/ClickHouse, XL-size cluster for Snowflake) using Apache JMeter to measure QPS at 10/30/50 parallelisms.

Benchmark Overview

Benchmark	Focus	Key Findings
SSB-FLAT	Single wide-table queries (no joins)	VeloDB outperforms Snowflake 4.76–7.39x, ClickHouse 4.76–6.92x
SSB (Star Schema)	Join-heavy analytics	VeloDB outperforms Snowflake 5.17–6.37x; ClickHouse failed most join queries
TPC-H	Complex ad-hoc decision support	VeloDB outperforms Snowflake 1.71–3.10x; ClickHouse couldn’t run all queries (Q20/Q21/Q22 failed)

Key Takeaways

Complex Joins: Doris excels at join-heavy workloads (SSB/TPC-H) thanks to its advanced optimizer and execution engine
High Concurrency: Maintains performance at scale (50 parallelisms) while competitors struggle with memory or parsing errors
Wide-Table Performance: Even in single-table scans (SSB-FLAT), outperforms purpose-built systems like ClickHouse
Cost-Efficiency: Delivers more throughput per compute unit than Snowflake’s elastic architecture

Deep Dive: Apache Doris Core Technologies

Apache Doris’s performance and AI readiness stem from its innovative architecture. Let’s explore the key features powering its success.

1. Data Pruning: "Don’t Process Unnecessary Data"

The most efficient way to process data is to avoid processing it entirely. Doris uses two types of pruning:

Static Filters (Pre-Execution)

Partition Pruning: FE uses metadata to skip irrelevant partitions (e.g., time-based partitions outside a date range)
Key Column Pruning: Data is sorted by key columns—binary search narrows down the row range to scan
Value Column Pruning: Column files store min/max metadata to skip files that can’t match predicates

Dynamic Filters (Post-Execution)

For joins, filters are generated after building hash tables on the build side. This prunes irrelevant data on the probe side before joining, reducing join overhead.

2. Advanced Pruning Optimizations

LIMIT Pruning

Pushes LIMIT clauses down to data scanning—stops processing once the required number of rows is retrieved.

TopK Pruning

Optimizes TopK queries (e.g., "top 10 highest-grossing products") with:

Local truncation in scanning threads
Global merge sort via a coordinator
Two-phase execution: first sort key columns to get row indices, then fetch required columns—avoids full data scans

Join Pruning

Reduces probe-side data for hash joins:

Uses build-side hash table values to filter probe-side data
Minimizes data transfer and join computation (O(M+N) complexity vs. O(M*N) for Cartesian product)

3. Pipeline Engine: Efficient Execution at Scale

Doris uses a coroutine-like pipeline engine to maximize CPU utilization:

Yields CPU during blocking operations (disk I/O, network I/O in joins/exchanges)
Eliminates thread switching overhead with task scheduling triggered by external events (e.g., RPC completion)
Independent parallelism per pipeline (not constrained by tablet count)
Even data distribution to minimize skewing via local exchange optimization
Shared states across pipeline tasks (reduces initialization overhead)

4. Vectorized Query Execution

Processes data in batches (vectors) instead of row-by-row, leveraging:

SIMD (Single Instruction, Multiple Data) CPU instructions
Loop unrolling to reduce branch mispredictions
Accelerated compression, computation, and data processing
Delivers 2–10x performance gains for analytical queries

AI-Native Capabilities in Apache Doris 4.0

Apache Doris 4.0 is built for the AI era with native support for:

Vector Search: High-performance storage and retrieval of feature vectors for LLM inference
RAG Integration: Seamlessly connects with LLMs to augment generation with proprietary data
AI Functions: Built-in UDFs for ML/LLM workflows (e.g., embedding generation, text processing)
MCP Server: Native support for Model-as-a-Service integration
Agent Compatibility: Designed for programmatic access by AI agents with low-latency responses

Conclusion

The AI revolution demands data warehouses that are fast, flexible, and AI-native. Apache Doris 4.0 delivers on all fronts:

Outperforms competitors in complex joins, high concurrency, and wide-table analytics
Features like data pruning, pipeline engine, and vectorized execution enable millisecond response times
AI-native capabilities (vector search, RAG, agent support) integrate seamlessly with GenAI workflows

For teams building AI-driven applications, Apache Doris isn’t just a data warehouse—it’s the foundation for intelligent, real-time analytics that powers the next generation of products and decision-making.

Deploying Apache Doris with Storage-Compute Separation Using MinIO: A Practical Guide

Apache Doris — Fri, 05 Dec 2025 22:05:49 +0000

Modern data processing faces multiple challenges. The ever-growing volume of data drives up traditional storage costs, especially with unstructured data becoming more prevalent. Data quality issues further increase the burden of storage and cleansing. Additionally, enterprises often struggle with data integration across multiple internal systems, which raises the bar for efficient and cost-effective data analytics.

Apache Doris, a high-performance real-time analytics database with lakehouse capabilities, combined with MinIO, a high-performance S3-compatible object storage system, offers a powerful solution. Together, they enable an efficient, low-cost data analytics platform. This article explores the strengths of Apache Doris and MinIO and provides a step-by-step deployment guide.

Why Choose Apache Doris and MinIO?

Apache Doris: High-Performance Real-Time Analytics Database

Apache Doris is built on an MPP (Massively Parallel Processing) architecture, known for its efficiency, simplicity, and versatility—delivering sub-second query results on massive datasets. Key advantages:

High Performance: Sub-second responses for large datasets, supporting high-concurrency point queries and complex analytics.
Real-Time Analytics: Enables real-time data ingestion and querying for instant insights.
Ease of Use: Streamlined design with low operational and maintenance costs.
Scalability: Horizontal scaling via MPP to handle large-scale data and high-concurrency workloads.
Multi-Scenario Support: Ideal for reports, ad-hoc queries, user profiling, log retrieval, etc.
Robust Integration: Seamlessly works with MySQL, PostgreSQL, Hive, Flink, and other tools.
Active Community: Backed by 600+ contributors, deployed in production by 5,000+ organizations (including TikTok, Baidu).

Doris supports two deployment modes:

Integrated storage-compute (data stored internally)
Separate storage-compute (uses third-party storage like MinIO)

MinIO: High-Performance Object Storage

MinIO is an open-source, distributed object storage system optimized for cloud-native workloads. Core strengths:

High Performance: Fast data access to meet real-time analytics demands.
Scalability: Horizontal scaling for growing data volumes.
Cost-Effectiveness: Open-source, on-premises deployable (avoids cloud storage premiums).
S3 Compatibility: Fully compatible with Amazon S3 API for easy tool integration.
High Availability: Uses erasure coding for data redundancy.
Flexible Deployment: Supports bare-metal, Kubernetes, or cloud environments.

These features make MinIO an ideal storage backend for Doris in a storage-compute separation architecture.

Deployment Guide

Planning

Software Versions

Software	Version	Description
MinIO	latest	High-performance object storage
Apache Doris	3.0.6	Real-time analytics database
Doris Manager	25.0.0	Visual tool for Doris installation/deployment

Server Layout

Node IP	Doris Manager	MinIO	MetaService	FE	BE
172.20.1.2	✔️	✔️	✔️	✔️	✔️
172.20.1.3		✔️	✔️	✔️	✔️
172.20.1.4		✔️	✔️	✔️	✔️
172.20.1.5		✔️

For production environments: Use higher-spec machines and isolate components for optimal performance.

Preparation

1. Modify OS Parameters

swapoff -a

cat >> /etc/sysctl.conf << EOF
vm.max_map_count = 2000000
EOF

# Take effect immediately
sysctl -p

vi /etc/security/limits.conf 
* soft nofile 1000000
* hard nofile 1000000

2. Install Required Tools

apt update
apt install -y net-tools
apt install -y cron
apt install -y iputils-ping

Deploying MinIO

1. Download MinIO

wget https://dl.min.io/server/minio/release/linux-amd64/minio
chmod +x minio

2. Start MinIO on Each Node

export MINIO_REGION_NAME=us-east-1
export MINIO_ROOT_USER=minio
export MINIO_ROOT_PASSWORD=minioadmin
mkdir -p /mnt/disk{1..4}/minio
nohup minio server --address :9000 --console-address :9001 http://172.20.1.{2...5}:9000/mnt/disk{1...4}/minio 2>&1 &

3. Configure MinIO Client

wget https://dl.min.io/client/mc/release/linux-amd64/mc
chmod +x mc
./mc alias set myminio http://127.0.0.1:9000 minio minioadmin
./mc mb myminio/doris

Note: If MinIO is deployed on a local network without TLS, explicitly include http:// in the endpoint.

Deploying Doris Manager

1. Download Doris Manager

wget https://enterprise-doris-releases.oss-accelerate.aliyuncs.com/doris-manager/velodb-manager-25.0.0-x64-bin.tar.gz

2. Extract and Start Service

tar -zxf velodb-manager-25.0.0-x64-bin.tar.gz
cd velodb-manager-25.0.0-x64-bin/webserver/bin
bash start.sh

3. Access Web Interface

Open your browser and navigate to http://<Doris Manager IP>:8004. Follow the prompts to create an admin account.

Deploying Apache Doris

1. Download Doris

wget https://apache-doris-releases.oss-accelerate.aliyuncs.com/apache-doris-3.0.6.2-bin-x64.tar.gz
mv apache-doris-3.0.6.2-bin-x64.tar.gz /opt/downloads/doris

2. Create Cluster via Doris Manager

Select Doris version (3.0.6) and set root password

Enter MinIO details:

3. Configure Nodes

Run this script on all nodes to deploy agent:

wget http://172.20.1.2:8004/api/download/deploy.sh -O deploy_agent.sh && chmod +x deploy_agent.sh && ./deploy_agent.sh

Input node IPs in the Doris Manager interface

Configure FE nodes (specify roles and resources)

Configure BE nodes (specify storage paths and resources)

4. Deploy Cluster

Click "Deploy" and wait for the process to complete (10-15 minutes). Verify cluster status in Doris Manager.

Querying Data

Data Preparation

1. Access Query Interface

2. Create Doris Table

CREATE DATABASE IF NOT EXISTS `test`;
USE `test`;
CREATE TABLE `amazon_reviews` (  
  `review_date` int(11) NULL,  
  `marketplace` varchar(20) NULL,  
  `customer_id` bigint(20) NULL,  
  `review_id` varchar(40) NULL,
  `product_id` varchar(10) NULL,
  `product_parent` bigint(20) NULL,
  `product_title` varchar(500) NULL,
  `product_category` varchar(50) NULL,
  `star_rating` smallint(6) NULL,
  `helpful_votes` int(11) NULL,
  `total_votes` int(11) NULL,
  `vine` boolean NULL,
  `verified_purchase` boolean NULL,
  `review_headline` varchar(500) NULL,
  `review_body` string NULL
) ENGINE=OLAP
DUPLICATE KEY(`review_date`)
COMMENT 'OLAP'
DISTRIBUTED BY HASH(`review_date`) BUCKETS 16
PROPERTIES (
  "compression" = "ZSTD"
);

3. Download Sample Data

wget https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_2010.snappy.parquet

4. Load Data into Doris

curl --location-trusted -u root:<your password> \
-T amazon_reviews_2010.snappy.parquet \
-H "format:parquet" \
http://127.0.0.1:8030/api/test/amazon_reviews/_stream_load

5. Verify Data in MinIO

Log into MinIO Console (http://<MinIO IP>:9001) → Check doris bucket for data files.

Sample Query

SELECT
    product_id,
    AVG(product_title),
    AVG(star_rating) AS rating,
    COUNT() AS count
FROM
    amazon_reviews
WHERE
    review_body LIKE '%is super awesome%'
GROUP BY
    product_id
ORDER BY
    count DESC,
    rating DESC,
    product_id
LIMIT 5;

Summary

This setup is ideal for enterprises looking to balance performance and cost in real-time analytics scenarios. Try it out with the guide above and share your experience!

Overview of Real-Time Data Synchronization from MySQL to VeloDB

Apache Doris — Tue, 02 Dec 2025 20:40:25 +0000

In the process of migrating data from MySQL (including MySQL-compatible databases such as Amazon Aurora) to VeloDB, Flink can be used as a real-time data synchronization engine to ensure data consistency and real-timeliness. Flink boasts high-throughput and low-latency stream processing capabilities, enabling efficient full data synchronization and incremental change handling for databases.

For real-time synchronization scenarios, MySQL Binlog can be enabled to capture CDC (Change Data Capture) events. Whether it is a traditional self-hosted MySQL or Amazon Aurora-MySQL deployed on the cloud, you can enable Binlog and use Flink CDC for subscription to achieve:

Full data initial load: Import existing data from MySQL/Aurora to VeloDB first
Real-time synchronization of incremental changes: Capture Insert/Update/Delete operations based on Binlog and continuously write them to VeloDB

The overall link is as follows:

Here, we take Amazon Aurora-MySQL as an example to demonstrate how to use Flink CDC to capture data changes in Aurora and synchronize them to VeloDB in real time.

Example

1. Create an AWS RDS Aurora MySQL instance

2. Create a MySQL database and corresponding tables

CREATE DATABASE test_db;
CREATE TABLE test_db.student (
    id INT PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    age INT ,
    email VARCHAR(255) ,
    phone VARCHAR(20) ,
    score DECIMAL(5,2) ,
    created_at TIMESTAMP
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

INSERT INTO test_db.student (id, name, age, email, phone, score, created_at) 
VALUES 
(1, 'Alice Zhang', 22, 'alice@example.com', '13800138000', 89.50, NOW()),
(2, 'Bob Li', 21, 'bob@example.com', '13900139000', 76.80, NOW()),
(3, 'Charlie Wang', 23, 'charlie@example.com', '13600136000', 92.00, NOW()),
(4, 'David Chen', 20, 'david@example.com', '13500135000', 85.60, NOW()),
(5, 'Emma Liu', 22, 'emma@example.com', '13700137000', 78.90, NOW());

3. Create a VeloDB warehouse

4. Modify MySQL configuration

Create a parameter group and add the binlog configuration

Modify binlog_format to ROW

Replace the DB Cluster Parameter with the parameter group created just now, then restart the service after applying the changes

5. Install Flink With Doris Connector

5.1 Download the pre-defined installation package

Based on Flink 1.17, we provide a pre-defined installation package that can be directly downloaded and decompressed.

5.2 Manual installation

If you already have a Flink environment or need another version of Flink, you can use the manual installation method. Taking Flink 1.17 as an example here, download the Flink installation package and its dependencies:

Flink 1.17
Flink MySQL CDC Connector
Flink Doris Connector
MySQL Driver

After the download is complete, extract the Flink installation package:

tar -zxvf flink-1.17.2-bin-scala_2.12.tgz

Meanwhile, place the Flink MySQL CDC Connector, Doris Connector, and MySQL driver package into the flink-1.17.2-bin/lib directory. As follows:

6. Submit the Flink synchronization job

When submitting the job, the Doris Connector will automatically create corresponding tables in VeloDB in advance based on the table structure of the upstream MySQL database.

Flink supports job submission and operation in modes such as Local, Standalone, Yarn, and K8S. If you already have a Flink environment, you can directly submit the job to your own Flink environment.

6.1 Local Environment

> cd flink-1.17.2-bin
> bin/flink run -t local \
    -Dexecution.checkpointing.interval=10s \                                                                                                               
    -Dparallelism.default=1 \                                                                                                                              
    -c org.apache.doris.flink.tools.cdc.CdcTools \                                                                                                         
    lib/flink-doris-connector-1.17-25.1.0.jar \                                                                                                            
    mysql-sync-database \                                                                                                                                  
    --database test_db \                                                                                                                                   
    --mysql-conf hostname=database-test.cluster-ro-ckbuyoqerz2c.us-east-1.rds.amazonaws.com \                                                                                                                      
    --mysql-conf port=3306 \                                                                                                                               
    --mysql-conf username=admin \                                                                                                                           
    --mysql-conf password=123456 \                                                                                                                         
    --mysql-conf database-name=test \                                                                                                                      
    --including-tables "student" \                                                                                                                        
    --sink-conf fenodes=lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:8080 \                                                                                                                 
    --sink-conf username=admin \                                                                                                                            
    --sink-conf password=123456 \                                                                                                                                
    --sink-conf jdbc-url=jdbc:mysql://lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:9030 \                                                                                                   
    --sink-conf sink.label-prefix=label

6.2 Standalone Environment

> cd flink-1.17.2-bin
> bin/flink run -t remote \
    -Dexecution.checkpointing.interval=10s \                                                                                                               
    -Dparallelism.default=1 \                                                                                                                              
    -c org.apache.doris.flink.tools.cdc.CdcTools \                                                                                                         
    lib/flink-doris-connector-1.17-25.1.0.jar \                                                                                                            
    mysql-sync-database \                                                                                                                                  
    --database test_db \                                                                                                                                   
    --mysql-conf hostname=database-test.cluster-ro-ckbuyoqerz2c.us-east-1.rds.amazonaws.com \                                                                                                                      
    --mysql-conf port=3306 \                                                                                                                               
    --mysql-conf username=admin \                                                                                                                           
    --mysql-conf password=123456 \                                                                                                                         
    --mysql-conf database-name=test \                                                                                                                      
    --including-tables "student" \                                                                                                                        
    --sink-conf fenodes=lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:8080 \                                                                                                                 
    --sink-conf username=admin \                                                                                                                            
    --sink-conf password=123456 \                                                                                                                                
    --sink-conf jdbc-url=jdbc:mysql://lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:9030 \                                                                                                   
    --sink-conf sink.label-prefix=label

6.3 Yarn Environment

> cd flink-1.17.2-bin
> bin/flink run -t yarn-per-job \
    -Dexecution.checkpointing.interval=10s \                                                                                                               
    -Dparallelism.default=1 \                                                                                                                              
    -c org.apache.doris.flink.tools.cdc.CdcTools \                                                                                                         
    lib/flink-doris-connector-1.17-25.1.0.jar \                                                                                                            
    mysql-sync-database \                                                                                                                                  
    --database test_db \                                                                                                                                   
    --mysql-conf hostname=database-test.cluster-ro-ckbuyoqerz2c.us-east-1.rds.amazonaws.com \                                                                                                                      
    --mysql-conf port=3306 \                                                                                                                               
    --mysql-conf username=admin \                                                                                                                           
    --mysql-conf password=123456 \                                                                                                                         
    --mysql-conf database-name=test \                                                                                                                      
    --including-tables "student" \                                                                                                                        
    --sink-conf fenodes=lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:8080 \                                                                                                                 
    --sink-conf username=admin \                                                                                                                            
    --sink-conf password=123456 \                                                                                                                                
    --sink-conf jdbc-url=jdbc:mysql://lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:9030 \                                                                                                   
    --sink-conf sink.label-prefix=label

6.4 K8S Environment

> cd flink-1.17.2-bin
> bin/flink run -t kubernetes-session \
    -Dexecution.checkpointing.interval=10s \                                                                                                               
    -Dparallelism.default=1 \                                                                                                                              
    -c org.apache.doris.flink.tools.cdc.CdcTools \                                                                                                         
    lib/flink-doris-connector-1.17-25.1.0.jar \                                                                                                            
    mysql-sync-database \                                                                                                                                  
    --database test_db \                                                                                                                                   
    --mysql-conf hostname=database-test.cluster-ro-ckbuyoqerz2c.us-east-1.rds.amazonaws.com \                                                                                                                      
    --mysql-conf port=3306 \                                                                                                                               
    --mysql-conf username=admin \                                                                                                                           
    --mysql-conf password=123456 \                                                                                                                         
    --mysql-conf database-name=test \                                                                                                                      
    --including-tables "student" \                                                                                                                        
    --sink-conf fenodes=lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:8080 \                                                                                                                 
    --sink-conf username=admin \                                                                                                                            
    --sink-conf password=123456 \                                                                                                                                
    --sink-conf jdbc-url=jdbc:mysql://lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:9030 \                                                                                                   
    --sink-conf sink.label-prefix=label

Note: For more parameters of the Connector, refer to this link.

7. Verify Historical Data Synchronization

The Flink job will synchronize full historical data for the first time. Check the data synchronization status in VeloDB.

8. Verify Real-Time Data Synchronization

Perform data modifications in MySQL:

INSERT INTO student (id, name, age, email, phone, score, created_at) 
VALUES 
(6, 'Frank Zhao', 24, 'frank@example.com', '13400134000', 88.75, NOW());

DELETE FROM student WHERE id = 3;

UPDATE student 
SET score = 95.00, age = 23 
WHERE id = 2;

Verify data changes in VeloDB:

Apache Doris AI Capabilities Unveiled (Part II): Deep Analysis of AI_AGG and EMBED Functions

Apache Doris — Wed, 26 Nov 2025 20:24:27 +0000

Introduction

After a preliminary exploration of the possibilities of AI functions, we now turn our attention to two more core functions: AI_AGG and EMBED. We will delve into the design philosophy, implementation principles, and business applications of these two functions, demonstrating how Doris seamlessly integrates text aggregation and semantic vector analysis into SQL through native function design, providing users with a more powerful and user-friendly intelligent data analysis experience.

AI_AGG: AI-Based Text Aggregation

Aggregation is one of the most common operations in data analysis. However, when dealing with massive volumes of user comments, support tickets, or log texts, traditional aggregate functions struggle to process such unstructured text data directly. To address this, Doris supports AI_AGG, a function that can call AI to perform text aggregation. It allows analysts to handle specific tasks on large amounts of text according to custom instructions.

Examples

For detailed usage of AI_AGG, please refer to: Apache Doris AI_AGG Documentation

Example 1: Summarize Customer Service Tickets

The following table simulates a simple customer service ticket:

SELECT * FROM support_tickets;

ticket_id	customer_name	subject	details
2	Bob	Login Failure	Same problem as Alice. Also seeing 502 errors on the SSO page.
3	Carol	Payment Declined	Credit card charged twice but order still shows pending.
5	Eve	Login Failure	Getting redirected back to login after entering 2FA code.
1	Alice	Login Failure	Cannot log in after password reset. Tried clearing cache and different browsers.
4	Dave	Slow Dashboard	Dashboard takes >30 seconds to load since the last release.

We can use AI_AGG to summarize customer issues for different problem types:

SELECT
    subject,
    AI_AGG(
        details,'Summarize every ticket detail into one short paragraph') AS ai_summary
FROM support_tickets
GROUP BY subject;

The output is as follows:

subject	ai_summary
Slow Dashboard	The dashboard is experiencing slow loading times, taking over 30 seconds to load following the most recent release.
Payment Declined	A customer reports being charged twice for their order, which remains in a pending status.
Login Failure	Users are experiencing login issues, including 2FA redirection, post-password reset failures, and SSO 502 errors, despite clearing cache and trying different browsers.

AI_AGG Technical Analysis: Dynamic Pre-aggregation

Combining aggregate functions with AI requires solving the problem that the total text volume within a group can far exceed the model's context window. If all text is concatenated and sent to the AI at once, it can easily exceed the model's maximum context window. Doris solves this problem through dynamic pre-aggregation:

Context Monitoring: During the text aggregation process, AI_AGG maintains an internal text buffer for each group (currently fixed at 128KB, compatible with most AI context windows).
Dynamic Pre-aggregation: When a new text row would cause the buffer to exceed the threshold, AI_AGG triggers pre-aggregation—pausing to send the current buffer to the AI for intermediate processing.
Context Replacement: The AI's concise intermediate result replaces the original long text in the buffer, freeing space for more data. If the buffer still exceeds the threshold after replacement, AI_AGG errors out to prevent model service overload.

This implementation integrates seamlessly with Doris's distributed query plan, leveraging multi-node parallel computing. Users can perform efficient intelligent analysis on massive text data using familiar SQL aggregation syntax.

EMBED: Text Vectorization Function

For detailed usage of EMBED, please refer to: Apache Doris EMBED Documentation

The core function of EMBED is to convert any text into a high-dimensional floating-point vector through AI. This vector is a mathematical representation of the text in a semantic space, capturing its semantic information. Texts with similar semantics will have vectors that are closer in this space.

Examples

Example 1: Build a Knowledge Base with Vectorization

The following table simulates a simple employee handbook:

CREATE TABLE knowledge_base (
    id BIGINT,
    title STRING,
    content STRING,
    embedding ARRAY<FLOAT> COMMENT 'Embedding vector generated by EMBED function'
)
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 4
PROPERTIES (
    "replication_num" = "1"
);

INSERT INTO knowledge_base (id, title, content, embedding) VALUES
(1, "Travel Reimbursement Policy",
    "Employees must submit a reimbursement request within 7 days after the business trip, with invoices and travel approval attached.",
    EMBED("travel reimbursement policy")),
(2, "Leave Policy",
    "Employees must apply for leave in the system in advance. If the leave is longer than three days, approval from the direct manager is required.",
    EMBED("leave request policy")),
(3, "VPN User Guide",
    "To access the internal network, employees must use VPN. For the first login, download and install the client and configure the certificate.",
    EMBED("VPN guide intranet access")),
(4, "Meeting Room Reservation",
    "Meeting rooms can be reserved in advance through the OA system, with time and number of participants specified.",
    EMBED("meeting room booking reservation")),
(5, "Procurement Request Process",
    "Departments must fill out a procurement request form for purchasing items. If the amount exceeds $5000, financial approval is required.",
    EMBED("procurement request process finance"));

By vectorizing text with EMBED, combined with Doris's vector functions, you can perform the following operations:

1. Q&A Retrieval (with COSINE_DISTANCE)

SELECT 
    id, title, content,
    COSINE_DISTANCE(embedding, EMBED("How to apply for travel reimbursement?")) AS score
FROM knowledge_base
ORDER BY score ASCLIMIT 2;

id	title	content	score
1	Travel Reimbursement Policy	Employees must submit a reimbursement request within 7 days after the business trip, with invoices and travel approval attached.	0.4463210454563673
5	Procurement Request Process	Departments must fill out a procurement request form for purchasing items. If the amount exceeds $5000, financial approval is required.	0.5726841578491431

2. Problem Analysis Matching (with L2_DISTANCE)

SELECT 
    id, title, content,
    L2_DISTANCE(embedding, EMBED("How to access the company intranet")) AS distance
FROM knowledge_base
ORDER BY distance ASCLIMIT 2;

id	title	content	distance
3	VPN User Guide	To access the internal network, employees must use VPN. For the first login, download and install the client and configure the certificate.	0.5838271122253775
1	Travel Reimbursement Policy	Employees must submit a reimbursement request within 7 days after the business trip, with invoices and travel approval attached.	1.272394695975331

3. Text Relevance Matching (with INNER_PRODUCT)

SELECT 
    id, title, content,
    INNER_PRODUCT(embedding, EMBED("Leave system request leader approval")) AS score
FROM knowledge_base
WHERE id != 2ORDER BY score DESCLIMIT 2;

id	title	content	score
5	Procurement Request Process	Departments must fill out a procurement request form for purchasing items. If the amount exceeds $5000, financial approval is required.	0.33268885332504
4	Meeting Room Reservation	Meeting rooms can be reserved in advance through the OA system, with time and number of participants specified.	0.29224032230852487

4. Find Similar Content (with L1_DISTANCE)

SELECT 
    id, title, content,
    L1_DISTANCE(embedding, EMBED("Procurement application process")) AS distance
FROM knowledge_base
ORDER BY distance ASCLIMIT 3;

id	title	content	distance
5	Procurement Request Process	Departments must fill out a procurement request form for purchasing items. If the amount exceeds $5000, financial approval is required.	18.66882028897362
4	Meeting Room Reservation	Meeting rooms can be reserved in advance through the OA system, with time and number of participants specified.	30.90449328294426
2	Leave Policy	Employees must apply for leave in the system in advance. If the leave is longer than three days, approval from the direct manager is required.	31.060405636536416

Flexible Vector Dimension Control

Through Doris's built-in RESOURCE mechanism, users can set the ai.dimensions parameter when configuring an AI Resource to precisely specify the dimension of the generated vectors:

High-dimensional vectors: Retain richer semantic information (suitable for high-precision retrieval).
Low-dimensional vectors: Save storage space and accelerate computation (suitable for lightweight matching).

Note:

Ensure the AI model configured in the RESOURCE supports the specified dimension (otherwise, requests may fail).
For models that do not support dimension customization (e.g., OpenAI's text-embedding-ada-002), the ai.dimensions setting will be ignored, and the model's default dimension will be used.

Summary and Outlook

With the AI_AGG and EMBED functions, Apache Doris has successfully embedded AI capabilities into its database kernel, injecting powerful intelligent analysis capabilities into its native SQL engine and greatly expanding the boundaries of data analysis and intelligent applications:

AI_AGG: With dynamic pre-aggregation, it enables intelligent analysis of unstructured text (e.g., user comments, logs) directly in the database.
EMBED: Seamlessly integrates with vector functions to provide end-to-end semantic retrieval solutions (e.g., Q&A systems, content recommendation), simplifying application development.

These features empower SQL with the ability to command AI models, allowing data analysts to harness powerful AI at low cost and high efficiency to uncover deeper semantic value in data.

Looking ahead, Doris will continue to deepen the integration of AI and databases:

Optimize model scheduling and computational performance.
Explore cutting-edge features like multi-modal data analysis and AI Agent interactions.
Continuously lower the barrier to using AI technology, making data-driven intelligent decisions ubiquitous.

Building Real-Time Lakehouse with S3 Tables, AWS Glue, and Apache Doris

Apache Doris — Fri, 21 Nov 2025 18:45:38 +0000

We built a real-time lakehouse with S3 Tables, AWS Glue, and Apache Doris. In this solution, S3 Tables stores data in the Apache Iceberg format on Amazon S3. AWS Glue manages and organizes metadata and schema, providing a single catalog that connects all resources. And Apache Doris runs sub-second queries directly on those Iceberg tables: no ETL, no data copies, no complex architecture.

Together, the S3 Tables + AWS Glue + Apache Doris form a real-time lakehouse that combines the openness of a data lake with the high performance of a data warehouse, providing a key data foundation for AI and agentic workloads.

You get:

Unified metadata for easy table discovery and governance
Open Apache Iceberg tables on S3 with ACID, time-travel, and schema evolution
A high-performance query engine with Apache Doris offering low-latency and high-concurrency
Interoperability across engines with Spark, Flink, Trino, Doris, and more

This is a practical, production-ready real-time lakehouse you can use to power dashboards, streaming analytics, or AI features directly from the data lake. The solution is also applicable to many other open-source combinations, with table formats like Iceberg, Paimon, catalogs like Unity, Polaris, Gravitino, and query engines like Spark, Flink, and Trino.

Simple steps to replicate:

Let's see how to set up this solution in a demo. We will explore how to harness the power of Apache Doris, as well as configure a third-party engine to work with AWS Glue Iceberg REST Catalog. The demo will include details on how to perform read/write data operations against S3 tables with AWS Glue.

Create S3 Table Buckets

Create policy for Glue and S3 Tables

Use the following JSON policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "glue:GetCatalog",
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:GetTable",
                "glue:GetTables",
                "glue:CreateTable",
                "glue:UpdateTable"
            ],
            "Resource": [
                "arn:aws:glue:<region>:<account_id>:catalog",
                "arn:aws:glue:<region>:<account_id>:catalog/s3tablescatalog",
                "arn:aws:glue:<region>:<account_id>:catalog/s3tablescatalog/<bucket_name>",
                "arn:aws:glue:<region>:<account_id>:table/s3tablescatalog/<bucket_name>/<db_name>/*",
                "arn:aws:glue:<region>:<account_id>:database/s3tablescatalog/<bucket_name>/<db_name>"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "lakeformation:GetDataAccess"
            ],
            "Resource": "*"
        }
    ]
}

Attach the policy to user

Search the policy you just created and attach it to your user.

Connect to Iceberg catalog using SQL

-- Create Catalog
CREATE CATALOG my_glue_catalog properties (
    'type' = 'iceberg',
    'iceberg.catalog.type' = 'rest',
    'warehouse' = '<acount_id>:s3tablescatalog/<bucket_name>',
    'iceberg.rest.uri' = 'https://glue.<region>.amazonaws.com/iceberg',
    'iceberg.rest.sigv4-enabled' = 'true',
    'iceberg.rest.signing-name' = 'glue',
    'iceberg.rest.signing-region' = '<region>',
    'iceberg.rest.access-key-id' = '<ak>',
    'iceberg.rest.secret-access-key' = '<sk>',
    'test_connection' = 'true'
);
-- Switch to the catalog
SWITCH my_glue_catalog;
-- View current existing databases
SHOW DATABASES;
-- Create a new database
CREATE DATABSE gluedb;
-- Change to the newly created database
USE gluedb;
-- Create a new Iceberg table
CREATE TABLE iceberg_table(id INT, name STRING);
-- Insert values into table
INSERT INTO iceberg_table VALUES(1, "Jacky");
-- Query the Iceberg table
SELECT * FROM iceberg_table；

Replace the placeholders with the real information.

Conclusion and Next Steps

A unified data foundation is what makes real-time analytics possible, and key for companies to adopt large-scale AI and agentic workloads.

S3 Tables and AWS Glue provide an open, governed data layer, and Apache Doris delivers sub-second analytics directly on that data. This real-time lakehouse offers a simpler architecture, smarter governance, and AI readiness, allowing teams to query fresh information without complex ETL or data silos.

10x Query Performance Improvement: The Design and Implementation of the New Unique Key

Apache Doris — Thu, 20 Nov 2025 19:44:20 +0000

In business scenarios of real-time data warehouses, providing good support for real-time data updates is an extremely important capability. For example, in scenarios such as database synchronization (CDC), e-commerce transaction orders, advertising effect delivery, and marketing business reports, when facing changes in upstream data, it is usually necessary to quickly capture change records and promptly modify single or multiple rows of data. This ensures that business analysts and related analysis platforms can quickly grasp the latest progress and improve the timeliness of business decisions.

For OLAP databases, which have traditionally been weak at data updates, how to better implement real-time update capabilities has become a key to winning fierce competition in today's environment where data timeliness requirements are increasingly strong and the application scope of real-time data warehouse businesses is expanding.

In the past, Apache Doris mainly implemented real-time data Upserts through the Unique Key data model. Due to its underlying LSM Tree-like structure, it provides strong support for high-frequency writes of large datasets. However, its Merge-on-Read update mode has become a bottleneck restricting Apache Doris' real-time update capabilities, which may cause query jitters when dealing with concurrent reading and writing of real-time data.

Based on this, in the Apache Doris 1.2.0 version, we introduced a new data update method - Merge-On-Write - for the Unique Key model, striving to balance real-time updates and efficient queries. This article will detail the design, implementation and effects of the new primary key model.

Implementation of the Original Unique Key Model

Users familiar with Apache Doris' history may know that Doris' initial design was inspired by Google Mesa, and it only had Duplicate Key and Aggregate Key models at first. The Unique Key model was added later based on user needs during Doris' development. However, the demand for real-time updates was not so strong at that time, so the implementation of Unique Key was relatively simple - it was just a wrapper around the Aggregate Key model, without in-depth optimization for real-time update requirements.

Specifically, the implementation of the Unique Key model is just a special case of the Aggregate Key model. If you use the Aggregate Key model and set the aggregation type of all non-key columns to REPLACE, you can achieve exactly the same effect. As shown in the following figure, when describing example_tbl, a table of the Unique Key model, the aggregation type in the last column shows that it is equivalent to an Aggregate Key table where all columns have the REPLACE aggregation type.

Image: Original Unique-Key-Aggregate-Key

Both the Unique Key and Aggregate Key data models adopt the Merge-On-Read implementation method. That is, when data is imported, it is first written to a new Rowset, and no deduplication is performed after writing. Only when a query is initiated will multi-way concurrent sorting be performed. During multi-way merge sorting, duplicate keys will be grouped together and aggregation operations will be performed. Among them, keys with higher versions will overwrite those with lower versions, and finally only the record with the highest version will be returned to the user.

The following figure is a simplified representation of the execution process of the Unique Key model:

Image: Performance Improvement - Simplified Unique-Key

Although their implementation methods are relatively consistent, the usage scenarios of the Unique Key and Aggregate Key data models are significantly different:

When users create a table with the Aggregate Key model, they have a very clear understanding of the aggregation query conditions - aggregating according to the columns specified by the Aggregate Key, and the aggregate functions on the Value columns are the main aggregation methods (COUNT/SUM/MAX/MIN, etc.) used by users. For example, using user_id as the Aggregate Key and summing the number of visits and duration to calculate UV and user usage duration.

However, the main function of the Key in the Unique Key data model is to ensure uniqueness, not to serve as an aggregation Key. For example, in the order scenario, data synchronized from TP databases through Flink CDC uses the order ID as the Unique Key for deduplication. However, during queries, filtering, aggregation and analysis are usually performed on certain Value columns (such as order status, order amount, order time consumption, order placement time, etc.).

Shortcomings

As can be seen from the above, when users query using the Unique Key model, they actually perform two aggregation operations. The first is to aggregate all data by Key according to the Unique Key to remove duplicate Keys; the second is to aggregate according to the actual aggregation conditions required by the query. These two aggregation operations lead to serious efficiency issues and low query performance:

Data deduplication requires expensive multi-way merge sorting, and full Key comparison consumes a lot of CPU computing resources.
Effective data pruning cannot be performed, introducing a large amount of additional data IO. For example, if a data partition has 10 million pieces of data, but only 1,000 pieces meet the filtering conditions, the rich indexes of the OLAP system are designed to efficiently filter out these 1,000 pieces of data. However, since it is impossible to determine whether a certain piece of data in a specific file is valid, these indexes cannot be used. It is necessary to first perform full merge sorting and data deduplication, and then filter these finally confirmed valid data. This brings about a 10,000-fold IO amplification (this figure is only a rough estimate, and the actual amplification effect is more complicated to calculate).

Scheme Research and Selection

In order to solve the problems existing in the original Unique Key model and better meet the needs of business scenarios, we decided to optimize the Unique Key model and conducted a detailed research on optimization schemes for read and write efficiency issues.

There have been many industry explorations on solutions to the above problems. There are three representative types:

Delete + Insert: That is, when writing data, find the overwritten key through a primary key index and mark it as deleted. A representative system is Microsoft's SQL Server.
Delta Store: Divide data into base data and delta data. Each primary key in the base data is guaranteed to be unique. All updates are recorded in the Delta Store. During queries, the base data and delta data are merged. At the same time, background merge threads regularly merge the delta data and base data. A representative system is Apache Kudu.
Copy-on-Write: When updating data, directly copy the original data row, update it, and write it to a new file. This method is widely used in data lakes, with representative systems such as Apache Hudi and Delta Lake.

The implementation mechanisms and comparisons of these three schemes are as follows:

Delete + Insert (i.e., Merge-on-Write)

A representative example is the scheme proposed in the paper "Real-Time Analytical Processing with SQL Server" published by SQL Server in VLDB in 2015. Simply put, this paper proposes that when writing data, old data is marked for deletion (using a data structure called Delete Bitmap), and new data is recorded in the Delta Store. During queries, the Base data, Delete Bitmap, and data in the Delta Store are merged to obtain the latest data. The overall scheme is shown in the following figure, and will not be elaborated due to space limitations.

Image: Performance Improvement - Merge-on-Write

The advantage of this scheme is that any valid primary key exists only in one place (either in Base Data or Delta Store), which avoids a large amount of merge sorting consumption during queries. At the same time, various rich columnar indexes in the Base data remain valid.

Delta Store

A representative system using the Delta Store method is Apache Kudu. In Kudu, data is divided into Base Data and Delta Data. The primary keys in the Base Data are all unique. Any modification to the Base data will be first written to the Delta Store (marking the corresponding relationship with the Base Data through row numbers, which can avoid sorting during merging). Different from the Base + Delta of SQL Server mentioned earlier, Kudu does not mark deletions, so data with the same primary key will exist in two places. Therefore, during queries, the data from Base and Delta must be merged to obtain the latest result. Kudu's scheme is shown in the following figure:

Image: Performance Improvement - Delta-Store

Kudu's scheme can also avoid the high cost caused by merge sorting when reading data. However, since data with the same primary key can exist in multiple places, it is difficult to ensure the accuracy of indexes and cannot perform efficient predicate pushdown. Indexes and predicate pushdown are important means for analytical databases to optimize performance, so this shortcoming has a significant impact on performance.

Copy-On-Write

Since Apache Doris is positioned as a real-time analytical database, the Copy-On-Write scheme has too high a cost for real-time updates and is not suitable for Doris.

Scheme Comparison

The following table compares various schemes. Among them, Merge-On-Read is the default implementation of the Unique Key model, i.e., the implementation before version 1.2. Merge-On-Write (Merge on Write) is the Delete + Insert scheme mentioned earlier.

Image: Performance Improvement - Scheme Comparison

As can be seen from the above, Merge-On-Write trades moderate write costs for lower read costs, well supports predicate pushdown and non-key column index filtering, and has good effects on query performance optimization. After comprehensive comparison, we chose Merge-On-Write as the final optimization scheme.

Design and Implementation of the New Scheme

In short, the processing flow of Merge-On-Write is:

For each Key, find its position in the Base data (rowsetid + segmentid + row number).
If the Key exists, mark the corresponding row of data as deleted. The information of marked deletion is recorded in the Delete Bitmap, and each Segment has a corresponding Delete Bitmap.
Write the updated data to a new Rowset, complete the transaction, and make the new data visible (able to be queried).
During queries, read the Delete Bitmap, filter out the rows marked as deleted, and only return valid data.

Key Issues

To design a Merge-On-Write scheme suitable for Doris, the following key issues need to be focused on solving:

How to efficiently locate whether there is old data that needs to be marked for deletion during import?
How to efficiently store the information of marked deletion?
How to efficiently use the marked deletion information to filter data during the query phase?
Can multi-version support be realized?
How to avoid transaction conflicts in concurrent imports and write conflicts between imports and Compaction?
Is the additional memory consumption introduced by the scheme reasonable?
Is the write performance degradation caused by write costs within an acceptable range?

Based on the above key issues, we have implemented a series of optimization measures to solve these problems well. They will be introduced in detail in the following text:

Primary Key Index

Since Doris is a columnar storage system designed for large-scale analysis, it does not have the capability of primary key index. Therefore, in order to quickly locate whether there is a primary key to be overwritten and the row number of the primary key to be overwritten, it is necessary to add a primary key index to Doris.

We have taken the following optimization measures:

Maintain a primary key index for each Segment. The primary key index is implemented using a scheme similar to RocksDB Partitioned Index. This scheme can achieve very high query QPS, and the file-based index scheme can also save memory usage.
Maintain a Bloom Filter corresponding to the primary key index for each Segment. The primary key index will only be queried when the Bloom Filter hits.
Record a primary key range [min-key, max-key] for each Segment.
Maintain a pure in-memory interval tree, constructed using the primary key ranges of all Segments. When querying a primary key, there is no need to traverse all Segments. The interval tree can be used to locate the Segments that may contain the primary key, greatly reducing the amount of indexes that need to be queried.
For all hit Segments, query them in descending order of version. In Doris, a higher version means more updated data. Therefore, if a primary key hits in the index of a higher-version Segment, there is no need to continue querying lower-version Segments.

The flow of querying a single primary key is shown in the following figure:

Image: Performance Improvement - Primary Key Index

Delete Bitmap

Delete Bitmap adopts a multi-version recording method, as shown in the following figure:

Image: Performance Improvement - Delete-Bitmap

The Segment file in the figure is generated by the import of version 5, including the imported data of version 5 in this Tablet.

The import of version 6 includes the update of primary key B, so the second row will be marked as deleted in the Bitmap, and the modification of this Segment by the import of version 6 will be recorded in the DeleteBitmap.

The import of version 7 includes the update of primary key A, which will also generate a Bitmap corresponding to the version; similarly, the import of version 8 will also generate a corresponding Bitmap.

All Delete Bitmaps are stored in a large Map. Each import will serialize the latest Delete Bitmap into RocksDB. The key definitions are as follows:

using SegmentId = uint32_t;
using Version = uint64_t;
using BitmapKey = std::tuple<RowsetId, SegmentId, Version>;
std::map<BitmapKey, roaring::Roaring> delete_bitmap;

Each Segment in each Rowset will record multiple versions of Bitmaps. A Bitmap with Version x means the modification of the current Segment by the import of version x.

Advantages of multi-version Delete Bitmap:

It can well support multi-version queries. For example, after the import of version 7 is completed, a query on this table starts to execute and will use Version 7. Even if the query takes a long time and the import of version 8 is completed during the query execution, there is no need to worry about reading the data of version 8 (or missing the data deleted by version 8).
It can well support complex Schema Changes. In Doris, complex Schema Changes (such as type conversion) require double writing first, and at the same time convert historical data before a certain version and then delete the old version of data. Multi-version Delete Bitmap can well support the current Schema Change implementation.
It can support multi-version requirements during data copying and replica repair.

However, multi-version Delete Bitmap also has corresponding costs. In the previous example, to access the data of version 8, the three Bitmaps of v6, v7 and v8 need to be merged to get a complete Bitmap, and then this Bitmap is used to filter the Segment data. In real-time high-frequency import scenarios, a large number of Bitmaps can be easily generated, and the CPU cost of the union operation of Roaringbitmap is high. In order to minimize the impact of a large number of union operations, we added an LRUCache to DeleteBitmap to record the latest merged Bitmaps.

Write Flow

When writing data, the primary key index of each Segment will be created first, and then the Delete Bitmap will be updated. The establishment of the primary key index is relatively simple and will not be described in detail due to space limitations. The focus is on introducing the more complex Delete Bitmap update flow:

Image: Performance Improvement - Write Flow

DeltaWriter will first flush the data to the disk.
In the Publish phase, batch point queries are performed on all Keys, and the Bitmaps corresponding to the overwritten Keys are updated. In the following figure, the version of the newly written Rowset is 8, which modifies the data in 3 Rowsets, so 3 Bitmap modification records will be generated.
Updating the Bitmap in the Publish phase ensures that no new visible Rowsets will appear during the batch point query of Keys and Bitmap update, ensuring the correctness of Bitmap update.
If a Segment is not modified, there will be no Bitmap record corresponding to the version. For example, Segment1 of Rowset1 has no Bitmap corresponding to Version 8.

Read Flow

The reading flow of Bitmap is shown in the following figure. It can be seen from the figure:

Image: Performance Improvement - Read Flow

A Query requesting version 7 will only see the data corresponding to version 7.
When reading the data of Rowset5, the Bitmaps generated by the modifications of v6 and v7 to it will be merged to obtain the complete DeleteBitmap corresponding to Version7, which is used to filter data.
In the example in the figure, the import of version 8 overwrites a piece of data in Segment2 of Rowset1, but the Query requesting version 7 can still read this piece of data.

In high-frequency import scenarios, there may be a large number of versions of Bitmaps. Merging these Bitmaps itself may also consume a lot of CPU computing resources. Therefore, we introduced an LRUCache, and each version of Bitmap only needs to be merged once.

Handling of Compaction and Write Conflicts

Normal Compaction Flow

When Compaction reads data, it obtains the version Vx of the Rowset being processed, and will automatically filter out the rows marked as deleted through the Delete Bitmap (see the query layer adaptation part earlier).
After Compaction is completed, all DeleteBitmaps on the source Rowset that are less than or equal to version Vx can be cleaned up.

Handling of Compaction and Write Conflicts

During the execution of Compaction, a new import task may be submitted, assuming the corresponding version is Vy. If the write corresponding to Vy has modifications to the Rowset in the Compaction source, it will be updated to Vy of the DeleteBitmap of this Rowset.
After Compaction is completed, check all DeleteBitmaps on this Rowset that are greater than Vx, and update the row numbers in them to the Segment row numbers in the newly generated Rowset.

As shown in the following figure, Compaction selects three Rowsets [0-5], [6-6], [7-7]. During the Compaction process, the import of Version8 is successfully executed. In the Compaction Commit phase, it is necessary to process the new Bitmap generated by the data import of Version8.

Image: Performance Improvement - Compaction

Write Performance Optimization

In the initial design, DeltaWriter did not perform point queries and Delete Bitmap updates during the data writing phase, but did so in the Publish phase. This can ensure that all data before this version can be seen when updating the Delete Bitmap, ensuring the correctness of the Delete Bitmap. However, in actual high-frequency import tests, it was found that the additional consumption caused by serial full-point queries and updates of each Rowset's data in the Publish phase would lead to a significant drop in import throughput.

Therefore, in the final design, we changed the update of Delete Bitmap to a two-phase form: the first phase can be executed in parallel, only finding and marking deletions for the Version visible at that time; the second phase must be executed serially, and updating the data in the newly imported Rowsets that may have been missed in the previous first phase. The amount of incremental update data in the second phase is very small, so the impact on the overall throughput is very limited.

Optimization Effects

The new Merge-On-Write implementation marks old data as deleted during writing, which can always ensure that valid primary keys only appear in one file (that is, the uniqueness of primary keys is ensured during writing). There is no need to deduplicate primary keys through merge sorting during reading. For high-frequency writing scenarios, this greatly reduces the additional consumption during query execution.

In addition, the new version implementation can also support predicate pushdown and make good use of Doris' rich indexes. Sufficient data pruning can be performed at the data IO level, greatly reducing the amount of data read and computed. Therefore, there is a significant performance improvement in queries in many scenarios.

It should be noted that if users use the Unique Key in low-frequency batch update scenarios, the improvement of the Merge-On-Write implementation on users' query effects may not be obvious. Because for low-frequency batch updates, Doris' Compaction mechanism can usually quickly compact the data into a good state (that is, Compaction completes the deduplication of primary keys), avoiding the deduplication computing cost during queries.

Optimization Effects on Aggregation Analysis

We conducted tests using the Lineitem table, which has the largest data volume in TPC-H 100. To simulate multiple continuous writing scenarios, the data was divided into 100 parts and imported repeatedly 3 times. Then count(*) queries were performed, and the effect comparison is as follows:

Image: Optimization - Aggregation Analysis

The scenarios with and without Cache were compared respectively. In the case of no Cache, due to the high time consumption of loading data from the disk, there is an overall performance improvement of about 4 times; excluding the impact of disk reading overhead, in the case of Cache, the computing efficiency of the new version implementation can be improved by more than 20 times.

The effect of Sum is similar, and will not be listed due to space limitations.

SSB Flat

In addition to simple Count and Sum, we also tested the SSB-Flat dataset. The optimization effect on the 100G dataset (divided into 10 parts and imported multiple times to simulate data update scenarios) is shown in the following figure:

Implementation of the Original Unique Key Model

Image: Original Unique-Key-Aggregate-Key

The following figure is a simplified representation of the execution process of the Unique Key model:

Image: Performance Improvement - Simplified Unique-Key

Although their implementation methods are relatively consistent, the usage scenarios of the Unique Key and Aggregate Key data models are significantly different:

Shortcomings

Data deduplication requires expensive multi-way merge sorting, and full Key comparison consumes a lot of CPU computing resources.
Effective data pruning cannot be performed, introducing a large amount of additional data IO. For example, if a data partition has 10 million pieces of data, but only 1,000 pieces meet the filtering conditions, the rich indexes of the OLAP system are designed to efficiently filter out these 1,000 pieces of data. However, since it is impossible to determine whether a certain piece of data in a specific file is valid, these indexes cannot be used. It is necessary to first perform full merge sorting and data deduplication, and then filter these finally confirmed valid data. This brings about a 10,000-fold IO amplification (this figure is only a rough estimate, and the actual amplification effect is more complicated to calculate).

Scheme Research and Selection

There have been many industry explorations on solutions to the above problems. There are three representative types:

Delete + Insert: That is, when writing data, find the overwritten key through a primary key index and mark it as deleted. A representative system is Microsoft's SQL Server.
Delta Store: Divide data into base data and delta data. Each primary key in the base data is guaranteed to be unique. All updates are recorded in the Delta Store. During queries, the base data and delta data are merged. At the same time, background merge threads regularly merge the delta data and base data. A representative system is Apache Kudu.
Copy-on-Write: When updating data, directly copy the original data row, update it, and write it to a new file. This method is widely used in data lakes, with representative systems such as Apache Hudi and Delta Lake.

The implementation mechanisms and comparisons of these three schemes are as follows:

Delete + Insert (i.e., Merge-on-Write)

Image: Performance Improvement - Merge-on-Write

Delta Store

Image: Performance Improvement - Delta-Store

Copy-On-Write

Since Apache Doris is positioned as a real-time analytical database, the Copy-On-Write scheme has too high a cost for real-time updates and is not suitable for Doris.

Scheme Comparison

Image: Performance Improvement - Scheme Comparison

Design and Implementation of the New Scheme

In short, the processing flow of Merge-On-Write is:

For each Key, find its position in the Base data (rowsetid + segmentid + row number).
If the Key exists, mark the corresponding row of data as deleted. The information of marked deletion is recorded in the Delete Bitmap, and each Segment has a corresponding Delete Bitmap.
Write the updated data to a new Rowset, complete the transaction, and make the new data visible (able to be queried).
During queries, read the Delete Bitmap, filter out the rows marked as deleted, and only return valid data.

Key Issues

To design a Merge-On-Write scheme suitable for Doris, the following key issues need to be focused on solving:

How to efficiently locate whether there is old data that needs to be marked for deletion during import?
How to efficiently store the information of marked deletion?
How to efficiently use the marked deletion information to filter data during the query phase?
Can multi-version support be realized?
How to avoid transaction conflicts in concurrent imports and write conflicts between imports and Compaction?
Is the additional memory consumption introduced by the scheme reasonable?
Is the write performance degradation caused by write costs within an acceptable range?

Based on the above key issues, we have implemented a series of optimization measures to solve these problems well. They will be introduced in detail in the following text:

Primary Key Index

We have taken the following optimization measures:

Maintain a primary key index for each Segment. The primary key index is implemented using a scheme similar to RocksDB Partitioned Index. This scheme can achieve very high query QPS, and the file-based index scheme can also save memory usage.
Maintain a Bloom Filter corresponding to the primary key index for each Segment. The primary key index will only be queried when the Bloom Filter hits.
Record a primary key range [min-key, max-key] for each Segment.
Maintain a pure in-memory interval tree, constructed using the primary key ranges of all Segments. When querying a primary key, there is no need to traverse all Segments. The interval tree can be used to locate the Segments that may contain the primary key, greatly reducing the amount of indexes that need to be queried.
For all hit Segments, query them in descending order of version. In Doris, a higher version means more updated data. Therefore, if a primary key hits in the index of a higher-version Segment, there is no need to continue querying lower-version Segments.

The flow of querying a single primary key is shown in the following figure:

Image: Performance Improvement - Primary Key Index

Delete Bitmap

Delete Bitmap adopts a multi-version recording method, as shown in the following figure:

Image: Performance Improvement - Delete-Bitmap

The Segment file in the figure is generated by the import of version 5, including the imported data of version 5 in this Tablet.

All Delete Bitmaps are stored in a large Map. Each import will serialize the latest Delete Bitmap into RocksDB. The key definitions are as follows:

using SegmentId = uint32_t;
using Version = uint64_t;
using BitmapKey = std::tuple<RowsetId, SegmentId, Version>;
std::map<BitmapKey, roaring::Roaring> delete_bitmap;

Each Segment in each Rowset will record multiple versions of Bitmaps. A Bitmap with Version x means the modification of the current Segment by the import of version x.

Advantages of multi-version Delete Bitmap:

It can well support multi-version queries. For example, after the import of version 7 is completed, a query on this table starts to execute and will use Version 7. Even if the query takes a long time and the import of version 8 is completed during the query execution, there is no need to worry about reading the data of version 8 (or missing the data deleted by version 8).
It can well support complex Schema Changes. In Doris, complex Schema Changes (such as type conversion) require double writing first, and at the same time convert historical data before a certain version and then delete the old version of data. Multi-version Delete Bitmap can well support the current Schema Change implementation.
It can support multi-version requirements during data copying and replica repair.

Write Flow

Image: Performance Improvement - Write Flow

DeltaWriter will first flush the data to the disk.
In the Publish phase, batch point queries are performed on all Keys, and the Bitmaps corresponding to the overwritten Keys are updated. In the following figure, the version of the newly written Rowset is 8, which modifies the data in 3 Rowsets, so 3 Bitmap modification records will be generated.
Updating the Bitmap in the Publish phase ensures that no new visible Rowsets will appear during the batch point query of Keys and Bitmap update, ensuring the correctness of Bitmap update.
If a Segment is not modified, there will be no Bitmap record corresponding to the version. For example, Segment1 of Rowset1 has no Bitmap corresponding to Version 8.

Read Flow

The reading flow of Bitmap is shown in the following figure. It can be seen from the figure:

Image: Performance Improvement - Read Flow

A Query requesting version 7 will only see the data corresponding to version 7.
When reading the data of Rowset5, the Bitmaps generated by the modifications of v6 and v7 to it will be merged to obtain the complete DeleteBitmap corresponding to Version7, which is used to filter data.
In the example in the figure, the import of version 8 overwrites a piece of data in Segment2 of Rowset1, but the Query requesting version 7 can still read this piece of data.

Handling of Compaction and Write Conflicts

Normal Compaction Flow

When Compaction reads data, it obtains the version Vx of the Rowset being processed, and will automatically filter out the rows marked as deleted through the Delete Bitmap (see the query layer adaptation part earlier).
After Compaction is completed, all DeleteBitmaps on the source Rowset that are less than or equal to version Vx can be cleaned up.

Handling of Compaction and Write Conflicts

During the execution of Compaction, a new import task may be submitted, assuming the corresponding version is Vy. If the write corresponding to Vy has modifications to the Rowset in the Compaction source, it will be updated to Vy of the DeleteBitmap of this Rowset.
After Compaction is completed, check all DeleteBitmaps on this Rowset that are greater than Vx, and update the row numbers in them to the Segment row numbers in the newly generated Rowset.

Image: Performance Improvement - Compaction

Write Performance Optimization

Optimization Effects

Optimization Effects on Aggregation Analysis

Image: Optimization - Aggregation Analysis

The effect of Sum is similar, and will not be listed due to space limitations.

SSB Flat

Explanation of test results:

Under the typical configuration of 32C64GB, the total time for all queries to complete is 4.5 seconds for the new version implementation, and 126.4 seconds for the old version implementation, with a speed difference of nearly 30 times. Further analysis found that when queries were executed on the table of the old version implementation, all 32-core CPUs were fully loaded. Therefore, a machine with a higher configuration was used to test the query time on the table of the old version implementation when computing resources were sufficient.
Under the configuration of 64C128GB, the total time of the old version implementation is 49.9s, and the maximum number of cores used is about 48. When computing resources are sufficient, the old version implementation still has a 12-fold performance gap compared with the new version implementation.

It can be seen that the new version implementation not only greatly improves the query speed, but also significantly reduces CPU consumption.

Impact on Data Import

The new Merge-On-Write implementation is mainly to optimize the query performance of data. As mentioned earlier, it has achieved good results. However, these optimization effects are obtained by doing some additional work during writing. Therefore, the new version of Merge-On-Write implementation will slow down the data import efficiency to a small extent. However, due to concurrency and the pipeline effect between multiple batches of imports, the overall throughput does not decrease significantly.

Usage Method

In version 1.2, as a new Feature, Merge-on-Write is disabled by default. Users can enable it by adding the following Property when creating a table:

"enable_unique_key_merge_on_write" = "true"

In addition, the new version of the Merge-on-Write data update mode is different from the old version of the Merge-on-Read implementation. Therefore, the already created Unique Key table cannot directly support it by adding Property through Alter Table, and it can only be specified when creating a new table. If users need to convert the old table to the new table, they can use the method of insert into new_table select * from old_table.

Explanation of test results:

Under the typical configuration of 32C64GB, the total time for all queries to complete is 4.5 seconds for the new version implementation, and 126.4 seconds for the old version implementation, with a speed difference of nearly 30 times. Further analysis found that when queries were executed on the table of the old version implementation, all 32-core CPUs were fully loaded. Therefore, a machine with a higher configuration was used to test the query time on the table of the old version implementation when computing resources were sufficient.
Under the configuration of 64C128GB, the total time of the old version implementation is 49.9s, and the maximum number of cores used is about 48. When computing resources are sufficient, the old version implementation still has a 12-fold performance gap compared with the new version implementation.

It can be seen that the new version implementation not only greatly improves the query speed, but also significantly reduces CPU consumption.

Impact on Data Import

Usage Method

In version 1.2, as a new Feature, Merge-on-Write is disabled by default. Users can enable it by adding the following Property when creating a table:

"enable_unique_key_merge_on_write" = "true"

1 billion JSON records, 1-second query response: Apache Doris vs. ClickHouse, Elasticsearch, and PostgreSQL

Apache Doris — Tue, 04 Nov 2025 19:32:35 +0000

Honestly, every time I check performance benchmarks, my eyes instinctively dart to see where Apache Doris ranks. Opening JSONBench's leaderboard this time, I felt that familiar mix of anticipation and nervousness. Fortunately, the result brought me a sigh of relief: Apache Doris snagged third place with just its default configuration, trailing only two versions of ClickHouse (the maintainer of JSONBench itself).

Not bad. But can Apache Doris go even further? I wanted to see how much more we could cut query latency through optimization, and find out the true performance gap between Apache Doris and ClickHouse. Long story short, here's a before-and-after comparison chart of our optimizations. For the details behind the improvements, read on!

I. What is JSONBench?

JSONBench is a benchmark tool specifically designed for JSON data analytics, with the following core features:

Test Data: 1 billion JSON-format user behavior logs from real production environments;
Test Cases: 5 SQL queries specifically designed for JSON structures, accurately evaluating the database's ability to process semi-structured data;
Participants: Covers mainstream databases such as ClickHouse, SingleStore, MongoDB, Elasticsearch, DuckDB, and PostgreSQL.

At the time of testing, Apache Doris had already delivered an impressive performance: twice as fast as Elasticsearch and a staggering 80 times faster than PostgreSQL!

JSONBench Official Website: jsonbench.com

In addition to performance advantages, Apache Doris also has strong competitiveness in storage occupancy: under the same dataset, its storage volume is only 50% of Elasticsearch and 1/3 of PostgreSQL.

1.1 JSONBench Testing Process

Create a table named Bluesky in the database and import 1 billion real user behavior logs;
Each query is executed 3 times, and the operating system's Page Cache is cleared before each execution to simulate both cold and warm query scenarios;
Determine the database performance ranking based on the total query execution time.

1.2 Apache Doris Test Basic Configuration

In this test, Apache Doris used the VARIANT data type to store JSON data (introduced in Doris version 2.1, specifically designed for semi-structured JSON data), with the default table structure as follows:


CREATE TABLE bluesky (
    `id` BIGINT NOT NULL AUTO_INCREMENT,
    `data` variant NOT NULL
)
DISTRIBUTED BY HASH(id) BUCKETS 32
PROPERTIES ("replication_num"="1");

Core Advantages of VARIANT Data Type:

No need to predefine column structures; can directly store complex data containing integers, strings, booleans and other types;
Adapts to frequently changing nested structures; can automatically infer column information based on data structure and type during writing, and dynamically merge write schemas;
Stores JSON key-value pairs as columns and dynamic sub-columns, balancing the flexibility of semi-structured data and query efficiency.

More information about VARIANT data type: Apache Doris Official Documentation

II. Apache Doris Performance Optimization Practice

The JSONBench leaderboard is based on the performance data of each database system under its default configuration. However, in actual production environments, can we further unlock the potential of Apache Doris through tuning? The following is the complete optimization process.

2.1 Basic Environment Configuration

Server: AWS M6i.8xlarge (32 cores, 128GB memory);
Operating System: Ubuntu 24.04;
Apache Doris Version: v3.0.5.

2.2 Core Optimization: Schema Structuring Transformation

All queries in JSONBench target fixed JSON extraction paths, which means the actual schema of the semi-structured data is fixed. Based on this, we used Generated Columns to extract frequently accessed fields, combining the advantages of semi-structured and structured data. For frequently accessed JSON paths or calculation expressions, adding generated columns can significantly improve query speed!

2.2.1 Optimized Table Structure


CREATE TABLE bluesky (
    kind VARCHAR(100) GENERATED ALWAYS AS (get_json_string(data, '$.kind')) NOT NULL,
    operation VARCHAR(100) GENERATED ALWAYS AS (get_json_string(data, '$.commit.operation')) NULL,
    collection VARCHAR(100) GENERATED ALWAYS AS (get_json_string(data, '$.commit.collection')) NULL,
    did VARCHAR(100) GENERATED ALWAYS AS (get_json_string(data,'$.did')) NOT NULL,
    time DATETIME GENERATED ALWAYS AS (from_microsecond(get_json_bigint(data, '$.time_us'))) NOT NULL,
    `data` variant NOT NULL
)
DUPLICATE KEY (kind, operation, collection)
DISTRIBUTED BY HASH(collection, did) BUCKETS 32
PROPERTIES ("replication_num"="1");

This transformation not only reduces the data extraction overhead during queries, but the flattened columns can also be used as partition columns to achieve more balanced data distribution.

2.2.2 Supporting Query Statement Optimization

Query statements need to be modified synchronously to use flattened columns. The following is a comparison before and after optimization:

Before Optimization (Native JSON Query):


-- Query 1
SELECT cast(data['commit']['collection'] AS TEXT ) AS event, COUNT(*) AS count FROM bluesky GROUP BY event ORDER BY count DESC;
-- Query 2
SELECT cast(data['commit']['collection'] AS TEXT ) AS event, COUNT(*) AS count, COUNT(DISTINCT cast(data['did'] AS TEXT )) AS users FROM bluesky WHERE cast(data['kind'] AS TEXT ) = 'commit' AND cast(data['commit']['operation'] AS TEXT ) = 'create' GROUP BY event ORDER BY count DESC;
-- Query 3
SELECT cast(data['commit']['collection'] AS TEXT ) AS event, HOUR(from_microsecond(CAST(data['time_us'] AS BIGINT))) AS hour_of_day, COUNT(*) AS count FROM bluesky WHERE cast(data['kind'] AS TEXT ) = 'commit' AND cast(data['commit']['operation'] AS TEXT ) = 'create' AND cast(data['commit']['collection'] AS TEXT ) IN ('app.bsky.feed.post', 'app.bsky.feed.repost', 'app.bsky.feed.like') GROUP BY event, hour_of_day ORDER BY hour_of_day, event;
-- Query 4
SELECT cast(data['did'] AS TEXT ) AS user_id, MIN(from_microsecond(CAST(data['time_us'] AS BIGINT))) AS first_post_ts FROM bluesky WHERE cast(data['kind'] AS TEXT ) = 'commit' AND cast(data['commit']['operation'] AS TEXT ) = 'create' AND cast(data['commit']['collection'] AS TEXT ) = 'app.bsky.feed.post' GROUP BY user_id ORDER BY first_post_ts ASC LIMIT 3;
-- Query 5
SELECT cast(data['did'] AS TEXT ) AS user_id, MILLISECONDS_DIFF(MAX(from_microsecond(CAST(data['time_us'] AS BIGINT))),MIN(from_microsecond(CAST(data['time_us'] AS BIGINT)))) AS activity_span FROM bluesky WHERE cast(data['kind'] AS TEXT ) = 'commit' AND cast(data['commit']['operation'] AS TEXT ) = 'create' AND cast(data['commit']['collection'] AS TEXT ) = 'app.bsky.feed.post' GROUP BY user_id ORDER BY activity_span DESC LIMIT 3;

After Optimization (Flattened Column Query):


-- Query 1
SELECT collection AS event, COUNT(*) AS count FROM bluesky GROUP BY event ORDER BY count DESC;
-- Query 2
SELECT collection AS event, COUNT(*) AS count, COUNT(DISTINCT did) AS users FROM bluesky WHERE kind = 'commit' AND operation = 'create' GROUP BY event ORDER BY count DESC;
-- Query 3
SELECT collection AS event, HOUR(time) AS hour_of_day, COUNT(*) AS count FROM bluesky WHERE kind = 'commit' AND operation = 'create' AND collection IN ('app.bsky.feed.post', 'app.bsky.feed.repost', 'app.bsky.feed.like') GROUP BY event, hour_of_day ORDER BY hour_of_day, event;
-- Query 4
SELECT did AS user_id, MIN(time) AS first_post_ts FROM bluesky WHERE kind = 'commit' AND operation = 'create' AND collection = 'app.bsky.feed.post' GROUP BY user_id ORDER BY first_post_ts ASC LIMIT 3;
-- Query 5
SELECT did AS user_id, MILLISECONDS_DIFF(MAX(time),MIN(time)) AS activity_span FROM bluesky WHERE kind = 'commit' AND operation = 'create' AND collection = 'app.bsky.feed.post' GROUP BY user_id ORDER BY activity_span DESC LIMIT 3;

2.3 Page Cache Tuning

After modifying the query statements, we enabled performance profiling and executed the complete test:


set enable_profile=true;

By accessing the FE Web UI (port 8030) to view the profile, we found that the Page Cache hit rate of the SCAN Operator was extremely low — this meant that cold reads still occurred during the hot query test (similar to wanting to get something from the fridge but finding it empty, having to go all the way to the supermarket). The key data is as follows:

Cached Pages Number (CachedPagesNum): 1.258K (1258);
Total Pages Number (TotalPagesNum): 7.422K (7422).

The root cause is that the default size of Page Cache is not sufficient to hold all the data of the Bluesky table. The solution is to add a configuration in be.conf to increase the proportion of Page Cache in total memory from the default 20% to 60%:


storage_page_cache_limit=60%

After re-running the test, the cold read issue was completely resolved, with a cache hit rate of 100%:

Cached Pages Number (CachedPagesNum): 7.316K (7316);
Total Pages Number (TotalPagesNum): 7.316K (7316).

2.4 Maximizing Parallelism Configuration

To further unleash performance, we set the session variable parallel_pipeline_task_num to 32 — since the test server has 32 CPU cores, matching the parallelism to the number of CPU cores can maximize CPU utilization:


-- Parallelism configuration for a single Fragment
set global parallel_pipeline_task_num=32;

III. Optimization Result: Surpassing ClickHouse by 39%

After the above-mentioned adjustments to schema, queries, memory limits, and CPU parameters, we compared the performance of Apache Doris before and after optimization, as well as with other database systems:

Core improvement data:

Compared with pre-optimization, Apache Doris reduced the total query time by 74%;
Compared with ClickHouse, which was previously ranked first on the leaderboard, the performance was improved by 39%.

IV. Summary and Future Outlook

Through schema structuring transformation, query statement optimization, cache configuration adjustment, and parallelism parameter tuning, Apache Doris has achieved a significant reduction in semi-structured data query latency: under default configuration, it only lagged behind ClickHouse by a few seconds when querying 1 billion JSON records. However, with its strong JSON processing capabilities, VARIANT data type support, and Generated Columns feature, it has clearly surpassed similar databases in this scenario after optimization.

In the future, Apache Doris will continue to deepen its semi-structured data processing capabilities and achieve more powerful and efficient analytics through the following directions:

Optimize sparse VARIANT column storage to support more than 10,000 sub-columns;
Reduce memory usage of wide tables with 10,000-level columns;
Support custom types and indexes for VARIANT sub-columns based on column name patterns.

The data lakehouse evolution

Apache Doris — Thu, 30 Oct 2025 18:44:20 +0000

Data lakehouses are everywhere in today’s conversations about modern data architecture. But before we get swept up in the buzz, it’s worth stepping back to understand how the industry got here — and what we truly need from a lakehouse. Then I’ll introduce Apache Doris as a next-generation lakehouse solution and show how it delivers on those expectations.

The evolution towards lakehouse

01 Traditional data warehouse

In the early days of enterprise digital transformation, the growing complexity of business data gave rise to traditional data warehouses. These systems were designed to empower business intelligence (BI) by consolidating structured data from diverse sources through ETL pipelines.

With features like well-defined schemas, columnar storage, and tightly coupled compute-storage architecture, data warehouses enabled fast, reliable analysis and reporting using standard SQL. They also ensured data consistency through centralized management and strict transactional controls.

However, as the digital landscape expanded—driven by the rise of the internet, IoT, and an explosion of unstructured data formats like logs, images, and documents—traditional warehouses struggled to scale efficiently or support flexible, exploratory analytics.

This gap sparked the emergence of data lakes, offering a more cost-effective, schema-flexible approach better suited for big data and machine learning workloads.

02 Data lake

Google’s pioneering contributions to big data—Google File System (GFS), MapReduce, and BigTable—ignited a global wave of innovation and laid the foundation for the Hadoop ecosystem. Hadoop revolutionized large-scale data processing by enabling cost-efficient computation on commodity hardware. Data lakes, built on this foundation, became instrumental for handling complex and massive datasets across a variety of use cases:

Massive-scale data processing: By leveraging distributed storage and parallel computing, data lakes support high-throughput processing on standard computing nodes, eliminating the need for expensive proprietary hardware.

Multi-modal data support & low-cost storage: Unlike traditional data warehouses, data lakes store raw, unstructured, or semi-structured data without rigid schema definitions. With a schema-on-read approach, structure is applied at query time, preserving the full value of diverse data types such as images, videos, logs, and more. Object storage further reduces costs while enabling massive scalability.

Multi-modal computing: A single dataset in a data lake can be accessed by various engines for different tasks—SQL querying, machine learning, AI model training—delivering a highly flexible and unified analytics environment.

The term "data lake" vividly captures the essence: vast pools of raw data stored in a unified layer, ready for various downstream processing. As the architecture evolved, a three-tier design emerged, paving the way for lakehouse-style analytics:

Storage layer: It is backed by distributed file systems or cloud object storage services like HDFS, AWS S3, and Azure Blob. These platforms offer near-infinite scalability, high availability, and cost-efficiency. Data is retained in its original form to provide flexibility for various analytical use cases.

Compute layer: Data stored in lakes can be accessed by multiple compute engines based on workload needs. Hive enables batch ETL via HiveQL, Spark handles batch, streaming, and ML tasks, while Presto excels at interactive, ad-hoc querying.

Metadata layer: Services like Hive Metastore manage schema definitions, partitions, and data locations to provide a shared metadata foundation across engines. This layer is crucial for consistent interpretation, collaboration, and discoverability of data within the lake.

03 New challenges in modern data processing

Over the years, both data warehouses and data lakes have evolved to serve vital roles in enterprise data architecture.

However, as modern businesses demand real-time insights, greater flexibility, and open ecosystem integration, both architectures are facing growing limitations. Here’s a snapshot of the key challenges confronting each:

For traditional data warehouses:

Lack of real-time capabilities: In high-stakes scenarios like flash sales or real-time monitoring, businesses expect sub-second analytics. Traditional warehouses rely on static ETL processes and struggle to handle continuously changing data, making real-time decision-making difficult. For instance, tracking dynamic shipment routes in logistics is a challenge without true real-time data handling.

Inflexibility with semi-structured and unstructured data: With the rise of data from social media, medical imaging, and IoT, rigid schema management in warehouses leads to inefficiencies in storage, indexing, and querying. Handling massive volumes of clinical text and images in healthcare research is one such area where traditional warehouses fall short.

For data lakes:

Performance bottlenecks: While great for batch processing, engines like Spark and Hive lag behind in interactive, low-latency analytics. Business users and analysts often face sluggish query response times, which can hinder timely decision-making in areas like real-time fraud detection or financial risk assessment.

Lack of transactional integrity: To maximize flexibility and scalability, data lakes often sacrifice transactional guarantees. This tradeoff can lead to data inconsistency or even loss during complex data operations, posing risks for accuracy-critical applications.

Data governance pitfalls: Open-write access in data lakes can lead to data quality issues and inconsistency. Without robust governance, the "data lake" can quickly become a "data swamp", making it difficult to extract reliable insights, especially when integrating diverse data sources with inconsistent formats or semantics.

To meet the needs of both real-time analytics and flexible data processing, many organizations maintain both data warehouses and data lakes. However, this dual-system approach introduces its own challenges: data duplication, redundant pipelines, fragmented user experiences, and data silos. As a result, the industry is shifting toward a unified solution: merging the strengths of warehouses and lakes into a unified lakehouse architecture.

04 Data lakehouse

The lakehouse architecture unifies storage, computation, and metadata into a single cohesive platform—reducing redundancy, lowering costs, and ensuring data freshness. Over time, this architecture has crystallized into a multi-layered paradigm:

Storage layer: the solid foundation Building on the distributed storage capabilities pioneered by data lakes, lakehouses typically rely on HDFS or cloud object stores (e.g., AWS S3, Azure Blob, GCS). Data is stored in raw or open columnar formats like Parquet and ORC, which offer high compression and efficient columnar access. This setup drastically reduces I/O overhead and provides a performant backbone for downstream data processing.

Open data formats: interoperability In addition to open file formats, like Parquet and ORC, which ensure interoperability across diverse compute engines, lakehouse systems also embrace open table formats like Apache Iceberg, Hudi, and Delta Lake, enabling features such as near real-time updates, ACID transactions, time travel, and snapshot isolation. These formats ensure seamless compatibility across SQL engines and unify the flexibility of data lakes with the transactional guarantees of traditional warehouses, so the same dataset can be available for both real-time processing and historical analytics.

Computation layer: diverse engines, unified power The computation layer combines various engines to leverage their respective strengths. Spark powers large-scale batch jobs and machine learning with its rich APIs. Flink handles real-time stream processing. Presto and Apache Doris excel at ultra-fast, interactive queries. By leveraging a shared storage layer and integrated resource management, these engines can collaboratively execute complex workflows, serving use cases from real-time dashboards to in-depth analytics.

Metadata layer: the intelligent control plane Evolving from tools like Hive Metastore to modern systems such as Unity Catalog and Apache Gravitino, metadata management in lakehouses provides a unified namespace and centralized data catalog across multi-cloud and multi-cluster environments. This allows users to easily discover, govern, and interact with data, regardless of where it resides or which engine is querying it. Enhanced features like access control, audit logging, and lineage tracking ensure enterprise-grade data governance.

In essence, the lakehouse unites the best of both worlds—retaining the cost-efficiency and scalability of lakes while integrating the performance and reliability of warehouses. By standardizing data formats, centralizing metadata, and supporting hybrid processing (real-time + batch), it’s quickly becoming the gold standard for modern big data architectures.

Apache Doris: the lakehouse solution

To respond to the trend and provide better analytics services, Apache Doris has extensively enhanced its data lakehousing capabilities since version 2.1.

As enterprises push forward with building a lakehouse architecture, they often face complex challenges—from selecting new systems and integrating legacy platforms to managing data format conversions, adapting to new APIs, ensuring seamless system transitions, and coordinating teams across departments for permissions and compliance. To help companies navigate this complexity, Apache Doris introduces two core concepts: "Boundless Data" and "Boundless Lakehouse". These ideas aim to accelerate the lakehouse transformation while minimizing risks and costs.

01 Boundless Data

Boundless Data focuses on breaking down data silos. Apache Doris offers unified query acceleration and simplifies system architecture.

Easy data access

Apache Doris supports a wide range of data systems and formats through its flexible extensible connector framework, enabling users to run cross-platform SQL analytics without overhauling their existing data infrastructure.

Apache Doris offers powerful data source connectors that make it easy to connect to and efficiently extract data from a wide range of systems—whether it's Hive, Iceberg, Hudi, Paimon, or any database that supports the JDBC protocol. For lakehouse systems, Doris can seamlessly retrieve table schemas and distribution information from the underlying metadata services, enabling smart query planning. Thanks to its MPP (Massively Parallel Processing) architecture, Doris can scan and process distributed data at scale with high performance. Below is a list of the supported data sources along with their corresponding metadata and storage systems.

Apache Doris features an extensible connector framework that makes it easy for developers to integrate custom enterprise data sources and achieve seamless data interoperability:

Doris defines a standardized three-level structure—Catalog, Database, and Table, so developers can easily map to the appropriate layers of their target data systems. Doris also provides standard interfaces for metadata services and data access, allowing developers to integrate new data sources simply by implementing the defined APIs.

Additionally, Doris is compatible with Trino connectors, enabling teams to directly deploy Trino plugin packages into a Doris cluster with minimal configuration. Doris already supports integrations with sources like Kudu, BigQuery, Delta Lake, Kafka, and Redis.

Beyond integration, Doris also enables convenient cross-source data processing. It allows users to create multiple connectors at runtime, so they can perform federated queries across different data sources using standard SQL. For example, users can easily join a fact table from Hive with a dimension table from MySQL:

SELECT h.id, m.name
 FROM hive.db.hive_table h JOIN mysql.db.mysql_table m
 ON h.id = m.id;

Combined with Doris’ built-in job scheduling capabilities, users can automate such queries. For example, they can set it as an hourly job and write the query results into an Iceberg table:

CREATE JOB schedule_load
ON SCHEDULE EVERY 1 HOUR DO
INSERT INTO iceberg.db.ice_table
SELECT h.id, m.name
FROM hive.db.hive_table h JOIN mysql.db.mysql_table m
ON h.id = m.id;

High-performance data processing

High-performance data analytics is a fundamental driver behind the transition from data lakes to unified lakehouse architectures. Apache Doris has extensively optimized data processing and offers a rich set of query acceleration features:

Execution engine: Doris is built on an MPP (Massively Parallel Processing) framework combined with a pipeline-based execution model. This design enables it to process massive datasets quickly in a multi-machine, multi-core distributed environment. With fully vectorized operators, Doris delivers leading performance on industry-standard benchmarks like TPC-DS.

Query optimizer: Doris features an intelligent query optimizer that automatically handles complex SQL requests. It deeply optimizes operations such as multi-table joins, aggregations, sorting, and pagination. Specifically, it uses advanced cost models and relational algebra transformations to generate highly efficient execution plans, making SQL writing much simpler for users while boosting performance.

Caching and I/O optimization: Accessing external data sources often involves high-latency, unstable network communication. Doris addresses this with a comprehensive caching system. It has optimized cache types, freshness, and strategies to maximize the use of memory and local high-speed disks. It also fine-tunes network I/O to deal with high throughput, low IOPS, and high latency, offering near-local performance even when accessing remote data sources.

Materialized views and transparent acceleration: Doris supports flexible refresh strategies for materialized view, including full refresh and partition-based incremental refresh, in order to reduce maintenance costs and improving data freshness. In addition to manual refresh, it also supports scheduled and data-triggered refreshes for greater automation. Transparent acceleration is when the query optimizer can automatically route queries to the best available materialized view. Featuring columnar storage, efficient compression, and intelligent indexing, the materialized views of Doris can greatly improve query efficiency and can even replace traditional caching layers.

As a result, in benchmark tests on a 1TB TPC-DS dataset using the Iceberg table format, Apache Doris completed 99 queries in just one-third of the total time taken by Trino.

In real-world user scenarios, Apache Doris delivers performance gains over Presto while using only half the computing resources. On average, Doris reduces query latency by 20%, and achieves a 50% reduction in 95th percentile latency.

Seamless migration

When integrating multiple data sources into a unified lakehouse, migrating SQL queries is often one of the biggest hurdles. Different SQL dialects across systems can create major compatibility challenges, leading to costly and time-consuming rewrites.

To simplify this process, Apache Doris offers a SQL Converter. It allows users to directly query data using SQL dialects from other engines because it automatically translates queries into Doris SQL (standard SQL). Currently, Doris supports SQL dialects from Presto/Trino, Hive, PostgreSQL, and ClickHouse, achieving over 99% compatibility in some production environments.

02 Boundless Lakehouse

Beyond query migration, Doris also addresses the need for architectural streamlining.

Modern deployment architecture

Since version 3.0, Doris has supported a cloud-native, compute-storage decoupled architecture. This modern deployment model maximizes resource efficiency because it enables independent scaling of compute and storage resources. This is important for enterprises because it provides flexible resource mangement for large-scale analytics workloads.

As illustrated above, in the compute-storage decoupled mode of Apache Doris, the compute nodes no longer store the primary data. Instead, HDFS or object storage serves as a unified, shared storage layer. This architecture powers a reliable and cost-efficient lakehouse in the following ways:

Cost-efficient storage: Storage and compute resources scale separately, allowing enterprises to expand storage without incurring additional compute costs. Meanwhile, organizations benefit from low-cost cloud object storage and higher availability. For the frequently accessed hot data, users can still cache it on local high-speed disks for better performance.

Single source of truth: All data is centralized in the shared storage layer, making it accessible across multiple compute clusters. This ensures data consistency, eliminates duplication, and simplifies data management.

Workload flexibility: Users can dynamically adjust their compute resources to match different workloads. For example, batch processing, real-time analytics, and machine learning use cases vary in resource requirements. With storage and compute decoupled, enterprises can fine-tune resource usage for maximum efficiency across diverse operational demands.

Data storage and management

Apache Doris offers a rich set of data storage and management capabilities, supporting both mainstream lakehouse table formats like Iceberg and Hudi as well as its own highly optimized storage format. Beyond simply accommodating industry standards, Doris brings even greater flexibility and performance to the table.

Semi-structured data support: Apache Doris natively supports semi-structured data types such as JSON and VARIANT to provide a schemaless experience that eliminates the overhead of manual data transformation and cleansing. Users can directly ingest raw JSON data, which Doris stores in a high-performance columnar format for complex analytics.

Data updates: Doris enables near real-time data updates and efficient change data capture (CDC). Also, the partial column update capability allows users to easily merge multiple data streams into wide tables inside Doris, simplifying data pipelines.

Data indexing: Doris offers various indexing options, such as prefix indexes, inverted indexes, skiplist indexes, and BloomFilter index to speed up query performance and minimize both local and network IO, especially in compute-storage decoupled environments.

Stream and batch writing: Doris supports both bulk batch loading and high-frequency writes through micro-batching. It leverages MVCC (Multi-Version Concurrency Control) to seamlessly manage both real-time and historical data within the same dataset.

Openness

The openness of a data lakehouse is key to data integration and management efficiency. As discussed earlier, Apache Doris offers strong support for open table formats and file formats. Beyond that, Doris ensures the same level of openness for its own storage. It provides an open storage API based on the Arrow Flight SQL protocol, combining the high performance of Arrow Flight with the usability of JDBC/ODBC. Through this interface, users can easily access data stored in Doris using ABDC clients for Python, Java, Spark, and Flink.

Instead of relying solely on open file formats, Doris' open storage API abstracts away the underlying file format complexities, allowing it to fully exploit its advanced storage features like indexing for faster data retrieval. Meanwhile, the compute engine does not need to adapt to storage-level changes. This means that any engine supporting the protocol can seamlessly benefit from the Doris capabilities without additional integration work.

The end

The data lakehouse represents the future of unified analytics, but its success depends on overcoming performance and complexity barriers. Apache Doris combines the scalability of a data lake with the speed and reliability of a warehouse. It stays true to the idea of open data lakehouse with boundless data and architecture, and empowers it with its real-time querying capability, elastic scalability, and open-source flexibility.