<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Apache Doris</title>
    <description>The latest articles on Forem by Apache Doris (@apachedoris).</description>
    <link>https://forem.com/apachedoris</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F868250%2F969aae9e-130a-4966-a0d8-84d4278b28fa.jpg</url>
      <title>Forem: Apache Doris</title>
      <link>https://forem.com/apachedoris</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/apachedoris"/>
    <language>en</language>
    <item>
      <title>Can I use Apache Doris with my existing RAG system?</title>
      <dc:creator>Apache Doris</dc:creator>
      <pubDate>Wed, 28 Jan 2026 21:32:47 +0000</pubDate>
      <link>https://forem.com/apachedoris/can-i-use-apache-doris-with-my-existing-rag-system-3f2f</link>
      <guid>https://forem.com/apachedoris/can-i-use-apache-doris-with-my-existing-rag-system-3f2f</guid>
      <description>&lt;p&gt;This question came up in our recent webinar Q&amp;amp;A [video below👇]. &lt;br&gt;
The short answer: Yes. Apache Doris can replace your existing vector store (ChromaDB, Pinecone, Milvus...), but your chunking, embedding pipeline, and application logic stay exactly as they are.&lt;/p&gt;

&lt;p&gt;A lot of RAG systems infra today look like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Postgres for structured data&lt;/li&gt;
&lt;li&gt;Pinecone/ChromaDB/Milvus/Weaviate for vectors&lt;/li&gt;
&lt;li&gt;Some even adding Elasticsearch for keyword search&lt;/li&gt;
&lt;li&gt;Your app stitches results together&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But what if "clients want to query their database with an LLM, not just text, but structured and unstructured data together?"&lt;/p&gt;

&lt;p&gt;When your vectors, keywords, and metadata live in different systems, it's difficult for you to do searches like this efficiently: "find Python engineers in San Francisco hired in 2024 with similar backgrounds to this resume." &lt;/p&gt;

&lt;p&gt;But with Apache Doris, a real-time database that now support hybrid search and vector search, you can do those searches in one SQL query, in one database, using one unified system.&lt;/p&gt;

&lt;p&gt;If you're running RAG in production, juggling multiple databases, and facing cost and performance issues, it might be worth asking: what if you didn't have to?&lt;br&gt;
🔗 See how ByteDance uses Apache Doris' hybrid search to cut down vector search cost: &lt;a href="https://www.velodb.io/blog/bytedance-solved-billion-scale-vector-search-problem-with-apache-doris-4-0?utm_source=linkedin" rel="noopener noreferrer"&gt;https://www.velodb.io/blog/bytedance-solved-billion-scale-vector-search-problem-with-apache-doris-4-0?utm_source=linkedin&lt;/a&gt;&lt;br&gt;
🔗 Watch the webinar in full: &lt;a href="https://www.youtube.com/watch?v=kKiXWNWZYVc" rel="noopener noreferrer"&gt;https://www.youtube.com/watch?v=kKiXWNWZYVc&lt;/a&gt;&lt;/p&gt;

</description>
      <category>rag</category>
      <category>ai</category>
      <category>apachedoris</category>
      <category>vectorsearch</category>
    </item>
    <item>
      <title>Overview of Real-Time Data Synchronization from PostgreSQL to VeloDB</title>
      <dc:creator>Apache Doris</dc:creator>
      <pubDate>Tue, 20 Jan 2026 22:14:48 +0000</pubDate>
      <link>https://forem.com/apachedoris/overview-of-real-time-data-synchronization-from-postgresql-to-velodb-5aem</link>
      <guid>https://forem.com/apachedoris/overview-of-real-time-data-synchronization-from-postgresql-to-velodb-5aem</guid>
      <description>&lt;h1&gt;
  
  
  Overview
&lt;/h1&gt;

&lt;p&gt;In the process of migrating data from PostgreSQL (including PostgreSQL-compatible Amazon Aurora) to VeloDB, Flink can be introduced as a real-time data synchronization engine to ensure data consistency and real-timeliness. Flink possesses high-throughput and low-latency stream processing capabilities, enabling efficient full-data loading and incremental change handling for databases.&lt;/p&gt;

&lt;p&gt;For real-time synchronization scenarios, PostgreSQL's Logical Replication can be enabled to capture CDC (Change Data Capture) events. Whether it is a self-hosted PostgreSQL or cloud-based Amazon Aurora-PostgreSQL, you can subscribe with Flink CDC by enabling the logical decoding plugin and creating a Replication Slot, thereby achieving:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Full data initial load: First import business data from PostgreSQL/Aurora into VeloDB&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Real-time synchronization of incremental changes: Capture Insert/Update/Delete operations based on Logical Replication and continuously write them to VeloDB&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The following takes Amazon Aurora-PostgreSQL as an example to demonstrate how to use Flink CDC to subscribe to Aurora changes and synchronize them to VeloDB in real time.&lt;/p&gt;

&lt;h1&gt;
  
  
  Example
&lt;/h1&gt;

&lt;h2&gt;
  
  
  1. Create an AWS RDS Aurora PostgreSQL instance
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft3nplp5dpn0lqx2ju1jy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft3nplp5dpn0lqx2ju1jy.png" alt=" " width="800" height="324"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Create a VeloDB warehouse
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F88qwgmdt5sc9vi340nv2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F88qwgmdt5sc9vi340nv2.png" alt=" " width="800" height="210"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Create a PostgreSQL database and corresponding tables
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;test_db&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Create table&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;test_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;public&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;student&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;255&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;phone&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="nb"&gt;NUMERIC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;CURRENT_TIMESTAMP&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Load data&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;test_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;public&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;student&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;phone&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="k"&gt;VALUES&lt;/span&gt; 
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Alice Zhang'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;22&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'alice@example.com'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'13800138000'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;89&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Bob Li'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;21&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'bob@example.com'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'13900139000'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;76&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Charlie Wang'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;23&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'charlie@example.com'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'13600136000'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;92&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'David Chen'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'david@example.com'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'13500135000'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;85&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Emma Liu'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;22&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'emma@example.com'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'13700137000'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;78&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;90&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  4. Enable PostgreSQL Logical Replication
&lt;/h2&gt;

&lt;p&gt;Create a parameter group and modify the rds.logical_replication configuration&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fktav4qgd42kxpxloophx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fktav4qgd42kxpxloophx.png" alt=" " width="800" height="381"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8wh63dd8gn23ae8y40df.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8wh63dd8gn23ae8y40df.png" alt=" " width="800" height="155"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Modify the PostgreSQL configuration: replace the DB Cluster Parameter with the parameter group created just now, apply the changes, and restart the service&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdqsmhscpfd8c3pn3art2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdqsmhscpfd8c3pn3art2.png" alt=" " width="800" height="169"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Install Flink With Doris Connector
&lt;/h2&gt;

&lt;h3&gt;
  
  
  5.1 Download the pre-defined installation package
&lt;/h3&gt;

&lt;p&gt;Based on Flink 1.17, we provide a pre-defined installation package that can be directly downloaded and decompressed.&lt;/p&gt;

&lt;h3&gt;
  
  
  5.2 Manual installation
&lt;/h3&gt;

&lt;p&gt;If you already have a Flink environment or need another version of Flink, you can use the manual installation method.Taking Flink 1.17 as an example here, download the Flink installation package and its dependencies.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Flink 1.17&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Flink Postgres CDC Connector&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Flink Doris Connector&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After the download is complete, extract the Flink installation package.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;
&lt;span class="nb"&gt;tar&lt;/span&gt; &lt;span class="nt"&gt;-zxvf&lt;/span&gt; flink-1.17.2-bin-scala_2.12.tgz
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Meanwhile, place the Flink PostgreSQL CDC Connector and Doris Connector into the flink-1.17.2-bin/lib directory.&lt;/p&gt;

&lt;p&gt;As follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo2oq577dh7dkjn4hvhm3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo2oq577dh7dkjn4hvhm3.png" alt=" " width="800" height="269"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Submit the Flink synchronization job
&lt;/h2&gt;

&lt;p&gt;When submitting the job, the Doris Connector will automatically create corresponding tables in VeloDB in advance based on the table structure of the upstream PostgreSQL database.&lt;/p&gt;

&lt;p&gt;Flink supports job submission and operation in modes such as Local, Standalone, and Yarn. If you already have a Flink environment, you can directly submit the job to your own Flink environment.&lt;/p&gt;

&lt;h3&gt;
  
  
  6.1 Local Environment
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;flink-1.17.2-bin
bin/flink run &lt;span class="nt"&gt;-t&lt;/span&gt; &lt;span class="nb"&gt;local&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-Dexecution&lt;/span&gt;.checkpointing.interval&lt;span class="o"&gt;=&lt;/span&gt;10s &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                              
    &lt;span class="nt"&gt;-Dparallelism&lt;/span&gt;.default&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                             
    &lt;span class="nt"&gt;-c&lt;/span&gt; org.apache.doris.flink.tools.cdc.CdcTools &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                        
    lib/flink-doris-connector-1.17-25.1.0.jar &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                           
    postgres-sync-database &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                                 
    &lt;span class="nt"&gt;--database&lt;/span&gt; test_db &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                                  
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; &lt;span class="nb"&gt;hostname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;database-test.xxx.us-east-1.rds.amazonaws.com &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                     
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; &lt;span class="nv"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3306 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                              
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; &lt;span class="nv"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;admin &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                          
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; &lt;span class="nv"&gt;password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;123456 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                        
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; database-name&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                     
    &lt;span class="nt"&gt;--including-tables&lt;/span&gt; &lt;span class="s2"&gt;"student"&lt;/span&gt; &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                       
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; &lt;span class="nv"&gt;fenodes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:8080 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; &lt;span class="nv"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;admin &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                           
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; &lt;span class="nv"&gt;password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;123456 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                               
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; jdbc-url&lt;span class="o"&gt;=&lt;/span&gt;jdbc:mysql://lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:9030 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                  
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; sink.label-prefix&lt;span class="o"&gt;=&lt;/span&gt;label
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  6.2 Standalone Environment
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;flink-1.17.2-bin
bin/flink run &lt;span class="nt"&gt;-t&lt;/span&gt; remote &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-Dexecution&lt;/span&gt;.checkpointing.interval&lt;span class="o"&gt;=&lt;/span&gt;10s &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                              
    &lt;span class="nt"&gt;-Dparallelism&lt;/span&gt;.default&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                             
    &lt;span class="nt"&gt;-c&lt;/span&gt; org.apache.doris.flink.tools.cdc.CdcTools &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                        
    lib/flink-doris-connector-1.17-25.1.0.jar &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                           
    postgres-sync-database &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                                 
    &lt;span class="nt"&gt;--database&lt;/span&gt; test_db &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                                  
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; &lt;span class="nb"&gt;hostname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;database-test.xxx.us-east-1.rds.amazonaws.com &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                     
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; &lt;span class="nv"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3306 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                              
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; &lt;span class="nv"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;admin &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                          
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; &lt;span class="nv"&gt;password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;123456 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                        
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; database-name&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                     
    &lt;span class="nt"&gt;--including-tables&lt;/span&gt; &lt;span class="s2"&gt;"student"&lt;/span&gt; &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                       
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; &lt;span class="nv"&gt;fenodes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:8080 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; &lt;span class="nv"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;admin &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                           
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; &lt;span class="nv"&gt;password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;123456 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                               
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; jdbc-url&lt;span class="o"&gt;=&lt;/span&gt;jdbc:mysql://lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:9030 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                  
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; sink.label-prefix&lt;span class="o"&gt;=&lt;/span&gt;label
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  6.3 Yarn Environment
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;flink-1.17.2-bin
bin/flink run &lt;span class="nt"&gt;-t&lt;/span&gt; yarn-per-job &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-Dexecution&lt;/span&gt;.checkpointing.interval&lt;span class="o"&gt;=&lt;/span&gt;10s &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                              
    &lt;span class="nt"&gt;-Dparallelism&lt;/span&gt;.default&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                             
    &lt;span class="nt"&gt;-c&lt;/span&gt; org.apache.doris.flink.tools.cdc.CdcTools &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                        
    lib/flink-doris-connector-1.17-25.1.0.jar &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                           
    postgres-sync-database &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                                 
    &lt;span class="nt"&gt;--database&lt;/span&gt; test_db &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                                  
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; &lt;span class="nb"&gt;hostname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;database-test.xxx.us-east-1.rds.amazonaws.com &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                     
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; &lt;span class="nv"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3306 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                              
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; &lt;span class="nv"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;admin &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                          
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; &lt;span class="nv"&gt;password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;123456 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                        
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; database-name&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                     
    &lt;span class="nt"&gt;--including-tables&lt;/span&gt; &lt;span class="s2"&gt;"student"&lt;/span&gt; &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                       
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; &lt;span class="nv"&gt;fenodes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:8080 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; &lt;span class="nv"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;admin &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                           
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; &lt;span class="nv"&gt;password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;123456 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                               
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; jdbc-url&lt;span class="o"&gt;=&lt;/span&gt;jdbc:mysql://lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:9030 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                  
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; sink.label-prefix&lt;span class="o"&gt;=&lt;/span&gt;label
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  6.4 K8S Environment
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;flink-1.17.2-bin
bin/flink run &lt;span class="nt"&gt;-t&lt;/span&gt; kubernetes-session &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-Dexecution&lt;/span&gt;.checkpointing.interval&lt;span class="o"&gt;=&lt;/span&gt;10s &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                              
    &lt;span class="nt"&gt;-Dparallelism&lt;/span&gt;.default&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                             
    &lt;span class="nt"&gt;-c&lt;/span&gt; org.apache.doris.flink.tools.cdc.CdcTools &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                        
    lib/flink-doris-connector-1.17-25.1.0.jar &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                           
    postgres-sync-database &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                                 
    &lt;span class="nt"&gt;--postgres&lt;/span&gt; test_db &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                                  
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; &lt;span class="nb"&gt;hostname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;database-test.xxx.us-east-1.rds.amazonaws.com &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                     
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; &lt;span class="nv"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3306 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                              
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; &lt;span class="nv"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;admin &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                          
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; &lt;span class="nv"&gt;password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;123456 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                        
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; database-name&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                     
    &lt;span class="nt"&gt;--including-tables&lt;/span&gt; &lt;span class="s2"&gt;"student"&lt;/span&gt; &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                       
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; &lt;span class="nv"&gt;fenodes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:8080 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; &lt;span class="nv"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;admin &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                           
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; &lt;span class="nv"&gt;password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;123456 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                               
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; jdbc-url&lt;span class="o"&gt;=&lt;/span&gt;jdbc:mysql://lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:9030 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                  
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; sink.label-prefix&lt;span class="o"&gt;=&lt;/span&gt;label
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note: For more parameters of the Connector, refer to this link.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. Verify Historical Data Synchronization
&lt;/h2&gt;

&lt;p&gt;The Flink job will synchronize full historical data for the first time. Check the data synchronization status in VeloDB.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5vbxrqb52w0m6fqi56xi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5vbxrqb52w0m6fqi56xi.png" alt=" " width="800" height="202"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  8. Verify Real-Time Data Synchronization
&lt;/h2&gt;

&lt;p&gt;For scenarios requiring capture of deleted data, enable the following configuration in PostgreSQL&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="k"&gt;public&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;student&lt;/span&gt; &lt;span class="n"&gt;REPLICA&lt;/span&gt; &lt;span class="k"&gt;IDENTITY&lt;/span&gt; &lt;span class="k"&gt;FULL&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For details, refer to this link.&lt;/p&gt;

&lt;p&gt;Perform data modifications in PostgreSQL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;student&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;phone&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="k"&gt;VALUES&lt;/span&gt; 
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Frank Zhao'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'frank@example.com'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'13400134000'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;88&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;75&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;

&lt;span class="k"&gt;DELETE&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;student&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;student&lt;/span&gt; 
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;95&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;23&lt;/span&gt; 
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify data changes in VeloDB:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftifdboaadc92ivtsw2vk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftifdboaadc92ivtsw2vk.png" alt=" " width="800" height="208"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>bigdata</category>
      <category>database</category>
      <category>doris</category>
    </item>
    <item>
      <title>Apache Doris IP change problem handling method</title>
      <dc:creator>Apache Doris</dc:creator>
      <pubDate>Thu, 18 Dec 2025 19:32:05 +0000</pubDate>
      <link>https://forem.com/apachedoris/apache-doris-ip-change-problem-handling-method-3o6k</link>
      <guid>https://forem.com/apachedoris/apache-doris-ip-change-problem-handling-method-3o6k</guid>
      <description>&lt;h2&gt;
  
  
  Background note
&lt;/h2&gt;

&lt;p&gt;Due to the existence of multiple network interface cards, or the existence of virtual network interface cards caused by the installation of Docker and other environments, there may be multiple different IPs on the same host. The current Apache Doris does not automatically recognize available IPs. Therefore, when encountering multiple IPs on the deployment host, you must force the correct IP through the priority_networks configuration item.&lt;/p&gt;

&lt;p&gt;priority_networks is a configuration that both FE and BE have, and the configuration item needs to be written in fe.conf and be.conf. This configuration item is used to tell the process which IP to bind when FE or BE starts. An example is as follows:&lt;/p&gt;

&lt;p&gt;$priority_networks =10.1.3.0/24$ &lt;/p&gt;

&lt;p&gt;This is a CIDR representation. FE or BE will use this configuration to find a matching IP as their localIP.&lt;/p&gt;

&lt;p&gt;CIDR uses slash notation and is expressed as the number of bits of IP Address/Network ID. The specific conversion method can be seen in the following two examples.&lt;/p&gt;

&lt;p&gt;① &lt;a href="//192.168.0.0/16,"&gt;192.168.0.0/16,&lt;/a&gt; converted to a 32-bit binary address: 11000000.10101000.0000000.00000000. Where/16 represents the 16-bit network ID, that is, the first 16 bits of the 32-bit binary address are fixed, corresponding to the network segment: 11000000.10101000.00000000.00000000~ 11000000.10101000.11111111.11111111.&lt;/p&gt;

&lt;p&gt;② &lt;a href="//192.168.1.2/24,"&gt;192.168.1.2/24,&lt;/a&gt; converted to a 32-bit binary address: 11000000.10101000.00000001.00000000. Where/24 means that the first 24 bits of the 32-bit binary address are fixed, corresponding to the network segment: 11000000.10101000.00000001.00000000~ 11000000.10101000.00000001.1111111111&lt;/p&gt;

&lt;p&gt;When the following scenario occurs, the ip will change, causing fe/be to malfunction and unable to start and operate normally&lt;/p&gt;

&lt;p&gt;① cluster migration leads to ip network segment change&lt;/p&gt;

&lt;p&gt;② IP change caused by dynamic address in virtual environment&lt;/p&gt;

&lt;p&gt;③ If fe/be is not properly configured before restarting priority_networks the ip obtained after restarting is inconsistent with the metadata&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Hardware information
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;CPU model: ARM64&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Memory: 2GB&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Hard drive: 36GB SSD&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  2. Software information
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;VM mirror version: CentOS-7&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Apache Doris version: 1.2.4 (other versions are also acceptable)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Cluster size: 1FE * 3BE&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  FE recovery
&lt;/h1&gt;

&lt;h2&gt;
  
  
  3. Exception log
&lt;/h2&gt;

&lt;p&gt;When checking fe.out, the following exception will be reported, and the fe process cannot be started at this time;&lt;/p&gt;

&lt;p&gt;Before operation, pay attention to backup all fe metadata and stop upstream read and write actions!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx0iricmlt6c9jkq7mhn5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx0iricmlt6c9jkq7mhn5.png" alt=" " width="800" height="82"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Get the current IP
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ip addr

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkctuqmlnfi4p5mowltsz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkctuqmlnfi4p5mowltsz.png" alt=" " width="800" height="211"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  5. 5. Reset IP information
&lt;/h2&gt;

&lt;p&gt;After resetting the ip information, the above exception will still be reported, and the metadata needs to be reset .&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# modify fe.conf priority_networks&lt;/span&gt;
priority_networks &lt;span class="o"&gt;=&lt;/span&gt; 192.168.0.0/16
&lt;span class="c"&gt;# or use this&lt;/span&gt;
priority_networks &lt;span class="o"&gt;=&lt;/span&gt; 192.168.31.78/16

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  6. Reset metadata record
&lt;/h2&gt;

&lt;p&gt;After resetting the metadata record, although the FE process can start, it is not available and requires metadata mode recovery.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Annotate out the old ips previously recorded in the fe metadata&lt;/span&gt;
vim doris-meta/image/ROLE

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkn1epocqznlooj77pa8n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkn1epocqznlooj77pa8n.png" alt=" " width="632" height="116"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  7. Metadata mode recovery
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Add metadata_failure_recovery=true to fe.conf to restart fe in recovery mode&lt;/span&gt;
vim fe.conf
&lt;span class="nv"&gt;metadata_failure_recovery&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;
&lt;span class="c"&gt;# Then go to http://192.168.31.78:8030/login, if you can open the fe web UI, it can be normal boot fe&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fru7d6asfc00avgs9g7g7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fru7d6asfc00avgs9g7g7.png" alt=" " width="800" height="322"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  8. Reset fe cluster node
&lt;/h2&gt;

&lt;p&gt;Although fe can currently be started using metadata Recovery Mode, it has not been fully restored because the cluster nodes recorded in the current fe metadata do not have the IP node that was just modified .&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="o"&gt;#&lt;/span&gt; &lt;span class="k"&gt;Execute&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="k"&gt;following&lt;/span&gt; &lt;span class="k"&gt;sql&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;mysql&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="k"&gt;or&lt;/span&gt; &lt;span class="n"&gt;web&lt;/span&gt; &lt;span class="n"&gt;ui&lt;/span&gt; &lt;span class="n"&gt;Playground&lt;/span&gt; &lt;span class="k"&gt;to&lt;/span&gt; &lt;span class="k"&gt;update&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;fe&lt;/span&gt; &lt;span class="n"&gt;nodes&lt;/span&gt; &lt;span class="n"&gt;recorded&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;fe&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;
&lt;span class="o"&gt;#&lt;/span&gt; &lt;span class="n"&gt;remove&lt;/span&gt; &lt;span class="k"&gt;old&lt;/span&gt; &lt;span class="n"&gt;ip&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="k"&gt;DROP&lt;/span&gt; &lt;span class="n"&gt;FOLLOWER&lt;/span&gt; &lt;span class="nv"&gt;"192.168.31.81:9010"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="o"&gt;#&lt;/span&gt; &lt;span class="k"&gt;add&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;ip&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt; 
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="k"&gt;ADD&lt;/span&gt; &lt;span class="n"&gt;FOLLOWER&lt;/span&gt; &lt;span class="nv"&gt;"192.168.31.78:9010"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The old IP nodes are as follows.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp9p3p5tzm4wxotjn2c49.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp9p3p5tzm4wxotjn2c49.png" alt=" " width="800" height="221"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The new IP node after reset is as follows.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjceyqgj9etct4myoqya8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjceyqgj9etct4myoqya8.png" alt=" " width="800" height="228"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  9. Turn off metadata mode and restart FE
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Annotate metadata_failure_recovery=true in fe.conf Turn off recovery mode and restart fe&lt;/span&gt;
vim fe.conf
&lt;span class="c"&gt;#metadata_failure_recovery=true&lt;/span&gt;

&lt;span class="c"&gt;# and then go to http://192.168.31.78:8030/login, if you can open the fe web UI, fe completely restored&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  BE Recovery
&lt;/h1&gt;

&lt;h2&gt;
  
  
  10. Get the current IP
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ip addr

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2wu4vuux51w339m13xds.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2wu4vuux51w339m13xds.png" alt=" " width="800" height="226"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  11. Reset IP information
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# modify be.conf priority_networks&lt;/span&gt;
priority_networks &lt;span class="o"&gt;=&lt;/span&gt; 192.168.0.0/16
&lt;span class="c"&gt;# or use this&lt;/span&gt;
priority_networks &lt;span class="o"&gt;=&lt;/span&gt; 192.168.31.136/16
&lt;span class="c"&gt;# After setting, restart be&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  12. Reset BE cluster node
&lt;/h2&gt;

&lt;p&gt;Although the current be can be started, it has not been fully restored because the be cluster node recorded in the current fe metadata does not have the just modified be node .&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="o"&gt;#&lt;/span&gt; &lt;span class="k"&gt;Execute&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="k"&gt;following&lt;/span&gt; &lt;span class="k"&gt;sql&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;mysql&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="k"&gt;or&lt;/span&gt; &lt;span class="n"&gt;web&lt;/span&gt; &lt;span class="n"&gt;ui&lt;/span&gt; &lt;span class="n"&gt;Playground&lt;/span&gt; &lt;span class="k"&gt;to&lt;/span&gt; &lt;span class="k"&gt;update&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;be&lt;/span&gt; &lt;span class="n"&gt;nodes&lt;/span&gt; &lt;span class="n"&gt;recorded&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;fe&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;
&lt;span class="o"&gt;#&lt;/span&gt; &lt;span class="n"&gt;remove&lt;/span&gt; &lt;span class="k"&gt;old&lt;/span&gt; &lt;span class="n"&gt;ip&lt;/span&gt; &lt;span class="n"&gt;nodes&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="n"&gt;DROPP&lt;/span&gt; &lt;span class="n"&gt;FOLLOWER&lt;/span&gt; &lt;span class="nv"&gt;"192.168.31.81:9010"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="n"&gt;DROPP&lt;/span&gt; &lt;span class="n"&gt;FOLLOWER&lt;/span&gt; &lt;span class="nv"&gt;"192.168.31.72:9010"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="n"&gt;DROPP&lt;/span&gt; &lt;span class="n"&gt;FOLLOWER&lt;/span&gt; &lt;span class="nv"&gt;"192.168.31.133:9010"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="o"&gt;#&lt;/span&gt; &lt;span class="k"&gt;add&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;ip&lt;/span&gt; &lt;span class="n"&gt;nodes&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="k"&gt;ADD&lt;/span&gt; &lt;span class="n"&gt;FOLLOWER&lt;/span&gt; &lt;span class="nv"&gt;"192.168.31.78:9010"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="k"&gt;ADD&lt;/span&gt; &lt;span class="n"&gt;FOLLOWER&lt;/span&gt; &lt;span class="nv"&gt;"192.168.31.71:9010"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="k"&gt;ADD&lt;/span&gt; &lt;span class="n"&gt;FOLLOWER&lt;/span&gt; &lt;span class="nv"&gt;"192.168.31.136:9010"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After all three BEs were reset, they were fully restored as follows.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6gcd1j8cplbvn8c63a1s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6gcd1j8cplbvn8c63a1s.png" alt=" " width="800" height="263"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At this point, Apache Doris cluster exception problem caused by IP change has been processed and restored&lt;/p&gt;

</description>
      <category>bigdata</category>
      <category>apachedoris</category>
      <category>database</category>
      <category>olap</category>
    </item>
    <item>
      <title>Overview of Real-Time Data Synchronization from PostgreSQL to VeloDB</title>
      <dc:creator>Apache Doris</dc:creator>
      <pubDate>Wed, 17 Dec 2025 22:04:51 +0000</pubDate>
      <link>https://forem.com/apachedoris/overview-of-real-time-data-synchronization-from-postgresql-to-velodb-188l</link>
      <guid>https://forem.com/apachedoris/overview-of-real-time-data-synchronization-from-postgresql-to-velodb-188l</guid>
      <description>&lt;p&gt;Migrating data from PostgreSQL (or Amazon Aurora-PostgreSQL) to VeloDB while ensuring &lt;strong&gt;real-time consistency&lt;/strong&gt; can be a challenge—luckily, Flink CDC (Change Data Capture) solves this problem with high throughput and low latency. This step-by-step guide will walk you through using Flink CDC to sync data from Aurora-PostgreSQL to VeloDB, covering full data loading and incremental change capture.&lt;/p&gt;

&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;p&gt;When syncing PostgreSQL/Aurora to VeloDB, Flink acts as the real-time stream processing engine, and PostgreSQL’s &lt;strong&gt;Logical Replication&lt;/strong&gt; captures CDC events. This combination enables:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Full data initial load&lt;/strong&gt;: Import existing business data from PostgreSQL/Aurora to VeloDB in one go.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Real-time incremental sync&lt;/strong&gt;: Capture &lt;code&gt;INSERT/UPDATE/DELETE&lt;/code&gt; operations from PostgreSQL and write them to VeloDB continuously.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We’ll use &lt;strong&gt;Amazon Aurora-PostgreSQL&lt;/strong&gt; as the source and VeloDB as the sink to demonstrate the entire process.&lt;/p&gt;




&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;Before starting, ensure you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;An AWS RDS Aurora PostgreSQL instance (or self-hosted PostgreSQL).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A VeloDB warehouse (with FE nodes accessible via network).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Flink 1.17+ environment (we’ll cover both pre-built and manual installation).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Network connectivity between Flink, PostgreSQL/Aurora, and VeloDB (e.g., security groups, VPC peering).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Step 1: Set Up Aurora-PostgreSQL &amp;amp; Test Data
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1.1 Create an Aurora-PostgreSQL Instance
&lt;/h3&gt;

&lt;p&gt;First, create an AWS RDS Aurora PostgreSQL instance (skip this if you already have one).  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F63eaxw7u15r5hz0ftd4t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F63eaxw7u15r5hz0ftd4t.png" alt=" " width="800" height="324"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  1.2 Create a Database and Table
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fystj4pzk6jatc1gncja8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fystj4pzk6jatc1gncja8.png" alt=" " width="800" height="210"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Connect to your Aurora-PostgreSQL instance and run the following SQL to create a test database and table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;test_db&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- 创建表&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;test_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;public&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;student&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;255&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;phone&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="nb"&gt;NUMERIC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;CURRENT_TIMESTAMP&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- 插入数据&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;test_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;public&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;student&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;phone&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="k"&gt;VALUES&lt;/span&gt; 
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Alice Zhang'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;22&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'alice@example.com'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'13800138000'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;89&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Bob Li'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;21&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'bob@example.com'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'13900139000'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;76&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Charlie Wang'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;23&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'charlie@example.com'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'13600136000'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;92&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'David Chen'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'david@example.com'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'13500135000'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;85&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Emma Liu'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;22&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'emma@example.com'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'13700137000'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;78&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;90&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  1.3 Enable PostgreSQL Logical Replication
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzy7ip60grsuaa9rznhc2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzy7ip60grsuaa9rznhc2.png" alt=" " width="800" height="381"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frll6iafjaqdp4wlxgwdl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frll6iafjaqdp4wlxgwdl.png" alt=" " width="800" height="155"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Modify the PostgreSQL configuration: replace the DB Cluster Parameter with the parameter group created just now, apply the changes, and restart the service&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvr57an6efml4hp0rx71a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvr57an6efml4hp0rx71a.png" alt=" " width="800" height="169"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Install Flink with Doris/VeloDB Connector
&lt;/h2&gt;

&lt;p&gt;VeloDB is compatible with the &lt;strong&gt;Flink Doris Connector&lt;/strong&gt;, so we’ll use that to connect Flink to VeloDB. You can choose either the pre-built package or manual installation.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.1 Pre-Built Installation (Simplest)
&lt;/h3&gt;

&lt;p&gt;We provide a pre-built Flink 1.17 package with all required connectors (PostgreSQL CDC + Doris/VeloDB). Simply download and extract it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;tar&lt;/span&gt; &lt;span class="nt"&gt;-zxvf&lt;/span&gt; flink-1.17.2-bin-scala_2.12.tgz

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2.2 Manual Installation (For Existing Flink Environments)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1fwutnd2p3aoxyku2z9w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1fwutnd2p3aoxyku2z9w.png" alt=" " width="800" height="269"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you already have Flink 1.17 installed, download the required dependencies and add them to the &lt;code&gt;lib&lt;/code&gt; directory:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Download Flink 1.17.2: &lt;a href="https://flink.apache.org/downloads.html#flink-117" rel="noopener noreferrer"&gt;Flink 1.17.2 Download&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Download Flink PostgreSQL CDC Connector: &lt;a href="https://mvnrepository.com/artifact/com.ververica/flink-connector-postgres-cdc/2.4.0" rel="noopener noreferrer"&gt;Flink Postgres CDC&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Download Flink Doris Connector: &lt;a href="https://repo.maven.apache.org/maven2/org/apache/doris/flink-doris-connector/" rel="noopener noreferrer"&gt;Flink Doris Connector&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Step 3: Submit the Flink CDC Sync Job
&lt;/h2&gt;

&lt;p&gt;The Flink Doris Connector will &lt;strong&gt;automatically create corresponding tables in VeloDB&lt;/strong&gt; based on the PostgreSQL table structure. We’ll cover job submission in 4 common environments: Local, Standalone, Yarn, and Kubernetes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Important Notes Before Submission
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Port Correction&lt;/strong&gt;: The example uses &lt;code&gt;port=3306&lt;/code&gt; (MySQL port) by mistake—PostgreSQL/Aurora default port is &lt;strong&gt;5432&lt;/strong&gt; (update this in the commands below).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Database Name Fix&lt;/strong&gt;: The example has &lt;code&gt;--postgres-conf database-name=test&lt;/code&gt; (mismatch with &lt;code&gt;test_db&lt;/code&gt;)—replace with &lt;code&gt;test_db&lt;/code&gt; for consistency.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Customize Params&lt;/strong&gt;: Replace placeholder values (e.g., &lt;code&gt;hostname&lt;/code&gt;, &lt;code&gt;fenodes&lt;/code&gt;, &lt;code&gt;password&lt;/code&gt;) with your actual credentials.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3.1 Local Environment
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;flink-1.17.2-bin
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; bin/flink run &lt;span class="nt"&gt;-t&lt;/span&gt; &lt;span class="nb"&gt;local&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-Dexecution&lt;/span&gt;.checkpointing.interval&lt;span class="o"&gt;=&lt;/span&gt;10s &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                              
    &lt;span class="nt"&gt;-Dparallelism&lt;/span&gt;.default&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                             
    &lt;span class="nt"&gt;-c&lt;/span&gt; org.apache.doris.flink.tools.cdc.CdcTools &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                        
    lib/flink-doris-connector-1.17-25.1.0.jar &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                           
    postgres-sync-database &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                                 
    &lt;span class="nt"&gt;--database&lt;/span&gt; test_db &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                                  
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; &lt;span class="nb"&gt;hostname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;database-test.xxx.us-east-1.rds.amazonaws.com &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                     
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; &lt;span class="nv"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3306 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                              
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; &lt;span class="nv"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;admin &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                          
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; &lt;span class="nv"&gt;password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;123456 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                        
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; database-name&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                     
    &lt;span class="nt"&gt;--including-tables&lt;/span&gt; &lt;span class="s2"&gt;"student"&lt;/span&gt; &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                       
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; &lt;span class="nv"&gt;fenodes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:8080 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; &lt;span class="nv"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;admin &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                           
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; &lt;span class="nv"&gt;password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;123456 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                               
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; jdbc-url&lt;span class="o"&gt;=&lt;/span&gt;jdbc:mysql://lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:9030 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                  
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; sink.label-prefix&lt;span class="o"&gt;=&lt;/span&gt;label

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3.2 Standalone Environment
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;flink-1.17.2-bin
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; bin/flink run &lt;span class="nt"&gt;-t&lt;/span&gt; remote &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-Dexecution&lt;/span&gt;.checkpointing.interval&lt;span class="o"&gt;=&lt;/span&gt;10s &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                              
    &lt;span class="nt"&gt;-Dparallelism&lt;/span&gt;.default&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                             
    &lt;span class="nt"&gt;-c&lt;/span&gt; org.apache.doris.flink.tools.cdc.CdcTools &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                        
    lib/flink-doris-connector-1.17-25.1.0.jar &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                           
    postgres-sync-database &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                                 
    &lt;span class="nt"&gt;--database&lt;/span&gt; test_db &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                                  
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; &lt;span class="nb"&gt;hostname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;database-test.xxx.us-east-1.rds.amazonaws.com &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                     
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; &lt;span class="nv"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3306 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                              
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; &lt;span class="nv"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;admin &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                          
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; &lt;span class="nv"&gt;password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;123456 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                        
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; database-name&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                     
    &lt;span class="nt"&gt;--including-tables&lt;/span&gt; &lt;span class="s2"&gt;"student"&lt;/span&gt; &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                       
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; &lt;span class="nv"&gt;fenodes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:8080 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; &lt;span class="nv"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;admin &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                           
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; &lt;span class="nv"&gt;password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;123456 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                               
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; jdbc-url&lt;span class="o"&gt;=&lt;/span&gt;jdbc:mysql://lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:9030 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                  
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; sink.label-prefix&lt;span class="o"&gt;=&lt;/span&gt;label

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3.3 Yarn Environment (Per-Job Mode)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;flink-1.17.2-bin
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; bin/flink run &lt;span class="nt"&gt;-t&lt;/span&gt; yarn-per-job &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-Dexecution&lt;/span&gt;.checkpointing.interval&lt;span class="o"&gt;=&lt;/span&gt;10s &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                              
    &lt;span class="nt"&gt;-Dparallelism&lt;/span&gt;.default&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                             
    &lt;span class="nt"&gt;-c&lt;/span&gt; org.apache.doris.flink.tools.cdc.CdcTools &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                        
    lib/flink-doris-connector-1.17-25.1.0.jar &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                           
    postgres-sync-database &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                                 
    &lt;span class="nt"&gt;--database&lt;/span&gt; test_db &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                                  
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; &lt;span class="nb"&gt;hostname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;database-test.xxx.us-east-1.rds.amazonaws.com &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                     
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; &lt;span class="nv"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3306 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                              
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; &lt;span class="nv"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;admin &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                          
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; &lt;span class="nv"&gt;password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;123456 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                        
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; database-name&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                     
    &lt;span class="nt"&gt;--including-tables&lt;/span&gt; &lt;span class="s2"&gt;"student"&lt;/span&gt; &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                       
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; &lt;span class="nv"&gt;fenodes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:8080 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; &lt;span class="nv"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;admin &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                           
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; &lt;span class="nv"&gt;password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;123456 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                               
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; jdbc-url&lt;span class="o"&gt;=&lt;/span&gt;jdbc:mysql://lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:9030 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                  
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; sink.label-prefix&lt;span class="o"&gt;=&lt;/span&gt;label

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3.4 Kubernetes Environment (Session Mode)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;flink-1.17.2-bin
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; bin/flink run &lt;span class="nt"&gt;-t&lt;/span&gt; kubernetes-session &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-Dexecution&lt;/span&gt;.checkpointing.interval&lt;span class="o"&gt;=&lt;/span&gt;10s &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                              
    &lt;span class="nt"&gt;-Dparallelism&lt;/span&gt;.default&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                             
    &lt;span class="nt"&gt;-c&lt;/span&gt; org.apache.doris.flink.tools.cdc.CdcTools &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                        
    lib/flink-doris-connector-1.17-25.1.0.jar &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                           
    postgres-sync-database &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                                 
    &lt;span class="nt"&gt;--postgres&lt;/span&gt; test_db &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                                  
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; &lt;span class="nb"&gt;hostname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;database-test.xxx.us-east-1.rds.amazonaws.com &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                     
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; &lt;span class="nv"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3306 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                              
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; &lt;span class="nv"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;admin &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                          
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; &lt;span class="nv"&gt;password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;123456 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                        
    &lt;span class="nt"&gt;--postgres-conf&lt;/span&gt; database-name&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                     
    &lt;span class="nt"&gt;--including-tables&lt;/span&gt; &lt;span class="s2"&gt;"student"&lt;/span&gt; &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                       
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; &lt;span class="nv"&gt;fenodes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:8080 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; &lt;span class="nv"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;admin &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                           
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; &lt;span class="nv"&gt;password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;123456 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                               
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; jdbc-url&lt;span class="o"&gt;=&lt;/span&gt;jdbc:mysql://lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:9030 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                  
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; sink.label-prefix&lt;span class="o"&gt;=&lt;/span&gt;label

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;💡 &lt;strong&gt;Tip&lt;/strong&gt;: For more Flink Doris Connector parameters, check the &lt;a href="https://doris.apache.org/docs/data-operate/load/flink-doris-connector" rel="noopener noreferrer"&gt;official documentation&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Verify Data Sync
&lt;/h2&gt;

&lt;h3&gt;
  
  
  4.1 Verify Full Historical Data Sync
&lt;/h3&gt;

&lt;p&gt;The Flink job will first sync all existing data from PostgreSQL to VeloDB. Connect to your VeloDB warehouse and query the &lt;code&gt;student&lt;/code&gt; table to confirm the data is present.  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz5zxhei6ceh095s5wjai.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz5zxhei6ceh095s5wjai.png" alt=" " width="800" height="202"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  4.2 Verify Real-Time Incremental Sync
&lt;/h3&gt;

&lt;p&gt;To capture &lt;strong&gt;DELETE&lt;/strong&gt; operations (required for full incremental sync), first enable full replica identity on the PostgreSQL table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="k"&gt;public&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;student&lt;/span&gt; &lt;span class="n"&gt;REPLICA&lt;/span&gt; &lt;span class="k"&gt;IDENTITY&lt;/span&gt; &lt;span class="k"&gt;FULL&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;📚 &lt;strong&gt;Reference&lt;/strong&gt;: &lt;a href="https://www.postgresql.org/docs/current/sql-altertable.html#SQL-ALTERTABLE-REPLICA-IDENTITY" rel="noopener noreferrer"&gt;PostgreSQL Replica Identity Documentation&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now, modify data in PostgreSQL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;student&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;phone&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="k"&gt;VALUES&lt;/span&gt; 
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Frank Zhao'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'frank@example.com'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'13400134000'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;88&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;75&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;

&lt;span class="k"&gt;DELETE&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;student&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;student&lt;/span&gt; 
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;95&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;23&lt;/span&gt; 
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check VeloDB to confirm the changes are synced in real-time:  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4odo1z9ehcip275qo142.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4odo1z9ehcip275qo142.png" alt=" " width="800" height="208"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Pitfalls to Avoid
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;PostgreSQL Port Mismatch&lt;/strong&gt;: Don’t use 3306 (MySQL) for PostgreSQL—use 5432 instead.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Logical Replication Not Enabled&lt;/strong&gt;: Without &lt;code&gt;rds.logical_replication = 1&lt;/code&gt;, CDC events won’t be captured.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Replica Identity Missing&lt;/strong&gt;: For DELETE operations, &lt;code&gt;REPLICA IDENTITY FULL&lt;/code&gt; is required (otherwise, deletes won’t sync).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Network Connectivity&lt;/strong&gt;: Ensure Flink can reach Aurora-PostgreSQL (5432) and VeloDB (8080, 9030) via security groups/VPC.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;Flink CDC provides a robust, real-time way to sync data from PostgreSQL/Aurora to VeloDB, covering both full data loads and incremental changes. By following this guide, you can set up a reliable sync pipeline with minimal effort. If you run into issues, check the Flink and VeloDB logs for details, or refer to the official documentation for additional parameters.&lt;/p&gt;

&lt;p&gt;Happy syncing! 🚀&lt;/p&gt;

</description>
      <category>bigdata</category>
      <category>postgressql</category>
      <category>apachedoris</category>
      <category>database</category>
    </item>
    <item>
      <title>Agent Facing Analytics with High Concurrency: Doris vs Clickhouse vs Snowflake</title>
      <dc:creator>Apache Doris</dc:creator>
      <pubDate>Wed, 10 Dec 2025 21:09:50 +0000</pubDate>
      <link>https://forem.com/apachedoris/agent-facing-analytics-with-high-concurrency-doris-vs-clickhouse-vs-snowflake-18ij</link>
      <guid>https://forem.com/apachedoris/agent-facing-analytics-with-high-concurrency-doris-vs-clickhouse-vs-snowflake-18ij</guid>
      <description>&lt;p&gt;Data warehouses have evolved drastically over the past 30 years—from BI-driven legacy systems to big data-powered modern platforms. Now, with the explosion of GenAI and LLM applications, we're entering a new era where data warehouses must seamlessly integrate with AI workflows, support real-time agent interactions, and deliver extreme performance at scale. Apache Doris 4.0 emerges as the game-changer, combining enterprise-grade analytics with AI-native capabilities to meet the demands of today's intelligent applications.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Evolution of Data Warehouses: From Legacy to AI-Native
&lt;/h2&gt;

&lt;p&gt;Let's trace the journey of data warehouses and understand how AI is reshaping their core requirements.&lt;/p&gt;

&lt;h3&gt;
  
  
  Legacy Data Warehouses (BI-Driven)
&lt;/h3&gt;

&lt;p&gt;The first generation of data warehouses separated analytical data from transactional systems to handle large volumes of historical data (e.g., daily trading reports for stockbrokers). However, they quickly hit walls in the big data era:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Scalability&lt;/strong&gt;: Expensive hardware upgrades with limited horizontal scaling&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cost Efficiency&lt;/strong&gt;: On-premise deployments required specialized hardware and high maintenance costs&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Advanced Analytics&lt;/strong&gt;: Poor support for real-time insights, AI, and ML&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Flexibility&lt;/strong&gt;: Rigid architectures unable to adapt to new use cases or diverse data sources&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Modern Data Warehouses (Big Data-Driven)
&lt;/h3&gt;

&lt;p&gt;Post-2000, the mobile internet and e-commerce boom drove the need for more agile analytics. Modern data warehouses addressed legacy limitations with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Stateless Compute/Storage&lt;/strong&gt;: Lower overhead for scaling resources&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Low-Latency&lt;/strong&gt;: Sub-second response times for user queries&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;High-Concurrency&lt;/strong&gt;: Effortlessly handles thousands of concurrent workloads&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Hybrid Workloads&lt;/strong&gt;: Supports ad-hoc queries, ETL, and batch processing&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Federated Queries&lt;/strong&gt;: Breaks data silos by unifying access to data lakes, transactional DBs, and more&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Data Warehouses in the AI Era
&lt;/h3&gt;

&lt;p&gt;ISG Research predicts: &lt;em&gt;"Through 2027, almost all enterprises developing GenAI applications will invest in data platforms with vector search and retrieval-augmented generation (RAG) to complement foundation models with proprietary data."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;LLMs thrive on high-quality data—for both training and inference. AI-driven applications require data warehouses to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Balance Volume &amp;amp; Quality&lt;/strong&gt;: High-quality data directly impacts model performance&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Dual-Purpose Data&lt;/strong&gt;: Support both model training and real-time inference&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Dynamic Freshness&lt;/strong&gt;: Handle continuous data read/write with near-zero latency&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Agent-Friendly&lt;/strong&gt;: Enable autonomous AI agents to interact without human intervention&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;AI-First Design&lt;/strong&gt;: Natively support LLM functions, vector storage, and high-performance vector I/O&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Paradigm Shift: Agentic-Facing Analytics
&lt;/h2&gt;

&lt;p&gt;Traditional BI and OLAP systems are built for &lt;em&gt;passive, historical reporting&lt;/em&gt;—few users running heavy queries with slow tolerance. AI changes this with &lt;strong&gt;agentic-facing analytics&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Proactive, autonomous AI agents that reason, analyze in real-time, and trigger actions&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Workloads shift to: &lt;em&gt;"Massive users (agents), light/iterative queries, zero latency tolerance"&lt;/em&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Requires millisecond response times for thousands of concurrent queries&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Legacy OLAP systems can't keep up—their pre-aggregated data cubes, batch processing, and data silos create bottlenecks for agentic workflows. The solution? A &lt;strong&gt;semantics-and-response-centric architecture&lt;/strong&gt; that prioritizes flexibility, real-time access, and unified data context.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Doris Outperforms Competitors: Benchmark Results
&lt;/h2&gt;

&lt;p&gt;Apache Doris (and its commercial distribution VeloDB) sets a new standard for performance across key analytics benchmarks. We compared it against Snowflake and ClickHouse Cloud with equivalent compute resources (128 cores for VeloDB/ClickHouse, XL-size cluster for Snowflake) using Apache JMeter to measure QPS at 10/30/50 parallelisms.&lt;/p&gt;

&lt;h3&gt;
  
  
  Benchmark Overview
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benchmark&lt;/th&gt;
&lt;th&gt;Focus&lt;/th&gt;
&lt;th&gt;Key Findings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SSB-FLAT&lt;/td&gt;
&lt;td&gt;Single wide-table queries (no joins)&lt;/td&gt;
&lt;td&gt;VeloDB outperforms Snowflake 4.76–7.39x, ClickHouse 4.76–6.92x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SSB (Star Schema)&lt;/td&gt;
&lt;td&gt;Join-heavy analytics&lt;/td&gt;
&lt;td&gt;VeloDB outperforms Snowflake 5.17–6.37x; ClickHouse failed most join queries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TPC-H&lt;/td&gt;
&lt;td&gt;Complex ad-hoc decision support&lt;/td&gt;
&lt;td&gt;VeloDB outperforms Snowflake 1.71–3.10x; ClickHouse couldn’t run all queries (Q20/Q21/Q22 failed)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Key Takeaways
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Complex Joins&lt;/strong&gt;: Doris excels at join-heavy workloads (SSB/TPC-H) thanks to its advanced optimizer and execution engine&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;High Concurrency&lt;/strong&gt;: Maintains performance at scale (50 parallelisms) while competitors struggle with memory or parsing errors&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Wide-Table Performance&lt;/strong&gt;: Even in single-table scans (SSB-FLAT), outperforms purpose-built systems like ClickHouse&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cost-Efficiency&lt;/strong&gt;: Delivers more throughput per compute unit than Snowflake’s elastic architecture&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Deep Dive: Apache Doris Core Technologies
&lt;/h2&gt;

&lt;p&gt;Apache Doris’s performance and AI readiness stem from its innovative architecture. Let’s explore the key features powering its success.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Data Pruning: "Don’t Process Unnecessary Data"
&lt;/h3&gt;

&lt;p&gt;The most efficient way to process data is to avoid processing it entirely. Doris uses two types of pruning:&lt;/p&gt;

&lt;h4&gt;
  
  
  Static Filters (Pre-Execution)
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Partition Pruning&lt;/strong&gt;: FE uses metadata to skip irrelevant partitions (e.g., time-based partitions outside a date range)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Key Column Pruning&lt;/strong&gt;: Data is sorted by key columns—binary search narrows down the row range to scan&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Value Column Pruning&lt;/strong&gt;: Column files store min/max metadata to skip files that can’t match predicates&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Dynamic Filters (Post-Execution)
&lt;/h4&gt;

&lt;p&gt;For joins, filters are generated after building hash tables on the build side. This prunes irrelevant data on the probe side before joining, reducing join overhead.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Advanced Pruning Optimizations
&lt;/h3&gt;

&lt;h4&gt;
  
  
  LIMIT Pruning
&lt;/h4&gt;

&lt;p&gt;Pushes LIMIT clauses down to data scanning—stops processing once the required number of rows is retrieved.&lt;/p&gt;

&lt;h4&gt;
  
  
  TopK Pruning
&lt;/h4&gt;

&lt;p&gt;Optimizes TopK queries (e.g., "top 10 highest-grossing products") with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Local truncation in scanning threads&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Global merge sort via a coordinator&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Two-phase execution: first sort key columns to get row indices, then fetch required columns—avoids full data scans&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Join Pruning
&lt;/h4&gt;

&lt;p&gt;Reduces probe-side data for hash joins:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Uses build-side hash table values to filter probe-side data&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Minimizes data transfer and join computation (O(M+N) complexity vs. O(M*N) for Cartesian product)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Pipeline Engine: Efficient Execution at Scale
&lt;/h3&gt;

&lt;p&gt;Doris uses a coroutine-like pipeline engine to maximize CPU utilization:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Yields CPU during blocking operations (disk I/O, network I/O in joins/exchanges)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Eliminates thread switching overhead with task scheduling triggered by external events (e.g., RPC completion)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Independent parallelism per pipeline (not constrained by tablet count)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Even data distribution to minimize skewing via local exchange optimization&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Shared states across pipeline tasks (reduces initialization overhead)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Vectorized Query Execution
&lt;/h3&gt;

&lt;p&gt;Processes data in batches (vectors) instead of row-by-row, leveraging:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;SIMD (Single Instruction, Multiple Data) CPU instructions&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Loop unrolling to reduce branch mispredictions&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Accelerated compression, computation, and data processing&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Delivers 2–10x performance gains for analytical queries&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  AI-Native Capabilities in Apache Doris 4.0
&lt;/h2&gt;

&lt;p&gt;Apache Doris 4.0 is built for the AI era with native support for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Vector Search&lt;/strong&gt;: High-performance storage and retrieval of feature vectors for LLM inference&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;RAG Integration&lt;/strong&gt;: Seamlessly connects with LLMs to augment generation with proprietary data&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;AI Functions&lt;/strong&gt;: Built-in UDFs for ML/LLM workflows (e.g., embedding generation, text processing)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;MCP Server&lt;/strong&gt;: Native support for Model-as-a-Service integration&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Agent Compatibility&lt;/strong&gt;: Designed for programmatic access by AI agents with low-latency responses&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The AI revolution demands data warehouses that are fast, flexible, and AI-native. Apache Doris 4.0 delivers on all fronts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Outperforms competitors in complex joins, high concurrency, and wide-table analytics&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Features like data pruning, pipeline engine, and vectorized execution enable millisecond response times&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;AI-native capabilities (vector search, RAG, agent support) integrate seamlessly with GenAI workflows&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For teams building AI-driven applications, Apache Doris isn’t just a data warehouse—it’s the foundation for intelligent, real-time analytics that powers the next generation of products and decision-making.&lt;/p&gt;

</description>
      <category>bigdata</category>
      <category>ai</category>
      <category>apachedoris</category>
      <category>database</category>
    </item>
    <item>
      <title>Deploying Apache Doris with Storage-Compute Separation Using MinIO: A Practical Guide</title>
      <dc:creator>Apache Doris</dc:creator>
      <pubDate>Fri, 05 Dec 2025 22:05:49 +0000</pubDate>
      <link>https://forem.com/apachedoris/deploying-apache-doris-with-storage-compute-separation-using-minio-a-practical-guide-381j</link>
      <guid>https://forem.com/apachedoris/deploying-apache-doris-with-storage-compute-separation-using-minio-a-practical-guide-381j</guid>
      <description>&lt;p&gt;Modern data processing faces multiple challenges. The ever-growing volume of data drives up traditional storage costs, especially with unstructured data becoming more prevalent. Data quality issues further increase the burden of storage and cleansing. Additionally, enterprises often struggle with data integration across multiple internal systems, which raises the bar for efficient and cost-effective data analytics.&lt;/p&gt;

&lt;p&gt;Apache Doris, a high-performance real-time analytics database with lakehouse capabilities, combined with MinIO, a high-performance S3-compatible object storage system, offers a powerful solution. Together, they enable an efficient, low-cost data analytics platform. This article explores the strengths of Apache Doris and MinIO and provides a step-by-step deployment guide.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Choose Apache Doris and MinIO?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Apache Doris: High-Performance Real-Time Analytics Database
&lt;/h3&gt;

&lt;p&gt;Apache Doris is built on an MPP (Massively Parallel Processing) architecture, known for its efficiency, simplicity, and versatility—delivering sub-second query results on massive datasets. Key advantages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;High Performance&lt;/strong&gt;: Sub-second responses for large datasets, supporting high-concurrency point queries and complex analytics.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Real-Time Analytics&lt;/strong&gt;: Enables real-time data ingestion and querying for instant insights.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Ease of Use&lt;/strong&gt;: Streamlined design with low operational and maintenance costs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Scalability&lt;/strong&gt;: Horizontal scaling via MPP to handle large-scale data and high-concurrency workloads.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Multi-Scenario Support&lt;/strong&gt;: Ideal for reports, ad-hoc queries, user profiling, log retrieval, etc.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Robust Integration&lt;/strong&gt;: Seamlessly works with MySQL, PostgreSQL, Hive, Flink, and other tools.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Active Community&lt;/strong&gt;: Backed by 600+ contributors, deployed in production by 5,000+ organizations (including TikTok, Baidu).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Doris supports two deployment modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Integrated storage-compute (data stored internally)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Separate storage-compute (uses third-party storage like MinIO)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  MinIO: High-Performance Object Storage
&lt;/h3&gt;

&lt;p&gt;MinIO is an open-source, distributed object storage system optimized for cloud-native workloads. Core strengths:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;High Performance&lt;/strong&gt;: Fast data access to meet real-time analytics demands.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Scalability&lt;/strong&gt;: Horizontal scaling for growing data volumes.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cost-Effectiveness&lt;/strong&gt;: Open-source, on-premises deployable (avoids cloud storage premiums).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;S3 Compatibility&lt;/strong&gt;: Fully compatible with Amazon S3 API for easy tool integration.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;High Availability&lt;/strong&gt;: Uses erasure coding for data redundancy.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Flexible Deployment&lt;/strong&gt;: Supports bare-metal, Kubernetes, or cloud environments.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These features make MinIO an ideal storage backend for Doris in a storage-compute separation architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deployment Guide
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Planning
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Software Versions
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Software&lt;/th&gt;
&lt;th&gt;Version&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MinIO&lt;/td&gt;
&lt;td&gt;latest&lt;/td&gt;
&lt;td&gt;High-performance object storage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Apache Doris&lt;/td&gt;
&lt;td&gt;3.0.6&lt;/td&gt;
&lt;td&gt;Real-time analytics database&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Doris Manager&lt;/td&gt;
&lt;td&gt;25.0.0&lt;/td&gt;
&lt;td&gt;Visual tool for Doris installation/deployment&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;
  
  
  Server Layout
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Node IP&lt;/th&gt;
&lt;th&gt;Doris Manager&lt;/th&gt;
&lt;th&gt;MinIO&lt;/th&gt;
&lt;th&gt;MetaService&lt;/th&gt;
&lt;th&gt;FE&lt;/th&gt;
&lt;th&gt;BE&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="//172.20.1.2"&gt;172.20.1.2&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;✔️&lt;/td&gt;
&lt;td&gt;✔️&lt;/td&gt;
&lt;td&gt;✔️&lt;/td&gt;
&lt;td&gt;✔️&lt;/td&gt;
&lt;td&gt;✔️&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="//172.20.1.3"&gt;172.20.1.3&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;✔️&lt;/td&gt;
&lt;td&gt;✔️&lt;/td&gt;
&lt;td&gt;✔️&lt;/td&gt;
&lt;td&gt;✔️&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="//172.20.1.4"&gt;172.20.1.4&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;✔️&lt;/td&gt;
&lt;td&gt;✔️&lt;/td&gt;
&lt;td&gt;✔️&lt;/td&gt;
&lt;td&gt;✔️&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="//172.20.1.5"&gt;172.20.1.5&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;✔️&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For production environments: Use higher-spec machines and isolate components for optimal performance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Preparation
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1. Modify OS Parameters
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;swapoff &lt;span class="nt"&gt;-a&lt;/span&gt;

&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; /etc/sysctl.conf &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;
vm.max_map_count = 2000000
&lt;/span&gt;&lt;span class="no"&gt;EOF

&lt;/span&gt;&lt;span class="c"&gt;# Take effect immediately&lt;/span&gt;
sysctl &lt;span class="nt"&gt;-p&lt;/span&gt;

vi /etc/security/limits.conf 
&lt;span class="k"&gt;*&lt;/span&gt; soft nofile 1000000
&lt;span class="k"&gt;*&lt;/span&gt; hard nofile 1000000

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  2. Install Required Tools
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;apt update
apt &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; net-tools
apt &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; cron
apt &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; iputils-ping

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Deploying MinIO
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1. Download MinIO
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;wget https://dl.min.io/server/minio/release/linux-amd64/minio
&lt;span class="nb"&gt;chmod&lt;/span&gt; +x minio

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  2. Start MinIO on Each Node
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;MINIO_REGION_NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;us-east-1
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;MINIO_ROOT_USER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;minio
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;MINIO_ROOT_PASSWORD&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;minioadmin
&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; /mnt/disk&lt;span class="o"&gt;{&lt;/span&gt;1..4&lt;span class="o"&gt;}&lt;/span&gt;/minio
&lt;span class="nb"&gt;nohup &lt;/span&gt;minio server &lt;span class="nt"&gt;--address&lt;/span&gt; :9000 &lt;span class="nt"&gt;--console-address&lt;/span&gt; :9001 http://172.20.1.&lt;span class="o"&gt;{&lt;/span&gt;2...5&lt;span class="o"&gt;}&lt;/span&gt;:9000/mnt/disk&lt;span class="o"&gt;{&lt;/span&gt;1...4&lt;span class="o"&gt;}&lt;/span&gt;/minio 2&amp;gt;&amp;amp;1 &amp;amp;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  3. Configure MinIO Client
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;wget https://dl.min.io/client/mc/release/linux-amd64/mc
&lt;span class="nb"&gt;chmod&lt;/span&gt; +x mc
./mc &lt;span class="nb"&gt;alias set &lt;/span&gt;myminio http://127.0.0.1:9000 minio minioadmin
./mc mb myminio/doris

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note: If MinIO is deployed on a local network without TLS, explicitly include &lt;code&gt;http://&lt;/code&gt; in the endpoint.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deploying Doris Manager
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1. Download Doris Manager
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;wget https://enterprise-doris-releases.oss-accelerate.aliyuncs.com/doris-manager/velodb-manager-25.0.0-x64-bin.tar.gz
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  2. Extract and Start Service
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;tar&lt;/span&gt; &lt;span class="nt"&gt;-zxf&lt;/span&gt; velodb-manager-25.0.0-x64-bin.tar.gz
&lt;span class="nb"&gt;cd &lt;/span&gt;velodb-manager-25.0.0-x64-bin/webserver/bin
bash start.sh

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  3. Access Web Interface
&lt;/h4&gt;

&lt;p&gt;Open your browser and navigate to &lt;code&gt;http://&amp;lt;Doris Manager IP&amp;gt;:8004&lt;/code&gt;. Follow the prompts to create an admin account.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2it6o3bg713en16mnay4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2it6o3bg713en16mnay4.png" alt=" " width="800" height="519"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Deploying Apache Doris
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1. Download Doris
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;wget https://apache-doris-releases.oss-accelerate.aliyuncs.com/apache-doris-3.0.6.2-bin-x64.tar.gz
&lt;span class="nb"&gt;mv &lt;/span&gt;apache-doris-3.0.6.2-bin-x64.tar.gz /opt/downloads/doris

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  2. Create Cluster via Doris Manager
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1jxx9ukyczzb8fd676b0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1jxx9ukyczzb8fd676b0.png" alt=" " width="800" height="381"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Select Doris version (3.0.6) and set root password&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjar0px3nprs7vzj050gm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjar0px3nprs7vzj050gm.png" alt=" " width="800" height="396"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Enter MinIO details:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2nsk02nxqs12ki5pv6jy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2nsk02nxqs12ki5pv6jy.png" alt=" " width="800" height="477"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  3. Configure Nodes
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Run this script on all nodes to deploy agent:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;wget http://172.20.1.2:8004/api/download/deploy.sh &lt;span class="nt"&gt;-O&lt;/span&gt; deploy_agent.sh &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;chmod&lt;/span&gt; +x deploy_agent.sh &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; ./deploy_agent.sh

&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Input node IPs in the Doris Manager interface&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxqcutu815np6gwfwn1fj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxqcutu815np6gwfwn1fj.png" alt=" " width="800" height="476"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Configure FE nodes (specify roles and resources)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F33pazvwisixkxmk9w3w8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F33pazvwisixkxmk9w3w8.png" alt=" " width="800" height="476"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Configure BE nodes (specify storage paths and resources)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpjnjbcoorw48kleortu2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpjnjbcoorw48kleortu2.png" alt=" " width="800" height="608"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  4. Deploy Cluster
&lt;/h4&gt;

&lt;p&gt;Click "Deploy" and wait for the process to complete (10-15 minutes). Verify cluster status in Doris Manager.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwe82gurm7le6qvf4t6cm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwe82gurm7le6qvf4t6cm.png" alt=" " width="800" height="325"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk48z3vr9lapiqb71zj6u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk48z3vr9lapiqb71zj6u.png" alt=" " width="800" height="477"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Querying Data
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Data Preparation
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1. Access Query Interface
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyhy36a1kr65e854291fc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyhy36a1kr65e854291fc.png" alt=" " width="800" height="481"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fya5koim1qv63g9v0gf6r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fya5koim1qv63g9v0gf6r.png" alt=" " width="800" height="481"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  2. Create Doris Table
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="nv"&gt;`test`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;USE&lt;/span&gt; &lt;span class="nv"&gt;`test`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="nv"&gt;`amazon_reviews`&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;  
  &lt;span class="nv"&gt;`review_date`&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
  &lt;span class="nv"&gt;`marketplace`&lt;/span&gt; &lt;span class="nb"&gt;varchar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
  &lt;span class="nv"&gt;`customer_id`&lt;/span&gt; &lt;span class="nb"&gt;bigint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
  &lt;span class="nv"&gt;`review_id`&lt;/span&gt; &lt;span class="nb"&gt;varchar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`product_id`&lt;/span&gt; &lt;span class="nb"&gt;varchar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`product_parent`&lt;/span&gt; &lt;span class="nb"&gt;bigint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`product_title`&lt;/span&gt; &lt;span class="nb"&gt;varchar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`product_category`&lt;/span&gt; &lt;span class="nb"&gt;varchar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`star_rating`&lt;/span&gt; &lt;span class="nb"&gt;smallint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`helpful_votes`&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`total_votes`&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`vine`&lt;/span&gt; &lt;span class="nb"&gt;boolean&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`verified_purchase`&lt;/span&gt; &lt;span class="nb"&gt;boolean&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`review_headline`&lt;/span&gt; &lt;span class="nb"&gt;varchar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`review_body`&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;ENGINE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;OLAP&lt;/span&gt;
&lt;span class="n"&gt;DUPLICATE&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;`review_date`&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;COMMENT&lt;/span&gt; &lt;span class="s1"&gt;'OLAP'&lt;/span&gt;
&lt;span class="n"&gt;DISTRIBUTED&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;HASH&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;`review_date`&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;BUCKETS&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;
&lt;span class="n"&gt;PROPERTIES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nv"&gt;"compression"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;"ZSTD"&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  3. Download Sample Data
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;wget https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_2010.snappy.parquet

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  4. Load Data into Doris
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;--location-trusted&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt; root:&amp;lt;your password&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;-T&lt;/span&gt; amazon_reviews_2010.snappy.parquet &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"format:parquet"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
http://127.0.0.1:8030/api/test/amazon_reviews/_stream_load

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  5. Verify Data in MinIO
&lt;/h4&gt;

&lt;p&gt;Log into MinIO Console (&lt;code&gt;http://&amp;lt;MinIO IP&amp;gt;:9001&lt;/code&gt;) → Check &lt;code&gt;doris&lt;/code&gt; bucket for data files.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F19lrwr25op6nygodpotp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F19lrwr25op6nygodpotp.png" alt=" " width="800" height="478"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Sample Query
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;product_title&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;star_rating&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;rating&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;
    &lt;span class="n"&gt;amazon_reviews&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt;
    &lt;span class="n"&gt;review_body&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'%is super awesome%'&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt;
    &lt;span class="n"&gt;product_id&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt;
    &lt;span class="k"&gt;count&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;rating&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;product_id&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;This setup is ideal for enterprises looking to balance performance and cost in real-time analytics scenarios. Try it out with the guide above and share your experience!&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Overview of Real-Time Data Synchronization from MySQL to VeloDB</title>
      <dc:creator>Apache Doris</dc:creator>
      <pubDate>Tue, 02 Dec 2025 20:40:25 +0000</pubDate>
      <link>https://forem.com/apachedoris/overview-of-real-time-data-synchronization-from-mysql-to-velodb-5888</link>
      <guid>https://forem.com/apachedoris/overview-of-real-time-data-synchronization-from-mysql-to-velodb-5888</guid>
      <description>&lt;p&gt;In the process of migrating data from MySQL (including MySQL-compatible databases such as Amazon Aurora) to VeloDB, Flink can be used as a real-time data synchronization engine to ensure data consistency and real-timeliness. Flink boasts high-throughput and low-latency stream processing capabilities, enabling efficient full data synchronization and incremental change handling for databases.&lt;/p&gt;

&lt;p&gt;For real-time synchronization scenarios, MySQL Binlog can be enabled to capture CDC (Change Data Capture) events. Whether it is a traditional self-hosted MySQL or Amazon Aurora-MySQL deployed on the cloud, you can enable Binlog and use Flink CDC for subscription to achieve:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Full data initial load: Import existing data from MySQL/Aurora to VeloDB first&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Real-time synchronization of incremental changes: Capture Insert/Update/Delete operations based on Binlog and continuously write them to VeloDB&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The overall link is as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwy0citx5o0wctlf4y2yl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwy0citx5o0wctlf4y2yl.png" alt=" " width="800" height="379"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here, we take Amazon Aurora-MySQL as an example to demonstrate how to use Flink CDC to capture data changes in Aurora and synchronize them to VeloDB in real time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Example
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Create an AWS RDS Aurora MySQL instance
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fft5a0cmo5vwedwoxj8d9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fft5a0cmo5vwedwoxj8d9.png" alt=" " width="800" height="303"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Create a MySQL database and corresponding tables
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;test_db&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;test_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;student&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt; &lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;255&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;phone&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="nb"&gt;DECIMAL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;ENGINE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;InnoDB&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;CHARSET&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;utf8mb4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;test_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;student&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;phone&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="k"&gt;VALUES&lt;/span&gt; 
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Alice Zhang'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;22&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'alice@example.com'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'13800138000'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;89&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Bob Li'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;21&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'bob@example.com'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'13900139000'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;76&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Charlie Wang'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;23&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'charlie@example.com'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'13600136000'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;92&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'David Chen'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'david@example.com'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'13500135000'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;85&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Emma Liu'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;22&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'emma@example.com'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'13700137000'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;78&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;90&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Create a VeloDB warehouse
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm5k6ev0gnp82nxcrzmco.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm5k6ev0gnp82nxcrzmco.png" alt=" " width="800" height="224"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Modify MySQL configuration
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Create a parameter group and add the binlog configuration&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F88ql3x4jrm5ihd6g8llp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F88ql3x4jrm5ihd6g8llp.png" alt=" " width="800" height="311"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Modify &lt;code&gt;binlog_format&lt;/code&gt; to &lt;code&gt;ROW&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxfjl0em88v0aqkldubue.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxfjl0em88v0aqkldubue.png" alt=" " width="800" height="175"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Replace the DB Cluster Parameter with the parameter group created just now, then restart the service after applying the changes&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frwhunnsn5bs15mlze1p1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frwhunnsn5bs15mlze1p1.png" alt=" " width="800" height="310"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Install Flink With Doris Connector
&lt;/h3&gt;

&lt;h4&gt;
  
  
  5.1 Download the pre-defined installation package
&lt;/h4&gt;

&lt;p&gt;Based on Flink 1.17, we provide a pre-defined installation package that can be directly downloaded and decompressed.&lt;/p&gt;

&lt;h4&gt;
  
  
  5.2 Manual installation
&lt;/h4&gt;

&lt;p&gt;If you already have a Flink environment or need another version of Flink, you can use the manual installation method. Taking Flink 1.17 as an example here, download the Flink installation package and its dependencies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Flink 1.17&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Flink MySQL CDC Connector&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Flink Doris Connector&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;MySQL Driver&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After the download is complete, extract the Flink installation package:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;tar&lt;/span&gt; &lt;span class="nt"&gt;-zxvf&lt;/span&gt; flink-1.17.2-bin-scala_2.12.tgz

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Meanwhile, place the Flink MySQL CDC Connector, Doris Connector, and MySQL driver package into the &lt;code&gt;flink-1.17.2-bin/lib&lt;/code&gt; directory. As follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fftpj6f6o7lo4mrfdvou7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fftpj6f6o7lo4mrfdvou7.png" alt=" " width="800" height="268"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Submit the Flink synchronization job
&lt;/h3&gt;

&lt;p&gt;When submitting the job, the Doris Connector will automatically create corresponding tables in VeloDB in advance based on the table structure of the upstream MySQL database.&lt;/p&gt;

&lt;p&gt;Flink supports job submission and operation in modes such as Local, Standalone, Yarn, and K8S. If you already have a Flink environment, you can directly submit the job to your own Flink environment.&lt;/p&gt;

&lt;h4&gt;
  
  
  6.1 Local Environment
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;flink-1.17.2-bin
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; bin/flink run &lt;span class="nt"&gt;-t&lt;/span&gt; &lt;span class="nb"&gt;local&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-Dexecution&lt;/span&gt;.checkpointing.interval&lt;span class="o"&gt;=&lt;/span&gt;10s &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                              
    &lt;span class="nt"&gt;-Dparallelism&lt;/span&gt;.default&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                             
    &lt;span class="nt"&gt;-c&lt;/span&gt; org.apache.doris.flink.tools.cdc.CdcTools &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                        
    lib/flink-doris-connector-1.17-25.1.0.jar &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                           
    mysql-sync-database &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                                 
    &lt;span class="nt"&gt;--database&lt;/span&gt; test_db &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                                  
    &lt;span class="nt"&gt;--mysql-conf&lt;/span&gt; &lt;span class="nb"&gt;hostname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;database-test.cluster-ro-ckbuyoqerz2c.us-east-1.rds.amazonaws.com &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                     
    &lt;span class="nt"&gt;--mysql-conf&lt;/span&gt; &lt;span class="nv"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3306 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                              
    &lt;span class="nt"&gt;--mysql-conf&lt;/span&gt; &lt;span class="nv"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;admin &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                          
    &lt;span class="nt"&gt;--mysql-conf&lt;/span&gt; &lt;span class="nv"&gt;password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;123456 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                        
    &lt;span class="nt"&gt;--mysql-conf&lt;/span&gt; database-name&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                     
    &lt;span class="nt"&gt;--including-tables&lt;/span&gt; &lt;span class="s2"&gt;"student"&lt;/span&gt; &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                       
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; &lt;span class="nv"&gt;fenodes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:8080 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; &lt;span class="nv"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;admin &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                           
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; &lt;span class="nv"&gt;password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;123456 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                               
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; jdbc-url&lt;span class="o"&gt;=&lt;/span&gt;jdbc:mysql://lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:9030 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                  
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; sink.label-prefix&lt;span class="o"&gt;=&lt;/span&gt;label

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  6.2 Standalone Environment
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;flink-1.17.2-bin
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; bin/flink run &lt;span class="nt"&gt;-t&lt;/span&gt; remote &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-Dexecution&lt;/span&gt;.checkpointing.interval&lt;span class="o"&gt;=&lt;/span&gt;10s &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                              
    &lt;span class="nt"&gt;-Dparallelism&lt;/span&gt;.default&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                             
    &lt;span class="nt"&gt;-c&lt;/span&gt; org.apache.doris.flink.tools.cdc.CdcTools &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                        
    lib/flink-doris-connector-1.17-25.1.0.jar &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                           
    mysql-sync-database &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                                 
    &lt;span class="nt"&gt;--database&lt;/span&gt; test_db &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                                  
    &lt;span class="nt"&gt;--mysql-conf&lt;/span&gt; &lt;span class="nb"&gt;hostname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;database-test.cluster-ro-ckbuyoqerz2c.us-east-1.rds.amazonaws.com &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                     
    &lt;span class="nt"&gt;--mysql-conf&lt;/span&gt; &lt;span class="nv"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3306 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                              
    &lt;span class="nt"&gt;--mysql-conf&lt;/span&gt; &lt;span class="nv"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;admin &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                          
    &lt;span class="nt"&gt;--mysql-conf&lt;/span&gt; &lt;span class="nv"&gt;password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;123456 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                        
    &lt;span class="nt"&gt;--mysql-conf&lt;/span&gt; database-name&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                     
    &lt;span class="nt"&gt;--including-tables&lt;/span&gt; &lt;span class="s2"&gt;"student"&lt;/span&gt; &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                       
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; &lt;span class="nv"&gt;fenodes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:8080 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; &lt;span class="nv"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;admin &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                           
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; &lt;span class="nv"&gt;password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;123456 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                               
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; jdbc-url&lt;span class="o"&gt;=&lt;/span&gt;jdbc:mysql://lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:9030 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                  
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; sink.label-prefix&lt;span class="o"&gt;=&lt;/span&gt;label

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  6.3 Yarn Environment
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;flink-1.17.2-bin
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; bin/flink run &lt;span class="nt"&gt;-t&lt;/span&gt; yarn-per-job &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-Dexecution&lt;/span&gt;.checkpointing.interval&lt;span class="o"&gt;=&lt;/span&gt;10s &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                              
    &lt;span class="nt"&gt;-Dparallelism&lt;/span&gt;.default&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                             
    &lt;span class="nt"&gt;-c&lt;/span&gt; org.apache.doris.flink.tools.cdc.CdcTools &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                        
    lib/flink-doris-connector-1.17-25.1.0.jar &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                           
    mysql-sync-database &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                                 
    &lt;span class="nt"&gt;--database&lt;/span&gt; test_db &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                                  
    &lt;span class="nt"&gt;--mysql-conf&lt;/span&gt; &lt;span class="nb"&gt;hostname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;database-test.cluster-ro-ckbuyoqerz2c.us-east-1.rds.amazonaws.com &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                     
    &lt;span class="nt"&gt;--mysql-conf&lt;/span&gt; &lt;span class="nv"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3306 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                              
    &lt;span class="nt"&gt;--mysql-conf&lt;/span&gt; &lt;span class="nv"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;admin &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                          
    &lt;span class="nt"&gt;--mysql-conf&lt;/span&gt; &lt;span class="nv"&gt;password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;123456 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                        
    &lt;span class="nt"&gt;--mysql-conf&lt;/span&gt; database-name&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                     
    &lt;span class="nt"&gt;--including-tables&lt;/span&gt; &lt;span class="s2"&gt;"student"&lt;/span&gt; &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                       
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; &lt;span class="nv"&gt;fenodes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:8080 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; &lt;span class="nv"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;admin &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                           
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; &lt;span class="nv"&gt;password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;123456 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                               
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; jdbc-url&lt;span class="o"&gt;=&lt;/span&gt;jdbc:mysql://lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:9030 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                  
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; sink.label-prefix&lt;span class="o"&gt;=&lt;/span&gt;label

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  6.4 K8S Environment
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;flink-1.17.2-bin
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; bin/flink run &lt;span class="nt"&gt;-t&lt;/span&gt; kubernetes-session &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-Dexecution&lt;/span&gt;.checkpointing.interval&lt;span class="o"&gt;=&lt;/span&gt;10s &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                              
    &lt;span class="nt"&gt;-Dparallelism&lt;/span&gt;.default&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                             
    &lt;span class="nt"&gt;-c&lt;/span&gt; org.apache.doris.flink.tools.cdc.CdcTools &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                        
    lib/flink-doris-connector-1.17-25.1.0.jar &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                           
    mysql-sync-database &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                                 
    &lt;span class="nt"&gt;--database&lt;/span&gt; test_db &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                                  
    &lt;span class="nt"&gt;--mysql-conf&lt;/span&gt; &lt;span class="nb"&gt;hostname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;database-test.cluster-ro-ckbuyoqerz2c.us-east-1.rds.amazonaws.com &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                     
    &lt;span class="nt"&gt;--mysql-conf&lt;/span&gt; &lt;span class="nv"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3306 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                              
    &lt;span class="nt"&gt;--mysql-conf&lt;/span&gt; &lt;span class="nv"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;admin &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                          
    &lt;span class="nt"&gt;--mysql-conf&lt;/span&gt; &lt;span class="nv"&gt;password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;123456 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                        
    &lt;span class="nt"&gt;--mysql-conf&lt;/span&gt; database-name&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                     
    &lt;span class="nt"&gt;--including-tables&lt;/span&gt; &lt;span class="s2"&gt;"student"&lt;/span&gt; &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                       
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; &lt;span class="nv"&gt;fenodes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:8080 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; &lt;span class="nv"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;admin &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                           
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; &lt;span class="nv"&gt;password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;123456 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                                               
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; jdbc-url&lt;span class="o"&gt;=&lt;/span&gt;jdbc:mysql://lb-40579077-a97732bc6c030909.elb.us-east-1.amazonaws.com:9030 &lt;span class="se"&gt;\ &lt;/span&gt;                                                                                                  
    &lt;span class="nt"&gt;--sink-conf&lt;/span&gt; sink.label-prefix&lt;span class="o"&gt;=&lt;/span&gt;label

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note: For more parameters of the Connector, refer to this link.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. Verify Historical Data Synchronization
&lt;/h3&gt;

&lt;p&gt;The Flink job will synchronize full historical data for the first time. Check the data synchronization status in VeloDB.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcdrm8uewycxttv3tgi3t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcdrm8uewycxttv3tgi3t.png" alt=" " width="800" height="246"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  8. Verify Real-Time Data Synchronization
&lt;/h3&gt;

&lt;p&gt;Perform data modifications in MySQL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;student&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;phone&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="k"&gt;VALUES&lt;/span&gt; 
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Frank Zhao'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'frank@example.com'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'13400134000'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;88&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;75&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;

&lt;span class="k"&gt;DELETE&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;student&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;student&lt;/span&gt; 
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;95&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;23&lt;/span&gt; 
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify data changes in VeloDB:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyq06ozn2y47frku7mgpw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyq06ozn2y47frku7mgpw.png" alt=" " width="800" height="246"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>database</category>
      <category>aws</category>
      <category>dataengineering</category>
      <category>mysql</category>
    </item>
    <item>
      <title>Apache Doris AI Capabilities Unveiled (Part II): Deep Analysis of AI_AGG and EMBED Functions</title>
      <dc:creator>Apache Doris</dc:creator>
      <pubDate>Wed, 26 Nov 2025 20:24:27 +0000</pubDate>
      <link>https://forem.com/apachedoris/apache-doris-ai-capabilities-unveiled-part-ii-deep-analysis-of-aiagg-and-embed-functions-2kek</link>
      <guid>https://forem.com/apachedoris/apache-doris-ai-capabilities-unveiled-part-ii-deep-analysis-of-aiagg-and-embed-functions-2kek</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;After a preliminary exploration of the possibilities of AI functions, we now turn our attention to two more core functions: &lt;strong&gt;AI_AGG&lt;/strong&gt; and &lt;strong&gt;EMBED&lt;/strong&gt;. We will delve into the design philosophy, implementation principles, and business applications of these two functions, demonstrating how Doris seamlessly integrates text aggregation and semantic vector analysis into SQL through native function design, providing users with a more powerful and user-friendly intelligent data analysis experience.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI_AGG: AI-Based Text Aggregation
&lt;/h2&gt;

&lt;p&gt;Aggregation is one of the most common operations in data analysis. However, when dealing with massive volumes of user comments, support tickets, or log texts, traditional aggregate functions struggle to process such unstructured text data directly. To address this, Doris supports &lt;strong&gt;AI_AGG&lt;/strong&gt;, a function that can call AI to perform text aggregation. It allows analysts to handle specific tasks on large amounts of text according to custom instructions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Examples
&lt;/h3&gt;

&lt;p&gt;For detailed usage of AI_AGG, please refer to: &lt;a href="https://doris.apache.org/docs/dev/sql-manual/sql-functions/aggregate-functions/ai-agg" rel="noopener noreferrer"&gt;Apache Doris AI_AGG Documentation&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Example 1: Summarize Customer Service Tickets
&lt;/h4&gt;

&lt;p&gt;The following table simulates a simple customer service ticket:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;support_tickets&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;ticket_id&lt;/th&gt;
&lt;th&gt;customer_name&lt;/th&gt;
&lt;th&gt;subject&lt;/th&gt;
&lt;th&gt;details&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;td&gt;Login Failure&lt;/td&gt;
&lt;td&gt;Same problem as Alice. Also seeing 502 errors on the SSO page.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Carol&lt;/td&gt;
&lt;td&gt;Payment Declined&lt;/td&gt;
&lt;td&gt;Credit card charged twice but order still shows pending.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Eve&lt;/td&gt;
&lt;td&gt;Login Failure&lt;/td&gt;
&lt;td&gt;Getting redirected back to login after entering 2FA code.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Alice&lt;/td&gt;
&lt;td&gt;Login Failure&lt;/td&gt;
&lt;td&gt;Cannot log in after password reset. Tried clearing cache and different browsers.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Dave&lt;/td&gt;
&lt;td&gt;Slow Dashboard&lt;/td&gt;
&lt;td&gt;Dashboard takes &amp;gt;30 seconds to load since the last release.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;We can use &lt;code&gt;AI_AGG&lt;/code&gt; to summarize customer issues for different problem types:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;subject&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;AI_AGG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;details&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'Summarize every ticket detail into one short paragraph'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;ai_summary&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;support_tickets&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;subject&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output is as follows:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;subject&lt;/th&gt;
&lt;th&gt;ai_summary&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Slow Dashboard&lt;/td&gt;
&lt;td&gt;The dashboard is experiencing slow loading times, taking over 30 seconds to load following the most recent release.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Payment Declined&lt;/td&gt;
&lt;td&gt;A customer reports being charged twice for their order, which remains in a pending status.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Login Failure&lt;/td&gt;
&lt;td&gt;Users are experiencing login issues, including 2FA redirection, post-password reset failures, and SSO 502 errors, despite clearing cache and trying different browsers.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  AI_AGG Technical Analysis: Dynamic Pre-aggregation
&lt;/h3&gt;

&lt;p&gt;Combining aggregate functions with AI requires solving the problem that the total text volume within a group can far exceed the model's context window. If all text is concatenated and sent to the AI at once, it can easily exceed the model's maximum context window. Doris solves this problem through &lt;strong&gt;dynamic pre-aggregation&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffo0r3xl0720224lfhne4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffo0r3xl0720224lfhne4.png" alt=" " width="642" height="1012"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Context Monitoring&lt;/strong&gt;: During the text aggregation process, AI_AGG maintains an internal text buffer for each group (currently fixed at 128KB, compatible with most AI context windows).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Dynamic Pre-aggregation&lt;/strong&gt;: When a new text row would cause the buffer to exceed the threshold, AI_AGG triggers pre-aggregation—pausing to send the current buffer to the AI for intermediate processing.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Context Replacement&lt;/strong&gt;: The AI's concise intermediate result replaces the original long text in the buffer, freeing space for more data. If the buffer still exceeds the threshold after replacement, AI_AGG errors out to prevent model service overload.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This implementation integrates seamlessly with Doris's distributed query plan, leveraging multi-node parallel computing. Users can perform efficient intelligent analysis on massive text data using familiar SQL aggregation syntax.&lt;/p&gt;

&lt;h2&gt;
  
  
  EMBED: Text Vectorization Function
&lt;/h2&gt;

&lt;p&gt;For detailed usage of EMBED, please refer to: &lt;a href="https://doris.apache.org/docs/dev/sql-manual/sql-functions/ai-functions/distance-functions/embed" rel="noopener noreferrer"&gt;Apache Doris EMBED Documentation&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The core function of &lt;strong&gt;EMBED&lt;/strong&gt; is to convert any text into a high-dimensional floating-point vector through AI. This vector is a mathematical representation of the text in a semantic space, capturing its semantic information. Texts with similar semantics will have vectors that are closer in this space.&lt;/p&gt;

&lt;h3&gt;
  
  
  Examples
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Example 1: Build a Knowledge Base with Vectorization
&lt;/h4&gt;

&lt;p&gt;The following table simulates a simple employee handbook:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;knowledge_base&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="n"&gt;ARRAY&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;FLOAT&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;COMMENT&lt;/span&gt; &lt;span class="s1"&gt;'Embedding vector generated by EMBED function'&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;DUPLICATE&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;DISTRIBUTED&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;HASH&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;BUCKETS&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;
&lt;span class="n"&gt;PROPERTIES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nv"&gt;"replication_num"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;"1"&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;knowledge_base&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"Travel Reimbursement Policy"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;"Employees must submit a reimbursement request within 7 days after the business trip, with invoices and travel approval attached."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;EMBED&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"travel reimbursement policy"&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"Leave Policy"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;"Employees must apply for leave in the system in advance. If the leave is longer than three days, approval from the direct manager is required."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;EMBED&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"leave request policy"&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"VPN User Guide"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;"To access the internal network, employees must use VPN. For the first login, download and install the client and configure the certificate."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;EMBED&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"VPN guide intranet access"&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"Meeting Room Reservation"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;"Meeting rooms can be reserved in advance through the OA system, with time and number of participants specified."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;EMBED&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"meeting room booking reservation"&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"Procurement Request Process"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;"Departments must fill out a procurement request form for purchasing items. If the amount exceeds $5000, financial approval is required."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;EMBED&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"procurement request process finance"&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By vectorizing text with &lt;code&gt;EMBED&lt;/code&gt;, combined with Doris's vector functions, you can perform the following operations:&lt;/p&gt;

&lt;h5&gt;
  
  
  1. Q&amp;amp;A Retrieval (with COSINE_DISTANCE)
&lt;/h5&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;COSINE_DISTANCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;EMBED&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"How to apply for travel reimbursement?"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;knowledge_base&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="n"&gt;ASCLIMIT&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;id&lt;/th&gt;
&lt;th&gt;title&lt;/th&gt;
&lt;th&gt;content&lt;/th&gt;
&lt;th&gt;score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Travel Reimbursement Policy&lt;/td&gt;
&lt;td&gt;Employees must submit a reimbursement request within 7 days after the business trip, with invoices and travel approval attached.&lt;/td&gt;
&lt;td&gt;0.4463210454563673&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Procurement Request Process&lt;/td&gt;
&lt;td&gt;Departments must fill out a procurement request form for purchasing items. If the amount exceeds $5000, financial approval is required.&lt;/td&gt;
&lt;td&gt;0.5726841578491431&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h5&gt;
  
  
  2. Problem Analysis Matching (with L2_DISTANCE)
&lt;/h5&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;L2_DISTANCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;EMBED&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"How to access the company intranet"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;distance&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;knowledge_base&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;distance&lt;/span&gt; &lt;span class="n"&gt;ASCLIMIT&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;id&lt;/th&gt;
&lt;th&gt;title&lt;/th&gt;
&lt;th&gt;content&lt;/th&gt;
&lt;th&gt;distance&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;VPN User Guide&lt;/td&gt;
&lt;td&gt;To access the internal network, employees must use VPN. For the first login, download and install the client and configure the certificate.&lt;/td&gt;
&lt;td&gt;0.5838271122253775&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Travel Reimbursement Policy&lt;/td&gt;
&lt;td&gt;Employees must submit a reimbursement request within 7 days after the business trip, with invoices and travel approval attached.&lt;/td&gt;
&lt;td&gt;1.272394695975331&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h5&gt;
  
  
  3. Text Relevance Matching (with INNER_PRODUCT)
&lt;/h5&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;INNER_PRODUCT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;EMBED&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"Leave system request leader approval"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;knowledge_base&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="n"&gt;DESCLIMIT&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;id&lt;/th&gt;
&lt;th&gt;title&lt;/th&gt;
&lt;th&gt;content&lt;/th&gt;
&lt;th&gt;score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Procurement Request Process&lt;/td&gt;
&lt;td&gt;Departments must fill out a procurement request form for purchasing items. If the amount exceeds $5000, financial approval is required.&lt;/td&gt;
&lt;td&gt;0.33268885332504&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Meeting Room Reservation&lt;/td&gt;
&lt;td&gt;Meeting rooms can be reserved in advance through the OA system, with time and number of participants specified.&lt;/td&gt;
&lt;td&gt;0.29224032230852487&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h5&gt;
  
  
  4. Find Similar Content (with L1_DISTANCE)
&lt;/h5&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;L1_DISTANCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;EMBED&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"Procurement application process"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;distance&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;knowledge_base&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;distance&lt;/span&gt; &lt;span class="n"&gt;ASCLIMIT&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;id&lt;/th&gt;
&lt;th&gt;title&lt;/th&gt;
&lt;th&gt;content&lt;/th&gt;
&lt;th&gt;distance&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Procurement Request Process&lt;/td&gt;
&lt;td&gt;Departments must fill out a procurement request form for purchasing items. If the amount exceeds $5000, financial approval is required.&lt;/td&gt;
&lt;td&gt;18.66882028897362&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Meeting Room Reservation&lt;/td&gt;
&lt;td&gt;Meeting rooms can be reserved in advance through the OA system, with time and number of participants specified.&lt;/td&gt;
&lt;td&gt;30.90449328294426&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Leave Policy&lt;/td&gt;
&lt;td&gt;Employees must apply for leave in the system in advance. If the leave is longer than three days, approval from the direct manager is required.&lt;/td&gt;
&lt;td&gt;31.060405636536416&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Flexible Vector Dimension Control
&lt;/h3&gt;

&lt;p&gt;Through Doris's built-in &lt;strong&gt;RESOURCE&lt;/strong&gt; mechanism, users can set the &lt;code&gt;ai.dimensions&lt;/code&gt; parameter when configuring an AI Resource to precisely specify the dimension of the generated vectors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;High-dimensional vectors&lt;/strong&gt;: Retain richer semantic information (suitable for high-precision retrieval).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Low-dimensional vectors&lt;/strong&gt;: Save storage space and accelerate computation (suitable for lightweight matching).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Ensure the AI model configured in the RESOURCE supports the specified dimension (otherwise, requests may fail).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;For models that do not support dimension customization (e.g., OpenAI's &lt;code&gt;text-embedding-ada-002&lt;/code&gt;), the &lt;code&gt;ai.dimensions&lt;/code&gt; setting will be ignored, and the model's default dimension will be used.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Summary and Outlook
&lt;/h2&gt;

&lt;p&gt;With the &lt;strong&gt;AI_AGG&lt;/strong&gt; and &lt;strong&gt;EMBED&lt;/strong&gt; functions, Apache Doris has successfully embedded AI capabilities into its database kernel, injecting powerful intelligent analysis capabilities into its native SQL engine and greatly expanding the boundaries of data analysis and intelligent applications:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;AI_AGG&lt;/strong&gt;: With dynamic pre-aggregation, it enables intelligent analysis of unstructured text (e.g., user comments, logs) directly in the database.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;EMBED&lt;/strong&gt;: Seamlessly integrates with vector functions to provide end-to-end semantic retrieval solutions (e.g., Q&amp;amp;A systems, content recommendation), simplifying application development.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These features empower SQL with the ability to command AI models, allowing data analysts to harness powerful AI at low cost and high efficiency to uncover deeper semantic value in data.&lt;/p&gt;

&lt;p&gt;Looking ahead, Doris will continue to deepen the integration of AI and databases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Optimize model scheduling and computational performance.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Explore cutting-edge features like multi-modal data analysis and AI Agent interactions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Continuously lower the barrier to using AI technology, making data-driven intelligent decisions ubiquitous.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>apachedoris</category>
      <category>olap</category>
      <category>database</category>
    </item>
    <item>
      <title>Building Real-Time Lakehouse with S3 Tables, AWS Glue, and Apache Doris</title>
      <dc:creator>Apache Doris</dc:creator>
      <pubDate>Fri, 21 Nov 2025 18:45:38 +0000</pubDate>
      <link>https://forem.com/apachedoris/building-real-time-lakehouse-with-s3-tables-aws-glue-and-apache-doris-5am0</link>
      <guid>https://forem.com/apachedoris/building-real-time-lakehouse-with-s3-tables-aws-glue-and-apache-doris-5am0</guid>
      <description>&lt;p&gt;We built a real-time lakehouse with S3 Tables, AWS Glue, and Apache Doris. In this solution, S3 Tables stores data in the Apache Iceberg format on Amazon S3. AWS Glue manages and organizes metadata and schema, providing a single catalog that connects all resources. And Apache Doris runs sub-second queries directly on those Iceberg tables: no ETL, no data copies, no complex architecture.&lt;/p&gt;

&lt;p&gt;Together, the S3 Tables + AWS Glue + Apache Doris form a real-time lakehouse that combines the openness of a data lake with the high performance of a data warehouse, providing a key data foundation for AI and agentic workloads.&lt;/p&gt;

&lt;p&gt;You get:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Unified metadata for easy table discovery and governance&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Open Apache Iceberg tables on S3 with ACID, time-travel, and schema evolution&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A high-performance query engine with Apache Doris offering low-latency and high-concurrency&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Interoperability across engines with Spark, Flink, Trino, Doris, and more&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is a practical, production-ready real-time lakehouse you can use to power dashboards, streaming analytics, or AI features directly from the data lake. The solution is also applicable to many other open-source combinations, with table formats like Iceberg, Paimon, catalogs like Unity, Polaris, Gravitino, and query engines like Spark, Flink, and Trino.&lt;/p&gt;

&lt;h2&gt;
  
  
  Simple steps to replicate:
&lt;/h2&gt;

&lt;p&gt;Let's see how to set up this solution in a demo. We will explore how to harness the power of Apache Doris, as well as configure a third-party engine to work with AWS Glue Iceberg REST Catalog. The demo will include details on how to perform read/write data operations against S3 tables with AWS Glue.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Create S3 Table Buckets&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffr5q6hfqo43euolh41s8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffr5q6hfqo43euolh41s8.png" alt=" " width="800" height="292"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Create policy for Glue and S3 Tables&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fstnl3topote0wqtbgtyv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fstnl3topote0wqtbgtyv.png" alt=" " width="800" height="143"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Use the following JSON policy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Statement"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Sid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"VisualEditor0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="s2"&gt;"glue:GetCatalog"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="s2"&gt;"glue:GetDatabase"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="s2"&gt;"glue:GetDatabases"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="s2"&gt;"glue:GetTable"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="s2"&gt;"glue:GetTables"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="s2"&gt;"glue:CreateTable"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="s2"&gt;"glue:UpdateTable"&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:glue:&amp;lt;region&amp;gt;:&amp;lt;account_id&amp;gt;:catalog"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:glue:&amp;lt;region&amp;gt;:&amp;lt;account_id&amp;gt;:catalog/s3tablescatalog"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:glue:&amp;lt;region&amp;gt;:&amp;lt;account_id&amp;gt;:catalog/s3tablescatalog/&amp;lt;bucket_name&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:glue:&amp;lt;region&amp;gt;:&amp;lt;account_id&amp;gt;:table/s3tablescatalog/&amp;lt;bucket_name&amp;gt;/&amp;lt;db_name&amp;gt;/*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:glue:&amp;lt;region&amp;gt;:&amp;lt;account_id&amp;gt;:database/s3tablescatalog/&amp;lt;bucket_name&amp;gt;/&amp;lt;db_name&amp;gt;"&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="s2"&gt;"lakeformation:GetDataAccess"&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"*"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Attach the policy to user&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftzizk79jrkb3xn929jre.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftzizk79jrkb3xn929jre.png" alt=" " width="800" height="241"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Search the policy you just created and attach it to your user.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Connect to Iceberg catalog using SQL
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Create Catalog&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;CATALOG&lt;/span&gt; &lt;span class="n"&gt;my_glue_catalog&lt;/span&gt; &lt;span class="n"&gt;properties&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s1"&gt;'type'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'iceberg'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'iceberg.catalog.type'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'rest'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'warehouse'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'&amp;lt;acount_id&amp;gt;:s3tablescatalog/&amp;lt;bucket_name&amp;gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'iceberg.rest.uri'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'https://glue.&amp;lt;region&amp;gt;.amazonaws.com/iceberg'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'iceberg.rest.sigv4-enabled'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'true'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'iceberg.rest.signing-name'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'glue'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'iceberg.rest.signing-region'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'&amp;lt;region&amp;gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'iceberg.rest.access-key-id'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'&amp;lt;ak&amp;gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'iceberg.rest.secret-access-key'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'&amp;lt;sk&amp;gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'test_connection'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'true'&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="c1"&gt;-- Switch to the catalog&lt;/span&gt;
&lt;span class="n"&gt;SWITCH&lt;/span&gt; &lt;span class="n"&gt;my_glue_catalog&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- View current existing databases&lt;/span&gt;
&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;DATABASES&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- Create a new database&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;DATABSE&lt;/span&gt; &lt;span class="n"&gt;gluedb&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- Change to the newly created database&lt;/span&gt;
&lt;span class="n"&gt;USE&lt;/span&gt; &lt;span class="n"&gt;gluedb&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- Create a new Iceberg table&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;iceberg_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="c1"&gt;-- Insert values into table&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;iceberg_table&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"Jacky"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="c1"&gt;-- Query the Iceberg table&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;iceberg_table&lt;/span&gt;&lt;span class="err"&gt;；&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Replace the placeholders with the real information.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion and Next Steps
&lt;/h2&gt;

&lt;p&gt;A unified data foundation is what makes real-time analytics possible, and key for companies to adopt large-scale AI and agentic workloads.&lt;/p&gt;

&lt;p&gt;S3 Tables and AWS Glue provide an open, governed data layer, and Apache Doris delivers sub-second analytics directly on that data. This real-time lakehouse offers a simpler architecture, smarter governance, and AI readiness, allowing teams to query fresh information without complex ETL or data silos.&lt;/p&gt;

</description>
      <category>bigdata</category>
      <category>lakehouse</category>
      <category>database</category>
      <category>apachedoris</category>
    </item>
    <item>
      <title>10x Query Performance Improvement: The Design and Implementation of the New Unique Key</title>
      <dc:creator>Apache Doris</dc:creator>
      <pubDate>Thu, 20 Nov 2025 19:44:20 +0000</pubDate>
      <link>https://forem.com/apachedoris/10x-query-performance-improvement-the-design-and-implementation-of-the-new-unique-key-157l</link>
      <guid>https://forem.com/apachedoris/10x-query-performance-improvement-the-design-and-implementation-of-the-new-unique-key-157l</guid>
      <description>&lt;p&gt;In business scenarios of real-time data warehouses, providing good support for real-time data updates is an extremely important capability. For example, in scenarios such as database synchronization (CDC), e-commerce transaction orders, advertising effect delivery, and marketing business reports, when facing changes in upstream data, it is usually necessary to quickly capture change records and promptly modify single or multiple rows of data. This ensures that business analysts and related analysis platforms can quickly grasp the latest progress and improve the timeliness of business decisions.&lt;/p&gt;

&lt;p&gt;For OLAP databases, which have traditionally been weak at data updates, how to better implement real-time update capabilities has become a key to winning fierce competition in today's environment where data timeliness requirements are increasingly strong and the application scope of real-time data warehouse businesses is expanding.&lt;/p&gt;

&lt;p&gt;In the past, Apache Doris mainly implemented real-time data Upserts through the Unique Key data model. Due to its underlying LSM Tree-like structure, it provides strong support for high-frequency writes of large datasets. However, its Merge-on-Read update mode has become a bottleneck restricting Apache Doris' real-time update capabilities, which may cause query jitters when dealing with concurrent reading and writing of real-time data.&lt;/p&gt;

&lt;p&gt;Based on this, in the Apache Doris 1.2.0 version, we introduced a new data update method - Merge-On-Write - for the Unique Key model, striving to balance real-time updates and efficient queries. This article will detail the design, implementation and effects of the new primary key model.&lt;/p&gt;

&lt;h1&gt;
  
  
  Implementation of the Original Unique Key Model
&lt;/h1&gt;

&lt;p&gt;Users familiar with Apache Doris' history may know that Doris' initial design was inspired by Google Mesa, and it only had Duplicate Key and Aggregate Key models at first. The Unique Key model was added later based on user needs during Doris' development. However, the demand for real-time updates was not so strong at that time, so the implementation of Unique Key was relatively simple - it was just a wrapper around the Aggregate Key model, without in-depth optimization for real-time update requirements.&lt;/p&gt;

&lt;p&gt;Specifically, the implementation of the Unique Key model is just a special case of the Aggregate Key model. If you use the Aggregate Key model and set the aggregation type of all non-key columns to REPLACE, you can achieve exactly the same effect. As shown in the following figure, when describing example_tbl, a table of the Unique Key model, the aggregation type in the last column shows that it is equivalent to an Aggregate Key table where all columns have the REPLACE aggregation type.&lt;/p&gt;

&lt;p&gt;Image: Original Unique-Key-Aggregate-Key&lt;/p&gt;

&lt;p&gt;Both the Unique Key and Aggregate Key data models adopt the Merge-On-Read implementation method. That is, when data is imported, it is first written to a new Rowset, and no deduplication is performed after writing. Only when a query is initiated will multi-way concurrent sorting be performed. During multi-way merge sorting, duplicate keys will be grouped together and aggregation operations will be performed. Among them, keys with higher versions will overwrite those with lower versions, and finally only the record with the highest version will be returned to the user.&lt;/p&gt;

&lt;p&gt;The following figure is a simplified representation of the execution process of the Unique Key model:&lt;/p&gt;

&lt;p&gt;Image: Performance Improvement - Simplified Unique-Key&lt;/p&gt;

&lt;p&gt;Although their implementation methods are relatively consistent, the usage scenarios of the Unique Key and Aggregate Key data models are significantly different:&lt;/p&gt;

&lt;p&gt;When users create a table with the Aggregate Key model, they have a very clear understanding of the aggregation query conditions - aggregating according to the columns specified by the Aggregate Key, and the aggregate functions on the Value columns are the main aggregation methods (COUNT/SUM/MAX/MIN, etc.) used by users. For example, using user_id as the Aggregate Key and summing the number of visits and duration to calculate UV and user usage duration.&lt;/p&gt;

&lt;p&gt;However, the main function of the Key in the Unique Key data model is to ensure uniqueness, not to serve as an aggregation Key. For example, in the order scenario, data synchronized from TP databases through Flink CDC uses the order ID as the Unique Key for deduplication. However, during queries, filtering, aggregation and analysis are usually performed on certain Value columns (such as order status, order amount, order time consumption, order placement time, etc.).&lt;/p&gt;

&lt;h1&gt;
  
  
  Shortcomings
&lt;/h1&gt;

&lt;p&gt;As can be seen from the above, when users query using the Unique Key model, they actually perform two aggregation operations. The first is to aggregate all data by Key according to the Unique Key to remove duplicate Keys; the second is to aggregate according to the actual aggregation conditions required by the query. These two aggregation operations lead to serious efficiency issues and low query performance:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Data deduplication requires expensive multi-way merge sorting, and full Key comparison consumes a lot of CPU computing resources.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Effective data pruning cannot be performed, introducing a large amount of additional data IO. For example, if a data partition has 10 million pieces of data, but only 1,000 pieces meet the filtering conditions, the rich indexes of the OLAP system are designed to efficiently filter out these 1,000 pieces of data. However, since it is impossible to determine whether a certain piece of data in a specific file is valid, these indexes cannot be used. It is necessary to first perform full merge sorting and data deduplication, and then filter these finally confirmed valid data. This brings about a 10,000-fold IO amplification (this figure is only a rough estimate, and the actual amplification effect is more complicated to calculate).&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h1&gt;
  
  
  Scheme Research and Selection
&lt;/h1&gt;

&lt;p&gt;In order to solve the problems existing in the original Unique Key model and better meet the needs of business scenarios, we decided to optimize the Unique Key model and conducted a detailed research on optimization schemes for read and write efficiency issues.&lt;/p&gt;

&lt;p&gt;There have been many industry explorations on solutions to the above problems. There are three representative types:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Delete + Insert: That is, when writing data, find the overwritten key through a primary key index and mark it as deleted. A representative system is Microsoft's SQL Server.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Delta Store: Divide data into base data and delta data. Each primary key in the base data is guaranteed to be unique. All updates are recorded in the Delta Store. During queries, the base data and delta data are merged. At the same time, background merge threads regularly merge the delta data and base data. A representative system is Apache Kudu.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Copy-on-Write: When updating data, directly copy the original data row, update it, and write it to a new file. This method is widely used in data lakes, with representative systems such as Apache Hudi and Delta Lake.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The implementation mechanisms and comparisons of these three schemes are as follows:&lt;/p&gt;

&lt;h2&gt;
  
  
  Delete + Insert (i.e., Merge-on-Write)
&lt;/h2&gt;

&lt;p&gt;A representative example is the scheme proposed in the paper "Real-Time Analytical Processing with SQL Server" published by SQL Server in VLDB in 2015. Simply put, this paper proposes that when writing data, old data is marked for deletion (using a data structure called Delete Bitmap), and new data is recorded in the Delta Store. During queries, the Base data, Delete Bitmap, and data in the Delta Store are merged to obtain the latest data. The overall scheme is shown in the following figure, and will not be elaborated due to space limitations.&lt;/p&gt;

&lt;p&gt;Image: Performance Improvement - Merge-on-Write&lt;/p&gt;

&lt;p&gt;The advantage of this scheme is that any valid primary key exists only in one place (either in Base Data or Delta Store), which avoids a large amount of merge sorting consumption during queries. At the same time, various rich columnar indexes in the Base data remain valid.&lt;/p&gt;

&lt;h2&gt;
  
  
  Delta Store
&lt;/h2&gt;

&lt;p&gt;A representative system using the Delta Store method is Apache Kudu. In Kudu, data is divided into Base Data and Delta Data. The primary keys in the Base Data are all unique. Any modification to the Base data will be first written to the Delta Store (marking the corresponding relationship with the Base Data through row numbers, which can avoid sorting during merging). Different from the Base + Delta of SQL Server mentioned earlier, Kudu does not mark deletions, so data with the same primary key will exist in two places. Therefore, during queries, the data from Base and Delta must be merged to obtain the latest result. Kudu's scheme is shown in the following figure:&lt;/p&gt;

&lt;p&gt;Image: Performance Improvement - Delta-Store&lt;/p&gt;

&lt;p&gt;Kudu's scheme can also avoid the high cost caused by merge sorting when reading data. However, since data with the same primary key can exist in multiple places, it is difficult to ensure the accuracy of indexes and cannot perform efficient predicate pushdown. Indexes and predicate pushdown are important means for analytical databases to optimize performance, so this shortcoming has a significant impact on performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Copy-On-Write
&lt;/h2&gt;

&lt;p&gt;Since Apache Doris is positioned as a real-time analytical database, the Copy-On-Write scheme has too high a cost for real-time updates and is not suitable for Doris.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scheme Comparison
&lt;/h2&gt;

&lt;p&gt;The following table compares various schemes. Among them, Merge-On-Read is the default implementation of the Unique Key model, i.e., the implementation before version 1.2. Merge-On-Write (Merge on Write) is the Delete + Insert scheme mentioned earlier.&lt;/p&gt;

&lt;p&gt;Image: Performance Improvement - Scheme Comparison&lt;/p&gt;

&lt;p&gt;As can be seen from the above, Merge-On-Write trades moderate write costs for lower read costs, well supports predicate pushdown and non-key column index filtering, and has good effects on query performance optimization. After comprehensive comparison, we chose Merge-On-Write as the final optimization scheme.&lt;/p&gt;

&lt;h1&gt;
  
  
  Design and Implementation of the New Scheme
&lt;/h1&gt;

&lt;p&gt;In short, the processing flow of Merge-On-Write is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;For each Key, find its position in the Base data (rowsetid + segmentid + row number).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If the Key exists, mark the corresponding row of data as deleted. The information of marked deletion is recorded in the Delete Bitmap, and each Segment has a corresponding Delete Bitmap.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Write the updated data to a new Rowset, complete the transaction, and make the new data visible (able to be queried).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;During queries, read the Delete Bitmap, filter out the rows marked as deleted, and only return valid data.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Key Issues
&lt;/h2&gt;

&lt;p&gt;To design a Merge-On-Write scheme suitable for Doris, the following key issues need to be focused on solving:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;How to efficiently locate whether there is old data that needs to be marked for deletion during import?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;How to efficiently store the information of marked deletion?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;How to efficiently use the marked deletion information to filter data during the query phase?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Can multi-version support be realized?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;How to avoid transaction conflicts in concurrent imports and write conflicts between imports and Compaction?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Is the additional memory consumption introduced by the scheme reasonable?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Is the write performance degradation caused by write costs within an acceptable range?&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Based on the above key issues, we have implemented a series of optimization measures to solve these problems well. They will be introduced in detail in the following text:&lt;/p&gt;

&lt;h3&gt;
  
  
  Primary Key Index
&lt;/h3&gt;

&lt;p&gt;Since Doris is a columnar storage system designed for large-scale analysis, it does not have the capability of primary key index. Therefore, in order to quickly locate whether there is a primary key to be overwritten and the row number of the primary key to be overwritten, it is necessary to add a primary key index to Doris.&lt;/p&gt;

&lt;p&gt;We have taken the following optimization measures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Maintain a primary key index for each Segment. The primary key index is implemented using a scheme similar to RocksDB Partitioned Index. This scheme can achieve very high query QPS, and the file-based index scheme can also save memory usage.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Maintain a Bloom Filter corresponding to the primary key index for each Segment. The primary key index will only be queried when the Bloom Filter hits.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Record a primary key range [min-key, max-key] for each Segment.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Maintain a pure in-memory interval tree, constructed using the primary key ranges of all Segments. When querying a primary key, there is no need to traverse all Segments. The interval tree can be used to locate the Segments that may contain the primary key, greatly reducing the amount of indexes that need to be queried.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;For all hit Segments, query them in descending order of version. In Doris, a higher version means more updated data. Therefore, if a primary key hits in the index of a higher-version Segment, there is no need to continue querying lower-version Segments.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The flow of querying a single primary key is shown in the following figure:&lt;/p&gt;

&lt;p&gt;Image: Performance Improvement - Primary Key Index&lt;/p&gt;

&lt;h3&gt;
  
  
  Delete Bitmap
&lt;/h3&gt;

&lt;p&gt;Delete Bitmap adopts a multi-version recording method, as shown in the following figure:&lt;/p&gt;

&lt;p&gt;Image: Performance Improvement - Delete-Bitmap&lt;/p&gt;

&lt;p&gt;The Segment file in the figure is generated by the import of version 5, including the imported data of version 5 in this Tablet.&lt;/p&gt;

&lt;p&gt;The import of version 6 includes the update of primary key B, so the second row will be marked as deleted in the Bitmap, and the modification of this Segment by the import of version 6 will be recorded in the DeleteBitmap.&lt;/p&gt;

&lt;p&gt;The import of version 7 includes the update of primary key A, which will also generate a Bitmap corresponding to the version; similarly, the import of version 8 will also generate a corresponding Bitmap.&lt;/p&gt;

&lt;p&gt;All Delete Bitmaps are stored in a large Map. Each import will serialize the latest Delete Bitmap into RocksDB. The key definitions are as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;SegmentId&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;Version&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;BitmapKey&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;tuple&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;RowsetId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SegmentId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Version&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;BitmapKey&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;roaring&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Roaring&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;delete_bitmap&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each Segment in each Rowset will record multiple versions of Bitmaps. A Bitmap with Version x means the modification of the current Segment by the import of version x.&lt;/p&gt;

&lt;p&gt;Advantages of multi-version Delete Bitmap:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;It can well support multi-version queries. For example, after the import of version 7 is completed, a query on this table starts to execute and will use Version 7. Even if the query takes a long time and the import of version 8 is completed during the query execution, there is no need to worry about reading the data of version 8 (or missing the data deleted by version 8).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;It can well support complex Schema Changes. In Doris, complex Schema Changes (such as type conversion) require double writing first, and at the same time convert historical data before a certain version and then delete the old version of data. Multi-version Delete Bitmap can well support the current Schema Change implementation.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;It can support multi-version requirements during data copying and replica repair.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However, multi-version Delete Bitmap also has corresponding costs. In the previous example, to access the data of version 8, the three Bitmaps of v6, v7 and v8 need to be merged to get a complete Bitmap, and then this Bitmap is used to filter the Segment data. In real-time high-frequency import scenarios, a large number of Bitmaps can be easily generated, and the CPU cost of the union operation of Roaringbitmap is high. In order to minimize the impact of a large number of union operations, we added an LRUCache to DeleteBitmap to record the latest merged Bitmaps.&lt;/p&gt;

&lt;h3&gt;
  
  
  Write Flow
&lt;/h3&gt;

&lt;p&gt;When writing data, the primary key index of each Segment will be created first, and then the Delete Bitmap will be updated. The establishment of the primary key index is relatively simple and will not be described in detail due to space limitations. The focus is on introducing the more complex Delete Bitmap update flow:&lt;/p&gt;

&lt;p&gt;Image: Performance Improvement - Write Flow&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;DeltaWriter will first flush the data to the disk.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;In the Publish phase, batch point queries are performed on all Keys, and the Bitmaps corresponding to the overwritten Keys are updated. In the following figure, the version of the newly written Rowset is 8, which modifies the data in 3 Rowsets, so 3 Bitmap modification records will be generated.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Updating the Bitmap in the Publish phase ensures that no new visible Rowsets will appear during the batch point query of Keys and Bitmap update, ensuring the correctness of Bitmap update.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If a Segment is not modified, there will be no Bitmap record corresponding to the version. For example, Segment1 of Rowset1 has no Bitmap corresponding to Version 8.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Read Flow
&lt;/h3&gt;

&lt;p&gt;The reading flow of Bitmap is shown in the following figure. It can be seen from the figure:&lt;/p&gt;

&lt;p&gt;Image: Performance Improvement - Read Flow&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;A Query requesting version 7 will only see the data corresponding to version 7.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;When reading the data of Rowset5, the Bitmaps generated by the modifications of v6 and v7 to it will be merged to obtain the complete DeleteBitmap corresponding to Version7, which is used to filter data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;In the example in the figure, the import of version 8 overwrites a piece of data in Segment2 of Rowset1, but the Query requesting version 7 can still read this piece of data.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In high-frequency import scenarios, there may be a large number of versions of Bitmaps. Merging these Bitmaps itself may also consume a lot of CPU computing resources. Therefore, we introduced an LRUCache, and each version of Bitmap only needs to be merged once.&lt;/p&gt;

&lt;h3&gt;
  
  
  Handling of Compaction and Write Conflicts
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Normal Compaction Flow
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;When Compaction reads data, it obtains the version Vx of the Rowset being processed, and will automatically filter out the rows marked as deleted through the Delete Bitmap (see the query layer adaptation part earlier).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;After Compaction is completed, all DeleteBitmaps on the source Rowset that are less than or equal to version Vx can be cleaned up.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Handling of Compaction and Write Conflicts
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;During the execution of Compaction, a new import task may be submitted, assuming the corresponding version is Vy. If the write corresponding to Vy has modifications to the Rowset in the Compaction source, it will be updated to Vy of the DeleteBitmap of this Rowset.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;After Compaction is completed, check all DeleteBitmaps on this Rowset that are greater than Vx, and update the row numbers in them to the Segment row numbers in the newly generated Rowset.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As shown in the following figure, Compaction selects three Rowsets [0-5], [6-6], [7-7]. During the Compaction process, the import of Version8 is successfully executed. In the Compaction Commit phase, it is necessary to process the new Bitmap generated by the data import of Version8.&lt;/p&gt;

&lt;p&gt;Image: Performance Improvement - Compaction&lt;/p&gt;

&lt;h3&gt;
  
  
  Write Performance Optimization
&lt;/h3&gt;

&lt;p&gt;In the initial design, DeltaWriter did not perform point queries and Delete Bitmap updates during the data writing phase, but did so in the Publish phase. This can ensure that all data before this version can be seen when updating the Delete Bitmap, ensuring the correctness of the Delete Bitmap. However, in actual high-frequency import tests, it was found that the additional consumption caused by serial full-point queries and updates of each Rowset's data in the Publish phase would lead to a significant drop in import throughput.&lt;/p&gt;

&lt;p&gt;Therefore, in the final design, we changed the update of Delete Bitmap to a two-phase form: the first phase can be executed in parallel, only finding and marking deletions for the Version visible at that time; the second phase must be executed serially, and updating the data in the newly imported Rowsets that may have been missed in the previous first phase. The amount of incremental update data in the second phase is very small, so the impact on the overall throughput is very limited.&lt;/p&gt;

&lt;h1&gt;
  
  
  Optimization Effects
&lt;/h1&gt;

&lt;p&gt;The new Merge-On-Write implementation marks old data as deleted during writing, which can always ensure that valid primary keys only appear in one file (that is, the uniqueness of primary keys is ensured during writing). There is no need to deduplicate primary keys through merge sorting during reading. For high-frequency writing scenarios, this greatly reduces the additional consumption during query execution.&lt;/p&gt;

&lt;p&gt;In addition, the new version implementation can also support predicate pushdown and make good use of Doris' rich indexes. Sufficient data pruning can be performed at the data IO level, greatly reducing the amount of data read and computed. Therefore, there is a significant performance improvement in queries in many scenarios.&lt;/p&gt;

&lt;p&gt;It should be noted that if users use the Unique Key in low-frequency batch update scenarios, the improvement of the Merge-On-Write implementation on users' query effects may not be obvious. Because for low-frequency batch updates, Doris' Compaction mechanism can usually quickly compact the data into a good state (that is, Compaction completes the deduplication of primary keys), avoiding the deduplication computing cost during queries.&lt;/p&gt;

&lt;h2&gt;
  
  
  Optimization Effects on Aggregation Analysis
&lt;/h2&gt;

&lt;p&gt;We conducted tests using the Lineitem table, which has the largest data volume in TPC-H 100. To simulate multiple continuous writing scenarios, the data was divided into 100 parts and imported repeatedly 3 times. Then count(*) queries were performed, and the effect comparison is as follows:&lt;/p&gt;

&lt;p&gt;Image: Optimization - Aggregation Analysis&lt;/p&gt;

&lt;p&gt;The scenarios with and without Cache were compared respectively. In the case of no Cache, due to the high time consumption of loading data from the disk, there is an overall performance improvement of about 4 times; excluding the impact of disk reading overhead, in the case of Cache, the computing efficiency of the new version implementation can be improved by more than 20 times.&lt;/p&gt;

&lt;p&gt;The effect of Sum is similar, and will not be listed due to space limitations.&lt;/p&gt;

&lt;h2&gt;
  
  
  SSB Flat
&lt;/h2&gt;

&lt;p&gt;In addition to simple Count and Sum, we also tested the SSB-Flat dataset. The optimization effect on the 100G dataset (divided into 10 parts and imported multiple times to simulate data update scenarios) is shown in the following figure:&lt;/p&gt;

&lt;p&gt;In business scenarios of real-time data warehouses, providing good support for real-time data updates is an extremely important capability. For example, in scenarios such as database synchronization (CDC), e-commerce transaction orders, advertising effect delivery, and marketing business reports, when facing changes in upstream data, it is usually necessary to quickly capture change records and promptly modify single or multiple rows of data. This ensures that business analysts and related analysis platforms can quickly grasp the latest progress and improve the timeliness of business decisions.&lt;/p&gt;

&lt;p&gt;For OLAP databases, which have traditionally been weak at data updates, how to better implement real-time update capabilities has become a key to winning fierce competition in today's environment where data timeliness requirements are increasingly strong and the application scope of real-time data warehouse businesses is expanding.&lt;/p&gt;

&lt;p&gt;In the past, Apache Doris mainly implemented real-time data Upserts through the Unique Key data model. Due to its underlying LSM Tree-like structure, it provides strong support for high-frequency writes of large datasets. However, its Merge-on-Read update mode has become a bottleneck restricting Apache Doris' real-time update capabilities, which may cause query jitters when dealing with concurrent reading and writing of real-time data.&lt;/p&gt;

&lt;p&gt;Based on this, in the Apache Doris 1.2.0 version, we introduced a new data update method - Merge-On-Write - for the Unique Key model, striving to balance real-time updates and efficient queries. This article will detail the design, implementation and effects of the new primary key model.&lt;/p&gt;

&lt;h1&gt;
  
  
  Implementation of the Original Unique Key Model
&lt;/h1&gt;

&lt;p&gt;Users familiar with Apache Doris' history may know that Doris' initial design was inspired by Google Mesa, and it only had Duplicate Key and Aggregate Key models at first. The Unique Key model was added later based on user needs during Doris' development. However, the demand for real-time updates was not so strong at that time, so the implementation of Unique Key was relatively simple - it was just a wrapper around the Aggregate Key model, without in-depth optimization for real-time update requirements.&lt;/p&gt;

&lt;p&gt;Specifically, the implementation of the Unique Key model is just a special case of the Aggregate Key model. If you use the Aggregate Key model and set the aggregation type of all non-key columns to REPLACE, you can achieve exactly the same effect. As shown in the following figure, when describing example_tbl, a table of the Unique Key model, the aggregation type in the last column shows that it is equivalent to an Aggregate Key table where all columns have the REPLACE aggregation type.&lt;/p&gt;

&lt;p&gt;Image: Original Unique-Key-Aggregate-Key&lt;/p&gt;

&lt;p&gt;Both the Unique Key and Aggregate Key data models adopt the Merge-On-Read implementation method. That is, when data is imported, it is first written to a new Rowset, and no deduplication is performed after writing. Only when a query is initiated will multi-way concurrent sorting be performed. During multi-way merge sorting, duplicate keys will be grouped together and aggregation operations will be performed. Among them, keys with higher versions will overwrite those with lower versions, and finally only the record with the highest version will be returned to the user.&lt;/p&gt;

&lt;p&gt;The following figure is a simplified representation of the execution process of the Unique Key model:&lt;/p&gt;

&lt;p&gt;Image: Performance Improvement - Simplified Unique-Key&lt;/p&gt;

&lt;p&gt;Although their implementation methods are relatively consistent, the usage scenarios of the Unique Key and Aggregate Key data models are significantly different:&lt;/p&gt;

&lt;p&gt;When users create a table with the Aggregate Key model, they have a very clear understanding of the aggregation query conditions - aggregating according to the columns specified by the Aggregate Key, and the aggregate functions on the Value columns are the main aggregation methods (COUNT/SUM/MAX/MIN, etc.) used by users. For example, using user_id as the Aggregate Key and summing the number of visits and duration to calculate UV and user usage duration.&lt;/p&gt;

&lt;p&gt;However, the main function of the Key in the Unique Key data model is to ensure uniqueness, not to serve as an aggregation Key. For example, in the order scenario, data synchronized from TP databases through Flink CDC uses the order ID as the Unique Key for deduplication. However, during queries, filtering, aggregation and analysis are usually performed on certain Value columns (such as order status, order amount, order time consumption, order placement time, etc.).&lt;/p&gt;

&lt;h1&gt;
  
  
  Shortcomings
&lt;/h1&gt;

&lt;p&gt;As can be seen from the above, when users query using the Unique Key model, they actually perform two aggregation operations. The first is to aggregate all data by Key according to the Unique Key to remove duplicate Keys; the second is to aggregate according to the actual aggregation conditions required by the query. These two aggregation operations lead to serious efficiency issues and low query performance:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Data deduplication requires expensive multi-way merge sorting, and full Key comparison consumes a lot of CPU computing resources.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Effective data pruning cannot be performed, introducing a large amount of additional data IO. For example, if a data partition has 10 million pieces of data, but only 1,000 pieces meet the filtering conditions, the rich indexes of the OLAP system are designed to efficiently filter out these 1,000 pieces of data. However, since it is impossible to determine whether a certain piece of data in a specific file is valid, these indexes cannot be used. It is necessary to first perform full merge sorting and data deduplication, and then filter these finally confirmed valid data. This brings about a 10,000-fold IO amplification (this figure is only a rough estimate, and the actual amplification effect is more complicated to calculate).&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h1&gt;
  
  
  Scheme Research and Selection
&lt;/h1&gt;

&lt;p&gt;In order to solve the problems existing in the original Unique Key model and better meet the needs of business scenarios, we decided to optimize the Unique Key model and conducted a detailed research on optimization schemes for read and write efficiency issues.&lt;/p&gt;

&lt;p&gt;There have been many industry explorations on solutions to the above problems. There are three representative types:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Delete + Insert: That is, when writing data, find the overwritten key through a primary key index and mark it as deleted. A representative system is Microsoft's SQL Server.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Delta Store: Divide data into base data and delta data. Each primary key in the base data is guaranteed to be unique. All updates are recorded in the Delta Store. During queries, the base data and delta data are merged. At the same time, background merge threads regularly merge the delta data and base data. A representative system is Apache Kudu.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Copy-on-Write: When updating data, directly copy the original data row, update it, and write it to a new file. This method is widely used in data lakes, with representative systems such as Apache Hudi and Delta Lake.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The implementation mechanisms and comparisons of these three schemes are as follows:&lt;/p&gt;

&lt;h2&gt;
  
  
  Delete + Insert (i.e., Merge-on-Write)
&lt;/h2&gt;

&lt;p&gt;A representative example is the scheme proposed in the paper "Real-Time Analytical Processing with SQL Server" published by SQL Server in VLDB in 2015. Simply put, this paper proposes that when writing data, old data is marked for deletion (using a data structure called Delete Bitmap), and new data is recorded in the Delta Store. During queries, the Base data, Delete Bitmap, and data in the Delta Store are merged to obtain the latest data. The overall scheme is shown in the following figure, and will not be elaborated due to space limitations.&lt;/p&gt;

&lt;p&gt;Image: Performance Improvement - Merge-on-Write&lt;/p&gt;

&lt;p&gt;The advantage of this scheme is that any valid primary key exists only in one place (either in Base Data or Delta Store), which avoids a large amount of merge sorting consumption during queries. At the same time, various rich columnar indexes in the Base data remain valid.&lt;/p&gt;

&lt;h2&gt;
  
  
  Delta Store
&lt;/h2&gt;

&lt;p&gt;A representative system using the Delta Store method is Apache Kudu. In Kudu, data is divided into Base Data and Delta Data. The primary keys in the Base Data are all unique. Any modification to the Base data will be first written to the Delta Store (marking the corresponding relationship with the Base Data through row numbers, which can avoid sorting during merging). Different from the Base + Delta of SQL Server mentioned earlier, Kudu does not mark deletions, so data with the same primary key will exist in two places. Therefore, during queries, the data from Base and Delta must be merged to obtain the latest result. Kudu's scheme is shown in the following figure:&lt;/p&gt;

&lt;p&gt;Image: Performance Improvement - Delta-Store&lt;/p&gt;

&lt;p&gt;Kudu's scheme can also avoid the high cost caused by merge sorting when reading data. However, since data with the same primary key can exist in multiple places, it is difficult to ensure the accuracy of indexes and cannot perform efficient predicate pushdown. Indexes and predicate pushdown are important means for analytical databases to optimize performance, so this shortcoming has a significant impact on performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Copy-On-Write
&lt;/h2&gt;

&lt;p&gt;Since Apache Doris is positioned as a real-time analytical database, the Copy-On-Write scheme has too high a cost for real-time updates and is not suitable for Doris.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scheme Comparison
&lt;/h2&gt;

&lt;p&gt;The following table compares various schemes. Among them, Merge-On-Read is the default implementation of the Unique Key model, i.e., the implementation before version 1.2. Merge-On-Write (Merge on Write) is the Delete + Insert scheme mentioned earlier.&lt;/p&gt;

&lt;p&gt;Image: Performance Improvement - Scheme Comparison&lt;/p&gt;

&lt;p&gt;As can be seen from the above, Merge-On-Write trades moderate write costs for lower read costs, well supports predicate pushdown and non-key column index filtering, and has good effects on query performance optimization. After comprehensive comparison, we chose Merge-On-Write as the final optimization scheme.&lt;/p&gt;

&lt;h1&gt;
  
  
  Design and Implementation of the New Scheme
&lt;/h1&gt;

&lt;p&gt;In short, the processing flow of Merge-On-Write is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;For each Key, find its position in the Base data (rowsetid + segmentid + row number).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If the Key exists, mark the corresponding row of data as deleted. The information of marked deletion is recorded in the Delete Bitmap, and each Segment has a corresponding Delete Bitmap.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Write the updated data to a new Rowset, complete the transaction, and make the new data visible (able to be queried).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;During queries, read the Delete Bitmap, filter out the rows marked as deleted, and only return valid data.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Key Issues
&lt;/h2&gt;

&lt;p&gt;To design a Merge-On-Write scheme suitable for Doris, the following key issues need to be focused on solving:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;How to efficiently locate whether there is old data that needs to be marked for deletion during import?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;How to efficiently store the information of marked deletion?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;How to efficiently use the marked deletion information to filter data during the query phase?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Can multi-version support be realized?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;How to avoid transaction conflicts in concurrent imports and write conflicts between imports and Compaction?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Is the additional memory consumption introduced by the scheme reasonable?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Is the write performance degradation caused by write costs within an acceptable range?&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Based on the above key issues, we have implemented a series of optimization measures to solve these problems well. They will be introduced in detail in the following text:&lt;/p&gt;

&lt;h3&gt;
  
  
  Primary Key Index
&lt;/h3&gt;

&lt;p&gt;Since Doris is a columnar storage system designed for large-scale analysis, it does not have the capability of primary key index. Therefore, in order to quickly locate whether there is a primary key to be overwritten and the row number of the primary key to be overwritten, it is necessary to add a primary key index to Doris.&lt;/p&gt;

&lt;p&gt;We have taken the following optimization measures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Maintain a primary key index for each Segment. The primary key index is implemented using a scheme similar to RocksDB Partitioned Index. This scheme can achieve very high query QPS, and the file-based index scheme can also save memory usage.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Maintain a Bloom Filter corresponding to the primary key index for each Segment. The primary key index will only be queried when the Bloom Filter hits.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Record a primary key range [min-key, max-key] for each Segment.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Maintain a pure in-memory interval tree, constructed using the primary key ranges of all Segments. When querying a primary key, there is no need to traverse all Segments. The interval tree can be used to locate the Segments that may contain the primary key, greatly reducing the amount of indexes that need to be queried.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;For all hit Segments, query them in descending order of version. In Doris, a higher version means more updated data. Therefore, if a primary key hits in the index of a higher-version Segment, there is no need to continue querying lower-version Segments.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The flow of querying a single primary key is shown in the following figure:&lt;/p&gt;

&lt;p&gt;Image: Performance Improvement - Primary Key Index&lt;/p&gt;

&lt;h3&gt;
  
  
  Delete Bitmap
&lt;/h3&gt;

&lt;p&gt;Delete Bitmap adopts a multi-version recording method, as shown in the following figure:&lt;/p&gt;

&lt;p&gt;Image: Performance Improvement - Delete-Bitmap&lt;/p&gt;

&lt;p&gt;The Segment file in the figure is generated by the import of version 5, including the imported data of version 5 in this Tablet.&lt;/p&gt;

&lt;p&gt;The import of version 6 includes the update of primary key B, so the second row will be marked as deleted in the Bitmap, and the modification of this Segment by the import of version 6 will be recorded in the DeleteBitmap.&lt;/p&gt;

&lt;p&gt;The import of version 7 includes the update of primary key A, which will also generate a Bitmap corresponding to the version; similarly, the import of version 8 will also generate a corresponding Bitmap.&lt;/p&gt;

&lt;p&gt;All Delete Bitmaps are stored in a large Map. Each import will serialize the latest Delete Bitmap into RocksDB. The key definitions are as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;SegmentId&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;Version&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;BitmapKey&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;tuple&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;RowsetId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SegmentId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Version&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;BitmapKey&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;roaring&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Roaring&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;delete_bitmap&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each Segment in each Rowset will record multiple versions of Bitmaps. A Bitmap with Version x means the modification of the current Segment by the import of version x.&lt;/p&gt;

&lt;p&gt;Advantages of multi-version Delete Bitmap:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;It can well support multi-version queries. For example, after the import of version 7 is completed, a query on this table starts to execute and will use Version 7. Even if the query takes a long time and the import of version 8 is completed during the query execution, there is no need to worry about reading the data of version 8 (or missing the data deleted by version 8).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;It can well support complex Schema Changes. In Doris, complex Schema Changes (such as type conversion) require double writing first, and at the same time convert historical data before a certain version and then delete the old version of data. Multi-version Delete Bitmap can well support the current Schema Change implementation.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;It can support multi-version requirements during data copying and replica repair.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However, multi-version Delete Bitmap also has corresponding costs. In the previous example, to access the data of version 8, the three Bitmaps of v6, v7 and v8 need to be merged to get a complete Bitmap, and then this Bitmap is used to filter the Segment data. In real-time high-frequency import scenarios, a large number of Bitmaps can be easily generated, and the CPU cost of the union operation of Roaringbitmap is high. In order to minimize the impact of a large number of union operations, we added an LRUCache to DeleteBitmap to record the latest merged Bitmaps.&lt;/p&gt;

&lt;h3&gt;
  
  
  Write Flow
&lt;/h3&gt;

&lt;p&gt;When writing data, the primary key index of each Segment will be created first, and then the Delete Bitmap will be updated. The establishment of the primary key index is relatively simple and will not be described in detail due to space limitations. The focus is on introducing the more complex Delete Bitmap update flow:&lt;/p&gt;

&lt;p&gt;Image: Performance Improvement - Write Flow&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;DeltaWriter will first flush the data to the disk.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;In the Publish phase, batch point queries are performed on all Keys, and the Bitmaps corresponding to the overwritten Keys are updated. In the following figure, the version of the newly written Rowset is 8, which modifies the data in 3 Rowsets, so 3 Bitmap modification records will be generated.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Updating the Bitmap in the Publish phase ensures that no new visible Rowsets will appear during the batch point query of Keys and Bitmap update, ensuring the correctness of Bitmap update.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If a Segment is not modified, there will be no Bitmap record corresponding to the version. For example, Segment1 of Rowset1 has no Bitmap corresponding to Version 8.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Read Flow
&lt;/h3&gt;

&lt;p&gt;The reading flow of Bitmap is shown in the following figure. It can be seen from the figure:&lt;/p&gt;

&lt;p&gt;Image: Performance Improvement - Read Flow&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;A Query requesting version 7 will only see the data corresponding to version 7.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;When reading the data of Rowset5, the Bitmaps generated by the modifications of v6 and v7 to it will be merged to obtain the complete DeleteBitmap corresponding to Version7, which is used to filter data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;In the example in the figure, the import of version 8 overwrites a piece of data in Segment2 of Rowset1, but the Query requesting version 7 can still read this piece of data.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In high-frequency import scenarios, there may be a large number of versions of Bitmaps. Merging these Bitmaps itself may also consume a lot of CPU computing resources. Therefore, we introduced an LRUCache, and each version of Bitmap only needs to be merged once.&lt;/p&gt;

&lt;h3&gt;
  
  
  Handling of Compaction and Write Conflicts
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Normal Compaction Flow
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;When Compaction reads data, it obtains the version Vx of the Rowset being processed, and will automatically filter out the rows marked as deleted through the Delete Bitmap (see the query layer adaptation part earlier).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;After Compaction is completed, all DeleteBitmaps on the source Rowset that are less than or equal to version Vx can be cleaned up.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Handling of Compaction and Write Conflicts
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;During the execution of Compaction, a new import task may be submitted, assuming the corresponding version is Vy. If the write corresponding to Vy has modifications to the Rowset in the Compaction source, it will be updated to Vy of the DeleteBitmap of this Rowset.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;After Compaction is completed, check all DeleteBitmaps on this Rowset that are greater than Vx, and update the row numbers in them to the Segment row numbers in the newly generated Rowset.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As shown in the following figure, Compaction selects three Rowsets [0-5], [6-6], [7-7]. During the Compaction process, the import of Version8 is successfully executed. In the Compaction Commit phase, it is necessary to process the new Bitmap generated by the data import of Version8.&lt;/p&gt;

&lt;p&gt;Image: Performance Improvement - Compaction&lt;/p&gt;

&lt;h3&gt;
  
  
  Write Performance Optimization
&lt;/h3&gt;

&lt;p&gt;In the initial design, DeltaWriter did not perform point queries and Delete Bitmap updates during the data writing phase, but did so in the Publish phase. This can ensure that all data before this version can be seen when updating the Delete Bitmap, ensuring the correctness of the Delete Bitmap. However, in actual high-frequency import tests, it was found that the additional consumption caused by serial full-point queries and updates of each Rowset's data in the Publish phase would lead to a significant drop in import throughput.&lt;/p&gt;

&lt;p&gt;Therefore, in the final design, we changed the update of Delete Bitmap to a two-phase form: the first phase can be executed in parallel, only finding and marking deletions for the Version visible at that time; the second phase must be executed serially, and updating the data in the newly imported Rowsets that may have been missed in the previous first phase. The amount of incremental update data in the second phase is very small, so the impact on the overall throughput is very limited.&lt;/p&gt;

&lt;h1&gt;
  
  
  Optimization Effects
&lt;/h1&gt;

&lt;p&gt;The new Merge-On-Write implementation marks old data as deleted during writing, which can always ensure that valid primary keys only appear in one file (that is, the uniqueness of primary keys is ensured during writing). There is no need to deduplicate primary keys through merge sorting during reading. For high-frequency writing scenarios, this greatly reduces the additional consumption during query execution.&lt;/p&gt;

&lt;p&gt;In addition, the new version implementation can also support predicate pushdown and make good use of Doris' rich indexes. Sufficient data pruning can be performed at the data IO level, greatly reducing the amount of data read and computed. Therefore, there is a significant performance improvement in queries in many scenarios.&lt;/p&gt;

&lt;p&gt;It should be noted that if users use the Unique Key in low-frequency batch update scenarios, the improvement of the Merge-On-Write implementation on users' query effects may not be obvious. Because for low-frequency batch updates, Doris' Compaction mechanism can usually quickly compact the data into a good state (that is, Compaction completes the deduplication of primary keys), avoiding the deduplication computing cost during queries.&lt;/p&gt;

&lt;h2&gt;
  
  
  Optimization Effects on Aggregation Analysis
&lt;/h2&gt;

&lt;p&gt;We conducted tests using the Lineitem table, which has the largest data volume in TPC-H 100. To simulate multiple continuous writing scenarios, the data was divided into 100 parts and imported repeatedly 3 times. Then count(*) queries were performed, and the effect comparison is as follows:&lt;/p&gt;

&lt;p&gt;Image: Optimization - Aggregation Analysis&lt;/p&gt;

&lt;p&gt;The scenarios with and without Cache were compared respectively. In the case of no Cache, due to the high time consumption of loading data from the disk, there is an overall performance improvement of about 4 times; excluding the impact of disk reading overhead, in the case of Cache, the computing efficiency of the new version implementation can be improved by more than 20 times.&lt;/p&gt;

&lt;p&gt;The effect of Sum is similar, and will not be listed due to space limitations.&lt;/p&gt;

&lt;h2&gt;
  
  
  SSB Flat
&lt;/h2&gt;

&lt;p&gt;In addition to simple Count and Sum, we also tested the SSB-Flat dataset. The optimization effect on the 100G dataset (divided into 10 parts and imported multiple times to simulate data update scenarios) is shown in the following figure:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff0pl3yj4ukewjb4ttzjd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff0pl3yj4ukewjb4ttzjd.png" alt=" " width="800" height="598"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Explanation of test results:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Under the typical configuration of 32C64GB, the total time for all queries to complete is 4.5 seconds for the new version implementation, and 126.4 seconds for the old version implementation, with a speed difference of nearly 30 times. Further analysis found that when queries were executed on the table of the old version implementation, all 32-core CPUs were fully loaded. Therefore, a machine with a higher configuration was used to test the query time on the table of the old version implementation when computing resources were sufficient.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Under the configuration of 64C128GB, the total time of the old version implementation is 49.9s, and the maximum number of cores used is about 48. When computing resources are sufficient, the old version implementation still has a 12-fold performance gap compared with the new version implementation.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It can be seen that the new version implementation not only greatly improves the query speed, but also significantly reduces CPU consumption.&lt;/p&gt;

&lt;h2&gt;
  
  
  Impact on Data Import
&lt;/h2&gt;

&lt;p&gt;The new Merge-On-Write implementation is mainly to optimize the query performance of data. As mentioned earlier, it has achieved good results. However, these optimization effects are obtained by doing some additional work during writing. Therefore, the new version of Merge-On-Write implementation will slow down the data import efficiency to a small extent. However, due to concurrency and the pipeline effect between multiple batches of imports, the overall throughput does not decrease significantly.&lt;/p&gt;

&lt;h1&gt;
  
  
  Usage Method
&lt;/h1&gt;

&lt;p&gt;In version 1.2, as a new Feature, Merge-on-Write is disabled by default. Users can enable it by adding the following Property when creating a table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="nv"&gt;"enable_unique_key_merge_on_write"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;"true"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In addition, the new version of the Merge-on-Write data update mode is different from the old version of the Merge-on-Read implementation. Therefore, the already created Unique Key table cannot directly support it by adding Property through Alter Table, and it can only be specified when creating a new table. If users need to convert the old table to the new table, they can use the method of insert into new_table select * from old_table.&lt;/p&gt;

&lt;p&gt;Explanation of test results:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Under the typical configuration of 32C64GB, the total time for all queries to complete is 4.5 seconds for the new version implementation, and 126.4 seconds for the old version implementation, with a speed difference of nearly 30 times. Further analysis found that when queries were executed on the table of the old version implementation, all 32-core CPUs were fully loaded. Therefore, a machine with a higher configuration was used to test the query time on the table of the old version implementation when computing resources were sufficient.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Under the configuration of 64C128GB, the total time of the old version implementation is 49.9s, and the maximum number of cores used is about 48. When computing resources are sufficient, the old version implementation still has a 12-fold performance gap compared with the new version implementation.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It can be seen that the new version implementation not only greatly improves the query speed, but also significantly reduces CPU consumption.&lt;/p&gt;

&lt;h2&gt;
  
  
  Impact on Data Import
&lt;/h2&gt;

&lt;p&gt;The new Merge-On-Write implementation is mainly to optimize the query performance of data. As mentioned earlier, it has achieved good results. However, these optimization effects are obtained by doing some additional work during writing. Therefore, the new version of Merge-On-Write implementation will slow down the data import efficiency to a small extent. However, due to concurrency and the pipeline effect between multiple batches of imports, the overall throughput does not decrease significantly.&lt;/p&gt;

&lt;h1&gt;
  
  
  Usage Method
&lt;/h1&gt;

&lt;p&gt;In version 1.2, as a new Feature, Merge-on-Write is disabled by default. Users can enable it by adding the following Property when creating a table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="nv"&gt;"enable_unique_key_merge_on_write"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;"true"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In addition, the new version of the Merge-on-Write data update mode is different from the old version of the Merge-on-Read implementation. Therefore, the already created Unique Key table cannot directly support it by adding Property through Alter Table, and it can only be specified when creating a new table. If users need to convert the old table to the new table, they can use the method of insert into new_table select * from old_table.&lt;/p&gt;

</description>
      <category>bigdata</category>
      <category>olap</category>
      <category>database</category>
      <category>apachedoris</category>
    </item>
    <item>
      <title>1 billion JSON records, 1-second query response: Apache Doris vs. ClickHouse, Elasticsearch, and PostgreSQL</title>
      <dc:creator>Apache Doris</dc:creator>
      <pubDate>Tue, 04 Nov 2025 19:32:35 +0000</pubDate>
      <link>https://forem.com/apachedoris/1-billion-json-records-1-second-query-response-apache-doris-vs-clickhouse-elasticsearch-and-22m2</link>
      <guid>https://forem.com/apachedoris/1-billion-json-records-1-second-query-response-apache-doris-vs-clickhouse-elasticsearch-and-22m2</guid>
      <description>&lt;p&gt;Honestly, every time I check performance benchmarks, my eyes instinctively dart to see where Apache Doris ranks. Opening JSONBench's leaderboard this time, I felt that familiar mix of anticipation and nervousness. Fortunately, the result brought me a sigh of relief: Apache Doris snagged third place with just its default configuration, trailing only two versions of ClickHouse (the maintainer of JSONBench itself).&lt;/p&gt;

&lt;p&gt;Not bad. But can Apache Doris go even further? I wanted to see how much more we could cut query latency through optimization, and find out the true performance gap between Apache Doris and ClickHouse. Long story short, here's a before-and-after comparison chart of our optimizations. For the details behind the improvements, read on!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffkjd66q4ccp6edhv3xoa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffkjd66q4ccp6edhv3xoa.png" alt=" " width="800" height="357"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgtb6mj0eyf6xr87523ss.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgtb6mj0eyf6xr87523ss.png" alt=" " width="800" height="320"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  I. What is JSONBench?
&lt;/h1&gt;

&lt;p&gt;JSONBench is a benchmark tool specifically designed for JSON data analytics, with the following core features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Test Data&lt;/strong&gt;: 1 billion JSON-format user behavior logs from real production environments;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Test Cases&lt;/strong&gt;: 5 SQL queries specifically designed for JSON structures, accurately evaluating the database's ability to process semi-structured data;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Participants&lt;/strong&gt;: Covers mainstream databases such as ClickHouse, SingleStore, MongoDB, Elasticsearch, DuckDB, and PostgreSQL.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At the time of testing, Apache Doris had already delivered an impressive performance: twice as fast as Elasticsearch and a staggering 80 times faster than PostgreSQL!&lt;/p&gt;

&lt;p&gt;JSONBench Official Website: &lt;a href="https://jsonbench.com" rel="noopener noreferrer"&gt;jsonbench.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flzp3r8do8yoirpwghsew.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flzp3r8do8yoirpwghsew.png" alt=" " width="800" height="507"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In addition to performance advantages, Apache Doris also has strong competitiveness in storage occupancy: under the same dataset, its storage volume is only 50% of Elasticsearch and 1/3 of PostgreSQL.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frq3minivqmkxxtb92b5x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frq3minivqmkxxtb92b5x.png" alt=" " width="800" height="543"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  1.1 JSONBench Testing Process
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Create a table named &lt;code&gt;Bluesky&lt;/code&gt; in the database and import 1 billion real user behavior logs;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Each query is executed 3 times, and the operating system's Page Cache is cleared before each execution to simulate both cold and warm query scenarios;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Determine the database performance ranking based on the total query execution time.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  1.2 Apache Doris Test Basic Configuration
&lt;/h2&gt;

&lt;p&gt;In this test, Apache Doris used the VARIANT data type to store JSON data (introduced in Doris version 2.1, specifically designed for semi-structured JSON data), with the default table structure as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;bluesky&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nv"&gt;`id`&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="n"&gt;AUTO_INCREMENT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;`data`&lt;/span&gt; &lt;span class="n"&gt;variant&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;DISTRIBUTED&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;HASH&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;BUCKETS&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;
&lt;span class="n"&gt;PROPERTIES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"replication_num"&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;"1"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Core Advantages of VARIANT Data Type&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;No need to predefine column structures; can directly store complex data containing integers, strings, booleans and other types;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Adapts to frequently changing nested structures; can automatically infer column information based on data structure and type during writing, and dynamically merge write schemas;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Stores JSON key-value pairs as columns and dynamic sub-columns, balancing the flexibility of semi-structured data and query efficiency.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;More information about VARIANT data type: &lt;a href="https://doris.apache.org/docs/3.0/sql-manual/basic-element/sql-data-types/semi-structured/VARIANT" rel="noopener noreferrer"&gt;Apache Doris Official Documentation&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  II. Apache Doris Performance Optimization Practice
&lt;/h1&gt;

&lt;p&gt;The JSONBench leaderboard is based on the performance data of each database system under its default configuration. However, in actual production environments, can we further unlock the potential of Apache Doris through tuning? The following is the complete optimization process.&lt;/p&gt;

&lt;h2&gt;
  
  
  2.1 Basic Environment Configuration
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Server: AWS M6i.8xlarge (32 cores, 128GB memory);&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Operating System: Ubuntu 24.04;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Apache Doris Version: v3.0.5.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  2.2 Core Optimization: Schema Structuring Transformation
&lt;/h2&gt;

&lt;p&gt;All queries in JSONBench target fixed JSON extraction paths, which means the actual schema of the semi-structured data is fixed. Based on this, we used &lt;strong&gt;Generated Columns&lt;/strong&gt; to extract frequently accessed fields, combining the advantages of semi-structured and structured data. For frequently accessed JSON paths or calculation expressions, adding generated columns can significantly improve query speed!&lt;/p&gt;

&lt;h3&gt;
  
  
  2.2.1 Optimized Table Structure
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;bluesky&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;kind&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;GENERATED&lt;/span&gt; &lt;span class="n"&gt;ALWAYS&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_json_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'$.kind'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;operation&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;GENERATED&lt;/span&gt; &lt;span class="n"&gt;ALWAYS&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_json_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'$.commit.operation'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;collection&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;GENERATED&lt;/span&gt; &lt;span class="n"&gt;ALWAYS&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_json_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'$.commit.collection'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;did&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;GENERATED&lt;/span&gt; &lt;span class="n"&gt;ALWAYS&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_json_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'$.did'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;time&lt;/span&gt; &lt;span class="nb"&gt;DATETIME&lt;/span&gt; &lt;span class="k"&gt;GENERATED&lt;/span&gt; &lt;span class="n"&gt;ALWAYS&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;from_microsecond&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_json_bigint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'$.time_us'&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;`data`&lt;/span&gt; &lt;span class="n"&gt;variant&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;DUPLICATE&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;operation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;DISTRIBUTED&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;HASH&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;did&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;BUCKETS&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;
&lt;span class="n"&gt;PROPERTIES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"replication_num"&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;"1"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This transformation not only reduces the data extraction overhead during queries, but the flattened columns can also be used as partition columns to achieve more balanced data distribution.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.2.2 Supporting Query Statement Optimization
&lt;/h3&gt;

&lt;p&gt;Query statements need to be modified synchronously to use flattened columns. The following is a comparison before and after optimization:&lt;/p&gt;

&lt;h4&gt;
  
  
  Before Optimization (Native JSON Query):
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;
&lt;span class="c1"&gt;-- Query 1&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'commit'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;'collection'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;bluesky&lt;/span&gt; &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- Query 2&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'commit'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;'collection'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'did'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;bluesky&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'kind'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'commit'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'commit'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;'operation'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'create'&lt;/span&gt; &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- Query 3&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'commit'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;'collection'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;HOUR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;from_microsecond&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;CAST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'time_us'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;hour_of_day&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;bluesky&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'kind'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'commit'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'commit'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;'operation'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'create'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'commit'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;'collection'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'app.bsky.feed.post'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'app.bsky.feed.repost'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'app.bsky.feed.like'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hour_of_day&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;hour_of_day&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- Query 4&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'did'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;MIN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;from_microsecond&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;CAST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'time_us'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;first_post_ts&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;bluesky&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'kind'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'commit'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'commit'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;'operation'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'create'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'commit'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;'collection'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'app.bsky.feed.post'&lt;/span&gt; &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;first_post_ts&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- Query 5&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'did'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;MILLISECONDS_DIFF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;from_microsecond&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;CAST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'time_us'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;))),&lt;/span&gt;&lt;span class="k"&gt;MIN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;from_microsecond&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;CAST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'time_us'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;))))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;activity_span&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;bluesky&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'kind'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'commit'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'commit'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;'operation'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'create'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'commit'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;'collection'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'app.bsky.feed.post'&lt;/span&gt; &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;activity_span&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  After Optimization (Flattened Column Query):
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;
&lt;span class="c1"&gt;-- Query 1&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;collection&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;bluesky&lt;/span&gt; &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- Query 2&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;collection&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;did&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;bluesky&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;kind&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'commit'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;operation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'create'&lt;/span&gt; &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- Query 3&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;collection&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;HOUR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;hour_of_day&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;bluesky&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;kind&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'commit'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;operation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'create'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;collection&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'app.bsky.feed.post'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'app.bsky.feed.repost'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'app.bsky.feed.like'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hour_of_day&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;hour_of_day&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- Query 4&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;did&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;MIN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;first_post_ts&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;bluesky&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;kind&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'commit'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;operation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'create'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;collection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'app.bsky.feed.post'&lt;/span&gt; &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;first_post_ts&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- Query 5&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;did&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;MILLISECONDS_DIFF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="k"&gt;MIN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;activity_span&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;bluesky&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;kind&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'commit'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;operation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'create'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;collection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'app.bsky.feed.post'&lt;/span&gt; &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;activity_span&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  2.3 Page Cache Tuning
&lt;/h2&gt;

&lt;p&gt;After modifying the query statements, we enabled performance profiling and executed the complete test:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;
&lt;span class="k"&gt;set&lt;/span&gt; &lt;span class="n"&gt;enable_profile&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By accessing the FE Web UI (port 8030) to view the profile, we found that the Page Cache hit rate of the SCAN Operator was extremely low — this meant that cold reads still occurred during the hot query test (similar to wanting to get something from the fridge but finding it empty, having to go all the way to the supermarket). The key data is as follows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Cached Pages Number (CachedPagesNum): 1.258K (1258);&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Total Pages Number (TotalPagesNum): 7.422K (7422).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The root cause is that the default size of Page Cache is not sufficient to hold all the data of the Bluesky table. The solution is to add a configuration in &lt;code&gt;be.conf&lt;/code&gt; to increase the proportion of Page Cache in total memory from the default 20% to 60%:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;
&lt;span class="n"&gt;storage_page_cache_limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After re-running the test, the cold read issue was completely resolved, with a cache hit rate of 100%:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Cached Pages Number (CachedPagesNum): 7.316K (7316);&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Total Pages Number (TotalPagesNum): 7.316K (7316).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  2.4 Maximizing Parallelism Configuration
&lt;/h2&gt;

&lt;p&gt;To further unleash performance, we set the session variable &lt;code&gt;parallel_pipeline_task_num&lt;/code&gt; to 32 — since the test server has 32 CPU cores, matching the parallelism to the number of CPU cores can maximize CPU utilization:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;
&lt;span class="c1"&gt;-- Parallelism configuration for a single Fragment&lt;/span&gt;
&lt;span class="k"&gt;set&lt;/span&gt; &lt;span class="k"&gt;global&lt;/span&gt; &lt;span class="n"&gt;parallel_pipeline_task_num&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  III. Optimization Result: Surpassing ClickHouse by 39%
&lt;/h1&gt;

&lt;p&gt;After the above-mentioned adjustments to schema, queries, memory limits, and CPU parameters, we compared the performance of Apache Doris before and after optimization, as well as with other database systems:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0eozc1ssk7shx3f5qj9i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0eozc1ssk7shx3f5qj9i.png" alt=" " width="800" height="577"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Core improvement data:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Compared with pre-optimization, Apache Doris reduced the total query time by 74%;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Compared with ClickHouse, which was previously ranked first on the leaderboard, the performance was improved by 39%.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  IV. Summary and Future Outlook
&lt;/h1&gt;

&lt;p&gt;Through schema structuring transformation, query statement optimization, cache configuration adjustment, and parallelism parameter tuning, Apache Doris has achieved a significant reduction in semi-structured data query latency: under default configuration, it only lagged behind ClickHouse by a few seconds when querying 1 billion JSON records. However, with its strong JSON processing capabilities, VARIANT data type support, and Generated Columns feature, it has clearly surpassed similar databases in this scenario after optimization.&lt;/p&gt;

&lt;p&gt;In the future, Apache Doris will continue to deepen its semi-structured data processing capabilities and achieve more powerful and efficient analytics through the following directions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Optimize sparse VARIANT column storage to support more than 10,000 sub-columns;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Reduce memory usage of wide tables with 10,000-level columns;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Support custom types and indexes for VARIANT sub-columns based on column name patterns.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>bigdata</category>
      <category>database</category>
      <category>olap</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>The data lakehouse evolution</title>
      <dc:creator>Apache Doris</dc:creator>
      <pubDate>Thu, 30 Oct 2025 18:44:20 +0000</pubDate>
      <link>https://forem.com/apachedoris/the-data-lakehouse-evolution-3a7e</link>
      <guid>https://forem.com/apachedoris/the-data-lakehouse-evolution-3a7e</guid>
      <description>&lt;p&gt;Data lakehouses are everywhere in today’s conversations about modern data architecture. But before we get swept up in the buzz, it’s worth stepping back to understand how the industry got here — and what we truly need from a lakehouse. Then I’ll introduce Apache Doris as a next-generation lakehouse solution and show how it delivers on those expectations.&lt;/p&gt;

&lt;h2&gt;
  
  
  The evolution towards lakehouse
&lt;/h2&gt;

&lt;h3&gt;
  
  
  01 Traditional data warehouse
&lt;/h3&gt;

&lt;p&gt;In the early days of enterprise digital transformation, the growing complexity of business data gave rise to traditional data warehouses. These systems were designed to empower business intelligence (BI) by consolidating structured data from diverse sources through ETL pipelines.&lt;/p&gt;

&lt;p&gt;With features like well-defined schemas, columnar storage, and tightly coupled compute-storage architecture, data warehouses enabled fast, reliable analysis and reporting using standard SQL. They also ensured data consistency through centralized management and strict transactional controls.&lt;/p&gt;

&lt;p&gt;However, as the digital landscape expanded—driven by the rise of the internet, IoT, and an explosion of unstructured data formats like logs, images, and documents—traditional warehouses struggled to scale efficiently or support flexible, exploratory analytics.&lt;/p&gt;

&lt;p&gt;This gap sparked the emergence of data lakes, offering a more cost-effective, schema-flexible approach better suited for big data and machine learning workloads.&lt;/p&gt;

&lt;h3&gt;
  
  
  02 Data lake
&lt;/h3&gt;

&lt;p&gt;Google’s pioneering contributions to big data—Google File System (GFS), MapReduce, and BigTable—ignited a global wave of innovation and laid the foundation for the Hadoop ecosystem. Hadoop revolutionized large-scale data processing by enabling cost-efficient computation on commodity hardware. Data lakes, built on this foundation, became instrumental for handling complex and massive datasets across a variety of use cases:&lt;/p&gt;

&lt;p&gt;Massive-scale data processing: By leveraging distributed storage and parallel computing, data lakes support high-throughput processing on standard computing nodes, eliminating the need for expensive proprietary hardware.&lt;/p&gt;

&lt;p&gt;Multi-modal data support &amp;amp; low-cost storage: Unlike traditional data warehouses, data lakes store raw, unstructured, or semi-structured data without rigid schema definitions. With a schema-on-read approach, structure is applied at query time, preserving the full value of diverse data types such as images, videos, logs, and more. Object storage further reduces costs while enabling massive scalability.&lt;/p&gt;

&lt;p&gt;Multi-modal computing: A single dataset in a data lake can be accessed by various engines for different tasks—SQL querying, machine learning, AI model training—delivering a highly flexible and unified analytics environment.&lt;/p&gt;

&lt;p&gt;The term "data lake" vividly captures the essence: vast pools of raw data stored in a unified layer, ready for various downstream processing. As the architecture evolved, a three-tier design emerged, paving the way for lakehouse-style analytics:&lt;/p&gt;

&lt;p&gt;Storage layer: It is backed by distributed file systems or cloud object storage services like HDFS, AWS S3, and Azure Blob. These platforms offer near-infinite scalability, high availability, and cost-efficiency. Data is retained in its original form to provide flexibility for various analytical use cases.&lt;/p&gt;

&lt;p&gt;Compute layer: Data stored in lakes can be accessed by multiple compute engines based on workload needs. Hive enables batch ETL via HiveQL, Spark handles batch, streaming, and ML tasks, while Presto excels at interactive, ad-hoc querying.&lt;/p&gt;

&lt;p&gt;Metadata layer: Services like Hive Metastore manage schema definitions, partitions, and data locations to provide a shared metadata foundation across engines. This layer is crucial for consistent interpretation, collaboration, and discoverability of data within the lake.&lt;/p&gt;

&lt;h3&gt;
  
  
  03 New challenges in modern data processing
&lt;/h3&gt;

&lt;p&gt;Over the years, both data warehouses and data lakes have evolved to serve vital roles in enterprise data architecture.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6xlwv48805ur5ef7j5l1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6xlwv48805ur5ef7j5l1.png" alt=" " width="800" height="553"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;However, as modern businesses demand real-time insights, greater flexibility, and open ecosystem integration, both architectures are facing growing limitations. Here’s a snapshot of the key challenges confronting each:&lt;/p&gt;

&lt;p&gt;For traditional data warehouses:&lt;/p&gt;

&lt;p&gt;Lack of real-time capabilities: In high-stakes scenarios like flash sales or real-time monitoring, businesses expect sub-second analytics. Traditional warehouses rely on static ETL processes and struggle to handle continuously changing data, making real-time decision-making difficult. For instance, tracking dynamic shipment routes in logistics is a challenge without true real-time data handling.&lt;/p&gt;

&lt;p&gt;Inflexibility with semi-structured and unstructured data: With the rise of data from social media, medical imaging, and IoT, rigid schema management in warehouses leads to inefficiencies in storage, indexing, and querying. Handling massive volumes of clinical text and images in healthcare research is one such area where traditional warehouses fall short.&lt;/p&gt;

&lt;p&gt;For data lakes:&lt;/p&gt;

&lt;p&gt;Performance bottlenecks: While great for batch processing, engines like Spark and Hive lag behind in interactive, low-latency analytics. Business users and analysts often face sluggish query response times, which can hinder timely decision-making in areas like real-time fraud detection or financial risk assessment.&lt;/p&gt;

&lt;p&gt;Lack of transactional integrity: To maximize flexibility and scalability, data lakes often sacrifice transactional guarantees. This tradeoff can lead to data inconsistency or even loss during complex data operations, posing risks for accuracy-critical applications.&lt;/p&gt;

&lt;p&gt;Data governance pitfalls: Open-write access in data lakes can lead to data quality issues and inconsistency. Without robust governance, the "data lake" can quickly become a "data swamp", making it difficult to extract reliable insights, especially when integrating diverse data sources with inconsistent formats or semantics.&lt;/p&gt;

&lt;p&gt;To meet the needs of both real-time analytics and flexible data processing, many organizations maintain both data warehouses and data lakes. However, this dual-system approach introduces its own challenges: data duplication, redundant pipelines, fragmented user experiences, and data silos. As a result, the industry is shifting toward a unified solution: merging the strengths of warehouses and lakes into a unified lakehouse architecture.&lt;/p&gt;

&lt;h3&gt;
  
  
  04 Data lakehouse
&lt;/h3&gt;

&lt;p&gt;The lakehouse architecture unifies storage, computation, and metadata into a single cohesive platform—reducing redundancy, lowering costs, and ensuring data freshness. Over time, this architecture has crystallized into a multi-layered paradigm:&lt;/p&gt;

&lt;p&gt;Storage layer: the solid foundation Building on the distributed storage capabilities pioneered by data lakes, lakehouses typically rely on HDFS or cloud object stores (e.g., AWS S3, Azure Blob, GCS). Data is stored in raw or open columnar formats like Parquet and ORC, which offer high compression and efficient columnar access. This setup drastically reduces I/O overhead and provides a performant backbone for downstream data processing.&lt;/p&gt;

&lt;p&gt;Open data formats: interoperability In addition to open file formats, like Parquet and ORC, which ensure interoperability across diverse compute engines, lakehouse systems also embrace open table formats like Apache Iceberg, Hudi, and Delta Lake, enabling features such as near real-time updates, ACID transactions, time travel, and snapshot isolation. These formats ensure seamless compatibility across SQL engines and unify the flexibility of data lakes with the transactional guarantees of traditional warehouses, so the same dataset can be available for both real-time processing and historical analytics.&lt;/p&gt;

&lt;p&gt;Computation layer: diverse engines, unified power The computation layer combines various engines to leverage their respective strengths. Spark powers large-scale batch jobs and machine learning with its rich APIs. Flink handles real-time stream processing. Presto and Apache Doris excel at ultra-fast, interactive queries. By leveraging a shared storage layer and integrated resource management, these engines can collaboratively execute complex workflows, serving use cases from real-time dashboards to in-depth analytics.&lt;/p&gt;

&lt;p&gt;Metadata layer: the intelligent control plane Evolving from tools like Hive Metastore to modern systems such as Unity Catalog and Apache Gravitino, metadata management in lakehouses provides a unified namespace and centralized data catalog across multi-cloud and multi-cluster environments. This allows users to easily discover, govern, and interact with data, regardless of where it resides or which engine is querying it. Enhanced features like access control, audit logging, and lineage tracking ensure enterprise-grade data governance.&lt;/p&gt;

&lt;p&gt;In essence, the lakehouse unites the best of both worlds—retaining the cost-efficiency and scalability of lakes while integrating the performance and reliability of warehouses. By standardizing data formats, centralizing metadata, and supporting hybrid processing (real-time + batch), it’s quickly becoming the gold standard for modern big data architectures.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Doris: the lakehouse solution
&lt;/h2&gt;

&lt;p&gt;To respond to the trend and provide better analytics services, Apache Doris has extensively enhanced its data lakehousing capabilities since version 2.1.&lt;/p&gt;

&lt;p&gt;As enterprises push forward with building a lakehouse architecture, they often face complex challenges—from selecting new systems and integrating legacy platforms to managing data format conversions, adapting to new APIs, ensuring seamless system transitions, and coordinating teams across departments for permissions and compliance. To help companies navigate this complexity, Apache Doris introduces two core concepts: "Boundless Data" and "Boundless Lakehouse". These ideas aim to accelerate the lakehouse transformation while minimizing risks and costs.&lt;/p&gt;

&lt;h3&gt;
  
  
  01 Boundless Data
&lt;/h3&gt;

&lt;p&gt;Boundless Data focuses on breaking down data silos. Apache Doris offers unified query acceleration and simplifies system architecture.&lt;/p&gt;

&lt;h4&gt;
  
  
  Easy data access
&lt;/h4&gt;

&lt;p&gt;Apache Doris supports a wide range of data systems and formats through its flexible extensible connector framework, enabling users to run cross-platform SQL analytics without overhauling their existing data infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F15vyds0shmw3div9f32a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F15vyds0shmw3div9f32a.png" alt=" " width="800" height="732"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Apache Doris offers powerful data source connectors that make it easy to connect to and efficiently extract data from a wide range of systems—whether it's Hive, Iceberg, Hudi, Paimon, or any database that supports the JDBC protocol. For lakehouse systems, Doris can seamlessly retrieve table schemas and distribution information from the underlying metadata services, enabling smart query planning. Thanks to its MPP (Massively Parallel Processing) architecture, Doris can scan and process distributed data at scale with high performance. Below is a list of the supported data sources along with their corresponding metadata and storage systems.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fowrwo78yzlt2i1r9kzo5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fowrwo78yzlt2i1r9kzo5.png" alt=" " width="800" height="318"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Apache Doris features an extensible connector framework that makes it easy for developers to integrate custom enterprise data sources and achieve seamless data interoperability:&lt;/p&gt;

&lt;p&gt;Doris defines a standardized three-level structure—Catalog, Database, and Table, so developers can easily map to the appropriate layers of their target data systems. Doris also provides standard interfaces for metadata services and data access, allowing developers to integrate new data sources simply by implementing the defined APIs.&lt;/p&gt;

&lt;p&gt;Additionally, Doris is compatible with Trino connectors, enabling teams to directly deploy Trino plugin packages into a Doris cluster with minimal configuration. Doris already supports integrations with sources like Kudu, BigQuery, Delta Lake, Kafka, and Redis.&lt;/p&gt;

&lt;p&gt;Beyond integration, Doris also enables convenient cross-source data processing. It allows users to create multiple connectors at runtime, so they can perform federated queries across different data sources using standard SQL. For example, users can easily join a fact table from Hive with a dimension table from MySQL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;
 &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;hive&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hive_table&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;mysql&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mysql_table&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;
 &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Combined with Doris’ built-in job scheduling capabilities, users can automate such queries. For example, they can set it as an hourly job and write the query results into an Iceberg table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;JOB&lt;/span&gt; &lt;span class="n"&gt;schedule_load&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;SCHEDULE&lt;/span&gt; &lt;span class="k"&gt;EVERY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="n"&gt;HOUR&lt;/span&gt; &lt;span class="k"&gt;DO&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;iceberg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ice_table&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;hive&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hive_table&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;mysql&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mysql_table&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  High-performance data processing
&lt;/h4&gt;

&lt;p&gt;High-performance data analytics is a fundamental driver behind the transition from data lakes to unified lakehouse architectures. Apache Doris has extensively optimized data processing and offers a rich set of query acceleration features:&lt;/p&gt;

&lt;p&gt;Execution engine: Doris is built on an MPP (Massively Parallel Processing) framework combined with a pipeline-based execution model. This design enables it to process massive datasets quickly in a multi-machine, multi-core distributed environment. With fully vectorized operators, Doris delivers leading performance on industry-standard benchmarks like TPC-DS.&lt;/p&gt;

&lt;p&gt;Query optimizer: Doris features an intelligent query optimizer that automatically handles complex SQL requests. It deeply optimizes operations such as multi-table joins, aggregations, sorting, and pagination. Specifically, it uses advanced cost models and relational algebra transformations to generate highly efficient execution plans, making SQL writing much simpler for users while boosting performance.&lt;/p&gt;

&lt;p&gt;Caching and I/O optimization: Accessing external data sources often involves high-latency, unstable network communication. Doris addresses this with a comprehensive caching system. It has optimized cache types, freshness, and strategies to maximize the use of memory and local high-speed disks. It also fine-tunes network I/O to deal with high throughput, low IOPS, and high latency, offering near-local performance even when accessing remote data sources.&lt;/p&gt;

&lt;p&gt;Materialized views and transparent acceleration: Doris supports flexible refresh strategies for materialized view, including full refresh and partition-based incremental refresh, in order to reduce maintenance costs and improving data freshness. In addition to manual refresh, it also supports scheduled and data-triggered refreshes for greater automation. Transparent acceleration is when the query optimizer can automatically route queries to the best available materialized view. Featuring columnar storage, efficient compression, and intelligent indexing, the materialized views of Doris can greatly improve query efficiency and can even replace traditional caching layers.&lt;/p&gt;

&lt;p&gt;As a result, in benchmark tests on a 1TB TPC-DS dataset using the Iceberg table format, Apache Doris completed 99 queries in just one-third of the total time taken by Trino.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F57wl8ykajwzqssz1t8uf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F57wl8ykajwzqssz1t8uf.png" alt=" " width="800" height="597"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In real-world user scenarios, Apache Doris delivers performance gains over Presto while using only half the computing resources. On average, Doris reduces query latency by 20%, and achieves a 50% reduction in 95th percentile latency.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7s36t26nqkewqaby5mow.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7s36t26nqkewqaby5mow.png" alt=" " width="800" height="518"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Seamless migration
&lt;/h4&gt;

&lt;p&gt;When integrating multiple data sources into a unified lakehouse, migrating SQL queries is often one of the biggest hurdles. Different SQL dialects across systems can create major compatibility challenges, leading to costly and time-consuming rewrites.&lt;/p&gt;

&lt;p&gt;To simplify this process, Apache Doris offers a SQL Converter. It allows users to directly query data using SQL dialects from other engines because it automatically translates queries into Doris SQL (standard SQL). Currently, Doris supports SQL dialects from Presto/Trino, Hive, PostgreSQL, and ClickHouse, achieving over 99% compatibility in some production environments.&lt;/p&gt;

&lt;h3&gt;
  
  
  02 Boundless Lakehouse
&lt;/h3&gt;

&lt;p&gt;Beyond query migration, Doris also addresses the need for architectural streamlining.&lt;/p&gt;

&lt;h4&gt;
  
  
  Modern deployment architecture
&lt;/h4&gt;

&lt;p&gt;Since version 3.0, Doris has supported a cloud-native, compute-storage decoupled architecture. This modern deployment model maximizes resource efficiency because it enables independent scaling of compute and storage resources. This is important for enterprises because it provides flexible resource mangement for large-scale analytics workloads.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxuv94y5wrjyhcs9mqjyq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxuv94y5wrjyhcs9mqjyq.png" alt=" " width="800" height="522"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As illustrated above, in the compute-storage decoupled mode of Apache Doris, the compute nodes no longer store the primary data. Instead, HDFS or object storage serves as a unified, shared storage layer. This architecture powers a reliable and cost-efficient lakehouse in the following ways:&lt;/p&gt;

&lt;p&gt;Cost-efficient storage: Storage and compute resources scale separately, allowing enterprises to expand storage without incurring additional compute costs. Meanwhile, organizations benefit from low-cost cloud object storage and higher availability. For the frequently accessed hot data, users can still cache it on local high-speed disks for better performance.&lt;/p&gt;

&lt;p&gt;Single source of truth: All data is centralized in the shared storage layer, making it accessible across multiple compute clusters. This ensures data consistency, eliminates duplication, and simplifies data management.&lt;/p&gt;

&lt;p&gt;Workload flexibility: Users can dynamically adjust their compute resources to match different workloads. For example, batch processing, real-time analytics, and machine learning use cases vary in resource requirements. With storage and compute decoupled, enterprises can fine-tune resource usage for maximum efficiency across diverse operational demands.&lt;/p&gt;

&lt;h4&gt;
  
  
  Data storage and management
&lt;/h4&gt;

&lt;p&gt;Apache Doris offers a rich set of data storage and management capabilities, supporting both mainstream lakehouse table formats like Iceberg and Hudi as well as its own highly optimized storage format. Beyond simply accommodating industry standards, Doris brings even greater flexibility and performance to the table.&lt;/p&gt;

&lt;p&gt;Semi-structured data support: Apache Doris natively supports semi-structured data types such as JSON and VARIANT to provide a schemaless experience that eliminates the overhead of manual data transformation and cleansing. Users can directly ingest raw JSON data, which Doris stores in a high-performance columnar format for complex analytics.&lt;/p&gt;

&lt;p&gt;Data updates: Doris enables near real-time data updates and efficient change data capture (CDC). Also, the partial column update capability allows users to easily merge multiple data streams into wide tables inside Doris, simplifying data pipelines.&lt;/p&gt;

&lt;p&gt;Data indexing: Doris offers various indexing options, such as prefix indexes, inverted indexes, skiplist indexes, and BloomFilter index to speed up query performance and minimize both local and network IO, especially in compute-storage decoupled environments.&lt;/p&gt;

&lt;p&gt;Stream and batch writing: Doris supports both bulk batch loading and high-frequency writes through micro-batching. It leverages MVCC (Multi-Version Concurrency Control) to seamlessly manage both real-time and historical data within the same dataset.&lt;/p&gt;

&lt;h4&gt;
  
  
  Openness
&lt;/h4&gt;

&lt;p&gt;The openness of a data lakehouse is key to data integration and management efficiency. As discussed earlier, Apache Doris offers strong support for open table formats and file formats. Beyond that, Doris ensures the same level of openness for its own storage. It provides an open storage API based on the Arrow Flight SQL protocol, combining the high performance of Arrow Flight with the usability of JDBC/ODBC. Through this interface, users can easily access data stored in Doris using ABDC clients for Python, Java, Spark, and Flink.&lt;/p&gt;

&lt;p&gt;Instead of relying solely on open file formats, Doris' open storage API abstracts away the underlying file format complexities, allowing it to fully exploit its advanced storage features like indexing for faster data retrieval. Meanwhile, the compute engine does not need to adapt to storage-level changes. This means that any engine supporting the protocol can seamlessly benefit from the Doris capabilities without additional integration work.&lt;/p&gt;

&lt;h2&gt;
  
  
  The end
&lt;/h2&gt;

&lt;p&gt;The data lakehouse represents the future of unified analytics, but its success depends on overcoming performance and complexity barriers. Apache Doris combines the scalability of a data lake with the speed and reliability of a warehouse. It stays true to the idea of open data lakehouse with boundless data and architecture, and empowers it with its real-time querying capability, elastic scalability, and open-source flexibility.&lt;/p&gt;

</description>
      <category>bigdata</category>
      <category>lakehouse</category>
      <category>dataengineering</category>
      <category>olap</category>
    </item>
  </channel>
</rss>
