Forem: BladePipe

DynamoDB vs MongoDB in 2025: Key Differences, Use Cases

BladePipe — Tue, 26 Aug 2025 02:26:02 +0000

Choosing the right database for a given application is always a problem for data engineers. Two popular NoSQL database options that frequently come up are AWS DynamoDB and MongoDB. Both offer scalability and flexibility but differ significantly in their architecture, features, and operational characteristics. This blog provides a comprehensive comparison to help you make an informed decision.

What is Amazon DynamoDB?

Amazon DynamoDB is Amazon’s fully managed, serverless NoSQL service. It supports both key–value and document data, scales automatically, and delivers single-digit millisecond response times at any size. Features like global tables, on-demand scaling, and tight integration with AWS services make it a go-to for high-scale workloads.

Key Strengths:

Fully managed service: No server to manage. DynamoDB automatically partitions data and scales throughput, eliminating operational overhead.
Low-latency at scale: It is designed for consistent millisecond latency for reads and writes, even under heavy load.
Deep AWS integration: It natively integrated with Lambda, API Gateway, Kinesis, CloudWatch, and IAM, simplifying building serverless architectures.
Global replication: Its global table offers multi-region, active-active replication that automatically keeps multiple copies of a DynamoDB table in sync across different AWS Regions.

Pricing:

DynamoDB has two pricing modes: On‑Demand (pay per request) and Provisioned (buy read/write capacity units). On-demand is simple for unpredictable or spiky traffic, while provisioned is more cost-efficient for steady high throughput.

For storage, the first 25 GB per month is free, and then $0.25 per GB per month is charged.

Additional costs apply for backup, global tables, change data capture, etc.

What is MongoDB?

MongoDB is a document database that stores data as BSON (binary JSON) documents. It’s flexible, schema-optional, and supports rich queries, secondary indexes, and powerful aggregation pipelines. You can self-host it or use MongoDB Atlas, the managed service that runs on AWS, Azure, or GCP.

Key Strengths:

Flexible Data Model: Documents allow for embedding and nested structures, accommodating complex and evolving data.
Various ad-hoc queries: It supports a wide range of queries, including field-based queries, regular expressions, and geospatial queries.
Rich indexing & analytics: It supports compound, text, geospatial, wildcard and partial indexes. Aggregation pipeline enables complex transformations and analytics inside the DB.
ACID Transaction: It supports multi-document ACID transactions (since v4.0), ensuring data consistency even if the driver has unexpected errors.

Pricing:

MongoDB Enterprise charges for the infrastructure costs (servers, storage, networking) on your chosen platform.

MongoDB Atlas (managed service) has a free tier, shared tiers, and dedicated clusters billed hourly (pay‑as‑you‑go). Pricing depends on cloud provider, instance family, vCPU/RAM, storage, backup retention, and data transfer.

DynamoDB vs MongoDB At a Glance

Feature	DynamoDB	MongoDB
Type	Fully managed NoSQL database (AWS)	Document NoSQL database
Deployment	AWS only	On-premise / MongoDB Atlas (managed on multiple cloud providers)
Data Model	Key-value and document	Document
Max Document Size	400 KB per item	16 MB per document
Query Language	Primary key lookups, range queries, secondary indexes; limited aggregation	Support ad-hoc queries, joins, and advanced aggregation pipeline
Scalability	Automatic partitioning and scaling	Manual or automated scaling via sharding and replica sets
Consistency	Eventually consistent by default, optional strong consistency; multi-item ACID transactions	Tunable consistency levels; multi-document ACID transactions
Performance	Single-digit millisecond response time	Varies based on configuration
Security	Integrated with AWS IAM	Role-Based Access Control
Multi-Region Support	Built-in via global tables (active-active)	Atlas Global Clusters or custom sharding
Integration	Deep AWS integration	Broad ecosystem, multi-cloud support
Vendor Lock-in	High (AWS only)	Lower (run on multiple clouds or on-prem)

Core Features Comparison

Data Model & Query

DynamoDB:

Employ a key-value store with support for document structures.
Optimized for fast lookups based on the primary key.
Global and local secondary indexes for additional access paths.
Limited aggregation support.

MongoDB:

A document-oriented database where data is stored in BSON documents within collections.
Expressive query language that supports many operators.
Powerful aggregation pipelines allow for complex in-database transformations.

Scalability and Performance

DynamoDB:

Automatic horizontal scaling of both storage and throughput.
Single-digit millisecond latency at any scale.
Handle huge throughput with AWS-managed partitioning.

MongoDB:

Scale via sharding and replica sets.
Efforts required for setting up and managing sharding.
Performance depends on query patterns, indexing, and the chosen consistency level.

Consistency

DynamoDB:

Eventually consistent reads by default or strongly consistent reads at a cost of higher latency.
ACID transactions across one or more tables within a single AWS region.

MongoDB:

Offer various read concerns to control the consistency and isolation of read operations.
ACID transactions for multi-document operations.

Availability

DynamoDB:

Automatic multi-AZ replication within a region.
Automatic regional failover.
Global tables for automated multi-region, active-active replication.

MongoDB:

Replica sets provide high availability, requiring one primary node and multiple secondary nodes.
Manual or semi-automatic failover depending on configuration. Atlas automates in managed clusters.
Atlas Global Clusters enable zone sharding to partition data and pin it to specific regions.

How to Choose between them?

There’s no universal winner. Both are mature, battle-tested products. You may consider the following cases:

Choose DynamoDB if:

You are all-in on AWS. DynamoDB integrates seamlessly with other AWS services, making it a natural choice for serverless services built within the AWS ecosystem.
Your query patterns are simple and predictable. The ideal use case for DynamoDB is fetching data using a known primary key. It's not designed for complex, ad-hoc queries.
You prefer minimal operational burden. DynamoDB is fully managed by AWS, minimizing the operational overhead.

Real-world case: How Disney+ scales globally on Amazon DynamoDB

Choose MongoDB if:

You require complex querying and data aggregation. MongoDB's rich query language and aggregation pipelines are good for perfoming data searches and analysis.
You need a flexible schema. MongoDB's document model easily accommodates data structure changes.
You want deployment flexibility. MongoDB can be run on-premises, on any cloud provider (AWS, GCP, Azure), or as a fully managed service via MongoDB Atlas.

Real-world case: How Novo Nordisk accelerates time to value with GenAI and MongoDB

Stream Data to DynamoDB and MongoDB Easily

In real-world architectures, DynamoDB and MongoDB don’t exist in isolation. They’re part of a larger data ecosystem that needs to move information in and out in real time.

This is where BladePipe fits perfectly. As a real-time, end-to-end data replication tool, it supports 60+ out-of-the-box connectors. It captures data changes (CDC) from multiple sources and continuously sync them into DynamoDB or MongoDB with sub-second latency. This ensures both databases always have fresh, consistent data without manual ETL jobs or complex pipelines. Both on-prem and cloud deployment is supported.

With BladePipe, teams only need to focus on building applications, not moving data.

10 Best LangChain Alternatives You Must Know in 2025

BladePipe — Fri, 25 Jul 2025 05:33:35 +0000

LangChain has become a go-to framework for building LLM-powered applications, including retrieval-augmented generation (RAG) and autonomous agents. But it’s not the only option out there, and depending on your needs, it might not even be the best.

If you’re hitting limits with LangChain, or just want to explore what else is out there, this post breaks down 10 top alternatives that give you more flexibility, performance, or control. Whether you need better data pipelines, simpler orchestration, or enterprise-ready agents, there’s likely a tool better suited to your use case.

What is LangChain?

LangChain is an open-source framework designed to help developers build applications powered by large language models (LLMs). At its core, LangChain provides a modular and composable toolkit for "chaining" different components together. It allows developers to focus on comlplex workflows rather than raw prompts and API calls.

The framework is built around a few key concepts:

Chains: Sequences of calls that form a complete application workflow.
Agents: LLM-powered dynamic chains, determining which tools to use and in what order.
Tools & Function Calling: External systems that agents interact with.
Memory: Allow applications to remember past conversations.
Integrations: Plug-and-play support for LLM, vector databases, document loaders, etc.

LangChain Use Cases

LangChain's versatility has made it a popular choice for a wide range of AI applications. Some of the most common use cases include:

Retrieval-Augmented Generation (RAG): With RAG, user queries are enhanced with information retrieved from external sources like vector databases, file systems, or knowledge bases.
AI Agents: Use LangChain to design complex workflows where LLMs interact with external tools and systems.
Enterprise Chatbots: LangChain supports multi-turn conversations and memory management, making it suitable for applications that require context-aware interactions.
Document Analysis and Summarization: LangChain is often used for applications that process, summarize, and analyze large volumes of text—across PDFs, email threads, research papers, or internal reports.

Why Need to Consider LangChain Alternatives?

While LangChain is a powerful and widely-adopted framework, it's not without its drawbacks. Here are some common reasons developers and teams look elsewhere:

Complexity

LangChain’s abstractions are powerful, but they can also be heavyweight. For simple pipelines, it might feel like using a full orchestration engine to run a shell script.

Performance Bottlenecks

The layered nature of LangChain can sometimes introduce performance overhead. For applications that require low latency and high throughput, this can be a significant issue.

Difficult Debugging

LangChain can feel overly complex, especially for newcomers. The framework's abstraction layers, while powerful, can sometimes make it difficult to understand what's happening under the hood. Debugging can be particularly challenging when things go wrong in a long chain.

Rapidly Evolving Ecosystem

The AI landscape is changing constantly. New frameworks are being developed with novel approaches, more intuitive interfaces, and better performance for specific tasks. Staying open to these alternatives is crucial for building the best possible applications.

Top 10 LangChain Alternatives

Let’s explore ten powerful alternatives to LangChain, each with unique strengths across use cases like RAG, agents, automation, and orchestration.

LlamaIndex

LlamaIndex is a data framework designed specifically to connect your private data with LLMs. While LangChain is about "chaining" different tools, LlamaIndex focuses on the "smart storage" and retrieval part of the equation, making it a powerful tool for RAG applications.

Key Features:

Flexible document loaders and index types (list, tree, vector, keyword)
Powerful query engines and retrievers
Tool calling and agent integrations

Best For:

Developers building LLM applications on top of private documents with fine-tuned control over retrieval.

BladePipe

BladePipe is a real-time data integration tool. Its RagApi function automates the process of building RAG applications. Through two end-to-end data pipelines in BladePipe, you can deliver data to vector databases in real time and always keep the knowledge fresh. It supports both cloud and on-premise deployment, ideal for teams of all sizes to get the right AI application solution.

Compared to traditional RAG setups, which often involve lots of manual work, BladePipe RagApi offers several unique benefits:

Two DataJobs for a RAG service: One to import documents, and one to create the API.
Zero-code deployment: No need to write any code, just configure.
Adjustable parameters: Adjust vector top-K, match threshold, prompt templates, model temperature, etc.
Multi-model and platform compatibility: Support DashScope (Alibaba Cloud), OpenAI, DeepSeek, and more.
OpenAI-compatible API: Integrate it directly with existing Chat apps or tools with no extra setup.

Best For:

Individuals and teams needing production-grade data pipelines for AI/RAG with minimal operational overhead.

Haystack

Haystack is an open-source framework for building search systems, question-answering applications, and conversational AI. It offers a modular, pipeline-based architecture that lets developers connect components like retrievers, readers, generators, and rankers with ease.

Key Features:

Modular components for indexing, retrieval and generation
70+ Integrations with LLMs, vector databases and transformer model.
REST API support, Dockerized deployment

Best For:

Building flexible, search-focused AI applications with full control over natural language processing (NLP) pipelines.

Semantic Kernel

Semantic Kernel is an open-source SDK from Microsoft. It provides a lightweight framework for integrating cutting-edge AI models into existing applications. It's particularly strong for developers working in C#, Python, or Java and aims to act as an efficient middleware for building AI agents.

Key Features:

Native plugin model for AI skills
Multi-language support (.NET, Python, JS)
Integration with Microsoft ecosystem

Best For:

Enterprise teams looking to build secure, composable AI agents integrated with Microsoft ecosystems.

Langroid

Langroid is an open-source Python framework that introduces a multi-agent programming paradigm. Instead of focusing on simple chains, Langroid treats agents as first-class citizens, enabling the creation of complex applications where multiple agents collaborate to solve a task.

Key Features:

Python-native agents with natural language and structured task definition
Multi-agent orchestration
Support various LLMs, vector databases, and function-calling tools

Best For:

Developers building collaborative agents with clear execution paths and modular logic.

Griptape

Griptape is a Python-based framework for building and running AI applications, specifically focused on creating reliable and production-ready RAG applications. It offers a structured approach to building LLM workflows, with strong control over data flow and governance.

Key Features:

Secure AI agents building
Cloud-native design with plugin support
A structured way to define AI workflows

Best For:

Enterprise AI workflows requiring traceability and production readiness.

AutoChain

AutoChain is a lightweight and simple framework for building LLM applications. It's designed to be a more straightforward alternative to LangChain, focusing on ease of use and quick prototyping. The goal is to provide a clean and intuitive way to create multi-step LLM workflows.

Key Features:

lightweight and extensible generative agent pipeline
simple memory tracking for conversation history and tools' outputs

Best For:

Builders who want to move fast without complex abstractions.

Braintrust

Braintrust is an open-source framework for building, testing, and deploying LLM workflows with a focus on reliability, observability, and performance. It stands out with built-in support for prompt versioning, output evaluation, and detailed logging, making it ideal for optimizing AI behavior over time.

Key Features:

Tools for continuous evaluation of LLM outputs
Built-in monitoring, logging, and benchmarking
Work with popular LLM providers

Best For: .

Teams building production LLM apps with performance and traceability in mind.

Flowise AI

Flowise AI is a low-code, visual tool for building and managing LLM applications. It's perfect for those who prefer a drag-and-drop interface over writing code. It's built on top of the LangChain ecosystem but provides a much more accessible and user-friendly experience.

Key Features:

Drag-and-drop interface for LLM apps
100+ integrations with LLMs, vector stores and more
Local and cloud deployment

Best For:

Non-technical users or rapid prototyping of LLM workflows visually.

Rivet

Rivet is a visual programming environment for building and prototyping LLM applications. It uses a graph-based interface to allow developers to visually design and test their AI workflows. Rivet's focus is on providing a powerful, intuitive, and highly-performant tool for building complex AI graphs.

Key Features:

Visual interface for prompt iterations and experiments
Built-in prompt editor and playground for fine-tuning prompts.
Real-time debugging

Best For:

AI teams optimizing prompts, chain design, or evaluation strategies collaboratively.

Getting Started with BladePipe

LangChain has paved the way for building powerful LLM applications, offering developers a flexible framework to prototype agents, RAG pipelines, and chatbots. But as teams move from experimentation to production, LangChain’s framework can introduce complexity, performance issues, and operational overhead.

If you're building RAG systems that depend on fresh and structured data, BladePipe is a strong contender. With built-in support for embedding and real-time sync, BladePipe turns your raw data into retrieval-ready intelligence. Skip the complexity. Try BladePipe and build AI systems that actually scale.

BladePipe vs. Fivetran-Features, Pricing and More (2025)

BladePipe — Fri, 18 Jul 2025 06:02:05 +0000

In today’s data-driven landscape, businesses rely heavily on efficient data integration platforms to consolidate and transform data from multiple sources. Two prominent players in this space are Fivetran and BladePipe, both offering solutions to automate and streamline data movement across cloud and on-premises environments.

This blog provides a clear comparison of BladePipe and Fivetran as of 2025, covering their core features, pricing models, deployment options, and suitability for different business needs.

Quick Intro

What is BladePipe?

BladePipe is a data integration platform known for its extremely low latency and high performance that facilitates efficient migration and sync of data across both on-premises and cloud databases. Founded in 2019, it’s built for analytics, microservices and AI-focused use cases that emphasizing real-time data.

The key features include：

Real-time replication, with a latency less than 10 seconds.
End-to-end pipeline for great reliability and easy maintenance.
One-stop management of the whole lifecycle from schema evolution to monitoring and alerting.
Zero-code RAG building for simpler and smarter AI.

What is Fivetran?

Fivetran is a global leader in automated data movement and is widely trusted by many companies. It offers a fully managed ELT (Extract-Load-Transform) service that automates data pipelines with prebuilt connectors, ensuring robust data sync and automatic adaptation to source schema changes.

The key features include：

Managed ELT pipelines, automating the entire Extract-Load-Transform process.
Extensive connectors (700+ prebuilt connectors).
Strong data transformation ability with dbt integration and built-in models.
Automatic schema handling, reducing human efforts.

Feature Comparison

Features	BladePipe	Fivetran
Sync Mode	Real-time CDC-first/ETL	ELT/Batch CDC
Batch and Streaming	Batch and Streaming	Batch only
Sync Latency	≤ 10 seconds	≥ 1 minute
Data Connectors	40+ connectors built by BladePipe	700+ connectors, 450+ are Lite (API) connectors
Source Data Fetch	Pull and Push hybrid	Pull-based
Data Transformation	Built-in transformations and custom code	Post-load transformation and dbt integration
Schema Evolution	Strong support	Strong support
Verification & Correction	Yes	No
Deployment Options	Self-hosted/Cloud (BYOC)	Self-hosted/Hybrid/SaaS
Security	SOC 2, ISO 27001, GDPR	SOC 2, ISO 27001, GDPR, HIPAA
Support	Enterprise-level support	Tiered support (Standard, Enterprise, Business Critical)
SLA	Available	Available

Pipeline Latency

Fivetran adopts batch-based CDC, which means the data is read in batch intervals. It offers a range of sync frequencies, from as low as 1 minute (for Enterprise/Business Critical plans) to 24 hours. That makes the latency to be around 10 minutes. Besides, it increases pressure to the source end.

BladePipe uses real-time Change Data Capture (CDC) for data integration. That means it instantly grab data changes from your source and deliver them to the destination within seconds. This approach is a big shift from traditional batch-based CDC methods. In BladePipe, real-time CDC works with nearly all of its 40+ connectors.

In summary, BladePipe outweighs Fivetran in terms of latency, ideal for use cases that requiring always fresh data.

Data Connectors

Fivetran offers an extensive library (700+) of pre-built connectors, covering databases, APIs, files and more. A variety of connectors satisfy diverse business needs. Among all the connectors, around 450 of them are lite connectors built for specific use cases with limited endpoints.

BladePipe offers over 40 pre-built connectors. It focuses on essential systems for real-time needs, like OLTPs, OLAPs, messaging tools, search engines, data warehouses/lakes, and vector databases. This makes it a great choice for real-time projects where getting fresh data quickly is a fundamental requirement.

In summary, Fivetran excels with its broad range of connectors, while BladePipe focuses on data delivery for critical real-time infrastructure. Choose the right tool that works for you.

Reliability

Fivetran's reliability has been a point of concern. We can find 15 or more incidents occurred per month in their status page, including connector failures, 3rd party service errors, and other service degradations. It even experienced an outage lasting more than 2 days.

BladePipe is built with production-grade reliability at its core. It provides real-time dashboards for monitoring every step of data movement. Alert notifications can be triggered for latency, failures, or data loss. That makes it easy to maintain pipelines and solve problems, enhancing reliability.

In summary, BladePipe shows a more reliable system performance than Fivetran, and its monitoring and alerting mechanism brings even stronger support for stable pipelines.

Support

Fivetran offers documentation, support portal, and email support for Standard plan. However, some customers complain about the long time waiting for response. Enterprise and Business Critical plans enjoy 1-hour support response, but the costs are much higher.

BladePipe offers a more white-glove support experience. For both Cloud and Enterprise customers, BladePipe provides the according SLAs. Its technical team works closely with clients during onboarding and when fine-tuning data pipelines.

In summary, both Fivetran and BladePipe provide documentation and technical support for better understanding and use.

Use Case Comparison

Based on the features stated above, the performance of the two tools varies in different use cases.

Use Case	BladePipe	Fivetran
Data sync between relational databases	Excellent	Average
Data sync between online business databases (RDB, data warehouse, message, cache, search engine)	Excellent	Average
Data lakehouse support	Average	Average
SaaS sources support	Average	Excellent
Multi-cloud data sync	Excellent	Average

Pricing Model Comparison

Pricing is a crucial consideration when evaluating data integration tools, especially for startups and organizations with extensive data replication needs. Fivetran and BladePipe employ significantly different pricing models.

Fivetran

Fivetran has four plans to consider: Free, Standard, Enterprise and Business Critical. The free plan offers a free usage for low-volumes (e.g., up to 500,000 MAR). The other three plans adopt MAR-based tiered pricing. See more details at the pricing page.

Besides, Fivetran separately charges for data transformation based on the models users run in a month, making the costs even higher.

As of March 2025, Fivetran's pricing model has been changed to a connector-level pricing. Pricing and discounts are often applied per individual connector instead of the entire account. This means if you have many connectors, your total cost might increase even if your overall data volume hasn't changed.

BladePipe

BladePipe offers two plans to choose:

Cloud: $0.01 per million rows of full data and $10 per million rows of incremental data. You can easily evaluate the costs via the price calculator. It is available at AWS Marketplace.
Enterprise: The costs are based on the number of pipelines and duration you need. Talk to the sales team on specific costs.

Summary

Here's a quick comparison of costs between BladePipe BYOC and Fivetran(Standard).

Million Rows per Month	BladePipe* (BYOC)	Fivetran (Standard)
1 M	$210	$500+
10 M	$300	$1350+
100 M	$1200	$2900+

*: include one AWS EC2 t2.xlarge for BladePipe Worker, $200/month.

In summary, BladePipe is a better choice when it comes to costs, considering the following factors:

Cost-effectiveness: BladePipe is much more cheaper than Fivetran when moving the same amount of data. Besides, BladePipe doesn't charge for data transformation separately.
Cost Predictability: BladePipe's direct per-million-row pricing offers more immediate cost predictability, especially for large, consistent data volumes. Fivetran's MAR can be less predictable due to the nature of "active rows", the data transformation charge and the new connector-level pricing.

Final Thoughts

Choosing between Fivetran and BladePipe depends heavily on your organization's specific data integration needs and priorities. Fivetran provides extensive coverage of connectors and an automated ELT experience for analytics. BladePipe features real-time CDC, ideal for mission-critical data syncs. In terms of pricing, BladePipe is a cost-effective choice for start-ups and organizations with a tight budget.

Evaluate your specific data sources, latency requirements, budget, internal team resources, and desired level of support to make the most suitable choice.

A Comprehensive Guide to Wide Table (2025)

BladePipe — Thu, 10 Jul 2025 10:02:06 +0000

In real-world business scenarios, even a basic report often requires joining 7 or 8 tables. This can severely impact query performance. Sometimes it takes hours for business teams to get a simple analysis done.

This article dives into how wide table technology helps solve this pain point. We’ll also show you how to build wide tables with zero code, making real-time cross-table data integration easier than ever.

The Challenge with Complex Queries

As business systems grow more complex, so do their data models. In an e-commerce system, for instance, tables recording orders, products, and users are naturally interrelated:

Order table: product ID (linked to Product table), quantity, discount, total price, buyer ID (linked to User table), etc.
Product table: name, color, texture, inventory, seller (linked to User table), etc.
User table: account info, phone numbers, emails, passwords, etc.

Relational databases are great at normalizing data and ensuring efficient storage and transaction performance. But when it comes to analytics, especially queries involving filtering, aggregation, and multi-table JOINs, the traditional schema becomes a performance bottleneck.

Take a query like "Top 10 products by sales in the last month": the more JOINs involved, the more complex and slower the query. And the number of possible query plans grows rapidly:

Tables Joined	Possible Query Plans
2	2
4	24
6	720
8	40320
10	3628800

For CRM or ERP systems, joining 5+ tables is standard. Then, the real question becomes: How to find the best query plan efficiently?

To tackle this, two main strategies have emerged: Query Optimization and Precomputation, with wide tables being a key form of the latter.

Query Optimization vs Precomputation

Query Optimization

One of the solutions is to reduce the number of possible query plans to accelerate query speed. This is called pruning. Two common approaches are derived:

RBO (Rule-Based Optimizer): RBO doesn't consider the actual distribution of your data. Instead, it tweak SQL query plans based on a set of predefined, static rules. Most databases have some common optimization rules built-in, like predicate pushdown. Depending on their specific business needs and architectural design, different databases also have their own unique optimization rules. Take SAP Hana, for instance: it powers SAP ERP operations and is designed for in-memory processing with lots of joins. Because of this, its optimizer rules are noticeably different from other databases.
CBO (Cost-Based Optimizer): CBO evaluates I/O, CPU and other resource consumption, and picks the plan with the lowest cost. This type of optimization dynamically adjusts based on the specific data distribution and the features of your SQL query. Even two identical SQL queries might end up with completely different query plans if the parameter values are different. CBO typically relies on a sophisticated and complex statistics subsystem, including crucial information like the volume of data in each table and data distribution histograms based on primary keys.

Most modern databases combine both RBO and CBO.

Precomputation

Precomputation assumes the relationships between tables are stable, so instead of joining on every query, it pre-joins data ahead of time into a wide table. When data is changed, only changes are delivered to the wide table. The idea has been around since the early days of materialized views in relational databases.

Compared with live queries, precomputation massively reduces runtime computation. But it's not perfect:

Limited JOIN semantics: Hard to handle anything beyond LEFT JOIN efficiently.
Heavy updates: A single change on the “1” side of a 1-to-N relation can cause cascading updates, challenging service reliability.
Functionality trade-offs: Precomputed tables lack the full flexibility of live queries (e.g. JOINs, filters, functions).

Best Practice: Combine Both

In the real world, a hybrid approach works best: use precomputation to generate intermediate wide tables, and use live queries on top of those to apply filters and aggregations.

Precomputation: A popular approach is stream computing, with stream processing databases emerging in recent years. Materialized views in traditional relational databases or data warehouses also offer an excellent solution.
Live queries: There is a significant performance boosts in data filtering and aggregation within real-time analytics databases, thanks to the columnar and hybrid row-column data structures, the new instruction sets like AVX 512, high-performance computing hardware such as FPGAs and GPUs, and the software application like distributed computing.

BladePipe's Wide Table Evolution

BladePipe started with a high-code approach: users had to write scripts to fetch related table data and construct wide tables manually during data sync. It worked, but wasn’t scalable due to too much effort required.

Now, BladePipe supports visual wide table building, enabling zero-code configuration. Users can select a driving table and the lookup tables directly in the UI to define JOINs. The system handles both initial data migration and real-time updates.

It currently supports visual wide table creation in the following pipelines:

MySQL -> MySQL/StarRocks/Doris/SelectDB
PostgreSQL/SQL Server/Oracle/MySQL -> MySQL
PostgreSQL -> StarRocks/Doris/SelectDB

More supported pipelines are coming soon.

How Visual Wide Table Building Works in BladePipe

Key Definitions

In BladePipe, a wide table consists of:

Driving Table: The main table used as the data source. Only one driving table can be selected.
Lookup Tables: Additional tables joined to the driving table. Multiple lookup tables are supported.

By default, the join behavior follows Left Join semantics: all records from the driving table are preserved, regardless of whether corresponding records exist in lookup tables.

BladePipe currently supports two types of join structures:

Linear: e.g., A.b_id = B.id AND B.c_id = C.id. Each table can only be selected once, and circular references are not allowed.
Star: e.g., A.b_id = B.id AND A.c_id = C.id. Each lookup table connects directly to the driving table. Cycles are not allowed.

In both cases, table A is the driving table, while B, C, etc. are lookup tables.

Data Change Rule

If the target is a relational DB (e.g. MySQL):

Driving table INSERT: Fields from lookup tables are automatically filled in.
Driving table UPDATE/DELETE: Lookup fields are not updated.
Lookup table INSERT: If downstream tables exist, the operation is converted to an UPDATE to refresh Lookup fields.
Lookup table UPDATE: If downstream tables exist, no changes are applied to related fields.
Lookup table DELETE: If downstream tables exist, the operation is converted to an UPDATE with all fields set to NULL.

If the target is an overwrite-style DB (e.g. StarRocks, Doris):

All operations (INSERT, UPDATE, DELETE) on the Driving table will auto-fill Lookup fields.
All operations on Lookup tables are ignored.

Note:

If you want to include lookup table updates when the target is an overwrite-style database, set up a two-satge pipeline:

Source DB → relational DB wide table
Wide table → overwrite-style DB

Step-by-Step Guide

Log in to BladePipe. Go to DataJob > Create DataJob.
In the Tables step,
1. Choose the tables that will participate in the wide table.
2. Click Batch Modify Target Names > Unified table name, and enter a name as the wide table name.
In the Data Processing step,
1. On the left panel, select the Driving Table and click Operation > Wide Table to define the join.
  - Specify Lookup Columns (multiple columns are supported).
  - Select additional fields from the Lookup Table and define how they map to wide table columns. This helps avoid naming conflicts across different source tables.
  - If a Lookup Table joins to another table, make sure to include the relevant Lookup columns. For example, in A.b_id = B.id AND B.c_id = C.id, when selecting fields from B, c_id must be included.
  - When multiple Driving or Lookup tables contain fields with the same name, always map them to different target column names to avoid collisions.
2. Click Submit to save the configuration.
3. Click Lookup Tables on the left panel to check whether field mappings are correct.
Continue with the DataJob creation process, and start the DataJob.

Wrapping up

Wide tables are a powerful way to speed up analytics by precomputing complex JOINs. With BladePipe’s visual builder, even non-engineers can set up and maintain real-time wide tables across multiple data systems.

Whether you're a data architect or a DBA, this tool helps streamline your analytics layer and power up your dashboards with near-instant queries.

BladePipe vs. Airbyte : Features, Pricing and More (2025)

BladePipe — Fri, 04 Jul 2025 06:26:26 +0000

In today’s data-driven landscape, building reliable pipelines is a business imperative, and the right integration tool can make a difference.

Two modern tools are BladePipe and Airbyte. BladePipe focuses on real-time end-to-end replication, while Airbyte offers a rich connector ecosystem for ELT pipelines. So, which one fits your use case?

In this blog, we break down the core differences between BladePipe and Airbyte to help you make an informed choice.

Intro

What is BladePipe?

BladePipe is a real-time end-to-end data replication tool. Founded in 2019, it’s built for high-throughput, low-latency environments, powering real-time analytics, AI applications, or microservices that require always-fresh data.

The key features include：

Real-time replication, with a latency less than 10 seconds.
End-to-end pipeline for great reliability and easy maintenance.
One-stop management of the whole lifecycle from schema evolution to monitoring and alerting.
Zero-code RAG building for simpler and smarter AI.

What is Airbyte?

Airbyte is founded in 2020. It is an open-source data integration platform that focuses on ELT pipelines. It offers a large library of pre-built and marketplace connectors for moving batch data from various sources to popular data warehouses and other destinations.

The key features include:

Focus on batch-based ELT pipelines.
Extensive connector ecosystem.
Open-source core with paid enterprise version.
Support for custom connectors with minimal code.

Feature Comparison

Features	BladePipe	Airbyte
Sync Mode	Real-time CDC-first/ETL	ELT-first/(Batch) CDC
Batch and Streaming	Batch and Streaming	Batch only
Sync Latency	≤ 10 seconds	≥ 1 minute
Data Connectors	40+ connectors built by BladePipe	50+ maintained connectors, 500+ marketplace connectors
Source Data Fetch	Pull and Push hybrid	Pull-based
Data Transformation	Built-in transformations and custom code	dbt and SQL
Schema Evolution	Strong support	Limited
Verification & Correction	Yes	No
Deployment Options	Cloud (BYOC)/Self-hosted	Self-hosted(OSS)/Cloud (Managed)
Security	SOC 2, ISO 27001, GDPR	SOC 2, ISO 27001, GDPR, HIPAA Conduit
Support	Enterprise-level support	Community (free) and Enterprise-level support

Pipeline Latency

Airbyte realizes data movement through batch-based extraction and loading. It supports Debezium-based CDC, which is applicable to only a few sources, and only for tables with primary keys. In Airbyte CDC, changes are pulled and loaded in scheduled batches (e.g., every 5 mins or 1 hour). That makes the latency to be minutes or even hours depending on the sync frequency.

BladePipe is built around real-time Change Data Capture (CDC). Different from batch-based CDC, BladePipe captures changes occurred in the source instantly and delivers them in the destination, with sub-second latency. The real-time CDC is applicable to almost all 40+ connectors.

In summary, Airbyte usually has a high latency. BladePipe CDC is more suitable for real-time architectures where freshness, latency, and data integrity are essential.

Data Connectors

Airbyte clearly leads in the breadth of supported sources and destinations. By now, Airbyte supports over 550 connectors, most of which are API-based connectors. Airbyte allows custom connector building through its Connector Builder, giving great extensibility of its connector reach. But among all the connectors, only around 50 of them are Airbyte-official connectors and a SLA is provided. The rest are open-source connectors powered by the community.

BladePipe, on the other hand, focuses on depth over breadth. It now supports 40+ connectors, which are all self-built and actively maintained. It targets critical real-time infrastructure: OLTPs, OLAPs, message middleware, search engines, data warehouses/lakes, vector databases, etc. This makes it a better fit for real-time applications, where data freshness and change tracking matter more than diversity of sources.

In summary, Airbyte stands out for its extensive coverage of connectors, while BladePipe focuses on real-time change delivery among multiple sources. Choose the suitable tool based on your specific need.

Data Transformation

Airbyte, as a ELT-first platform, uses a post-load transformation model, where data is loaded into the target first and then transformation is applied. It offers two options: a serialized JSON object or a normalized version as tables. For advanced users, custom transformations can be done via SQL and through integration with dbt. But the transformation capabilities are limited because data is transformed after being loaded.

BladePipe finishes data transformation in real time before data loading. Configure the transformation method when creating a pipeline, and all is done automatically. BladePipe supports built-in data transformations in a visualized way, including data filtering, data masking, column pruning, mapping, etc. Complex transformations can be done via custom code. With BladePipe, data gets ready when it flows through the pipeline.

In summary, Airbyte's data transformation capabilities are limited due to its ELT way of data replication. BladePipe offers both built-in transformations and custome code to satisfy various needs, and the transformations happen in real time.

Support

Airbyte provides free and paid technical support. Open source users can seek help in the community or solve the issue by themselves. It's free of charge but can be time-consuming for urgent production issues. Cloud customers can get help through chatting with Airbyte team members and contributors. Enterprise-level support is a separate paid tier, with custom SLAs, and access to training.

BladePipe offers a more white-glove support experience. For both Cloud and Enterprise customers, BladePipe provides the according SLAs. Its technical team is closely involved in onboarding and tuning pipelines. Besides, for all customers, alert notifications can be sent via email and webhook to ensure pipeline reliability.

In summary, both Airbyte and BladePipe provide documentation and technical support for better understanding and use. Just think about your needs and make the right choice.

Pricing Model Comparison

Pricing is one of the key factors to consider when evaluating various tools, especially for startups and organizations with large amount of data to be replicated. BladePipe and Airbyte show great differences in the pricing model.

BladePipe

BladePipe offers two plans to choose:

Cloud: $0.01 per million rows of full data or $10 per million rows of incremental data. You can easily evaluate the costs via the price calculator.
Enterprise: The costs are based on the number of pipelines and duration you need. Talk to the sales team on specific costs.

Airbyte

Airbyte has four plans to consider:

Open Source: Free to use for self-hosted deployment.
Cloud: $2.50 per credit, and start at $10/month(4 credits).
Team: Custom pricing for cloud deployment. Talk to the sales team on specific costs.
Enterprise: Custom pricing for self-hosted deployment. Talk to the sales team on specific costs.

Summary

Here's a quick comparison of costs between BladePipe BYOC and Airbyte Cloud.

Million Rows per Month	BladePipe* (BYOC)	Airbyte (Cloud)
1 M	$210	$450
10 M	$300	$1000
100 M	$1200	$3000
1000 M	$10200	$14000

*: include one AWS EC2 t2.xlarge for worker, $200 /month.

In summary, BladePipe is much cheaper than Airbyte. The cost gap becomes even wider when more data is moved per month. If you have a tight budget or need to integrate thousands of millions of rows of data, BladePipe would be a cost-effective option.

Final Thoughts

A right tool is critical for any business, and the choice should depend on your use case. This article lists a number of considerations and key differences. To summarize, Airbyte excels at extensive connectors and an open ecosystem, while BladePipe is designed for real-time end-to-end data use cases.

If your organization is building applications that rely on always-fresh, such as AI assistants, real-time search, or event streaming, BladePipe is likely a better fit.

If your organization needs to integrate data from various APIs or would like to maintain connectors by in-house staff, you may try Airbyte.

How to Prevent Replication Loops in MySQL Bidirectional Sync?

BladePipe — Fri, 27 Jun 2025 07:24:54 +0000

Real-time MySQL-to-MySQL two-way data sync is essential for high availability, seamless disaster recovery and active-active data architectures. It helps keep data consistent and up-to-date across various systems, regardless of where changes occur.

However, it's not that easy to always keep data updated and consistent in a two-way MySQL pipeline. Replication loop is one of the biggest challenges. In this page, we'll explain how to perform MySQL bidirectional data sync, preventing infinite data replication loops.

What is a Replication Loop?

The replication loop is a critical issue in MySQL two-way sync setups. It occurs when the same change keeps getting replicated back and forth between the two databases endlessly. For example, if Database A sends an update to Database B, and Database B thinks it's a new change, and sends it back to A, over and over again.

This cycle can lead to several serious issues:

Data Duplication: The same update may be applied multiple times, potentially causing duplicate rows, incorrect data, or integrity violations.
Increased Latency and Load: Continuous replication of the same changes consumes CPU, I/O, and network resources, degrading system performance.
Difficult Troubleshooting: Even minor update conflicts can escalate when each system repeatedly re-applies changes, making conflict resolution complex. Identifying the source of the loop and the specific transactions causing it can be extremely challenging.

How to Prevent Infinite Loops?

To prevent replication loops in MySQL two-way sync, GTID(Global Transaction Identifier) typically uses a combination of server_uuid and transaction IDs as conflict markers. However, this solution has its limitations.

BladePipe, a professional data replication tool, introduces a more streamlined approach by tagging binlog events directly.

In a typical DML binlog sequence—QueryEvent (TxBegin), TableMapEvent, WriteRowEvent (IUD), and QueryEvent (TxEnd)—tagging the WriteRowEvent would be ideal for conflict handling. But doing so generally requires modifying the MySQL storage engine code, which is complex and invasive.

Upon deep investigation, BladePipe discovered that MySQL's binlog includes a special event called RowsQueryLogEvent, which logs the original SQL statement when the binlog_rows_query_log_events parameter is enabled. This event allows to be attached with comments, which opens up a clean tagging mechanism.

Leveraging this, BladePipe automatically adds a custom marker /*ccw*/ when writing data to the target MySQL database. This tag appears in the RowsQueryLogEvent, making it easy to identify and filter out in a bidirectional sync.

This mechanism shows the following features:

No dependency on GTID
Order-independent and parallelizable replication
Reduced operations on the target database
Broad compatibility with cloud-based MySQL services
Support database/table/column-level filtering, mapping, and custom data processing

With this enhancement, the new binlog event sequence becomes:
QueryEvent (TxBegin), TableMapEvent, RowsQueryLogEvent, WriteRowEvent, and QueryEvent (TxEnd).

How to Perform MySQL Two-Way Sync Using BladePipe?

Next, we'll give a step-by-step guide on how to perform a MySQL two-way data sync. In the demonstration, we use RDS for MySQL instances.

Step 1: Install BladePipe

Follow the instructions in Install Worker (Docker) or Install Worker (Binary) to download and install a BladePipe Worker.

Step 2: Add DataSource

Log in to the RDS console. Go to the instance details page and click Parameters, then enable binlog_rows_query_log_events.
Log in to the BladePipe Cloud. Click DataSource > Add DataSource. It is suggested to modify the description of the DataSource to prevent mistaking the databases when you configure two-way DataJobs.

Step 3: Create Forward DataJob

:::info
In bidirectional sync, the forward DataJob generally refers to the DataJob where the source database has data and the target database has no data, which involves the initialization of data at the target database.
:::

Click DataJob > Create DataJob.
Select the source and target DataSources, and click Test Connection to ensure the connection to the source and target DataSources are both successful.

In Properties Page:
1. Select Incremental for DataJob Type, together with the Full Data option.
2. Check Synchronize DDL.
3. Grey out Start Automatically to set parameters after the DataJob is created.

Select the tables and columns to be replicated.
Confirm the DataJob creation.
Click Details > Functions > Modify DataJob Params.
1. Choose Target tab, and set deCycle to true.
2. Click Save and start the DataJob.

Step 4: Create Reverse DataJob

Click DataJob > Create DataJob.
Select the source and target DataSources(reverse selection of Forward DataJob), and click Test Connection to ensure the connection to the source and target DataSources are both successful.

In Properties Page:
1. Select Incremental, and DO NOT check Full Data option.
2. Check Synchronize DDL.
3. Grey out Start Automatically to set parameters after the DataJob is created.

Select the tables and columns to be replicated.
Confirm the DataJob creation.
Click Details > Functions > Modify DataJob Params.
1. Choose Target tab, and set deCycle to true.
2. Click Save and start the DataJob.

Forward and reverse DataJobs are running well.

Step 5: Check the Result

Do some DMLs in the source database. You can see there are changes in forward DataJob monitoring charts but no changes in reverse DataJob.

Do some DMLs in the target database. You can see there are changes in reverse DataJob monitoring charts but no changes in forward DataJob.

FAQ

What are the drawbacks of this solution？

First, it requires enabling the MySQL global variable binlog_rows_query_log_events, which is disabled by default. Compared to GTID which is typically enabled, this is a relative disadvantage.

Second, enabling this feature can cause the binlog to grow faster, potentially leading to increased disk usage and shorter binlog retention cycles.

Third, for BladePipe, this approach increases in-memory usage due to storing SQL statement text, which results in higher resource consumption.

That said, considering the significant improvements in performance and stability, BladePipe believes the benefits outweigh the drawbacks.

What other pipelines does this solution support?

At present, BladePipe has not conducted in-depth research on whether other data sources support tagging within DML statements or row data. However, tagging-based mechanisms remain a promising direction worth exploring.

Summary

In this article, we dive into how to prevent infinite replication loops in MySQL bidirectional sync, boosting the construction of an architecture with high availability, elasticity and disaster recovery.

Redis Sync at Scale: A Smarter Way to Handle Big Keys

BladePipe — Tue, 24 Jun 2025 08:01:15 +0000

In enterprise-grade data replication workflows, Redis is widely adopted thanks to its blazing speed and flexible data structures. But as data grows, so do the keys in Redis—literally. Over time, it’s common to see Redis keys ballooning with hundreds of thousands of elements in structures like Lists, Sets, or Hashes.

These “big keys” are usually one of the roots of poor performance in a full data migration or sync, slowing down processes or even bringing them to a crashing halt.

That’s why BladePipe, a professional data replication platform, recently rolled out a fresh round of enhancements to its Redis support. This includes expanded command coverage, data verification feature, and more importantly, major improvements for big key sync.

Let’s dig into how these improvements work and how they keep Redis migrations smooth and reliable.

Challenges of Big Key Sync

In high-throughput, real-time applications, it’s common for a single Redis key to contain a massive amount of elements. When it comes to syncing that data, a few serious issues can pop up:

Out-of-Memory (OOM) Crashes: Reading big keys all at once can cause the sync process to blow up memory usage, sometimes leading to OOM.
Protocol Size Limits: Redis commands and payloads have strict limits (e.g., 512MB for a single command via the RESP protocol). Exceed those limits, and Redis will reject the operation.
Target-Side Write Failures: Even if the source syncs properly, the target Redis might fail to process oversized writes, leading to data sync interruption.

How BladePipe Tackles Big Key Syncs

To address these issues, BladePipe introduces lazy loading and sharded sync mechanisms specifically tailored for big keys without sacrificing data integrity.

Lazy Loading

Traditional data sync tools often attempt to load an entire key into memory in one go. BladePipe flips the script by using on-demand loading. Instead of stuffing the entire key into memory, BladePipe streams it shard-by-shard during the sync process.

This dramatically reduces memory usage and minimizes the risk of OOM crashes.

Sharded Sync

The heart of BladePipe’s big key optimization lies in breaking big keys into smaller shards. Each shard contains a configurable number of elements and is sent to the target Redis in multiple commands.

Configurable parameter: parseFullEventBatchSize
Default value: 1024 elements per shard
Supported types: List, Set, ZSet, Hash

Example: If a Set contains 500,000 elements, BladePipe will divide it into ~490 shards, each with up to 1024 elements, and send them as separate SADD commands.

Shard-by-Shard Sync Process

Here’s a breakdown of how it works:

Shard Planning: BladePipe inspects the total number of elements in a big key and calculates how many shards are needed based on the parameter parseFullEventBatchSize.
Shard Construction & Dispatch: Each shard is formatted into a Redis-compatible command and sent to the target sequentially.
Order & Integrity Guarantees: Shards are written in the correct order, preserving data consistency on the target Redis.

Real-World Results

To benchmark the improvements, BladePipe ran sync tests with a mixed dataset:

1 million regular keys (String, List, Hash, Set, ZSet)
50,000 large keys (~30MB each; max ~35MB)

Here’s what performance looked like:

The result shows that even with big keys in the mix, BladePipe achieved a steady sync throughput of 4–5K RPS from Redis to Redis, which is enough to handle the daily production workloads for most businesses without compromising accuracy.

Wrapping Up

Big keys don’t have to be big problems. With lazy loading and sharded sync, BladePipe provides a reliable and memory-safe way to handle full Redis migrations—even for your biggest keys.

Real-Time Data Sync: 4 Questions We Get All the Time

BladePipe — Fri, 20 Jun 2025 07:37:33 +0000

We work closely with teams building real-time systems, migrating databases, or bridging heterogeneous data platforms. Along the way, we hear a lot of recurring questions. So we figured—why not write them down?

This is Part 1 of a practical Q&A series on real-time data sync. In this post, I'd like to share thoughts on the following questions:

How should I choose between official and third-party tools?
Can my project rely on “real-time” sync latency?
What does real-time data sync mean to my project?
How do I keep pipeline stability and data integrity over time?

How should I choose between official and third-party tools?

Mature database vendors typically provide their own tools for data migration or cold/hot backup, like Oracle GoldenGate or MySQL's built-in dump utilities.

Official tools often deliver:

The best possible performance for the migration and sync of that database.
Compatibility with obscure engine-specific features.
Support for special cases that third-party tools often cannot (e.g., Oracle GoldenGate parsing Redo logs).

But they also tend to:

Offer limited or no support for other databases.
Be less flexible for niche or custom workflows.
Lock you in, making data exit harder than data entry.

Third-party tools shine when:

You're syncing across platforms (e.g. MySQL > Kafka/Iceberg/Elasticsearch).
You need advanced features like filtering and transformation.
The official tool simply doesn't support your use case.

In short:

If it’s homogeneous migration or backup, use the official tool.
If it’s heterogeneous sync or anything custom, go third-party tool.

Can my project rely on “real-time” sync latency?

In short: any data sync process that doesn't guarantee distributed transaction consistency comes with some latency risk. Even distributed transactions come at a cost—usually via redundant replication and sacrificing write performance or availability.

Latency typically falls into two categories: fault-induced latency and business-induced latency.

Fault-induced Latency:

Issues with the sync tool itself, such as memory limits or bugs.
Source/target database failures—data can't be pulled or written properly.
Constraint conflicts on the target side, leading to write errors.
Incomplete schema on the target side causing insert failures.

Business-induced Latency:

Bulk data imports or data corrections on the source side.
Traffic spikes during business peaks exceeding the tool’s processing capacity.

You can reduce the chances of delays (via task tuning, schema change rule setting, and database resource planning), but you’ll never fully eliminate them. So the real question becomes:

Do you have a fallback plan (e.g. graceful degradation) when latency hits?

That would significantly mitigate the risks brought by high latency.

What does real-time data sync mean to my project?

Two words: incremental + real-time.

Unlike traditional batch-based ETL, a good real-time sync tool:

Captures only what changes, saving massive bandwidth.
Delivers changes within seconds, enabling use cases like fraud detection or live analytics.
Preserves deletes and DDLs, whereas traditional ETL often relies on external metadata services.

Think of it like this:
You don’t want to re-copy 1 billion rows every night when only 100 changed. Real-time sync gives you the speed and precision needed to power fast, reliable data products.

And with modern architectures—where one DB handles transactions, another serves queries, and a third powers ML—real-time sync is the glue holding it all together.

How do I keep pipeline stability and data integrity over time?

Most stability issues come from three factors: schema changes, traffic pattern shifts, and network environment issues. Mitigating or planning for these risks greatly improves stability.

Schema Changes:

Incompatibilities between schema change methods (e.g., native DDL, online tools like pt-osc or gh-ost) and the sync tool’s capabilities.
Uncoordinated changes to target schemas may cause errors or schema misalign.
Changes on the target side (e.g., schema changes or writes) may conflict with sync logic, causing the inconsistency between the source and target shcema or constraint conflicts.

Traffic Shifts:

Business surges causing unexpected peak loads that outstrip the sync tool’s capacity, leading to memory exhaustion or lag.
Ops activities like mass data corrections causing large data volumes and sync bottlenecks.

Network Environment:

Missing database whitelisting for sync nodes. Sync tasks may fail due to connection issues.
High latency in cross-region setups causing read/write problems.

You can reduce these risks significantly via change control setting, load testing during peak traffic, and pre-launch resource validation.

For data loss issues, they are typically resulted from:

Mismatched parallelism strategy causing write disorder.
Conflicting writes on the target side.
Excessive latency not handled in time, causing source-side logs to be purged before sync.

How to fight back:

Parallelism strategy mismatch often occurs due to cascading updates or reuse of primary key. You may need to fall back to table-level sync granularity and verify and correct data to ensure data consistency.
Target-side writes should be prevented via access control and database usage standardization.
Excessive latency must be caught via robust alerting. Also, extend log retention (ideally 24+ hours) on the source database.

With these measures in place, you can significantly enhance sync stability and data reliability—laying a solid foundation for data-driven business operations.

Intercontinental Data Sync - A Comparative Study for Performance Tuning

BladePipe — Wed, 18 Jun 2025 06:23:18 +0000

When it comes to moving data across vast distances, particularly between continents, businesses often face a range of challenges that can impact performance. At BladePipe, we regularly help enterprises tackle these hurdles. The most common question we receive is: What’s the best way to deploy BladePipe for optimal performance?

While we can offer general advice based on our experience, the reality is that these tasks come with many variables. This article explores the best practice for intercontinental data migration and sync, blending theory with hands-on insights from real-world experiments.

Challenges of Intercontinental Data Sync

Intercontinental data migration is no easy feat. There are two primary challenges that stand in the way of fast and reliable data transfers:

Unavoidable network latency: For instance, network latency between Singapore and the U.S. typically ranges from 150ms to 300ms, which is significantly higher compared to the sub-5ms latency of typical relational database INSERT/UPDATE operations.
Complex factors affecting network quality: Factors such as packet loss and routing paths can degrade the performance of intercontinental data transfers. Unlike intranet communication, intercontinental transfers pass through multiple layers of switches and routers in data centers and backbone networks.

Beyond these, it’s critical to consider the load on both the source and target databases, network bandwidth, and the volume of data being transferred.

When using BladePipe, understanding its data extraction and writing mechanisms is essential to determine the best deployment strategy.

BladePipe Migration & Sync Techniques

Data Migration Techniques

For relational databases, BladePipe uses JDBC-based data scanning, with support for resumable migration using techniques like pagination. Additionally, it supports parallel data migration—both inter-table and intra-table parallelism (via multiple tasks with specific filters).

On the target side, since all data is inserted via INSERT operations, BladePipe uses several batch writing techniques:

Batching
Spliting and parallel writing
Bulk inserts
INSERT rewriting (e.g., converting multiple rows into insert..values(),(),())

Data Sync Techniques

BladePipe supports different methods for capturing incremental changes depending on the source database. Here’s a quick look:

Source Database	Incremental Capture Method
MySQL	Binlog parsing
PostgreSQL	logical WAL subscription
Oracle	LogMiner parsing
SQL Server	SQL Server CDC table scan
MongoDB	Oplog scan / ChangeStream
Redis	PSYNC command
SAP Hana	Trigger
Kafka	Message subscription
StarRocks	Periodic incremental scan
...	...

These methods largely rely on the source database to emit incremental changes, which can vary based on network conditions.

On the target side, unlike data migration, more operations (INSERT/UPDATE/DELETE) need to be handled while order consistency must be kept in data sync. BladePipe offers a variety of techniques to improve data sync performance:

Optimization	Description
Batching	Reduce network overhead and help with merge performance
Partitioning by unique key	Ensure data order consistency
Partitioning by table	Looser method when unique key changes occur
Multi-statement execution	Reduce network latency by concatenating SQL
Bulk load	For data sources with full-image and upsert capabilities, INSERT/UPDATE operations are converted into INSERT for batch overwriting
Distributed tasks	Allow parallel writes of the same amount of data using multiple tasks

Exploring the Best Practice

BladePipe’s design emphasizes performance optimizations on the target side, which are more controllable. Typically, we recommend deploying BladePipe near the source data source to mitigate the impact of network quality on data extraction.

But does this theory hold up in practice? To test this, we conducted an intercontinental MySQL-to-MySQL migration and sync experiment.

Experimental Setup

Resources:

Source MySQL: located in Singapore (4 cores, 8GB RAM)
Target MySQL: located in Silicon Valley, USA (4 cores, 8GB RAM)
BladePipe: deployed on VMs in both Singapore and Silicon Valley (8 cores, 16GB RAM)

Test Plan: We migrated and synchronized the same data twice to compare performance with BladePipe deployed in different locations.

Process

Generate 1.3 million rows of data in Singapore MySQL.
Use BladePipe deployed in Singapore to migrate data to the U.S. and record performance.

Make data changes (INSERT/UPDATE) at Singapore MySQL and record sync performance.

Stop the DataJob and delete target data.
Use BladePipe deployed in the U.S. to migrate the data again from Singapore MySQL and record performance.

Make data changes at Singapore MySQL and record sync performance again.

Results & Analysis

Deployment Location	Task Type	Performance
Source (Singapore)	Migration	6.5k records/sec
Target (Silicon Valley)	Migration	15k records/sec
Source (Singapore)	Sync	8k records/sec
Target (Silicon Valley)	Sync	32k records/sec

Surprisingly, deploying BladePipe at the target (Silicon Valley) significantly outperformed the source-side deployment.

Potential Reasons:

Network policies and bandwidth differences between the two locations.
Target-side batch writes are less affected by poor network conditions compared to binlog/logical scanning on the source side.
Other unpredictable network variables.

Recommendations

While the experiment offers valuable insights to intercontinental data migration and sync, real-world environments can differ:

Production databases may be under heavy load, impacting the ability to push incremental changes efficiently.
Dedicated network lines may offer more consistent network quality.
Gateway rules and security policies vary across data centers, affecting performance.

Our recommendation: During the POC phase, deploy BladePipe on both the source and target sides, compare performance, and choose the best deployment strategy based on real-world results.

Build a Local RAG Using Ollama, PostgreSQL and BladePipe

BladePipe — Fri, 13 Jun 2025 07:04:34 +0000

Retrieval-Augmented Generation (RAG) is becoming increasingly common in enterprise applications. Unlike lightweight Q&A systems designed for personal users, enterprise RAG solutions must be reliable, controllable, scalable, and most importantly—secure.

Many companies are cautious about sending internal data to public cloud-based models or vector databases due to the risk of sensitive information leakage. For industries with strict compliance needs, this is often a dealbreaker.

To address these challenges, BladePipe now supports building local RAG services with Ollama, enabling enterprises to run intelligent RAG services entirely within their own infrastructure. This article walks you through building a fully private, production-ready RAG application—without writing any code.

What is an Enterprise-Grade RAG Service?

Enterprise-grade RAG emphasizes end-to-end integration, data control, and tight alignment with business systems. The goal isn’t just smart Q&A. It brings automation and intelligence that genuinely boost business.

Compared to hobby or research-focused RAG setups, enterprise systems have four key traits:

Fully private stack: All components must run locally or in a private cloud. No data leaves the enterprise boundary.
Diverse data sources: Beyond plain text files. Databases and more formats are supported.
Incremental data syncing: Business data updates constantly. RAG indexes must stay in sync automatically.
Integrated tool calling (MCP-like capabilities): Retrieval and generation are only part of the story. Tools like SQL query, function calls, or workflow execution must be supported.

Introducing BladePipe RagApi

BladePipe’s RagApi encapsulates both vector search and LLM-based Q&A capabilities and supports the MCP protocol. It’s designed to help every one quickly launch their own RAG services.

Compared to the traditional way to build RAG services, RagApi's key advantages include:

Two DataJobs for a RAG service: Import documents + publish API.
Zero-code deployment: Everything is configurable—no development needed.
Adjustable parameters: Adjust vector top-K, match threshold, prompt templates, model temperature, etc.
Multi-model and platform support: Support DashScope (Alibaba Cloud), OpenAI, DeepSeek, and more.
OpenAI-compatible API: Easily integrate into your existing chat UI or toolchain.

Demo

Here’s how to build a fully private, secure RAG service using:

Ollama for local model reasoning and embedding.
PostgreSQL for local vector storage.
BladePipe RagApi for building and managing the RagApi service.

The overall workflow is like:

Preparation

Run Ollama Locally

Ollama allows you to deploy LLMs on your local machine. It will be used for both embedding and reasoning.

Download Ollama from https://ollama.com/download
After installation, run the following command to pull and run a suitable model for embedding and reasoning, such as deepseek-r1. Note: Large models may require significant hardware resources.

ollama run deepseek-r1

Set Up PGVector

Install Docker (Skip if already installed): For different operating systems, refer to the following steps for installation:
MacOS: Refer to the official installation doc: Docker Desktop for Mac.
CentOS / RHEL: Refer to the script below.

## centos / rhel
sudo yum-config-manager --add-repo https://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo

sudo yum install -y docker-ce-20.10.9-3.* docker-ce-cli-20.10.9-3.*

sudo systemctl start docker

sudo echo '{"exec-opts": ["native.cgroupdriver=systemd"]}' > /etc/docker/daemon.json

sudo systemctl restart docker

Ubuntu: Refer to the script below.

## ubuntu
curl -fsSL https://mirrors.aliyun.com/docker-ce/linux/ubuntu/gpg | sudo apt-key add -

sudo add-apt-repository "deb [arch=amd64] https://mirrors.aliyun.com/docker-ce/linux/ubuntu $(lsb_release -cs) stable"

sudo apt-get update

sudo apt-get -y install docker-ce=5:20.10.24~3-0~ubuntu-* docker-ce-cli=5:20.10.24~3-0~ubuntu-*

sudo systemctl start docker

sudo echo '{"exec-opts": ["native.cgroupdriver=systemd"]}' > /etc/docker/daemon.json

sudo systemctl restart docker

Start PostgreSQL + pgvector Container Service: Execute the following command to deploy the PostgreSQL environment in one go:

cat <<'EOF' > init_pgvector.sh
#!/bin/bash

# create docker-compose.yml
cat <<YML > docker-compose.yml
version: "3"
services:
  db:
    container_name: pgvector-db
    hostname: 127.0.0.1
    image: pgvector/pgvector:pg16
    ports:
      - 5432:5432
    restart: always
    environment:
      - POSTGRES_DB=api
      - POSTGRES_USER=root
      - POSTGRES_PASSWORD=123456
YML

# Start container service (run in background)
docker-compose up --build -d

# Wait for container to start, then enter database and enable vector extension
echo "Waiting for container to start..."
sleep 5

docker exec -it pgvector-db psql -U root -d api -c "CREATE EXTENSION IF NOT EXISTS vector;"
EOF

# Grant execution permissions and run the script
chmod +x init_pgvector.sh
./init_pgvector.sh

After execution, local PostgreSQL will automatically enable the pgvector extension, ready to store document embeddings securely on-prem.

Deploy BladePipe (Enterprise)

Follow the installation guide to download BladePipe (Enterprise).

RAG Building

Add DataSources

Files(SshFile):

Select Self Maintenance > SshFile. You can set extra parmeters.

Address: Fill in the machine IP where the files are stored and SSH port (default 22).
Account & Password: Username and password of the machine.
Parameter *fileSuffixArray*: set to .md to include markdown files.
Parameter *dbsJson*: Copy the default value and modify the schema value (the root path where target files are located)

[
  {
    "db":"cc_virtual_fs",
    "schemas":[
      {
        "schema":"/Users/zoe/cloudcanal-doc-v2/locales",
        "tables":[]
      }
    ]
   }
]

Vector Database(PostgreSQL):

Choose Self Maintenance > PostgreSQL.

Configuration Details:

Address: localhost:5432
Account: root
Password: 123456

LLM (Ollama):

Choose Self Maintenance > Ollama.

Configuration Details:

Address: localhost:11434
Parameter *llmEmbedding*:

{
  "deepseek-r1": {
    "dimension": 4096
  }
}

Parameter *llmChat*:

{
  "deepseek-r1": {
    "temperature": 1,
    "topP": 0.9,
    "showReasoning": false
  }
}

RagApi Service (BladePipe):

Choose Self Maintenance > RagApi.

Address: Set host to localhost and port to 18089.
API Key: Customize a string (e.g. my-bp-rag-key), used for authentication when calling RagApi later.

DataJob 1: Vectorize the Docs

Go to DataJob > Create DataJob.
Choose source: SshFile, target: PostgreSQL, and test the connection.

Select Full Data for DataJob Type. Keep the specification as default (2 GB).
In Tables page,
1. Select the markdown files you want to process.
2. Click Batch Modify Target Names > Unified table name, and fill in the table name (e.g. knowledge_base).
3. Click Set LLM > Ollama, and select the instance and the embedding model.

In Data Processing page, click Batch Operation > LLM embedding. Select the fields for embedding, and check Select All.

In Creation page, click Create DataJob.

DataJob 2: Build RagApi Service

Go to DataJob > Create DataJob.
Choose source: PostgreSQL(with vectors stored), target: RagApi, and test the connection.

Select Incremental for DataJob Type. Keep the specification as default (2 GB).
In Tables page, select the vector table(s). Then, click Set LLM, and choose Ollama as the Embedding LLM and Chat LLM.

In Creation page, click Create DataJob to finish the setup.

Perform a simple test using the following command:

curl http://localhost:18089/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer my-cc-rag-key" \
  -d '{
        "messages": [
          {"role": "system", "content": "You are a helpful assistant."},
          {"role": "user", "content": "Hello!"}
        ],
        "stream": false
      }'

Test

You can test the RagApi with CherryStudio, a visual tool that supports OpenAI-compatible APIs.

Open CherryStudio, click the Settings icon in the bottom left corner.
Under Model Provider, search for OpenAI and configure:
- API Key: your RagApi key configured in BladePipe
- API Host: http://localhost:18089
- Model ID: BP_RAG

Back on the chat page:
1. Click Add Assistant > Default Assistant.
2. Right click Default Assistant > Edit Assistant > Model Settings, and choose BP_RAG as the default model.

Now try asking: What privileges does BladePipe require for a source MySQL?. RagApi will search your vector database and generate a response using the chat model.

Final Thoughts

Enterprise-grade RAG services prioritize data privacy and control. By combining BladePipe with Ollama, you can easily achieve a fully private RAG service deployment, creating a truly reliable enterprise-grade RAG solution that does not depend on public networks.

Build a RAG Chatbot with OpenAI and BladePipe - No Code Required

BladePipe — Fri, 06 Jun 2025 07:20:34 +0000

In a previous article, we explained key GenAI concepts like RAG, Function Calling, MCP, and AI Agents. Now the question is: how do we go from concepts to practice?

Currently, you can find plenty of RAG building tutorials online, but most of them are based on frameworks like LangChain, which still have a learning curve for beginners.

BladePipe, as a data integration platform, already supports the access to and processing of multiple data sources. This gives it a natural edge in setting up the semantic search foundations for a RAG system. Recently, BladePipe launched RagApi, which wraps up vector search and Q&A capabilities into a plug-and-play API service. With just two DataJobs in BladePipe, you can have your own RAG service—no coding required.

Why BladePipe RagApi?

Compared to traditional RAG setups, which often involve lots of manual work, BladePipe RagApi offers several unique benefits:

Two DataJobs for a RAG service: One to import documents, and one to create the API.
Zero-code deployment: No need to write any code, just configure.
Adjustable parameters: Adjust vector top-K, match threshold, prompt templates, model temperature, etc.
Multi-model and platform compatibility: Support DashScope (Alibaba Cloud), OpenAI, DeepSeek, and more.
OpenAI-compatible API: Integrate it directly with existing Chat apps or tools with no extra setup.

Demo: A Q&A Service for BladePipe Docs

We’ll use BladePipe’s own documentation as a knowledge base to create a RAG-based Q&A service.

Here’s what we’ll need:

BladePipe – to build and manage the RagApi service
PostgreSQL – as the vector database
Embedding model – OpenAI text-embedding-3-large
Chat model – OpenAI GPT-4o

Here’s the overall workflow:

Step-by-Step Setup

Install BladePipe

Follow the instructions in Install Worker (Docker) or Install Worker (Binary) to download and install a BladePipe Worker.

Prepare Your Resources

#!/bin/bash

# create file docker-compose.yml
cat <<EOF > docker-compose.yml
version: "3"
services:
  db:
    container_name: pgvector-db
    hostname: 127.0.0.1
    image: pgvector/pgvector:pg16
    ports:
      - 5432:5432
    restart: always
    environment:
      - POSTGRES_DB=api
      - POSTGRES_USER=root
      - POSTGRES_PASSWORD=123456
    volumes:
      - ./init.sql:/docker-entrypoint-initdb.d/init.sql
EOF

# Start docker-compose automatically
docker-compose up --build

# Access PostgreSQL
docker exec -it pgvector-db psql -U root -d api

Create a privileged user and log in.

Switch to the target schema where you need to create tables (like public).

Run the following SQL to enable vector capability:

CREATE EXTENSION IF NOT EXISTS vector;

Add DataSources

Add Files:

Select Self Maintenance > SshFile. You can set extra parmeters.

Address: Fill in the machine IP where the files are stored and SSH port (default 22).
Account & Password: Username and password of the machine.
Parameter fileSuffixArray: set to .md to include markdown files.
Parameter dbsJson: Copy the default value and modify the schema value (the root path where target files are located)

[
  {
    "db":"cc_virtual_fs",
    "schemas":[
      {
        "schema":"/tmp/cloudcanal-doc-v2/locales",
        "tables":[]
      }
    ]
   }
]

Add the Vector Database:

Choose Self Maintenance > PostgreSQL, then connect.

Add a LLM:

Choose Independent Cloud Platform > Manually Fill > OpenAI, and fill in the API key.

Add RagApi Service:

Choose Self Maintenance > RagApi.

Address: Set host to localhost and port to 18089.
API Key: Create your own API key for later use.

DataJob 1: Vectorize Your Data

Go to DataJob > Create DataJob.

Choose source: SshFile, target: PostgreSQL, and test the connection.

Select Full Data for DataJob Type. Keep the specification as default (2 GB).

In Tables page,
1. Select the markdown files you want to process.
2. Click Batch Modify Target Names > Unified table name, and fill in the table name (e.g. vector_store).

In Data Processing page,

1. Click Set LLM > OpenAI, and select the instance and the embedding model (text-embedding-3-large).
2. Click Batch Operation > LLM embedding. Select the fields for embedding, and check Select All.

In Creation page, click Create DataJob.

DataJob 2: Build RagApi Service

Go to DataJob > Create DataJob.

Choose source: PostgreSQL(with vectors stored), target: RagApi, and test the connection.

Select Incremental for DataJob Type. Keep the specification as default (2 GB).
In Tables page, select the vector table(s).

In Data Processing page, click Set LLM:

Embedding LLM: Select OpenAI and the embedding model (e.g. text-embedding-3-large). Note: Make sure vector dimensions in PostgreSQL match the embedding model.
Chat LLM: Select OpenAI and the chat model (e.g. gpt-4o).

In Creation page, click Create DataJob to finish the setup.

Test

You can test the RagApi with CherryStudio, a visual tool that supports OpenAI-compatible APIs.

Open CherryStudio, click the Settings icon in the bottom left corner.

Under Model Provider, search for OpenAI and configure:
- API Key: your RagApi key configured in BladePipe
- API Host: http://localhost:18089
- Model ID: BP_RAG

Back on the chat page:

- Click Add Assistant > Default Assistant.
- Right click Default Assistant > Edit Assistant > Model Settings, and choose BP_RAG as the default model.

Now try asking: How to create an incremental DataJob in BladePipe?. RagApi will search your vector database and generate a response using the chat model.

Wrapping Up

With just a few steps, we’ve built a fully functional RagApi service from scratch—vectorized the data, connected to a vector DB, configured LLMs, generated Prompt and deployed an OpenAI-compatible API.

With BladePipe, teams can quickly build Q&A services based on outside knowledge without writing any code. It's a powerful yet accessible way to tap into GenAI for your own data.

How to Load Data From MySQL to Iceberg in Real Time

BladePipe — Wed, 28 May 2025 07:59:50 +0000

As companies deal with more data than ever before, the need for real-time, scalable, and low-cost storage becomes critical. That's where Apache Iceberg shines. In this post, I’ll walk you through how to build a real-time data sync pipeline from MySQL to Iceberg using BladePipe—a tool that makes data migration ridiculously simple.

Let’s dive in.

Iceberg

What is Iceberg?

If you haven’t heard of Iceberg yet, it’s an open table format designed for large analytic datasets. It’s kind of like a smarter table format for your data lake—supporting schema evolution, hidden partitioning, ACID-like operations, and real-time data access.

It includes two key concepts:

Catalog: Think of this as metadata—the table names, columns, data types, etc.
Data Storage: Where the metadata and actual files are stored—like on S3 or HDFS.

Why Iceberg?

Iceberg is open and flexible. It defines clear standards for catalog, file formats, data storage, and data access. This makes it widely compatible with different tools and services.

Catalogs: AWS Glue, Hive, Nessie, JDBC, or custom REST catalogs.
File formats: Parquet, ORC, Avro, etc.
Storage options: AWS S3, Azure Blob, MinIO, HDFS, Posix FS, local file systems, and more.
Data access: Real-time data warehouses like StarRocks, Doris, ClickHouse, or batch/stream processing engines like Spark, Flink, and Hive can all read, process and analyze Iceberg data.

Besides its openness, Iceberg strikes a good balance between large-scale storage and near real-time support for inserts, updates, and deletes.

Here’s a quick comparison across several database types:

Database Type	Relational DB	Real-time Data Warehouse	Traditional Big Data	Data Lake
Data Capacity	Up to a few TBs	100+ TBs	PB level	PB level
Real-time Support	Millisecond-level latency, 10K+ QPS	Second-to-minute latency, thousands QPS	Hour-to-day latency, very low QPS	Minute-level latency, low QPS (batch write)
Transactions	ACID compliant	ACID compliant or eventually consistent	No	No
Storage Cost	High	High or very high	Very low	Low
Openness	Low	Medium(storage-compute decoupling)	High	Very high

From this table, it’s clear that Iceberg offers low cost, massive storage, and strong compatibility with analytics tools—a good replacement for older big data systems.

And thanks to its open architecture, you can keep exploring new use cases for it.

Why BladePipe?

Setting up Iceberg sounds great—until you realize how much work it takes to actually migrate and sync data from your transactional database. That’s where BladePipe comes in.

Supported Catalogs and Storage

BladePipe currently supports 3 Iceberg catalogs and 2 storage backends:

AWS Glue + AWS S3
Nessie + MinIO / AWS S3
REST Catalog + MinIO / AWS S3

For a fully cloud-based setup: Use AWS RDS + EC2 to deploy BladePipe + AWS Glue + AWS S3.

For an on-premise setup: Use a self-hosted relational database + On-Premise deployment of BladePipe + Nessie or REST catalog + MinIO.

One-Stop Data Sync

Before data replication, there's often a lot of manual setup. BladePipe takes care of that for you—automatically handling schema mapping, historical data migration, and other preparation.

Even though Iceberg isn't a traditional database, BladePipe supports an automatic data sync process, including converting schemas, mapping data types, adapting field lengths, cleaning constraints, etc. Everything happens in BladePipe.

Procedures

In this post, we’ll use:

Source: MySQL (self-hosted)
Target: Iceberg backed by AWS Glue + S3
Sync Tool: BladePipe (Cloud)

Let’s go step-by-step.

Step 1: Install BladePipe

Follow the instructions in Install Worker (Docker) or Install Worker (Binary) to download and install a BladePipe Worker.

Step 2: Add DataSources

Log in to the BladePipe Cloud.
Click DataSource > Add DataSource.
Add two sources – one MySQL, one Iceberg. For Iceberg, fill in the following (replace <...> with your values):
- Address: Fill in the AWS Glue endpoint.
```
glue.<aws_glue_region_code>.amazonaws.com
```

Version: Leave as default.
Description: Fill in meaningful words to help identify it.

Extra Info:

httpsEnabled: Enable it to set the value as true.
catalogName: Enter a meaningful name, such as glue__catalog.
catalogType: Fill in GLUE.
catalogWarehouse: The place where metadata and files are stored, such as s3://_iceberg.
catalogProps:

{
  "io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
  "s3.endpoint": "https://s3.<aws_s3_region_code>.amazonaws.com",
  "s3.access-key-id": "<aws_s3_iam_user_access_key>",
  "s3.secret-access-key": "<aws_s3_iam_user_secret_key>",
  "s3.path-style-access": "true",
  "client.region": "<aws_s3_region>",
  "client.credentials-provider.glue.access-key-id": "<aws_glue_iam_user_access_key>",
  "client.credentials-provider.glue.secret-access-key": "<aws_glue_iam_user_secret_key>",
  "client.credentials-provider": "com.amazonaws.glue.catalog.credentials.GlueAwsCredentialsProvider"
}

Step 3: Create a DataJob

Go to DataJob > Create DataJob.
Select the source and target DataSources, and click Test Connection for both. Here's the recommended Iceberg structure configuration:

   {
     "format-version": "2",
     "parquet.compression": "snappy",
     "iceberg.write.format": "parquet",
     "write.metadata.delete-after-commit.enabled": "true",
     "write.metadata.previous-versions-max": "3",
     "write.update.mode": "merge-on-read",
     "write.delete.mode": "merge-on-read",
     "write.merge.mode": "merge-on-read",
     "write.distribution-mode": "hash",
     "write.object-storage.enabled": "true",
     "write.spark.accept-any-schema": "true"
   }

Select Incremental for DataJob Type, together with the Full Data option. Use at least the 1 GB or 2 GB DataJob specification. Smaller specification may hit memory issues with large batches.
Select the tables to be replicated. It’s best to stay under 1000 tables per DataJob.
Select the columns to be replicated.
Confirm the DataJob creation, and start to run the DataJob.

Step 4: Test & Verify

Generate some insert/update/delete operations on MySQL
Stop data generation.
Set up a pay-as-you-go Aliyun EMR for StarRocks, add the AWS Glue Iceberg catalog, and run queries.

In StarRocks, add the external catalog:

 CREATE EXTERNAL CATALOG glue_test
 PROPERTIES
 (
   "type" = "iceberg",
   "iceberg.catalog.type" = "glue",
   "aws.glue.use_instance_profile" = "false",
   "aws.glue.access_key" = "<aws_glue_iam_user_access_key>",
   "aws.glue.secret_key" = "<aws_glue_iam_user_secret_key>",
   "aws.glue.region" = "ap-southeast-1",
   "aws.s3.use_instance_profile" = "false",
   "aws.s3.access_key" = "<aws_s3_iam_user_access_key>",
   "aws.s3.secret_key" = "<aws_s3_iam_user_secret_key>",
   "aws.s3.region" = "ap-southeast-1"
 )

set CATALOG glue_test;

set global new_planner_optimize_timeout=30000;

MySQL row count
Iceberg row count

Summary

Building a robust, real-time data pipeline from MySQL to Iceberg used to be a heavy lift. With tools like BladePipe, it becomes as easy as clicking through a setup wizard.

Whether you're modernizing your data platform or experimenting with lakehouse architectures, this combo gives you a low-cost, high-scale option to play with.