<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: BladePipe</title>
    <description>The latest articles on Forem by BladePipe (@bladepipe).</description>
    <link>https://forem.com/bladepipe</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2123762%2F3d600285-5652-4be9-9cdb-25038e97be8e.jpg</url>
      <title>Forem: BladePipe</title>
      <link>https://forem.com/bladepipe</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/bladepipe"/>
    <language>en</language>
    <item>
      <title>DynamoDB vs MongoDB in 2025: Key Differences, Use Cases</title>
      <dc:creator>BladePipe</dc:creator>
      <pubDate>Tue, 26 Aug 2025 02:26:02 +0000</pubDate>
      <link>https://forem.com/bladepipe/dynamodb-vs-mongodb-in-2025-key-differences-use-cases-1ed0</link>
      <guid>https://forem.com/bladepipe/dynamodb-vs-mongodb-in-2025-key-differences-use-cases-1ed0</guid>
      <description>&lt;p&gt;Choosing the right database for a given application is always a problem for data engineers. Two popular NoSQL database options that frequently come up are AWS DynamoDB and MongoDB. Both offer scalability and flexibility but differ significantly in their architecture, features, and operational characteristics. This blog provides a comprehensive comparison to help you make an informed decision.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Amazon DynamoDB?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://aws.amazon.com/dynamodb/" rel="noopener noreferrer"&gt;Amazon DynamoDB&lt;/a&gt; is Amazon’s fully managed, serverless NoSQL service. It supports both key–value and document data, scales automatically, and delivers single-digit millisecond response times at any size. Features like global tables, on-demand scaling, and tight integration with AWS services make it a go-to for high-scale workloads.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Strengths&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Fully managed service&lt;/strong&gt;: No server to manage. DynamoDB automatically partitions data and scales throughput, eliminating operational overhead.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Low-latency at scale&lt;/strong&gt;: It is designed for consistent millisecond latency for reads and writes, even under heavy load.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deep AWS integration&lt;/strong&gt;: It natively integrated with Lambda, API Gateway, Kinesis, CloudWatch, and IAM, simplifying building serverless architectures.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Global replication&lt;/strong&gt;: Its global table offers multi-region, active-active replication that automatically keeps multiple copies of a DynamoDB table in sync across different AWS Regions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pricing&lt;/strong&gt;:&lt;br&gt;&lt;br&gt;
DynamoDB has &lt;a href="https://aws.amazon.com/dynamodb/pricing" rel="noopener noreferrer"&gt;two pricing modes&lt;/a&gt;: &lt;strong&gt;On‑Demand&lt;/strong&gt; (pay per request) and &lt;strong&gt;Provisioned&lt;/strong&gt; (buy read/write capacity units). On-demand is simple for unpredictable or spiky traffic, while provisioned is more cost-efficient for steady high throughput. &lt;/p&gt;

&lt;p&gt;For storage, the first 25 GB per month is free, and then $0.25 per GB per month is charged.&lt;/p&gt;

&lt;p&gt;Additional costs apply for backup, global tables, change data capture, etc. &lt;/p&gt;

&lt;h2&gt;
  
  
  What is MongoDB?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.mongodb.com/" rel="noopener noreferrer"&gt;MongoDB&lt;/a&gt; is a document database that stores data as BSON (binary JSON) documents. It’s flexible, schema-optional, and supports rich queries, secondary indexes, and powerful aggregation pipelines. You can self-host it or use MongoDB Atlas, the managed service that runs on AWS, Azure, or GCP.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Strengths&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Flexible Data Model&lt;/strong&gt;: Documents allow for embedding and nested structures, accommodating complex and evolving data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Various ad-hoc queries&lt;/strong&gt;: It supports a wide range of queries, including field-based queries, regular expressions, and geospatial queries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rich indexing &amp;amp; analytics&lt;/strong&gt;: It supports compound, text, geospatial, wildcard and partial indexes. Aggregation pipeline enables complex transformations and analytics inside the DB. &lt;/li&gt;
&lt;li&gt; &lt;strong&gt;ACID Transaction&lt;/strong&gt;: It supports multi-document ACID transactions (since v4.0), ensuring data consistency even if the driver has unexpected errors.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pricing&lt;/strong&gt;:&lt;br&gt;&lt;br&gt;
&lt;strong&gt;MongoDB Enterprise&lt;/strong&gt; charges for the infrastructure costs (servers, storage, networking) on your chosen platform.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MongoDB Atlas&lt;/strong&gt; (managed service) has &lt;a href="https://www.mongodb.com/pricing?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;a free tier, shared tiers, and dedicated clusters billed hourly&lt;/a&gt; (pay‑as‑you‑go). Pricing depends on cloud provider, instance family, vCPU/RAM, storage, backup retention, and data transfer.&lt;/p&gt;

&lt;h2&gt;
  
  
  DynamoDB vs MongoDB At a Glance
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;DynamoDB&lt;/th&gt;
&lt;th&gt;MongoDB&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Type&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fully managed NoSQL database (AWS)&lt;/td&gt;
&lt;td&gt;Document NoSQL database&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deployment&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;AWS only&lt;/td&gt;
&lt;td&gt;On-premise / MongoDB Atlas (managed on multiple cloud providers)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data Model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Key-value and document&lt;/td&gt;
&lt;td&gt;Document&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Max Document Size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;400 KB per item&lt;/td&gt;
&lt;td&gt;16 MB per document&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Query Language&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Primary key lookups, range queries, secondary indexes; limited aggregation&lt;/td&gt;
&lt;td&gt;Support ad-hoc queries, joins, and advanced aggregation pipeline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scalability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Automatic partitioning and scaling&lt;/td&gt;
&lt;td&gt;Manual or automated scaling via sharding and replica sets&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Consistency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Eventually consistent by default, optional strong consistency; multi-item ACID transactions&lt;/td&gt;
&lt;td&gt;Tunable consistency levels; multi-document ACID transactions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Performance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Single-digit millisecond response time&lt;/td&gt;
&lt;td&gt;Varies based on configuration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Security&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Integrated with AWS IAM&lt;/td&gt;
&lt;td&gt;Role-Based Access Control&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Multi-Region Support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Built-in via global tables (active-active)&lt;/td&gt;
&lt;td&gt;Atlas Global Clusters or custom sharding&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Integration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Deep AWS integration&lt;/td&gt;
&lt;td&gt;Broad ecosystem, multi-cloud support&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Vendor Lock-in&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High (AWS only)&lt;/td&gt;
&lt;td&gt;Lower (run on multiple clouds or on-prem)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Core Features Comparison
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Data Model &amp;amp; Query
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;DynamoDB&lt;/strong&gt;: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Employ a key-value store with support for document structures. &lt;/li&gt;
&lt;li&gt;Optimized for fast lookups based on the primary key.&lt;/li&gt;
&lt;li&gt;Global and local secondary indexes for additional access paths.&lt;/li&gt;
&lt;li&gt;Limited aggregation support.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;MongoDB&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A document-oriented database where data is stored in BSON documents within collections.&lt;/li&gt;
&lt;li&gt;Expressive query language that supports many operators.&lt;/li&gt;
&lt;li&gt;Powerful aggregation pipelines allow for complex in-database transformations.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Scalability and Performance
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;DynamoDB&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Automatic horizontal scaling of both storage and throughput.&lt;/li&gt;
&lt;li&gt;Single-digit millisecond latency at any scale.&lt;/li&gt;
&lt;li&gt;Handle huge throughput with AWS-managed partitioning.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;MongoDB&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scale via sharding and replica sets.&lt;/li&gt;
&lt;li&gt;Efforts required for setting up and managing sharding.&lt;/li&gt;
&lt;li&gt;Performance depends on query patterns, indexing, and the chosen consistency level.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Consistency
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;DynamoDB&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Eventually consistent reads by default or strongly consistent reads at a cost of higher latency.&lt;/li&gt;
&lt;li&gt;ACID transactions across one or more tables within a single AWS region.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;MongoDB&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Offer various read concerns to control the consistency and isolation of read operations.&lt;/li&gt;
&lt;li&gt;ACID transactions for multi-document operations.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Availability
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;DynamoDB&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Automatic multi-AZ replication within a region.&lt;/li&gt;
&lt;li&gt;Automatic regional failover.&lt;/li&gt;
&lt;li&gt;Global tables for automated multi-region, active-active replication.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;MongoDB&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Replica sets provide high availability, requiring one primary node and multiple secondary nodes.&lt;/li&gt;
&lt;li&gt;Manual or semi-automatic failover depending on configuration. Atlas automates in managed clusters.&lt;/li&gt;
&lt;li&gt;Atlas Global Clusters enable zone sharding to partition data and pin it to specific regions.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How to Choose between them?
&lt;/h2&gt;

&lt;p&gt;There’s no universal winner. Both are mature, battle-tested products. You may consider the following cases:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Choose DynamoDB if&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;You are all-in on AWS.&lt;/strong&gt; DynamoDB integrates seamlessly with other AWS services, making it a natural choice for serverless services built within the AWS ecosystem.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Your query patterns are simple and predictable.&lt;/strong&gt; The ideal use case for DynamoDB is fetching data using a known primary key. It's not designed for complex, ad-hoc queries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You prefer minimal operational burden&lt;/strong&gt;. DynamoDB is fully managed by AWS, minimizing the operational overhead.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Real-world case: &lt;a href="https://www.youtube.com/watch?v=TCnmtSY2dFM" rel="noopener noreferrer"&gt;How Disney+ scales globally on Amazon DynamoDB&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Choose MongoDB if&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;You require complex querying and data aggregation.&lt;/strong&gt; MongoDB's rich query language and aggregation pipelines are good for perfoming data searches and analysis.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You need a flexible schema.&lt;/strong&gt; MongoDB's document model easily accommodates data structure changes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You want deployment flexibility.&lt;/strong&gt; MongoDB can be run on-premises, on any cloud provider (AWS, GCP, Azure), or as a fully managed service via MongoDB Atlas. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Real-world case: &lt;a href="https://www.mongodb.com/solutions/customer-case-studies/novo-nordisk?tck=customer" rel="noopener noreferrer"&gt;How Novo Nordisk accelerates time to value with GenAI and MongoDB&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Stream Data to DynamoDB and MongoDB Easily
&lt;/h2&gt;

&lt;p&gt;In real-world architectures, DynamoDB and MongoDB don’t exist in isolation. They’re part of a larger data ecosystem that needs to move information in and out in real time. &lt;/p&gt;

&lt;p&gt;This is where &lt;a href="https://www.bladepipe.com" rel="noopener noreferrer"&gt;BladePipe&lt;/a&gt; fits perfectly. As a real-time, end-to-end data replication tool, it supports &lt;a href="https://www.bladepipe.com/connector" rel="noopener noreferrer"&gt;60+ out-of-the-box connectors&lt;/a&gt;. It captures data changes (CDC) from multiple sources and continuously sync them into DynamoDB or MongoDB with sub-second latency. This ensures both databases always have fresh, consistent data without manual ETL jobs or complex pipelines. Both &lt;a href="https://www.bladepipe.com/pricing" rel="noopener noreferrer"&gt;on-prem and cloud deployment&lt;/a&gt; is supported. &lt;/p&gt;

&lt;p&gt;With BladePipe, teams only need to focus on building applications, not moving data.&lt;/p&gt;

</description>
      <category>mongodb</category>
      <category>dynamodb</category>
      <category>aws</category>
      <category>database</category>
    </item>
    <item>
      <title>10 Best LangChain Alternatives You Must Know in 2025</title>
      <dc:creator>BladePipe</dc:creator>
      <pubDate>Fri, 25 Jul 2025 05:33:35 +0000</pubDate>
      <link>https://forem.com/bladepipe/10-best-langchain-alternatives-you-must-know-in-2025-2ce5</link>
      <guid>https://forem.com/bladepipe/10-best-langchain-alternatives-you-must-know-in-2025-2ce5</guid>
      <description>&lt;p&gt;&lt;a href="https://www.langchain.com/" rel="noopener noreferrer"&gt;LangChain&lt;/a&gt; has become a go-to framework for building LLM-powered applications, including retrieval-augmented generation (RAG) and autonomous agents. But it’s not the only option out there, and depending on your needs, it might not even be the best. &lt;/p&gt;

&lt;p&gt;If you’re hitting limits with LangChain, or just want to explore what else is out there, this post breaks down 10 top alternatives that give you more flexibility, performance, or control. Whether you need better data pipelines, simpler orchestration, or enterprise-ready agents, there’s likely a tool better suited to your use case.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is LangChain?
&lt;/h2&gt;

&lt;p&gt;LangChain is an open-source framework designed to help developers build applications powered by large language models (LLMs). At its core, LangChain provides a modular and composable toolkit for "chaining" different components together. It allows developers to focus on comlplex workflows rather than raw prompts and API calls.&lt;/p&gt;

&lt;p&gt;The framework is built around a few key concepts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Chains&lt;/strong&gt;: Sequences of calls that form a complete application workflow.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agents&lt;/strong&gt;: LLM-powered dynamic chains, determining which tools to use and in what order.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tools &amp;amp; Function Calling&lt;/strong&gt;: External systems that agents interact with.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory&lt;/strong&gt;: Allow applications to remember past conversations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integrations&lt;/strong&gt;: Plug-and-play support for LLM, vector databases, document loaders, etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  LangChain Use Cases
&lt;/h2&gt;

&lt;p&gt;LangChain's versatility has made it a popular choice for a wide range of AI applications. Some of the most common use cases include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Retrieval-Augmented Generation (RAG)&lt;/strong&gt;: With RAG, user queries are enhanced with information retrieved from external sources like vector databases, file systems, or knowledge bases.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI Agents&lt;/strong&gt;: Use LangChain to design complex workflows where LLMs interact with external tools and systems.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enterprise Chatbots&lt;/strong&gt;: LangChain supports multi-turn conversations and memory management, making it suitable for applications that require context-aware interactions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Document Analysis and Summarization&lt;/strong&gt;: LangChain is often used for applications that process, summarize, and analyze large volumes of text—across PDFs, email threads, research papers, or internal reports.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why Need to Consider LangChain Alternatives?
&lt;/h2&gt;

&lt;p&gt;While LangChain is a powerful and widely-adopted framework, it's not without its drawbacks. Here are some common reasons developers and teams look elsewhere:&lt;/p&gt;

&lt;h3&gt;
  
  
  Complexity
&lt;/h3&gt;

&lt;p&gt;LangChain’s abstractions are powerful, but they can also be &lt;strong&gt;heavyweight&lt;/strong&gt;. For simple pipelines, it might feel like using a full orchestration engine to run a shell script.&lt;/p&gt;

&lt;h3&gt;
  
  
  Performance Bottlenecks
&lt;/h3&gt;

&lt;p&gt;The layered nature of LangChain can sometimes introduce performance overhead. For applications that require &lt;strong&gt;low latency&lt;/strong&gt; and &lt;strong&gt;high throughput&lt;/strong&gt;, this can be a significant issue.&lt;/p&gt;

&lt;h3&gt;
  
  
  Difficult Debugging
&lt;/h3&gt;

&lt;p&gt;LangChain can feel overly complex, especially for newcomers. The framework's abstraction layers, while powerful, can sometimes make it difficult to understand what's happening under the hood. &lt;strong&gt;Debugging can be particularly challenging when things go wrong in a long chain.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Rapidly Evolving Ecosystem
&lt;/h3&gt;

&lt;p&gt;The AI landscape is changing constantly. New frameworks are being developed with novel approaches, more intuitive interfaces, and better performance for specific tasks. Staying open to these alternatives is crucial for building the best possible applications.&lt;/p&gt;

&lt;h2&gt;
  
  
  Top 10 LangChain Alternatives
&lt;/h2&gt;

&lt;p&gt;Let’s explore ten powerful alternatives to LangChain, each with unique strengths across use cases like RAG, agents, automation, and orchestration.&lt;/p&gt;

&lt;h3&gt;
  
  
  LlamaIndex
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzjn7bpvmmfon71c9yw7o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzjn7bpvmmfon71c9yw7o.png" width="800" height="356"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://www.llamaindex.ai/" rel="noopener noreferrer"&gt;LlamaIndex&lt;/a&gt; is a data framework designed specifically to connect your private data with LLMs. While LangChain is about "chaining" different tools, LlamaIndex focuses on the "smart storage" and retrieval part of the equation, making it a powerful tool for RAG applications.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Flexible document loaders and index types (list, tree, vector, keyword)&lt;/li&gt;
&lt;li&gt;Powerful query engines and retrievers&lt;/li&gt;
&lt;li&gt;Tool calling and agent integrations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best For:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Developers building LLM applications on top of private documents with fine-tuned control over retrieval.&lt;/p&gt;

&lt;h3&gt;
  
  
  BladePipe
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxwdjo3o9epapizlgi1wy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxwdjo3o9epapizlgi1wy.png" width="800" height="460"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://www.bladepipe.com" rel="noopener noreferrer"&gt;BladePipe&lt;/a&gt; is a real-time data integration tool. Its RagApi function automates the process of building RAG applications. Through two end-to-end data pipelines in BladePipe, you can deliver data to vector databases in real time and always keep the knowledge fresh. It supports both cloud and on-premise deployment, ideal for teams of all sizes to get the right AI application solution.&lt;/p&gt;

&lt;p&gt;Compared to traditional RAG setups, which often involve lots of manual work, BladePipe RagApi offers several unique benefits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Two DataJobs for a RAG service&lt;/strong&gt;: One to import documents, and one to create the API.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero-code deployment&lt;/strong&gt;: No need to write any code, just configure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adjustable parameters&lt;/strong&gt;: Adjust vector top-K, match threshold, prompt templates, model temperature, etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-model and platform compatibility&lt;/strong&gt;: Support DashScope (Alibaba Cloud), OpenAI, DeepSeek, and more.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI-compatible API&lt;/strong&gt;: Integrate it directly with existing Chat apps or tools with no extra setup.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best For:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Individuals and teams needing production-grade data pipelines for AI/RAG with minimal operational overhead.&lt;/p&gt;

&lt;h3&gt;
  
  
  Haystack
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Fheystack-54b151e1e8b7b784fc2ef6c4c5b44d62.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Fheystack-54b151e1e8b7b784fc2ef6c4c5b44d62.png" width="800" height="368"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://haystack.deepset.ai/" rel="noopener noreferrer"&gt;Haystack&lt;/a&gt; is an open-source framework for building search systems, question-answering applications, and conversational AI. It offers a modular, pipeline-based architecture that lets developers connect components like retrievers, readers, generators, and rankers with ease. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Modular components for indexing, retrieval and generation&lt;/li&gt;
&lt;li&gt;70+ Integrations with LLMs, vector databases and transformer model.&lt;/li&gt;
&lt;li&gt;REST API support, Dockerized deployment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best For:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Building flexible, search-focused AI applications with full control over natural language processing (NLP) pipelines.&lt;/p&gt;

&lt;h3&gt;
  
  
  Semantic Kernel
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Fsementic-af25b37332ab3edcf0927c5f40860d82.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Fsementic-af25b37332ab3edcf0927c5f40860d82.png" width="800" height="589"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://learn.microsoft.com/en-us/semantic-kernel/overview/" rel="noopener noreferrer"&gt;Semantic Kernel&lt;/a&gt; is an open-source SDK from Microsoft. It provides a lightweight framework for integrating cutting-edge AI models into existing applications. It's particularly strong for developers working in C#, Python, or Java and aims to act as an efficient middleware for building AI agents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;     &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Native plugin model for AI skills&lt;/li&gt;
&lt;li&gt;Multi-language support (.NET, Python, JS)&lt;/li&gt;
&lt;li&gt;Integration with Microsoft ecosystem&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best For:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Enterprise teams looking to build secure, composable AI agents integrated with Microsoft ecosystems.&lt;/p&gt;

&lt;h3&gt;
  
  
  Langroid
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh5iar2rqgp48bse5jl7e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh5iar2rqgp48bse5jl7e.png" width="800" height="601"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://langroid.github.io/langroid/" rel="noopener noreferrer"&gt;Langroid&lt;/a&gt; is an open-source Python framework that introduces a multi-agent programming paradigm. Instead of focusing on simple chains, Langroid treats agents as first-class citizens, enabling the creation of complex applications where multiple agents collaborate to solve a task.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;     &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python-native agents with natural language and structured task definition&lt;/li&gt;
&lt;li&gt;Multi-agent orchestration&lt;/li&gt;
&lt;li&gt;Support various LLMs, vector databases, and function-calling tools&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best For:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Developers building collaborative agents with clear execution paths and modular logic.&lt;/p&gt;

&lt;h3&gt;
  
  
  Griptape
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Fgriptape-5cbc2b0b73889e8cae09f4ab1f7f9ed1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Fgriptape-5cbc2b0b73889e8cae09f4ab1f7f9ed1.png" width="800" height="324"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://www.griptape.ai/" rel="noopener noreferrer"&gt;Griptape&lt;/a&gt; is a Python-based framework for building and running AI applications, specifically focused on creating reliable and production-ready RAG applications. It offers a structured approach to building LLM workflows, with strong control over data flow and governance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Secure AI agents building&lt;/li&gt;
&lt;li&gt;Cloud-native design with plugin support&lt;/li&gt;
&lt;li&gt;A structured way to define AI workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best For:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Enterprise AI workflows requiring traceability and production readiness.&lt;/p&gt;

&lt;h3&gt;
  
  
  AutoChain
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flxli6hzpd5jbzkwvxp5y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flxli6hzpd5jbzkwvxp5y.png" width="800" height="529"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://autochain.forethought.ai/" rel="noopener noreferrer"&gt;AutoChain&lt;/a&gt; is a lightweight and simple framework for building LLM applications. It's designed to be a more straightforward alternative to LangChain, focusing on ease of use and quick prototyping. The goal is to provide a clean and intuitive way to create multi-step LLM workflows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;      &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;lightweight and extensible generative agent pipeline&lt;/li&gt;
&lt;li&gt;simple memory tracking for conversation history and tools' outputs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best For:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Builders who want to move fast without complex abstractions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Braintrust
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Fbraintrust-61f15fd92b29b80d3aa71dcc3447eade.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Fbraintrust-61f15fd92b29b80d3aa71dcc3447eade.png" width="800" height="433"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://www.braintrust.dev/" rel="noopener noreferrer"&gt;Braintrust&lt;/a&gt; is an open-source framework for building, testing, and deploying LLM workflows with a focus on reliability, observability, and performance. It stands out with built-in support for prompt versioning, output evaluation, and detailed logging, making it ideal for optimizing AI behavior over time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tools for continuous evaluation of LLM outputs&lt;/li&gt;
&lt;li&gt;Built-in monitoring, logging, and benchmarking&lt;/li&gt;
&lt;li&gt;Work with popular LLM providers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best For:&lt;/strong&gt; .&lt;br&gt;&lt;br&gt;
Teams building production LLM apps with performance and traceability in mind.&lt;/p&gt;

&lt;h3&gt;
  
  
  Flowise AI
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Fflowise-03b30a4c6e6a43959a02782cb1a94ce3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Fflowise-03b30a4c6e6a43959a02782cb1a94ce3.png" width="800" height="428"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://flowiseai.com/" rel="noopener noreferrer"&gt;Flowise AI&lt;/a&gt; is a low-code, visual tool for building and managing LLM applications. It's perfect for those who prefer a drag-and-drop interface over writing code. It's built on top of the LangChain ecosystem but provides a much more accessible and user-friendly experience.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Drag-and-drop interface for LLM apps&lt;/li&gt;
&lt;li&gt;100+ integrations with LLMs, vector stores and more&lt;/li&gt;
&lt;li&gt;Local and cloud deployment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best For:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Non-technical users or rapid prototyping of LLM workflows visually.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rivet
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Frivet-d637aad4e50a9c4f0ac46fddc35f3899.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Frivet-d637aad4e50a9c4f0ac46fddc35f3899.png" width="800" height="284"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://rivet.ironcladapp.com/" rel="noopener noreferrer"&gt;Rivet&lt;/a&gt; is a visual programming environment for building and prototyping LLM applications. It uses a graph-based interface to allow developers to visually design and test their AI workflows. Rivet's focus is on providing a powerful, intuitive, and highly-performant tool for building complex AI graphs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Visual interface for prompt iterations and experiments&lt;/li&gt;
&lt;li&gt;Built-in prompt editor and playground for fine-tuning prompts.&lt;/li&gt;
&lt;li&gt;Real-time debugging&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best For:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
AI teams optimizing prompts, chain design, or evaluation strategies collaboratively.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started with BladePipe
&lt;/h2&gt;

&lt;p&gt;LangChain has paved the way for building powerful LLM applications, offering developers a flexible framework to prototype agents, RAG pipelines, and chatbots. But as teams move from experimentation to production, LangChain’s framework can introduce complexity, performance issues, and operational overhead.&lt;/p&gt;

&lt;p&gt;If you're building RAG systems that depend on fresh and structured data, BladePipe is a strong contender. With built-in support for embedding and real-time sync, BladePipe turns your raw data into retrieval-ready intelligence. Skip the complexity. Try BladePipe and build AI systems that actually scale.&lt;/p&gt;

</description>
      <category>langchain</category>
      <category>rag</category>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>BladePipe vs. Fivetran-Features, Pricing and More (2025)</title>
      <dc:creator>BladePipe</dc:creator>
      <pubDate>Fri, 18 Jul 2025 06:02:05 +0000</pubDate>
      <link>https://forem.com/bladepipe/bladepipe-vs-fivetran-features-pricing-and-more-2025-f0k</link>
      <guid>https://forem.com/bladepipe/bladepipe-vs-fivetran-features-pricing-and-more-2025-f0k</guid>
      <description>&lt;p&gt;In today’s data-driven landscape, businesses rely heavily on efficient data integration platforms to consolidate and transform data from multiple sources. Two prominent players in this space are &lt;strong&gt;Fivetran&lt;/strong&gt; and &lt;strong&gt;BladePipe&lt;/strong&gt;, both offering solutions to automate and streamline data movement across cloud and on-premises environments. &lt;/p&gt;

&lt;p&gt;This blog provides a clear comparison of BladePipe and Fivetran as of 2025, covering their core features, pricing models, deployment options, and suitability for different business needs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Intro
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is BladePipe?
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.bladepipe.com" rel="noopener noreferrer"&gt;BladePipe&lt;/a&gt; is a data integration platform known for its extremely low latency and high performance that facilitates efficient migration and sync of data across both on-premises and cloud databases. Founded in 2019, it’s built for analytics, microservices and AI-focused use cases that emphasizing real-time data.&lt;/p&gt;

&lt;p&gt;The key features include：   &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Real-time replication&lt;/strong&gt;, with a latency less than 10 seconds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;End-to-end pipeline&lt;/strong&gt; for great reliability and easy maintenance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One-stop management&lt;/strong&gt; of the whole lifecycle from schema evolution to monitoring and alerting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero-code RAG&lt;/strong&gt; building for simpler and smarter AI.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What is Fivetran?
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.fivetran.com/" rel="noopener noreferrer"&gt;Fivetran&lt;/a&gt; is a global leader in automated data movement and is widely trusted by many companies. It offers a fully managed ELT (Extract-Load-Transform) service that automates data pipelines with prebuilt connectors, ensuring robust data sync and automatic adaptation to source schema changes. &lt;/p&gt;

&lt;p&gt;The key features include：&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Managed ELT pipelines&lt;/strong&gt;, automating the entire Extract-Load-Transform process.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extensive connectors&lt;/strong&gt; (700+ prebuilt connectors).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Strong data transformation ability&lt;/strong&gt; with dbt integration and built-in models.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automatic schema handling&lt;/strong&gt;, reducing human efforts.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Feature Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Features&lt;/th&gt;
&lt;th&gt;BladePipe&lt;/th&gt;
&lt;th&gt;Fivetran&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Sync Mode&lt;/td&gt;
&lt;td&gt;Real-time CDC-first/ETL&lt;/td&gt;
&lt;td&gt;ELT/Batch CDC&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Batch and Streaming&lt;/td&gt;
&lt;td&gt;Batch and Streaming&lt;/td&gt;
&lt;td&gt;Batch only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sync Latency&lt;/td&gt;
&lt;td&gt;≤ 10 seconds&lt;/td&gt;
&lt;td&gt;≥ 1 minute&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data Connectors&lt;/td&gt;
&lt;td&gt;40+ connectors built by BladePipe&lt;/td&gt;
&lt;td&gt;700+ connectors, 450+ are Lite (API) connectors&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Source Data Fetch&lt;/td&gt;
&lt;td&gt;Pull and Push hybrid&lt;/td&gt;
&lt;td&gt;Pull-based&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data Transformation&lt;/td&gt;
&lt;td&gt;Built-in transformations and custom code&lt;/td&gt;
&lt;td&gt;Post-load transformation and dbt integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Schema Evolution&lt;/td&gt;
&lt;td&gt;Strong support&lt;/td&gt;
&lt;td&gt;Strong support&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Verification &amp;amp; Correction&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deployment Options&lt;/td&gt;
&lt;td&gt;Self-hosted/Cloud (BYOC)&lt;/td&gt;
&lt;td&gt;Self-hosted/Hybrid/SaaS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security&lt;/td&gt;
&lt;td&gt;SOC 2, ISO 27001, GDPR&lt;/td&gt;
&lt;td&gt;SOC 2, ISO 27001, GDPR, HIPAA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Support&lt;/td&gt;
&lt;td&gt;Enterprise-level support&lt;/td&gt;
&lt;td&gt;Tiered support (Standard, Enterprise, Business Critical)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SLA&lt;/td&gt;
&lt;td&gt;Available&lt;/td&gt;
&lt;td&gt;Available&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Pipeline Latency
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Fivetran&lt;/strong&gt; adopts batch-based CDC, which means the data is read in batch intervals. It offers a range of sync frequencies, from as low as 1 minute (for Enterprise/Business Critical plans) to 24 hours. That makes the latency to be around 10 minutes. Besides, it increases pressure to the source end.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;BladePipe&lt;/strong&gt; uses &lt;strong&gt;real-time Change Data Capture (CDC)&lt;/strong&gt; for data integration. That means it instantly grab data changes from your source and deliver them to the destination within seconds. This approach is a big shift from traditional batch-based CDC methods. In BladePipe, real-time CDC works with nearly all of its 40+ connectors. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In summary&lt;/strong&gt;, BladePipe outweighs Fivetran in terms of latency, ideal for use cases that requiring always fresh data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Connectors
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Fivetran&lt;/strong&gt; offers an extensive library (700+) of pre-built connectors, covering databases, APIs, files and more. A variety of connectors satisfy diverse business needs. Among all the connectors, around 450 of them are lite connectors built for specific use cases with limited endpoints. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;BladePipe&lt;/strong&gt; offers &lt;strong&gt;over 40 pre-built connectors&lt;/strong&gt;. It focuses on essential systems for real-time needs, like OLTPs, OLAPs, messaging tools, search engines, data warehouses/lakes, and vector databases. This makes it a great choice for real-time projects where getting fresh data quickly is a fundamental requirement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In summary&lt;/strong&gt;, Fivetran excels with its broad range of connectors, while BladePipe focuses on data delivery for critical real-time infrastructure. Choose the right tool that works for you.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reliability
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Fivetran's&lt;/strong&gt; reliability has been a point of concern. We can find 15 or more incidents occurred per month in their &lt;a href="https://status.fivetran.com/" rel="noopener noreferrer"&gt;status page&lt;/a&gt;, including connector failures, 3rd party service errors, and other service degradations. It even experienced an outage lasting more than 2 days.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;BladePipe&lt;/strong&gt; is built with production-grade reliability at its core. It provides real-time dashboards for monitoring every step of data movement. Alert notifications can be triggered for latency, failures, or data loss. That makes it easy to maintain pipelines and solve problems, enhancing reliability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In summary&lt;/strong&gt;, BladePipe shows a more reliable system performance than Fivetran, and its monitoring and alerting mechanism brings even stronger support for stable pipelines.&lt;/p&gt;

&lt;h3&gt;
  
  
  Support
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Fivetran&lt;/strong&gt; offers documentation, support portal, and email support for Standard plan. However, some customers complain about the long time waiting for response. Enterprise and Business Critical plans enjoy 1-hour support response, but the costs are much higher.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;BladePipe&lt;/strong&gt; offers a more &lt;strong&gt;white-glove support experience&lt;/strong&gt;. For both Cloud and Enterprise customers, BladePipe provides the according SLAs. Its technical team works closely with clients during onboarding and when fine-tuning data pipelines.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In summary&lt;/strong&gt;, both Fivetran and BladePipe provide documentation and technical support for better understanding and use. &lt;/p&gt;

&lt;h2&gt;
  
  
  Use Case Comparison
&lt;/h2&gt;

&lt;p&gt;Based on the features stated above, the performance of the two tools varies in different use cases.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;th&gt;BladePipe&lt;/th&gt;
&lt;th&gt;Fivetran&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Data sync between relational databases&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;td&gt;Average&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data sync between online business databases (RDB, data warehouse, message, cache, search engine)&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;td&gt;Average&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data lakehouse support&lt;/td&gt;
&lt;td&gt;Average&lt;/td&gt;
&lt;td&gt;Average&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SaaS sources support&lt;/td&gt;
&lt;td&gt;Average&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-cloud data sync&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;td&gt;Average&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Pricing Model Comparison
&lt;/h2&gt;

&lt;p&gt;Pricing is a crucial consideration when evaluating data integration tools, especially for startups and organizations with extensive data replication needs. Fivetran and BladePipe employ significantly different pricing models.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fivetran
&lt;/h3&gt;

&lt;p&gt;Fivetran has four plans to consider: &lt;strong&gt;Free&lt;/strong&gt;, &lt;strong&gt;Standard&lt;/strong&gt;, &lt;strong&gt;Enterprise&lt;/strong&gt; and &lt;strong&gt;Business Critical&lt;/strong&gt;. The free plan offers a free usage for low-volumes (e.g., up to 500,000 MAR). The other three plans adopt MAR-based tiered pricing. See more details at the &lt;a href="https://www.fivetran.com/pricing" rel="noopener noreferrer"&gt;pricing page&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Besides, Fivetran separately charges for data transformation based on the models users run in a month, making the costs even higher.&lt;/p&gt;

&lt;p&gt;As of March 2025, Fivetran's pricing model has been changed to a &lt;strong&gt;connector-level pricing&lt;/strong&gt;. Pricing and discounts are often applied per individual connector instead of the entire account. This means if you have many connectors, your total cost might increase even if your overall data volume hasn't changed. &lt;/p&gt;

&lt;h3&gt;
  
  
  BladePipe
&lt;/h3&gt;

&lt;p&gt;BladePipe offers two plans to choose:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cloud&lt;/strong&gt;: $0.01 per million rows of full data and $10 per million rows of incremental data. You can easily evaluate the costs via the &lt;a href="https://www.bladepipe.com/pricing" rel="noopener noreferrer"&gt;price calculator&lt;/a&gt;. It is available at &lt;a href="https://aws.amazon.com/marketplace/pp/prodview-3moxhopumtmdc" rel="noopener noreferrer"&gt;AWS Marketplace&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enterprise&lt;/strong&gt;: The costs are based on the number of pipelines and duration you need. Talk to the sales team on specific costs.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Summary
&lt;/h3&gt;

&lt;p&gt;Here's a quick comparison of costs between BladePipe BYOC and Fivetran(Standard).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Million Rows per Month&lt;/th&gt;
&lt;th&gt;BladePipe* (BYOC)&lt;/th&gt;
&lt;th&gt;Fivetran (Standard)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1 M&lt;/td&gt;
&lt;td&gt;$210&lt;/td&gt;
&lt;td&gt;$500+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10 M&lt;/td&gt;
&lt;td&gt;$300&lt;/td&gt;
&lt;td&gt;$1350+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100 M&lt;/td&gt;
&lt;td&gt;$1200&lt;/td&gt;
&lt;td&gt;$2900+&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;*: include one AWS EC2 t2.xlarge for BladePipe Worker, $200/month.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In summary&lt;/strong&gt;, BladePipe is a better choice when it comes to costs, considering the following factors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cost-effectiveness&lt;/strong&gt;: BladePipe is much more cheaper than Fivetran when moving the same amount of data. Besides, BladePipe doesn't charge for data transformation separately.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cost Predictability&lt;/strong&gt;: BladePipe's direct per-million-row pricing offers more immediate cost predictability, especially for large, consistent data volumes. Fivetran's MAR can be less predictable due to the nature of "active rows", the data transformation charge and the new connector-level pricing. &lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Choosing between Fivetran and BladePipe depends heavily on your organization's specific data integration needs and priorities. Fivetran provides extensive coverage of connectors and an automated ELT experience for analytics. BladePipe features real-time CDC, ideal for mission-critical data syncs. In terms of pricing, BladePipe is a cost-effective choice for start-ups and organizations with a tight budget.&lt;/p&gt;

&lt;p&gt;Evaluate your specific data sources, latency requirements, budget, internal team resources, and desired level of support to make the most suitable choice.&lt;/p&gt;

</description>
      <category>programming</category>
    </item>
    <item>
      <title>A Comprehensive Guide to Wide Table (2025)</title>
      <dc:creator>BladePipe</dc:creator>
      <pubDate>Thu, 10 Jul 2025 10:02:06 +0000</pubDate>
      <link>https://forem.com/bladepipe/a-comprehensive-guide-to-wide-table-2025-2l0j</link>
      <guid>https://forem.com/bladepipe/a-comprehensive-guide-to-wide-table-2025-2l0j</guid>
      <description>&lt;p&gt;In real-world business scenarios, even a basic report often requires joining 7 or 8 tables. This can severely impact query performance. Sometimes it takes hours for business teams to get a simple analysis done.&lt;/p&gt;

&lt;p&gt;This article dives into how wide table technology helps solve this pain point. We’ll also show you how to build wide tables with zero code, making real-time cross-table data integration easier than ever.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Challenge with Complex Queries
&lt;/h2&gt;

&lt;p&gt;As business systems grow more complex, so do their data models. In an e-commerce system, for instance, tables recording orders, products, and users are naturally interrelated:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Order table&lt;/strong&gt;: product ID (linked to &lt;strong&gt;Product table&lt;/strong&gt;), quantity, discount, total price, buyer ID (linked to &lt;strong&gt;User table&lt;/strong&gt;), etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Product table&lt;/strong&gt;: name, color, texture, inventory, seller (linked to &lt;strong&gt;User table&lt;/strong&gt;), etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User table&lt;/strong&gt;: account info, phone numbers, emails, passwords, etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Relational databases are great at normalizing data and ensuring efficient storage and transaction performance. But when it comes to analytics, especially queries involving filtering, aggregation, and multi-table JOINs, the traditional schema becomes a performance bottleneck.&lt;/p&gt;

&lt;p&gt;Take a query like "Top 10 products by sales in the last month": the more JOINs involved, the more complex and slower the query. And the number of possible query plans grows rapidly:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tables Joined&lt;/th&gt;
&lt;th&gt;Possible Query Plans&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;24&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;720&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;40320&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;3628800&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For CRM or ERP systems, joining 5+ tables is standard. Then, the real question becomes: &lt;strong&gt;How to find the best query plan efficiently?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To tackle this, two main strategies have emerged: &lt;strong&gt;Query Optimization&lt;/strong&gt; and &lt;strong&gt;Precomputation&lt;/strong&gt;, with &lt;strong&gt;wide tables&lt;/strong&gt; being a key form of the latter.&lt;/p&gt;

&lt;h2&gt;
  
  
  Query Optimization vs Precomputation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Query Optimization
&lt;/h3&gt;

&lt;p&gt;One of the solutions is to reduce the number of possible query plans to accelerate query speed. This is called pruning. Two common approaches are derived:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;RBO (Rule-Based Optimizer)&lt;/strong&gt;: RBO doesn't consider the actual distribution of your data. Instead, it tweak SQL query plans based on a set of predefined, static rules. Most databases have some common optimization rules built-in, like predicate pushdown. Depending on their specific business needs and architectural design, different databases also have their own unique optimization rules. Take SAP Hana, for instance: it powers SAP ERP operations and is designed for in-memory processing with lots of joins. Because of this, its optimizer rules are noticeably different from other databases.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CBO (Cost-Based Optimizer)&lt;/strong&gt;: CBO evaluates I/O, CPU and other resource consumption, and picks the plan with the lowest cost. This type of optimization dynamically adjusts based on the specific data distribution and the features of your SQL query. Even two identical SQL queries might end up with completely different query plans if the parameter values are different. CBO typically relies on a sophisticated and complex statistics subsystem, including crucial information like the volume of data in each table and data distribution histograms based on primary keys.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most modern databases combine both RBO and CBO.&lt;/p&gt;

&lt;h3&gt;
  
  
  Precomputation
&lt;/h3&gt;

&lt;p&gt;Precomputation assumes &lt;strong&gt;the relationships between tables are stable&lt;/strong&gt;, so instead of joining on every query, it pre-joins data ahead of time into a wide table. When data is changed, only changes are delivered to the wide table. The idea has been around since the early days of &lt;strong&gt;materialized views&lt;/strong&gt; in relational databases. &lt;/p&gt;

&lt;p&gt;Compared with live queries, precomputation massively reduces runtime computation. But it's not perfect:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Limited JOIN semantics&lt;/strong&gt;: Hard to handle anything beyond LEFT JOIN efficiently.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Heavy updates&lt;/strong&gt;: A single change on the “1” side of a 1-to-N relation can cause cascading updates, challenging service reliability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Functionality trade-offs&lt;/strong&gt;: Precomputed tables lack the full flexibility of live queries (e.g. JOINs, filters, functions).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Best Practice: Combine Both
&lt;/h3&gt;

&lt;p&gt;In the real world, a hybrid approach works best: use &lt;strong&gt;precomputation&lt;/strong&gt; to generate intermediate wide tables, and use &lt;strong&gt;live queries&lt;/strong&gt; on top of those to apply filters and aggregations.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Precomputation&lt;/strong&gt;: A popular approach is stream computing, with stream processing databases emerging in recent years. Materialized views in traditional relational databases or data warehouses also offer an excellent solution.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Live queries&lt;/strong&gt;: There is a significant performance boosts in data filtering and aggregation within real-time analytics databases, thanks to the columnar and hybrid row-column data structures, the new instruction sets like AVX 512, high-performance computing hardware such as FPGAs and GPUs, and the software application like distributed computing.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  BladePipe's Wide Table Evolution
&lt;/h2&gt;

&lt;p&gt;BladePipe started with a high-code approach: users had to write scripts to fetch related table data and construct wide tables manually during data sync. It worked, but wasn’t scalable due to too much effort required.&lt;/p&gt;

&lt;p&gt;Now, BladePipe supports &lt;strong&gt;visual wide table building&lt;/strong&gt;, enabling zero-code configuration. Users can select a driving table and the lookup tables directly in the UI to define JOINs. The system handles both initial data migration and real-time updates.&lt;/p&gt;

&lt;p&gt;It currently supports visual wide table creation in the following pipelines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MySQL -&amp;gt; MySQL/StarRocks/Doris/SelectDB&lt;/li&gt;
&lt;li&gt;PostgreSQL/SQL Server/Oracle/MySQL -&amp;gt; MySQL&lt;/li&gt;
&lt;li&gt;PostgreSQL -&amp;gt; StarRocks/Doris/SelectDB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;More supported pipelines are coming soon.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Visual Wide Table Building Works in BladePipe
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Key Definitions
&lt;/h3&gt;

&lt;p&gt;In BladePipe, a wide table consists of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Driving Table&lt;/strong&gt;: The main table used as the data source. Only one driving table can be selected.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lookup Tables&lt;/strong&gt;: Additional tables joined to the driving table. Multiple lookup tables are supported.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By default, the join behavior follows &lt;strong&gt;Left Join&lt;/strong&gt; semantics: all records from the driving table are preserved, regardless of whether corresponding records exist in lookup tables.&lt;/p&gt;

&lt;p&gt;BladePipe currently supports two types of join structures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Linear&lt;/strong&gt;: e.g., A.b_id = B.id AND B.c_id = C.id. Each table can only be selected once, and circular references are not allowed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Star&lt;/strong&gt;: e.g., A.b_id = B.id AND A.c_id = C.id. Each lookup table connects directly to the driving table. Cycles are not allowed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In both cases, table A is the driving table, while B, C, etc. are lookup tables.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Change Rule
&lt;/h3&gt;

&lt;h4&gt;
  
  
  If the target is a relational DB (e.g. MySQL):
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Driving table INSERT&lt;/strong&gt;: Fields from lookup tables are automatically filled in.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Driving table UPDATE/DELETE&lt;/strong&gt;: Lookup fields are not updated.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lookup table INSERT&lt;/strong&gt;: If downstream tables exist, the operation is converted to an UPDATE to refresh Lookup fields.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lookup table UPDATE&lt;/strong&gt;: If downstream tables exist, no changes are applied to related fields.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lookup table DELETE&lt;/strong&gt;: If downstream tables exist, the operation is converted to an UPDATE with all fields set to NULL.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  If the target is an overwrite-style DB (e.g. StarRocks, Doris):
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;All operations (INSERT, UPDATE, DELETE) on the Driving table will auto-fill Lookup fields.&lt;/li&gt;
&lt;li&gt;All operations on Lookup tables are ignored.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
  If you want to include lookup table updates when the target is an overwrite-style database, set up a two-satge pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Source DB → relational DB wide table&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Wide table → overwrite-style DB&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Step-by-Step Guide
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Log in to BladePipe. Go to &lt;strong&gt;DataJob&lt;/strong&gt; &amp;gt; &lt;strong&gt;Create DataJob&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;In the &lt;strong&gt;Tables&lt;/strong&gt; step, 

&lt;ol&gt;
&lt;li&gt;Choose the tables that will participate in the wide table.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Batch Modify Target Names&lt;/strong&gt; &amp;gt; &lt;strong&gt;Unified table name&lt;/strong&gt;, and enter a name as the wide table name.&lt;/li&gt;
&lt;/ol&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;In the &lt;strong&gt;Data Processing&lt;/strong&gt; step,   &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;On the left panel, select the Driving Table and click &lt;strong&gt;Operation&lt;/strong&gt; &amp;gt; &lt;strong&gt;Wide Table&lt;/strong&gt; to define the join.

&lt;ul&gt;
&lt;li&gt;Specify Lookup Columns (multiple columns are supported).&lt;/li&gt;
&lt;li&gt;Select additional fields from the Lookup Table and define how they map to wide table columns. This helps avoid naming conflicts across different source tables.
&lt;/li&gt;
&lt;li&gt;If a Lookup Table joins to another table, &lt;strong&gt;make sure to include the relevant Lookup columns&lt;/strong&gt;. For example, in A.b_id = B.id AND B.c_id = C.id, when selecting fields from B, c_id must be included.
&lt;/li&gt;
&lt;li&gt;When multiple Driving or Lookup tables contain fields with the same name, always &lt;strong&gt;map them to different target column names to avoid collisions&lt;/strong&gt;.
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F1-194c95d00ab307fc48cb86ccf890fd29.png" width="800" height="412"&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Submit&lt;/strong&gt; to save the configuration.
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F2-e8e7901d2fdbde1faabffb8980fa5ac2.png" width="800" height="417"&gt;
&lt;/li&gt;
&lt;li&gt;Click Lookup Tables on the left panel to check whether field mappings are correct.&lt;/li&gt;
&lt;/ol&gt;


&lt;/li&gt;

&lt;li&gt;&lt;p&gt;Continue with the DataJob creation process, and start the DataJob.&lt;/p&gt;&lt;/li&gt;

&lt;/ol&gt;

&lt;h2&gt;
  
  
  Wrapping up
&lt;/h2&gt;

&lt;p&gt;Wide tables are a powerful way to speed up analytics by precomputing complex JOINs. With BladePipe’s visual builder, even non-engineers can set up and maintain real-time wide tables across multiple data systems.&lt;/p&gt;

&lt;p&gt;Whether you're a data architect or a DBA, this tool helps streamline your analytics layer and power up your dashboards with near-instant queries.&lt;/p&gt;

</description>
      <category>widetable</category>
      <category>database</category>
      <category>mysql</category>
      <category>programming</category>
    </item>
    <item>
      <title>BladePipe vs. Airbyte : Features, Pricing and More (2025)</title>
      <dc:creator>BladePipe</dc:creator>
      <pubDate>Fri, 04 Jul 2025 06:26:26 +0000</pubDate>
      <link>https://forem.com/bladepipe/bladepipe-vs-airbyte-features-pricing-and-more-2025-3j13</link>
      <guid>https://forem.com/bladepipe/bladepipe-vs-airbyte-features-pricing-and-more-2025-3j13</guid>
      <description>&lt;p&gt;In today’s data-driven landscape, building reliable pipelines is a business imperative, and the right integration tool can make a difference.&lt;/p&gt;

&lt;p&gt;Two modern tools are &lt;strong&gt;BladePipe&lt;/strong&gt; and &lt;strong&gt;Airbyte&lt;/strong&gt;. BladePipe focuses on real-time end-to-end replication, while Airbyte offers a rich connector ecosystem for ELT pipelines. So, which one fits your use case?&lt;/p&gt;

&lt;p&gt;In this blog, we break down the core differences between BladePipe and Airbyte to help you make an informed choice. &lt;/p&gt;

&lt;h2&gt;
  
  
  Intro
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is BladePipe?
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.bladepipe.com" rel="noopener noreferrer"&gt;BladePipe&lt;/a&gt; is a real-time end-to-end data replication tool. Founded in 2019, it’s built for high-throughput, low-latency environments, powering real-time analytics, AI applications, or microservices that require always-fresh data.&lt;/p&gt;

&lt;p&gt;The key features include：   &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Real-time replication&lt;/strong&gt;, with a latency less than 10 seconds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;End-to-end pipeline&lt;/strong&gt; for great reliability and easy maintenance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One-stop management&lt;/strong&gt; of the whole lifecycle from schema evolution to monitoring and alerting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero-code RAG&lt;/strong&gt; building for simpler and smarter AI.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What is Airbyte?
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://airbyte.com/" rel="noopener noreferrer"&gt;Airbyte&lt;/a&gt; is founded in 2020. It is an open-source data integration platform that focuses on ELT pipelines. It offers a large library of pre-built and marketplace connectors for moving batch data from various sources to popular data warehouses and other destinations.&lt;/p&gt;

&lt;p&gt;The key features include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Focus on &lt;strong&gt;batch-based ELT&lt;/strong&gt; pipelines.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extensive connector&lt;/strong&gt; ecosystem.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open-source&lt;/strong&gt; core with paid enterprise version.&lt;/li&gt;
&lt;li&gt;Support for &lt;strong&gt;custom connectors&lt;/strong&gt; with minimal code.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Feature Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Features&lt;/th&gt;
&lt;th&gt;BladePipe&lt;/th&gt;
&lt;th&gt;Airbyte&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Sync Mode&lt;/td&gt;
&lt;td&gt;Real-time CDC-first/ETL&lt;/td&gt;
&lt;td&gt;ELT-first/(Batch) CDC&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Batch and Streaming&lt;/td&gt;
&lt;td&gt;Batch and Streaming&lt;/td&gt;
&lt;td&gt;Batch only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sync Latency&lt;/td&gt;
&lt;td&gt;≤ 10 seconds&lt;/td&gt;
&lt;td&gt;≥ 1 minute&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data Connectors&lt;/td&gt;
&lt;td&gt;40+ connectors built by BladePipe&lt;/td&gt;
&lt;td&gt;50+ maintained connectors, 500+ marketplace connectors&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Source Data Fetch&lt;/td&gt;
&lt;td&gt;Pull and Push hybrid&lt;/td&gt;
&lt;td&gt;Pull-based&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data Transformation&lt;/td&gt;
&lt;td&gt;Built-in transformations and custom code&lt;/td&gt;
&lt;td&gt;dbt and SQL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Schema Evolution&lt;/td&gt;
&lt;td&gt;Strong support&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Verification &amp;amp; Correction&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deployment Options&lt;/td&gt;
&lt;td&gt;Cloud (BYOC)/Self-hosted&lt;/td&gt;
&lt;td&gt;Self-hosted(OSS)/Cloud (Managed)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security&lt;/td&gt;
&lt;td&gt;SOC 2, ISO 27001, GDPR&lt;/td&gt;
&lt;td&gt;SOC 2, ISO 27001, GDPR, HIPAA Conduit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Support&lt;/td&gt;
&lt;td&gt;Enterprise-level support&lt;/td&gt;
&lt;td&gt;Community (free) and Enterprise-level support&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Pipeline Latency
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Airbyte&lt;/strong&gt; realizes data movement through &lt;strong&gt;batch-based extraction and loading&lt;/strong&gt;. It supports Debezium-based CDC, which is applicable to &lt;a href="https://docs.airbyte.com/platform/understanding-airbyte/cdc#limitations" rel="noopener noreferrer"&gt;only a few sources&lt;/a&gt;, and only for tables with primary keys. In Airbyte CDC, changes are pulled and loaded in scheduled batches (e.g., every 5 mins or 1 hour). That makes the &lt;strong&gt;latency to be minutes or even hours&lt;/strong&gt; depending on the sync frequency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;BladePipe&lt;/strong&gt; is built around &lt;strong&gt;real-time Change Data Capture (CDC)&lt;/strong&gt;. Different from batch-based CDC, BladePipe captures changes occurred in the source instantly and delivers them in the destination, with &lt;strong&gt;sub-second latency&lt;/strong&gt;. The real-time CDC is applicable to almost all 40+ connectors. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In summary&lt;/strong&gt;, Airbyte usually has a high latency. BladePipe CDC is more suitable for real-time architectures where freshness, latency, and data integrity are essential.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Connectors
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Airbyte&lt;/strong&gt; clearly leads in the breadth of supported sources and destinations. By now, Airbyte supports &lt;strong&gt;over 550 connectors&lt;/strong&gt;, most of which are &lt;strong&gt;API-based connectors&lt;/strong&gt;. Airbyte allows custom connector building through its Connector Builder, giving great extensibility of its connector reach. But among all the connectors, &lt;strong&gt;only around 50 of them are Airbyte-official connectors&lt;/strong&gt; and a SLA is provided. The rest are open-source connectors powered by the community. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;BladePipe&lt;/strong&gt;, on the other hand, focuses on depth over breadth. It now supports &lt;strong&gt;40+ connectors&lt;/strong&gt;, which are &lt;strong&gt;all self-built and actively maintained&lt;/strong&gt;. It targets critical real-time infrastructure: OLTPs, OLAPs, message middleware, search engines, data warehouses/lakes, vector databases, etc. This makes it a better fit for real-time applications, where data freshness and change tracking matter more than diversity of sources. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In summary&lt;/strong&gt;, Airbyte stands out for its extensive coverage of connectors, while BladePipe focuses on real-time change delivery among multiple sources. Choose the suitable tool based on your specific need.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Transformation
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Airbyte&lt;/strong&gt;, as a ELT-first platform, uses &lt;strong&gt;a post-load transformation model&lt;/strong&gt;, where data is loaded into the target first and then transformation is applied. It offers two options: a serialized JSON object or a normalized version as tables. For advanced users, custom transformations can be done via SQL and through integration with dbt. But the transformation capabilities are limited because data is transformed after being loaded.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;BladePipe&lt;/strong&gt; finishes &lt;strong&gt;data transformation in real time before data loading&lt;/strong&gt;. Configure the transformation method when creating a pipeline, and all is done automatically. BladePipe supports &lt;a href="https://doc.bladepipe.com/blog/data_insights/etl_tranform" rel="noopener noreferrer"&gt;built-in data transformations&lt;/a&gt; in a visualized way, including data filtering, data masking, column pruning, mapping, etc. Complex transformations can be done via custom code. With BladePipe, data gets ready when it flows through the pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In summary&lt;/strong&gt;, Airbyte's data transformation capabilities are limited due to its ELT way of data replication. BladePipe offers both built-in transformations and custome code to satisfy various needs, and the transformations happen in real time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Support
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Airbyte&lt;/strong&gt; provides &lt;strong&gt;free and paid technical support&lt;/strong&gt;. Open source users can seek help in the community or solve the issue by themselves. It's free of charge but can be time-consuming for urgent production issues. Cloud customers can get help through chatting with Airbyte team members and contributors. Enterprise-level support is a separate paid tier, with custom SLAs, and access to training.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;BladePipe&lt;/strong&gt; offers a more &lt;strong&gt;white-glove support experience&lt;/strong&gt;. For both Cloud and Enterprise customers, BladePipe provides the according SLAs. Its technical team is closely involved in onboarding and tuning pipelines. Besides, for all customers, alert notifications can be sent via email and webhook to ensure pipeline reliability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In summary&lt;/strong&gt;, both Airbyte and BladePipe provide documentation and technical support for better understanding and use. Just think about your needs and make the right choice.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pricing Model Comparison
&lt;/h2&gt;

&lt;p&gt;Pricing is one of the key factors to consider when evaluating various tools, especially for startups and organizations with large amount of data to be replicated. BladePipe and Airbyte show great differences in the pricing model.&lt;/p&gt;

&lt;h3&gt;
  
  
  BladePipe
&lt;/h3&gt;

&lt;p&gt;BladePipe offers two plans to choose:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cloud&lt;/strong&gt;: $0.01 per million rows of full data or $10 per million rows of incremental data. You can easily evaluate the costs via the &lt;a href="https://www.bladepipe.com/pricing" rel="noopener noreferrer"&gt;price calculator&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enterprise&lt;/strong&gt;: The costs are based on the number of pipelines and duration you need. Talk to the sales team on specific costs.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Airbyte
&lt;/h3&gt;

&lt;p&gt;Airbyte has four plans to consider:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Open Source&lt;/strong&gt;: Free to use for self-hosted deployment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud&lt;/strong&gt;: $2.50 per credit, and start at $10/month(4 credits).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Team&lt;/strong&gt;: Custom pricing for cloud deployment. Talk to the sales team on specific costs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enterprise&lt;/strong&gt;: Custom pricing for self-hosted deployment. Talk to the sales team on specific costs.
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Summary
&lt;/h3&gt;

&lt;p&gt;Here's a quick comparison of costs between BladePipe BYOC and Airbyte Cloud.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Million Rows per Month&lt;/th&gt;
&lt;th&gt;BladePipe* (BYOC)&lt;/th&gt;
&lt;th&gt;Airbyte (Cloud)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1 M&lt;/td&gt;
&lt;td&gt;$210&lt;/td&gt;
&lt;td&gt;$450&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10 M&lt;/td&gt;
&lt;td&gt;$300&lt;/td&gt;
&lt;td&gt;$1000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100 M&lt;/td&gt;
&lt;td&gt;$1200&lt;/td&gt;
&lt;td&gt;$3000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1000 M&lt;/td&gt;
&lt;td&gt;$10200&lt;/td&gt;
&lt;td&gt;$14000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;*: include one AWS EC2 t2.xlarge for worker, $200 /month.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In summary&lt;/strong&gt;, BladePipe is much cheaper than Airbyte. The cost gap becomes even wider when more data is moved per month. If you have a tight budget or need to integrate thousands of millions of rows of data, BladePipe would be a cost-effective option.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;A right tool is critical for any business, and the choice should depend on your use case. This article lists a number of considerations and key differences. To summarize, Airbyte excels at extensive connectors and an open ecosystem, while BladePipe is designed for real-time end-to-end data use cases. &lt;/p&gt;

&lt;p&gt;If your organization is building applications that rely on always-fresh, such as AI assistants, real-time search, or event streaming, BladePipe is likely a better fit.&lt;/p&gt;

&lt;p&gt;If your organization needs to integrate data from various APIs or would like to maintain connectors by in-house staff, you may try Airbyte.&lt;/p&gt;

</description>
      <category>airbyte</category>
      <category>bladepipe</category>
      <category>database</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>How to Prevent Replication Loops in MySQL Bidirectional Sync?</title>
      <dc:creator>BladePipe</dc:creator>
      <pubDate>Fri, 27 Jun 2025 07:24:54 +0000</pubDate>
      <link>https://forem.com/bladepipe/how-to-prevent-replication-loops-in-mysql-bidirectional-sync-2kgp</link>
      <guid>https://forem.com/bladepipe/how-to-prevent-replication-loops-in-mysql-bidirectional-sync-2kgp</guid>
      <description>&lt;p&gt;Real-time MySQL-to-MySQL two-way data sync is essential for high availability, seamless disaster recovery and active-active data architectures. It helps keep data consistent and up-to-date across various systems, regardless of where changes occur. &lt;/p&gt;

&lt;p&gt;However, it's not that easy to always keep data updated and consistent in a two-way MySQL pipeline. Replication loop is one of the biggest challenges. In this page, we'll explain how to perform MySQL bidirectional data sync, preventing infinite data replication loops.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is a Replication Loop?
&lt;/h2&gt;

&lt;p&gt;The replication loop is a critical issue in MySQL two-way sync setups. It occurs when the same change keeps getting replicated back and forth between the two databases endlessly. For example, if Database A sends an update to Database B, and Database B thinks it's a new change, and sends it back to A, over and over again.&lt;/p&gt;

&lt;p&gt;This cycle can lead to several serious issues:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data Duplication&lt;/strong&gt;: The same update may be applied multiple times, potentially causing duplicate rows, incorrect data, or integrity violations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Increased Latency and Load&lt;/strong&gt;: Continuous replication of the same changes consumes CPU, I/O, and network resources, degrading system performance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Difficult Troubleshooting&lt;/strong&gt;: Even minor update conflicts can escalate when each system repeatedly re-applies changes, making conflict resolution complex. Identifying the source of the loop and the specific transactions causing it can be extremely challenging.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How to Prevent Infinite Loops?
&lt;/h2&gt;

&lt;p&gt;To prevent replication loops in MySQL two-way sync, GTID(Global Transaction Identifier) typically uses a combination of &lt;code&gt;server_uuid&lt;/code&gt; and transaction IDs as conflict markers. However, this solution has its limitations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.bladepipe.com" rel="noopener noreferrer"&gt;BladePipe&lt;/a&gt;, a professional data replication tool, introduces a more streamlined approach by &lt;strong&gt;tagging binlog events&lt;/strong&gt; directly.&lt;/p&gt;

&lt;p&gt;In a typical DML binlog sequence—&lt;code&gt;QueryEvent (TxBegin)&lt;/code&gt;, &lt;code&gt;TableMapEvent&lt;/code&gt;, &lt;code&gt;WriteRowEvent (IUD)&lt;/code&gt;, and &lt;code&gt;QueryEvent (TxEnd)&lt;/code&gt;—tagging the &lt;code&gt;WriteRowEvent&lt;/code&gt; would be ideal for conflict handling. But doing so generally requires modifying the MySQL storage engine code, which is complex and invasive.&lt;/p&gt;

&lt;p&gt;Upon deep investigation, BladePipe discovered that MySQL's binlog includes a special event called &lt;code&gt;RowsQueryLogEvent&lt;/code&gt;, which logs the original SQL statement when the &lt;code&gt;binlog_rows_query_log_events&lt;/code&gt; parameter is enabled. This event allows to be attached with comments, which opens up a clean tagging mechanism.&lt;/p&gt;

&lt;p&gt;Leveraging this, BladePipe automatically adds a custom marker /*ccw*/ when writing data to the target MySQL database. This tag appears in the &lt;code&gt;RowsQueryLogEvent&lt;/code&gt;, making it easy to identify and filter out in a bidirectional sync. &lt;/p&gt;

&lt;p&gt;This mechanism shows the following features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No dependency on GTID&lt;/li&gt;
&lt;li&gt;Order-independent and parallelizable replication&lt;/li&gt;
&lt;li&gt;Reduced operations on the target database&lt;/li&gt;
&lt;li&gt;Broad compatibility with cloud-based MySQL services&lt;/li&gt;
&lt;li&gt;Support database/table/column-level filtering, mapping, and custom data processing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With this enhancement, the new binlog event sequence becomes:&lt;br&gt;
&lt;code&gt;QueryEvent (TxBegin)&lt;/code&gt;, &lt;code&gt;TableMapEvent&lt;/code&gt;, &lt;code&gt;RowsQueryLogEvent&lt;/code&gt;, &lt;code&gt;WriteRowEvent&lt;/code&gt;, and &lt;code&gt;QueryEvent (TxEnd)&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Perform MySQL Two-Way Sync Using BladePipe?
&lt;/h2&gt;

&lt;p&gt;Next, we'll give a step-by-step guide on how to perform a MySQL two-way data sync. In the demonstration, we use RDS for MySQL instances.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Install BladePipe
&lt;/h3&gt;

&lt;p&gt;Follow the instructions in &lt;a href="//../../productOP/byoc/installation/install_worker_docker"&gt;Install Worker (Docker)&lt;/a&gt; or &lt;a href="//../../productOP/byoc/installation/install_worker_binary"&gt;Install Worker (Binary)&lt;/a&gt; to download and install a BladePipe Worker.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Add DataSource
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Log in to the RDS console. Go to the instance details page and click &lt;strong&gt;Parameters&lt;/strong&gt;, then enable &lt;strong&gt;binlog_rows_query_log_events&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Log in to the &lt;a href="https://cloud.bladepipe.com" rel="noopener noreferrer"&gt;BladePipe Cloud&lt;/a&gt;. Click &lt;strong&gt;DataSource&lt;/strong&gt; &amp;gt; &lt;strong&gt;Add DataSource&lt;/strong&gt;. It is suggested to modify the description of the DataSource to prevent mistaking the databases when you configure two-way DataJobs.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F1-0451ebcab8311f3116a589a8e665d77b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F1-0451ebcab8311f3116a589a8e665d77b.png" width="800" height="473"&gt;&lt;/a&gt;  &lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Create Forward DataJob
&lt;/h3&gt;

&lt;p&gt;:::info&lt;br&gt;
In bidirectional sync, the forward DataJob generally refers to the DataJob where the source database has data and the target database has no data, which involves the initialization of data at the target database.&lt;br&gt;
:::&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Click &lt;strong&gt;DataJob&lt;/strong&gt; &amp;gt; &lt;strong&gt;Create DataJob&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Select the source and target DataSources, and click &lt;strong&gt;Test Connection&lt;/strong&gt; to ensure the connection to the source and target DataSources are both successful.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F2-91b0bfdd683cf98f292b3a92dc60b4f8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F2-91b0bfdd683cf98f292b3a92dc60b4f8.png" width="800" height="472"&gt;&lt;/a&gt; &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;In &lt;strong&gt;Properties&lt;/strong&gt; Page:

&lt;ol&gt;
&lt;li&gt;Select &lt;strong&gt;Incremental&lt;/strong&gt; for DataJob Type, together with the &lt;strong&gt;Full Data&lt;/strong&gt; option.&lt;/li&gt;
&lt;li&gt;Check &lt;strong&gt;Synchronize DDL&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Grey out &lt;strong&gt;Start Automatically&lt;/strong&gt; to set parameters after the DataJob is created.&lt;/li&gt;
&lt;/ol&gt;


&lt;/li&gt;

&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F3-91e370b5a809ac6b16b24089b3347118.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F3-91e370b5a809ac6b16b24089b3347118.png" width="800" height="474"&gt;&lt;/a&gt;  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Select the tables and columns to be replicated.&lt;/li&gt;
&lt;li&gt;Confirm the DataJob creation.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Details&lt;/strong&gt; &amp;gt; &lt;strong&gt;Functions&lt;/strong&gt; &amp;gt; &lt;strong&gt;Modify DataJob Params&lt;/strong&gt;.

&lt;ol&gt;
&lt;li&gt;Choose Target tab, and set &lt;strong&gt;deCycle&lt;/strong&gt; to true.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Save&lt;/strong&gt; and start the DataJob.&lt;/li&gt;
&lt;/ol&gt;


&lt;/li&gt;

&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F4-5e485d8eae6d1bf75c1baab89279d9c6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F4-5e485d8eae6d1bf75c1baab89279d9c6.png" width="800" height="159"&gt;&lt;/a&gt; &lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Create Reverse DataJob
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Click &lt;strong&gt;DataJob&lt;/strong&gt; &amp;gt; &lt;strong&gt;Create DataJob&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Select the source and target DataSources(&lt;strong&gt;reverse selection of Forward DataJob&lt;/strong&gt;), and click &lt;strong&gt;Test Connection&lt;/strong&gt; to ensure the connection to the source and target DataSources are both successful.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F5-bdf85a05662b93681b33b4c5bd1dfe23.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F5-bdf85a05662b93681b33b4c5bd1dfe23.png" width="800" height="473"&gt;&lt;/a&gt;  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;In &lt;strong&gt;Properties&lt;/strong&gt; Page:

&lt;ol&gt;
&lt;li&gt;Select &lt;strong&gt;Incremental&lt;/strong&gt;, and DO NOT check &lt;strong&gt;Full Data&lt;/strong&gt; option.&lt;/li&gt;
&lt;li&gt;Check &lt;strong&gt;Synchronize DDL&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Grey out &lt;strong&gt;Start Automatically&lt;/strong&gt; to set parameters after the DataJob is created.&lt;/li&gt;
&lt;/ol&gt;


&lt;/li&gt;

&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F6-7f158b949ac19ce84d76fa89134bcec4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F6-7f158b949ac19ce84d76fa89134bcec4.png" width="800" height="472"&gt;&lt;/a&gt;  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Select the tables and columns to be replicated.&lt;/li&gt;
&lt;li&gt;Confirm the DataJob creation.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Details&lt;/strong&gt; &amp;gt; &lt;strong&gt;Functions&lt;/strong&gt; &amp;gt; &lt;strong&gt;Modify DataJob Params&lt;/strong&gt;.

&lt;ol&gt;
&lt;li&gt;Choose Target tab, and set &lt;strong&gt;deCycle&lt;/strong&gt; to true.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Save&lt;/strong&gt; and start the DataJob.&lt;/li&gt;
&lt;/ol&gt;


&lt;/li&gt;

&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F7-45355ef3c2a9db46c197749cc742b686.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F7-45355ef3c2a9db46c197749cc742b686.png" width="800" height="164"&gt;&lt;/a&gt;  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Forward and reverse DataJobs are running well.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F8-da2247a153298726a7db12999dc50fc1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F8-da2247a153298726a7db12999dc50fc1.png" width="800" height="184"&gt;&lt;/a&gt;  &lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5: Check the Result
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Do some DMLs in the source database. You can see there are changes in forward DataJob monitoring charts but no changes in reverse DataJob.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F9-400023a32155fc662448d66c43a24be3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F9-400023a32155fc662448d66c43a24be3.png" width="800" height="438"&gt;&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F10-a220cf3f34a5525695dd21204ab71acc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F10-a220cf3f34a5525695dd21204ab71acc.png" width="800" height="435"&gt;&lt;/a&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Do some DMLs in the target database. You can see there are changes in reverse DataJob monitoring charts but no changes in forward DataJob.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F11-5f769104f2a3cc79e93a056588704de8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F11-5f769104f2a3cc79e93a056588704de8.png" width="800" height="444"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F12-8cd1926ea97943841e71067d6ff35581.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F12-8cd1926ea97943841e71067d6ff35581.png" width="800" height="449"&gt;&lt;/a&gt;    &lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What are the drawbacks of this solution？
&lt;/h3&gt;

&lt;p&gt;First, it requires enabling the MySQL global variable &lt;code&gt;binlog_rows_query_log_events&lt;/code&gt;, which is disabled by default. Compared to GTID which is typically enabled, this is a relative disadvantage.&lt;/p&gt;

&lt;p&gt;Second, enabling this feature can cause the binlog to grow faster, potentially leading to increased disk usage and shorter binlog retention cycles.&lt;/p&gt;

&lt;p&gt;Third, for BladePipe, this approach increases in-memory usage due to storing SQL statement text, which results in higher resource consumption.&lt;/p&gt;

&lt;p&gt;That said, considering the significant improvements in performance and stability, BladePipe believes the benefits outweigh the drawbacks.&lt;/p&gt;

&lt;h3&gt;
  
  
  What other pipelines does this solution support?
&lt;/h3&gt;

&lt;p&gt;At present, BladePipe has not conducted in-depth research on whether other data sources support tagging within DML statements or row data. However, tagging-based mechanisms remain a promising direction worth exploring.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;In this article, we dive into how to prevent infinite replication loops in MySQL bidirectional sync, boosting the construction of an architecture with high availability, elasticity and disaster recovery.&lt;/p&gt;

</description>
      <category>mysql</category>
      <category>database</category>
      <category>tutorial</category>
      <category>data</category>
    </item>
    <item>
      <title>Redis Sync at Scale: A Smarter Way to Handle Big Keys</title>
      <dc:creator>BladePipe</dc:creator>
      <pubDate>Tue, 24 Jun 2025 08:01:15 +0000</pubDate>
      <link>https://forem.com/bladepipe/redis-sync-at-scale-a-smarter-way-to-handle-big-keys-5e53</link>
      <guid>https://forem.com/bladepipe/redis-sync-at-scale-a-smarter-way-to-handle-big-keys-5e53</guid>
      <description>&lt;p&gt;In enterprise-grade data replication workflows, Redis is widely adopted thanks to its blazing speed and flexible data structures. But as data grows, so do the keys in Redis—literally. Over time, it’s common to see Redis keys ballooning with hundreds of thousands of elements in structures like Lists, Sets, or Hashes.&lt;/p&gt;

&lt;p&gt;These “big keys” are usually one of the roots of poor performance in a full data migration or sync, slowing down processes or even bringing them to a crashing halt.&lt;/p&gt;

&lt;p&gt;That’s why &lt;a href="https://www.bladepipe.com" rel="noopener noreferrer"&gt;BladePipe&lt;/a&gt;, a professional data replication platform, recently rolled out a fresh round of enhancements to its Redis support. This includes expanded command coverage, data verification feature, and more importantly, &lt;strong&gt;major improvements for big key sync&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Let’s dig into how these improvements work and how they keep Redis migrations smooth and reliable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Challenges of Big Key Sync
&lt;/h2&gt;

&lt;p&gt;In high-throughput, real-time applications, it’s common for a single Redis key to contain a massive amount of elements. When it comes to syncing that data, a few serious issues can pop up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Out-of-Memory (OOM) Crashes:&lt;/strong&gt; Reading big keys all at once can cause the sync process to blow up memory usage, sometimes leading to OOM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Protocol Size Limits:&lt;/strong&gt; Redis commands and payloads have strict limits (e.g., 512MB for a single command via the RESP protocol). Exceed those limits, and Redis will reject the operation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Target-Side Write Failures:&lt;/strong&gt; Even if the source syncs properly, the target Redis might fail to process oversized writes, leading to data sync interruption.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How BladePipe Tackles Big Key Syncs
&lt;/h2&gt;

&lt;p&gt;To address these issues, BladePipe introduces lazy loading and sharded sync mechanisms specifically tailored for big keys without sacrificing data integrity.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lazy Loading
&lt;/h3&gt;

&lt;p&gt;Traditional data sync tools often attempt to load an entire key into memory in one go. BladePipe flips the script by using on-demand loading. Instead of stuffing the entire key into memory, BladePipe streams it shard-by-shard during the sync process.&lt;/p&gt;

&lt;p&gt;This dramatically reduces memory usage and minimizes the risk of OOM crashes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sharded Sync
&lt;/h3&gt;

&lt;p&gt;The heart of BladePipe’s big key optimization lies in breaking big keys into smaller shards. Each shard contains a configurable number of elements and is sent to the target Redis in multiple commands.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Configurable parameter: &lt;code&gt;parseFullEventBatchSize&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Default value: 1024 elements per shard&lt;/li&gt;
&lt;li&gt;Supported types: List, Set, ZSet, Hash&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example: If a Set contains 500,000 elements, BladePipe will divide it into ~490 shards, each with up to 1024 elements, and send them as separate SADD commands.&lt;/p&gt;

&lt;h3&gt;
  
  
  Shard-by-Shard Sync Process
&lt;/h3&gt;

&lt;p&gt;Here’s a breakdown of how it works:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Shard Planning:&lt;/strong&gt; BladePipe inspects the total number of elements in a big key and calculates how many shards are needed based on the parameter &lt;code&gt;parseFullEventBatchSize&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shard Construction &amp;amp; Dispatch:&lt;/strong&gt; Each shard is formatted into a Redis-compatible command and sent to the target sequentially.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Order &amp;amp; Integrity Guarantees:&lt;/strong&gt; Shards are written in the correct order, preserving data consistency on the target Redis.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Real-World Results
&lt;/h2&gt;

&lt;p&gt;To benchmark the improvements, BladePipe ran sync tests with a mixed dataset:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1 million regular keys (String, List, Hash, Set, ZSet)&lt;/li&gt;
&lt;li&gt;50,000 large keys (~30MB each; max ~35MB)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here’s what performance looked like:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Fbig_key-ed8661861f03b1aa6071b3633394695d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Fbig_key-ed8661861f03b1aa6071b3633394695d.png" width="800" height="206"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The result shows that even with big keys in the mix, BladePipe achieved a steady sync throughput of 4–5K RPS from Redis to Redis, which is enough to handle the daily production workloads for most businesses without compromising accuracy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;Big keys don’t have to be big problems. With lazy loading and sharded sync, BladePipe provides a reliable and memory-safe way to handle full Redis migrations—even for your biggest keys.&lt;/p&gt;

</description>
      <category>redis</category>
      <category>bigkey</category>
      <category>programming</category>
    </item>
    <item>
      <title>Real-Time Data Sync: 4 Questions We Get All the Time</title>
      <dc:creator>BladePipe</dc:creator>
      <pubDate>Fri, 20 Jun 2025 07:37:33 +0000</pubDate>
      <link>https://forem.com/bladepipe/real-time-data-sync-4-questions-we-get-all-the-time-16jf</link>
      <guid>https://forem.com/bladepipe/real-time-data-sync-4-questions-we-get-all-the-time-16jf</guid>
      <description>&lt;p&gt;We work closely with teams building real-time systems, migrating databases, or bridging heterogeneous data platforms. Along the way, we hear a lot of recurring questions. So we figured—why not write them down?&lt;/p&gt;

&lt;p&gt;This is Part 1 of a practical Q&amp;amp;A series on real-time data sync. In this post, I'd like to share thoughts on the following questions: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How should I choose between official and third-party tools?&lt;/li&gt;
&lt;li&gt;Can my project rely on “real-time” sync latency?&lt;/li&gt;
&lt;li&gt;What does real-time data sync mean to my project?&lt;/li&gt;
&lt;li&gt;How do I keep pipeline stability and data integrity over time?&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How should I choose between official and third-party tools?
&lt;/h2&gt;

&lt;p&gt;Mature database vendors typically provide their own tools for data migration or cold/hot backup, like Oracle GoldenGate or MySQL's built-in dump utilities.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Official tools&lt;/strong&gt; often deliver:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The best possible performance for the migration and sync of that database.&lt;/li&gt;
&lt;li&gt;Compatibility with obscure engine-specific features.&lt;/li&gt;
&lt;li&gt;Support for special cases that third-party tools often cannot (e.g., Oracle GoldenGate parsing Redo logs).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But they also tend to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Offer limited or no support for other databases.&lt;/li&gt;
&lt;li&gt;Be less flexible for niche or custom workflows.&lt;/li&gt;
&lt;li&gt;Lock you in, making data exit harder than data entry.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Third-party tools&lt;/strong&gt; shine when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're syncing across platforms (e.g. MySQL &amp;gt; Kafka/Iceberg/Elasticsearch).&lt;/li&gt;
&lt;li&gt;You need advanced features like filtering and transformation.&lt;/li&gt;
&lt;li&gt;The official tool simply doesn't support your use case.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In short:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If it’s homogeneous migration or backup, &lt;strong&gt;use the official tool&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;If it’s heterogeneous sync or anything custom, &lt;strong&gt;go third-party tool&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Can my project rely on “real-time” sync latency?
&lt;/h2&gt;

&lt;p&gt;In short: any data sync process that doesn't guarantee distributed transaction consistency comes with some latency risk. Even distributed transactions come at a cost—usually via redundant replication and sacrificing write performance or availability.&lt;/p&gt;

&lt;p&gt;Latency typically falls into two categories: &lt;strong&gt;fault-induced latency&lt;/strong&gt; and &lt;strong&gt;business-induced latency&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fault-induced Latency:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Issues with the sync tool itself, such as memory limits or bugs.&lt;/li&gt;
&lt;li&gt;Source/target database failures—data can't be pulled or written properly.&lt;/li&gt;
&lt;li&gt;Constraint conflicts on the target side, leading to write errors.&lt;/li&gt;
&lt;li&gt;Incomplete schema on the target side causing insert failures.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Business-induced Latency:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bulk data imports or data corrections on the source side.&lt;/li&gt;
&lt;li&gt;Traffic spikes during business peaks exceeding the tool’s processing capacity.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can reduce the chances of delays (via &lt;strong&gt;task tuning&lt;/strong&gt;, &lt;strong&gt;schema change rule setting&lt;/strong&gt;, and &lt;strong&gt;database resource planning&lt;/strong&gt;), but you’ll never fully eliminate them. So the real question becomes: &lt;/p&gt;

&lt;p&gt;Do you have a fallback plan (e.g. graceful degradation) when latency hits? &lt;/p&gt;

&lt;p&gt;That would significantly mitigate the risks brought by high latency.&lt;/p&gt;

&lt;h2&gt;
  
  
  What does real-time data sync mean to my project?
&lt;/h2&gt;

&lt;p&gt;Two words: &lt;strong&gt;incremental&lt;/strong&gt; + &lt;strong&gt;real-time&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Unlike traditional batch-based ETL, a good real-time sync tool:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Captures only what changes, saving massive bandwidth. &lt;/li&gt;
&lt;li&gt;Delivers changes within seconds, enabling use cases like fraud detection or live analytics.&lt;/li&gt;
&lt;li&gt;Preserves deletes and DDLs, whereas traditional ETL often relies on external metadata services.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of it like this:&lt;br&gt;
You don’t want to re-copy 1 billion rows every night when only 100 changed. Real-time sync gives you the speed and precision needed to power fast, reliable data products.&lt;/p&gt;

&lt;p&gt;And with modern architectures—where one DB handles transactions, another serves queries, and a third powers ML—real-time sync is the glue holding it all together.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do I keep pipeline stability and data integrity over time?
&lt;/h2&gt;

&lt;p&gt;Most stability issues come from three factors: &lt;strong&gt;schema changes&lt;/strong&gt;, &lt;strong&gt;traffic pattern shifts&lt;/strong&gt;, and &lt;strong&gt;network environment issues&lt;/strong&gt;. Mitigating or planning for these risks greatly improves stability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Schema Changes:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Incompatibilities between schema change methods (e.g., native DDL, online tools like pt-osc or gh-ost) and the sync tool’s capabilities.&lt;/li&gt;
&lt;li&gt;Uncoordinated changes to target schemas may cause errors or schema misalign.&lt;/li&gt;
&lt;li&gt;Changes on the target side (e.g., schema changes or writes) may conflict with sync logic, causing the inconsistency between the source and target shcema or constraint conflicts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Traffic Shifts:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Business surges causing unexpected peak loads that outstrip the sync tool’s capacity, leading to memory exhaustion or lag.&lt;/li&gt;
&lt;li&gt;Ops activities like mass data corrections causing large data volumes and sync bottlenecks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Network Environment:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Missing database whitelisting for sync nodes. Sync tasks may fail due to connection issues.&lt;/li&gt;
&lt;li&gt;High latency in cross-region setups causing read/write problems.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can reduce these risks significantly via &lt;strong&gt;change control setting&lt;/strong&gt;, &lt;strong&gt;load testing during peak traffic&lt;/strong&gt;, and &lt;strong&gt;pre-launch resource validation&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For data loss issues, they are typically resulted from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Mismatched parallelism strategy&lt;/strong&gt; causing write disorder.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Conflicting writes&lt;/strong&gt; on the target side.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Excessive latency&lt;/strong&gt; not handled in time, causing source-side logs to be purged before sync.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;How to fight back:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Parallelism strategy mismatch&lt;/strong&gt; often occurs due to cascading updates or reuse of primary key. You may need to fall back to table-level sync granularity and verify and correct data to ensure data consistency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Target-side writes&lt;/strong&gt; should be prevented via access control and database usage standardization.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Excessive latency&lt;/strong&gt; must be caught via robust alerting. Also, extend log retention (ideally 24+ hours) on the source database.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With these measures in place, you can significantly enhance sync stability and data reliability—laying a solid foundation for data-driven business operations.&lt;/p&gt;

</description>
      <category>database</category>
      <category>programming</category>
      <category>tutorial</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Intercontinental Data Sync - A Comparative Study for Performance Tuning</title>
      <dc:creator>BladePipe</dc:creator>
      <pubDate>Wed, 18 Jun 2025 06:23:18 +0000</pubDate>
      <link>https://forem.com/bladepipe/intercontinental-data-sync-a-comparative-study-for-performance-tuning-3egk</link>
      <guid>https://forem.com/bladepipe/intercontinental-data-sync-a-comparative-study-for-performance-tuning-3egk</guid>
      <description>&lt;p&gt;When it comes to moving data across vast distances, particularly between continents, businesses often face a range of challenges that can impact performance. At &lt;a href="https://www.bladepipe.com" rel="noopener noreferrer"&gt;BladePipe&lt;/a&gt;, we regularly help enterprises tackle these hurdles. The most common question we receive is: &lt;strong&gt;What’s the best way to deploy BladePipe for optimal performance?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;While we can offer general advice based on our experience, the reality is that these tasks come with many variables. This article explores the best practice for intercontinental data migration and sync, blending theory with hands-on insights from real-world experiments.&lt;/p&gt;

&lt;h2&gt;
  
  
  Challenges of Intercontinental Data Sync
&lt;/h2&gt;

&lt;p&gt;Intercontinental data migration is no easy feat. There are two primary challenges that stand in the way of fast and reliable data transfers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Unavoidable network latency:&lt;/strong&gt; For instance, network latency between Singapore and the U.S. typically ranges from 150ms to 300ms, which is significantly higher compared to the sub-5ms latency of typical relational database INSERT/UPDATE operations.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Complex factors affecting network quality:&lt;/strong&gt; Factors such as packet loss and routing paths can degrade the performance of intercontinental data transfers. Unlike intranet communication, intercontinental transfers pass through multiple layers of switches and routers in data centers and backbone networks.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Beyond these, it’s critical to consider the load on both the source and target databases, network bandwidth, and the volume of data being transferred.&lt;/p&gt;

&lt;p&gt;When using BladePipe, understanding its data extraction and writing mechanisms is essential to determine the best deployment strategy.&lt;/p&gt;

&lt;h2&gt;
  
  
  BladePipe Migration &amp;amp; Sync Techniques
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Data Migration Techniques
&lt;/h3&gt;

&lt;p&gt;For relational databases, BladePipe uses &lt;strong&gt;JDBC-based data scanning&lt;/strong&gt;, with support for &lt;strong&gt;resumable migration&lt;/strong&gt; using techniques like pagination. Additionally, it supports &lt;strong&gt;parallel data migration&lt;/strong&gt;—both inter-table and intra-table parallelism (via multiple tasks with specific filters).&lt;/p&gt;

&lt;p&gt;On the target side, since all data is inserted via INSERT operations, BladePipe uses several batch writing techniques:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Batching&lt;/li&gt;
&lt;li&gt;Spliting and parallel writing&lt;/li&gt;
&lt;li&gt;Bulk inserts&lt;/li&gt;
&lt;li&gt;INSERT rewriting (e.g., converting multiple rows into &lt;code&gt;insert..values(),(),()&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Data Sync Techniques
&lt;/h3&gt;

&lt;p&gt;BladePipe supports different methods for capturing incremental changes depending on the source database. Here’s a quick look:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source Database&lt;/th&gt;
&lt;th&gt;Incremental Capture Method&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MySQL&lt;/td&gt;
&lt;td&gt;Binlog parsing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PostgreSQL&lt;/td&gt;
&lt;td&gt;logical WAL subscription&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Oracle&lt;/td&gt;
&lt;td&gt;LogMiner parsing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SQL Server&lt;/td&gt;
&lt;td&gt;SQL Server CDC table scan&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MongoDB&lt;/td&gt;
&lt;td&gt;Oplog scan / ChangeStream&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Redis&lt;/td&gt;
&lt;td&gt;PSYNC command&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SAP Hana&lt;/td&gt;
&lt;td&gt;Trigger&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kafka&lt;/td&gt;
&lt;td&gt;Message subscription&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;StarRocks&lt;/td&gt;
&lt;td&gt;Periodic incremental scan&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These methods largely rely on the source database to emit incremental changes, which can vary based on network conditions.&lt;/p&gt;

&lt;p&gt;On the target side, unlike data migration, &lt;strong&gt;more operations&lt;/strong&gt; (INSERT/UPDATE/DELETE) need to be handled while &lt;strong&gt;order consistency&lt;/strong&gt; must be kept in data sync. BladePipe offers a variety of techniques to improve data sync performance:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Optimization&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Batching&lt;/td&gt;
&lt;td&gt;Reduce network overhead and help with merge performance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Partitioning by unique key&lt;/td&gt;
&lt;td&gt;Ensure data order consistency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Partitioning by table&lt;/td&gt;
&lt;td&gt;Looser method when unique key changes occur&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-statement execution&lt;/td&gt;
&lt;td&gt;Reduce network latency by concatenating SQL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bulk load&lt;/td&gt;
&lt;td&gt;For data sources with full-image and upsert capabilities, INSERT/UPDATE operations are converted into INSERT for batch overwriting&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Distributed tasks&lt;/td&gt;
&lt;td&gt;Allow parallel writes of the same amount of data using multiple tasks&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Exploring the Best Practice
&lt;/h2&gt;

&lt;p&gt;BladePipe’s design emphasizes performance optimizations on the target side, which are &lt;strong&gt;more controllable&lt;/strong&gt;. Typically, we recommend deploying BladePipe near the source data source to mitigate the impact of network quality on data extraction.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fskaietfrcjtdw24mvx5o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fskaietfrcjtdw24mvx5o.png" width="800" height="195"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;But does this theory hold up in practice? To test this, we conducted an intercontinental MySQL-to-MySQL migration and sync experiment.&lt;/p&gt;

&lt;h3&gt;
  
  
  Experimental Setup
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Resources:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Source MySQL: located in Singapore (4 cores, 8GB RAM)&lt;/li&gt;
&lt;li&gt;Target MySQL: located in Silicon Valley, USA (4 cores, 8GB RAM)&lt;/li&gt;
&lt;li&gt;BladePipe: deployed on VMs in both Singapore and Silicon Valley (8 cores, 16GB RAM)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Test Plan:&lt;/strong&gt; We migrated and synchronized the same data twice to compare performance with BladePipe deployed in different locations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Figxcqk5y0piarraffx4z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Figxcqk5y0piarraffx4z.png" width="800" height="379"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Process
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Generate 1.3 million rows of data in Singapore MySQL.&lt;/li&gt;
&lt;li&gt;Use &lt;strong&gt;BladePipe deployed in Singapore&lt;/strong&gt; to migrate data to &lt;strong&gt;the U.S.&lt;/strong&gt; and record performance. &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F3-04a29444d2f8e2571cf3f2b2d026910f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F3-04a29444d2f8e2571cf3f2b2d026910f.png" width="800" height="409"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Make data changes (INSERT/UPDATE) at &lt;strong&gt;Singapore MySQL&lt;/strong&gt; and record sync performance.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F4-c2d254d1abbe42cd1793ec7ed788ff54.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F4-c2d254d1abbe42cd1793ec7ed788ff54.png" width="800" height="410"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Stop the DataJob and delete target data.&lt;/li&gt;
&lt;li&gt;Use &lt;strong&gt;BladePipe deployed in the U.S.&lt;/strong&gt; to migrate the data again from &lt;strong&gt;Singapore MySQL&lt;/strong&gt; and record performance.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F5-ef5dcf6d68996284d38cf50bc9852e31.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F5-ef5dcf6d68996284d38cf50bc9852e31.png" width="800" height="407"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Make data changes at &lt;strong&gt;Singapore MySQL&lt;/strong&gt; and record sync performance again.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F6-9d97316d3c0a7ca5bf46bccbdc15af39.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F6-9d97316d3c0a7ca5bf46bccbdc15af39.png" width="800" height="405"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Results &amp;amp; Analysis
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Deployment Location&lt;/th&gt;
&lt;th&gt;Task Type&lt;/th&gt;
&lt;th&gt;Performance&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Source (Singapore)&lt;/td&gt;
&lt;td&gt;Migration&lt;/td&gt;
&lt;td&gt;6.5k records/sec&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Target (Silicon Valley)&lt;/td&gt;
&lt;td&gt;Migration&lt;/td&gt;
&lt;td&gt;15k records/sec&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Source (Singapore)&lt;/td&gt;
&lt;td&gt;Sync&lt;/td&gt;
&lt;td&gt;8k records/sec&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Target (Silicon Valley)&lt;/td&gt;
&lt;td&gt;Sync&lt;/td&gt;
&lt;td&gt;32k records/sec&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm35gjtb4wgux1s5uqhhh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm35gjtb4wgux1s5uqhhh.png" width="800" height="357"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Surprisingly, deploying BladePipe at the target (Silicon Valley) significantly outperformed the source-side deployment.    &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Potential Reasons:&lt;/strong&gt;    &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Network policies and bandwidth differences between the two locations.&lt;/li&gt;
&lt;li&gt;Target-side batch writes are less affected by poor network conditions compared to binlog/logical scanning on the source side.&lt;/li&gt;
&lt;li&gt;Other unpredictable network variables.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Recommendations
&lt;/h2&gt;

&lt;p&gt;While the experiment offers valuable insights to intercontinental data migration and sync, real-world environments can differ:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Production databases may be under heavy load, impacting the ability to push incremental changes efficiently.&lt;/li&gt;
&lt;li&gt;Dedicated network lines may offer more consistent network quality.&lt;/li&gt;
&lt;li&gt;Gateway rules and security policies vary across data centers, affecting performance.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Our recommendation:&lt;/strong&gt; During the POC phase, deploy BladePipe on both the source and target sides, compare performance, and &lt;strong&gt;choose the best deployment strategy based on real-world results&lt;/strong&gt;.&lt;/p&gt;

</description>
      <category>mysql</category>
      <category>programming</category>
      <category>database</category>
    </item>
    <item>
      <title>Build a Local RAG Using Ollama, PostgreSQL and BladePipe</title>
      <dc:creator>BladePipe</dc:creator>
      <pubDate>Fri, 13 Jun 2025 07:04:34 +0000</pubDate>
      <link>https://forem.com/bladepipe/build-a-local-rag-using-ollama-postgresql-and-bladepipe-4emp</link>
      <guid>https://forem.com/bladepipe/build-a-local-rag-using-ollama-postgresql-and-bladepipe-4emp</guid>
      <description>&lt;p&gt;Retrieval-Augmented Generation (RAG) is becoming increasingly common in enterprise applications. Unlike lightweight Q&amp;amp;A systems designed for personal users, enterprise RAG solutions must be reliable, controllable, scalable, and most importantly—secure.&lt;/p&gt;

&lt;p&gt;Many companies are cautious about sending internal data to public cloud-based models or vector databases due to the risk of sensitive information leakage. For industries with strict compliance needs, this is often a dealbreaker.&lt;/p&gt;

&lt;p&gt;To address these challenges, BladePipe now supports building local RAG services with Ollama, enabling enterprises to run intelligent RAG services entirely within their own infrastructure. This article walks you through &lt;strong&gt;building a fully private, production-ready RAG application&lt;/strong&gt;—without writing any code.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is an Enterprise-Grade RAG Service?
&lt;/h2&gt;

&lt;p&gt;Enterprise-grade RAG emphasizes end-to-end integration, data control, and tight alignment with business systems. The goal isn’t just smart Q&amp;amp;A. It brings automation and intelligence that genuinely boost business.&lt;/p&gt;

&lt;p&gt;Compared to hobby or research-focused RAG setups, enterprise systems have four key traits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Fully private stack&lt;/strong&gt;: All components must run locally or in a private cloud. No data leaves the enterprise boundary.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Diverse data sources&lt;/strong&gt;: Beyond plain text files. Databases and more formats are supported.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incremental data syncing&lt;/strong&gt;: Business data updates constantly. RAG indexes must stay in sync automatically.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integrated tool calling (MCP-like capabilities)&lt;/strong&gt;: Retrieval and generation are only part of the story. Tools like SQL query, function calls, or workflow execution must be supported.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Introducing BladePipe RagApi
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.bladepipe.com" rel="noopener noreferrer"&gt;BladePipe&lt;/a&gt;’s &lt;strong&gt;RagApi&lt;/strong&gt; encapsulates both vector search and LLM-based Q&amp;amp;A capabilities and supports the MCP protocol. It’s designed to help every one quickly launch their own RAG services.&lt;/p&gt;

&lt;p&gt;Compared to the traditional way to build RAG services, RagApi's key advantages include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Two DataJobs for a RAG service&lt;/strong&gt;: Import documents + publish API.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero-code deployment&lt;/strong&gt;: Everything is configurable—no development needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adjustable parameters&lt;/strong&gt;: Adjust vector top-K, match threshold, prompt templates, model temperature, etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-model and platform support&lt;/strong&gt;: Support DashScope (Alibaba Cloud), OpenAI, DeepSeek, and more.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI-compatible API&lt;/strong&gt;: Easily integrate into your existing chat UI or toolchain.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;Here’s how to build a fully private, secure RAG service using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ollama&lt;/strong&gt; for local model reasoning and embedding.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PostgreSQL&lt;/strong&gt; for local vector storage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BladePipe RagApi&lt;/strong&gt; for building and managing the RagApi service.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The overall workflow is like:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Flocal_1-e92761b59b81ade1c796fa313c25873c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Flocal_1-e92761b59b81ade1c796fa313c25873c.png" width="800" height="333"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Preparation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Run Ollama Locally
&lt;/h3&gt;

&lt;p&gt;Ollama allows you to deploy LLMs on your local machine. It will be used for both &lt;strong&gt;embedding&lt;/strong&gt; and &lt;strong&gt;reasoning&lt;/strong&gt;.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Download Ollama from &lt;a href="https://ollama.com/download" rel="noopener noreferrer"&gt;https://ollama.com/download&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;After installation, run the following command to pull and run a suitable model for embedding and reasoning, such as &lt;code&gt;deepseek-r1&lt;/code&gt;.
Note: Large models may require significant hardware resources.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ollama run deepseek-r1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3jdqiyaz7hqnkm0bm5ae.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3jdqiyaz7hqnkm0bm5ae.png" width="800" height="214"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Set Up PGVector
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Install Docker (Skip if already installed):
For different operating systems, refer to the following steps for installation:&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MacOS&lt;/strong&gt;: Refer to the official installation doc: &lt;a href="https://docs.docker.com/desktop/setup/install/mac-install/" rel="noopener noreferrer"&gt;Docker Desktop for Mac&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CentOS / RHEL&lt;/strong&gt;: Refer to the script below.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;## centos / rhel&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;yum-config-manager &lt;span class="nt"&gt;--add-repo&lt;/span&gt; https://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo

&lt;span class="nb"&gt;sudo &lt;/span&gt;yum &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; docker-ce-20.10.9-3.&lt;span class="k"&gt;*&lt;/span&gt; docker-ce-cli-20.10.9-3.&lt;span class="k"&gt;*&lt;/span&gt;

&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl start docker

&lt;span class="nb"&gt;sudo echo&lt;/span&gt; &lt;span class="s1"&gt;'{"exec-opts": ["native.cgroupdriver=systemd"]}'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /etc/docker/daemon.json

&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl restart docker
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ubuntu&lt;/strong&gt;: Refer to the script below.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;## ubuntu&lt;/span&gt;
curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://mirrors.aliyun.com/docker-ce/linux/ubuntu/gpg | &lt;span class="nb"&gt;sudo &lt;/span&gt;apt-key add -

&lt;span class="nb"&gt;sudo &lt;/span&gt;add-apt-repository &lt;span class="s2"&gt;"deb [arch=amd64] https://mirrors.aliyun.com/docker-ce/linux/ubuntu &lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;lsb_release &lt;span class="nt"&gt;-cs&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt; stable"&lt;/span&gt;

&lt;span class="nb"&gt;sudo &lt;/span&gt;apt-get update

&lt;span class="nb"&gt;sudo &lt;/span&gt;apt-get &lt;span class="nt"&gt;-y&lt;/span&gt; &lt;span class="nb"&gt;install &lt;/span&gt;docker-ce&lt;span class="o"&gt;=&lt;/span&gt;5:20.10.24~3-0~ubuntu-&lt;span class="k"&gt;*&lt;/span&gt; docker-ce-cli&lt;span class="o"&gt;=&lt;/span&gt;5:20.10.24~3-0~ubuntu-&lt;span class="k"&gt;*&lt;/span&gt;

&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl start docker

&lt;span class="nb"&gt;sudo echo&lt;/span&gt; &lt;span class="s1"&gt;'{"exec-opts": ["native.cgroupdriver=systemd"]}'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /etc/docker/daemon.json

&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl restart docker
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Start PostgreSQL + pgvector Container Service:
Execute the following command to deploy the PostgreSQL environment in one go:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;' &amp;gt; init_pgvector.sh
#!/bin/bash

# create docker-compose.yml
cat &amp;lt;&amp;lt;YML &amp;gt; docker-compose.yml
version: "3"
services:
  db:
    container_name: pgvector-db
    hostname: 127.0.0.1
    image: pgvector/pgvector:pg16
    ports:
      - 5432:5432
    restart: always
    environment:
      - POSTGRES_DB=api
      - POSTGRES_USER=root
      - POSTGRES_PASSWORD=123456
YML

# Start container service (run in background)
docker-compose up --build -d

# Wait for container to start, then enter database and enable vector extension
echo "Waiting for container to start..."
sleep 5

docker exec -it pgvector-db psql -U root -d api -c "CREATE EXTENSION IF NOT EXISTS vector;"
&lt;/span&gt;&lt;span class="no"&gt;EOF

&lt;/span&gt;&lt;span class="c"&gt;# Grant execution permissions and run the script&lt;/span&gt;
&lt;span class="nb"&gt;chmod&lt;/span&gt; +x init_pgvector.sh
./init_pgvector.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After execution, local PostgreSQL will automatically enable the pgvector extension, ready to store document embeddings securely on-prem.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deploy BladePipe (Enterprise)
&lt;/h3&gt;

&lt;p&gt;Follow the &lt;a href="https://doc.bladepipe.com/productOP/onPremise/installation/install_all_in_one_binary" rel="noopener noreferrer"&gt;installation guide&lt;/a&gt; to download &lt;a href="https://www.bladepipe.com/" rel="noopener noreferrer"&gt;BladePipe (Enterprise)&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  RAG Building
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Add DataSources
&lt;/h3&gt;

&lt;p&gt;Log in to BladePipe. Click &lt;strong&gt;DataSource&lt;/strong&gt; &amp;gt; &lt;strong&gt;Add DataSource&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Files(SshFile):&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Select &lt;strong&gt;Self Maintenance&lt;/strong&gt; &amp;gt; &lt;strong&gt;SshFile&lt;/strong&gt;. You can set &lt;a href="https://doc.bladepipe.com/reference/file_schema_format" rel="noopener noreferrer"&gt;extra parmeters&lt;/a&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Address&lt;/strong&gt;: Fill in the machine IP where the files are stored and SSH port (default 22).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Account &amp;amp; Password&lt;/strong&gt;: Username and password of the machine.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parameter *fileSuffixArray&lt;/strong&gt;*: set to &lt;code&gt;.md&lt;/code&gt; to include markdown files.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parameter *dbsJson&lt;/strong&gt;*: Copy the default value and modify the schema value (the root path where target files are located)
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"db"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"cc_virtual_fs"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"schemas"&lt;/span&gt;&lt;span class="p"&gt;:[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"schema"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"/Users/zoe/cloudcanal-doc-v2/locales"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"tables"&lt;/span&gt;&lt;span class="p"&gt;:[]&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Flocal_3-278a4734454e0782ce6deb93520e50f8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Flocal_3-278a4734454e0782ce6deb93520e50f8.png" width="800" height="382"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vector Database(PostgreSQL):&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Choose &lt;strong&gt;Self Maintenance&lt;/strong&gt; &amp;gt; &lt;strong&gt;PostgreSQL&lt;/strong&gt;.&lt;br&gt;&lt;br&gt;
Configuration Details:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Address&lt;/strong&gt;: localhost:5432&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Account&lt;/strong&gt;: root&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Password&lt;/strong&gt;: 123456&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Flocal_4-4fffae73e1004921c958c24ba2bb1c58.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Flocal_4-4fffae73e1004921c958c24ba2bb1c58.png" width="800" height="422"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM (Ollama):&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Choose &lt;strong&gt;Self Maintenance&lt;/strong&gt; &amp;gt; &lt;strong&gt;Ollama&lt;/strong&gt;.&lt;br&gt;&lt;br&gt;
Configuration Details:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Address&lt;/strong&gt;: localhost:11434&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parameter *llmEmbedding&lt;/strong&gt;*:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "deepseek-r1": {
    "dimension": 4096
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Parameter *llmChat&lt;/strong&gt;*:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "deepseek-r1": {
    "temperature": 1,
    "topP": 0.9,
    "showReasoning": false
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Flocal_5-8ad1236f1defe6684607ef21766b6c4d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Flocal_5-8ad1236f1defe6684607ef21766b6c4d.png" width="800" height="440"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RagApi Service (BladePipe):&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Choose &lt;strong&gt;Self Maintenance&lt;/strong&gt; &amp;gt; &lt;strong&gt;RagApi&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Address&lt;/strong&gt;: Set host to localhost and port to 18089.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API Key&lt;/strong&gt;: Customize a string (e.g. &lt;code&gt;my-bp-rag-key&lt;/code&gt;), used for authentication when calling RagApi later.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Flocal_6-27efeb77b49b69e10907c5004e0c4ea2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Flocal_6-27efeb77b49b69e10907c5004e0c4ea2.png" width="800" height="403"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  DataJob 1: Vectorize the Docs
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Go to &lt;strong&gt;DataJob&lt;/strong&gt; &amp;gt; &lt;strong&gt;Create DataJob&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Choose &lt;strong&gt;source: SshFile&lt;/strong&gt;, &lt;strong&gt;target: PostgreSQL&lt;/strong&gt;, and test the connection.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Flocal_7-ede2770725478c9ad6799b370ff70db7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Flocal_7-ede2770725478c9ad6799b370ff70db7.png" width="800" height="431"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Select &lt;strong&gt;Full Data&lt;/strong&gt; for DataJob Type. Keep the specification as default (2 GB).&lt;/li&gt;
&lt;li&gt;In &lt;strong&gt;Tables&lt;/strong&gt; page,

&lt;ol&gt;
&lt;li&gt;Select the markdown files you want to process.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Batch Modify Target Names&lt;/strong&gt; &amp;gt; &lt;strong&gt;Unified table name&lt;/strong&gt;, and fill in the table name (e.g. &lt;code&gt;knowledge_base&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Set LLM&lt;/strong&gt; &amp;gt; &lt;strong&gt;Ollama&lt;/strong&gt;, and select the instance and the embedding model.&lt;/li&gt;
&lt;/ol&gt;


&lt;/li&gt;

&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Flocal_8-94169e91021b34a8b9cb39e037633396.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Flocal_8-94169e91021b34a8b9cb39e037633396.png" width="800" height="431"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Flocal_9-7ef0d3bb898a0b706d4b4cb3bdd4d110.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Flocal_9-7ef0d3bb898a0b706d4b4cb3bdd4d110.png" width="800" height="332"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;In &lt;strong&gt;Data Processing&lt;/strong&gt; page, click &lt;strong&gt;Batch Operation&lt;/strong&gt; &amp;gt; &lt;strong&gt;LLM embedding&lt;/strong&gt;. Select the fields for embedding, and check &lt;strong&gt;Select All&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Flocal_10-b0118385b0de9416a53cd23406fa88d9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Flocal_10-b0118385b0de9416a53cd23406fa88d9.png" width="800" height="405"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;In &lt;strong&gt;Creation&lt;/strong&gt; page, click &lt;strong&gt;Create DataJob&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Flocal_11-b44e5ab4685faf0f9112f2e797fcd387.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Flocal_11-b44e5ab4685faf0f9112f2e797fcd387.png" width="800" height="134"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  DataJob 2: Build RagApi Service
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Go to &lt;strong&gt;DataJob&lt;/strong&gt; &amp;gt; &lt;strong&gt;Create DataJob&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Choose &lt;strong&gt;source: PostgreSQL&lt;/strong&gt;(with vectors stored), &lt;strong&gt;target: RagApi&lt;/strong&gt;, and test the connection.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Flocal_12-856913b70fb9cc15e46616f15081ecfb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Flocal_12-856913b70fb9cc15e46616f15081ecfb.png" width="800" height="420"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Select &lt;strong&gt;Incremental&lt;/strong&gt; for DataJob Type. Keep the specification as default (2 GB).&lt;/li&gt;
&lt;li&gt;In &lt;strong&gt;Tables&lt;/strong&gt; page, select the vector table(s). Then, click &lt;strong&gt;Set LLM&lt;/strong&gt;, and choose &lt;strong&gt;Ollama&lt;/strong&gt; as the &lt;strong&gt;Embedding LLM&lt;/strong&gt; and &lt;strong&gt;Chat LLM&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Flocal_13-bcfc97bc40042944896c811a436f52ca.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Flocal_13-bcfc97bc40042944896c811a436f52ca.png" width="800" height="422"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;In &lt;strong&gt;Creation&lt;/strong&gt; page, click &lt;strong&gt;Create DataJob&lt;/strong&gt; to finish the setup.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Flocal_15-38b48999513373ead55dc4e89915a7ef.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Flocal_15-38b48999513373ead55dc4e89915a7ef.png" width="800" height="170"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Perform a simple test using the following command:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;curl&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="n"&gt;localhost&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;18089&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;v1&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt; \
  &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;H&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type: application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; \
  &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;H&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization: Bearer my-cc-rag-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; \
  &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{
        &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: [
          {&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a helpful assistant.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;},
          {&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hello!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}
        ],
        &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: false
      }&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Test
&lt;/h2&gt;

&lt;p&gt;You can test the RagApi with &lt;a href="https://cherry-ai.com/" rel="noopener noreferrer"&gt;CherryStudio&lt;/a&gt;, a visual tool that supports OpenAI-compatible APIs.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Open CherryStudio, click the Settings icon in the bottom left corner.&lt;/li&gt;
&lt;li&gt;Under &lt;strong&gt;Model Provider&lt;/strong&gt;, search for &lt;strong&gt;OpenAI&lt;/strong&gt; and configure:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;API Key&lt;/strong&gt;: your RagApi key configured in BladePipe&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API Host&lt;/strong&gt;: &lt;a href="http://localhost:18089" rel="noopener noreferrer"&gt;http://localhost:18089&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model ID&lt;/strong&gt;: BP_RAG&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgungmtlp7p22avv1qlof.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgungmtlp7p22avv1qlof.png" width="800" height="228"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxt6nyp56oggfnsyy01li.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxt6nyp56oggfnsyy01li.png" width="800" height="452"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Back on the chat page:

&lt;ol&gt;
&lt;li&gt;Click &lt;strong&gt;Add Assistant&lt;/strong&gt; &amp;gt; &lt;strong&gt;Default Assistant&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Right click &lt;strong&gt;Default Assistant&lt;/strong&gt; &amp;gt; &lt;strong&gt;Edit Assistant&lt;/strong&gt; &amp;gt; &lt;strong&gt;Model Settings&lt;/strong&gt;, and choose BP_RAG as the default model.&lt;/li&gt;
&lt;/ol&gt;


&lt;/li&gt;

&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffzn7u0kv4hdxpottjjq4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffzn7u0kv4hdxpottjjq4.png" width="800" height="583"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Now try asking: &lt;code&gt;What privileges does BladePipe require for a source MySQL?&lt;/code&gt;. RagApi will search your vector database and generate a response using the chat model.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5lcrvxppg4380aqdbvxt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5lcrvxppg4380aqdbvxt.png" width="800" height="383"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Enterprise-grade RAG services prioritize data privacy and control. By combining BladePipe with Ollama, you can easily achieve a fully private RAG service deployment, creating a truly reliable enterprise-grade RAG solution that does not depend on public networks.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>tutorial</category>
      <category>postgres</category>
      <category>programming</category>
    </item>
    <item>
      <title>Build a RAG Chatbot with OpenAI and BladePipe - No Code Required</title>
      <dc:creator>BladePipe</dc:creator>
      <pubDate>Fri, 06 Jun 2025 07:20:34 +0000</pubDate>
      <link>https://forem.com/bladepipe/build-a-rag-chatbot-with-openai-and-bladepipe-no-code-required-4d0</link>
      <guid>https://forem.com/bladepipe/build-a-rag-chatbot-with-openai-and-bladepipe-no-code-required-4d0</guid>
      <description>&lt;p&gt;In &lt;a href="https://doc.bladepipe.com/blog/ai/rag_concept" rel="noopener noreferrer"&gt;a previous article&lt;/a&gt;, we explained key GenAI concepts like RAG, Function Calling, MCP, and AI Agents. Now the question is: how do we go from concepts to practice?  &lt;/p&gt;

&lt;p&gt;Currently, you can find plenty of RAG building tutorials online, but most of them are based on frameworks like LangChain, which still have a learning curve for beginners.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.bladepipe.com" rel="noopener noreferrer"&gt;BladePipe&lt;/a&gt;, as a data integration platform, already supports the access to and processing of multiple data sources. This gives it a natural edge in setting up the semantic search foundations for a RAG system. Recently, BladePipe launched &lt;strong&gt;RagApi&lt;/strong&gt;, which wraps up &lt;strong&gt;vector search&lt;/strong&gt; and &lt;strong&gt;Q&amp;amp;A capabilities&lt;/strong&gt; into a &lt;strong&gt;plug-and-play API service&lt;/strong&gt;. With just two DataJobs in BladePipe, you can have your own RAG service—no coding required.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why BladePipe RagApi?
&lt;/h2&gt;

&lt;p&gt;Compared to traditional RAG setups, which often involve lots of manual work, BladePipe RagApi offers several unique benefits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Two DataJobs for a RAG service&lt;/strong&gt;: One to import documents, and one to create the API.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero-code deployment&lt;/strong&gt;: No need to write any code, just configure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adjustable parameters&lt;/strong&gt;: Adjust vector top-K, match threshold, prompt templates, model temperature, etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-model and platform compatibility&lt;/strong&gt;: Support DashScope (Alibaba Cloud), OpenAI, DeepSeek, and more.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI-compatible API&lt;/strong&gt;: Integrate it directly with existing Chat apps or tools with no extra setup.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Demo: A Q&amp;amp;A Service for BladePipe Docs
&lt;/h2&gt;

&lt;p&gt;We’ll use BladePipe’s own documentation as a knowledge base to create a RAG-based Q&amp;amp;A service.&lt;/p&gt;

&lt;p&gt;Here’s what we’ll need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;BladePipe&lt;/strong&gt; – to build and manage the RagApi service&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PostgreSQL&lt;/strong&gt; – as the vector database&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embedding model&lt;/strong&gt; – OpenAI text-embedding-3-large&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chat model&lt;/strong&gt; – OpenAI GPT-4o&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here’s the overall workflow:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Fragapi_workflow-856e95a62e86522ba49452797820a8f7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Fragapi_workflow-856e95a62e86522ba49452797820a8f7.png" width="800" height="295"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Step-by-Step Setup
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Install BladePipe
&lt;/h3&gt;

&lt;p&gt;Follow the instructions in &lt;a href="https://doc.bladepipe.com/productOP/byoc/installation/install_worker_docker" rel="noopener noreferrer"&gt;Install Worker (Docker)&lt;/a&gt; or &lt;a href="https://doc.bladepipe.com/productOP/byoc/installation/install_worker_binary" rel="noopener noreferrer"&gt;Install Worker (Binary)&lt;/a&gt; to download and install a BladePipe Worker.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prepare Your Resources
&lt;/h3&gt;

&lt;p&gt;Log in &lt;a href="https://openai.com/index/openai-api/" rel="noopener noreferrer"&gt;OpenAI API platform&lt;/a&gt; and create the API key.&lt;br&gt;&lt;br&gt;
Install a local PostgreSQL instance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;#!/bin/bash

# create file docker-compose.yml
cat &lt;span class="err"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;EOF&lt;/span&gt; &lt;span class="nt"&gt;&amp;gt;&lt;/span&gt; docker-compose.yml
version: "3"
services:
  db:
    container_name: pgvector-db
    hostname: 127.0.0.1
    image: pgvector/pgvector:pg16
    ports:
      - 5432:5432
    restart: always
    environment:
      - POSTGRES_DB=api
      - POSTGRES_USER=root
      - POSTGRES_PASSWORD=123456
    volumes:
      - ./init.sql:/docker-entrypoint-initdb.d/init.sql
EOF

# Start docker-compose automatically
docker-compose up --build

# Access PostgreSQL
docker exec -it pgvector-db psql -U root -d api
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create a privileged user and log in.&lt;br&gt;&lt;br&gt;
Switch to the target schema where you need to create tables (like &lt;code&gt;public&lt;/code&gt;).&lt;br&gt;&lt;br&gt;
Run the following SQL to enable vector capability:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;CREATE EXTENSION IF NOT EXISTS vector&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Add DataSources
&lt;/h3&gt;

&lt;p&gt;Log in to the &lt;a href="https://cloud.bladepipe.com" rel="noopener noreferrer"&gt;BladePipe Cloud&lt;/a&gt;. Click &lt;strong&gt;DataSource&lt;/strong&gt; &amp;gt; &lt;strong&gt;Add DataSource&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Add Files:&lt;/strong&gt;   &lt;/p&gt;

&lt;p&gt;Select &lt;strong&gt;Self Maintenance&lt;/strong&gt; &amp;gt; &lt;strong&gt;SshFile&lt;/strong&gt;. You can set &lt;a href="https://doc.bladepipe.com/reference/file_schema_format" rel="noopener noreferrer"&gt;extra parmeters&lt;/a&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Address&lt;/strong&gt;: Fill in the machine IP where the files are stored and SSH port (default 22).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Account &amp;amp; Password&lt;/strong&gt;: Username and password of the machine.&lt;/li&gt;
&lt;li&gt;Parameter &lt;strong&gt;fileSuffixArray&lt;/strong&gt;: set to &lt;code&gt;.md&lt;/code&gt; to include markdown files.&lt;/li&gt;
&lt;li&gt;Parameter &lt;strong&gt;dbsJson&lt;/strong&gt;: Copy the default value and modify the &lt;strong&gt;schema&lt;/strong&gt; value (the root path where target files are located)
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"db"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"cc_virtual_fs"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"schemas"&lt;/span&gt;&lt;span class="p"&gt;:[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"schema"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"/tmp/cloudcanal-doc-v2/locales"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"tables"&lt;/span&gt;&lt;span class="p"&gt;:[]&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F1-cc70ae4e3b8c5051c9a3ac55643a5f46.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F1-cc70ae4e3b8c5051c9a3ac55643a5f46.png" width="800" height="483"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Add the Vector Database:&lt;/strong&gt;   &lt;/p&gt;

&lt;p&gt;Choose &lt;strong&gt;Self Maintenance&lt;/strong&gt; &amp;gt; &lt;strong&gt;PostgreSQL&lt;/strong&gt;, then connect.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F2-1123c7b039b74ab9b22ffe4e37fd6429.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F2-1123c7b039b74ab9b22ffe4e37fd6429.png" width="800" height="481"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Add a LLM:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Choose &lt;strong&gt;Independent Cloud Platform&lt;/strong&gt; &amp;gt; &lt;strong&gt;Manually Fill&lt;/strong&gt; &amp;gt; &lt;strong&gt;OpenAI&lt;/strong&gt;, and fill in the API key.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F3-30509920901044517f3f2cbdb0c71f69.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F3-30509920901044517f3f2cbdb0c71f69.png" width="800" height="482"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Add RagApi Service:&lt;/strong&gt;    &lt;/p&gt;

&lt;p&gt;Choose &lt;strong&gt;Self Maintenance&lt;/strong&gt; &amp;gt; &lt;strong&gt;RagApi&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Address&lt;/strong&gt;: Set host to &lt;code&gt;localhost&lt;/code&gt; and port to 18089.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API Key&lt;/strong&gt;: Create your own API key for later use. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F4-dcc508aa9fd4a185d3a861fd4658318c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F4-dcc508aa9fd4a185d3a861fd4658318c.png" width="800" height="482"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  DataJob 1: Vectorize Your Data
&lt;/h3&gt;

&lt;p&gt;Go to &lt;strong&gt;DataJob&lt;/strong&gt; &amp;gt; &lt;a href="https://doc.bladepipe.com/operation/job_manage/create_job/create_full_incre_task" rel="noopener noreferrer"&gt;&lt;strong&gt;Create DataJob&lt;/strong&gt;&lt;/a&gt;.&lt;br&gt;&lt;br&gt;
Choose source: &lt;strong&gt;SshFile&lt;/strong&gt;, target: &lt;strong&gt;PostgreSQL&lt;/strong&gt;, and test the connection.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F5-857f1ee2ddab3fb13e8a9271fb3f0a24.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F5-857f1ee2ddab3fb13e8a9271fb3f0a24.png" width="800" height="482"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Select &lt;strong&gt;Full Data&lt;/strong&gt; for DataJob Type. Keep the specification as default (2 GB).&lt;br&gt;&lt;br&gt;
In &lt;strong&gt;Tables&lt;/strong&gt; page, &lt;br&gt;
    1. Select the markdown files you want to process.&lt;br&gt;
    2. Click &lt;strong&gt;Batch Modify Target Names&lt;/strong&gt; &amp;gt; &lt;strong&gt;Unified table name&lt;/strong&gt;, and fill in the table name (e.g. &lt;code&gt;vector_store&lt;/code&gt;). &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F7-8a689e786aab30efc30e57219c62784b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F7-8a689e786aab30efc30e57219c62784b.png" width="800" height="481"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In &lt;strong&gt;Data Processing&lt;/strong&gt; page,&lt;br&gt;&lt;br&gt;
    1. Click &lt;strong&gt;Set LLM&lt;/strong&gt; &amp;gt; &lt;strong&gt;OpenAI&lt;/strong&gt;, and select the instance and the embedding model (text-embedding-3-large).&lt;br&gt;
    2. Click &lt;strong&gt;Batch Operation&lt;/strong&gt; &amp;gt; &lt;strong&gt;LLM embedding&lt;/strong&gt;. Select the fields for embedding, and check &lt;strong&gt;Select All&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F8-1-16f07226ade47c47e9aade29f08637ac.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F8-1-16f07226ade47c47e9aade29f08637ac.png" width="800" height="374"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F8-2-5f14ee948990dedbe1fe4ff8842ab291.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F8-2-5f14ee948990dedbe1fe4ff8842ab291.png" width="800" height="420"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In &lt;strong&gt;Creation&lt;/strong&gt; page, click &lt;strong&gt;Create DataJob&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F10-afbc31f00cae85df748ee7591a8ae49e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F10-afbc31f00cae85df748ee7591a8ae49e.png" width="800" height="139"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  DataJob 2: Build RagApi Service
&lt;/h3&gt;

&lt;p&gt;Go to &lt;strong&gt;DataJob&lt;/strong&gt; &amp;gt; &lt;a href="https://doc.bladepipe.com/operation/job_manage/create_job/create_full_incre_task" rel="noopener noreferrer"&gt;&lt;strong&gt;Create DataJob&lt;/strong&gt;&lt;/a&gt;.&lt;br&gt;&lt;br&gt;
Choose source: &lt;strong&gt;PostgreSQL&lt;/strong&gt;(with vectors stored), target: &lt;strong&gt;RagApi&lt;/strong&gt;, and test the connection.   &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F11-f12916f024b65d72a59df9c548184293.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F11-f12916f024b65d72a59df9c548184293.png" width="800" height="484"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Select &lt;strong&gt;Incremental&lt;/strong&gt; for DataJob Type. Keep the specification as default (2 GB). &lt;br&gt;
In &lt;strong&gt;Tables&lt;/strong&gt; page, select the vector table(s).   &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F13-470a46a3154ec318bcea09cb1e64f6f1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F13-470a46a3154ec318bcea09cb1e64f6f1.png" width="800" height="482"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In &lt;strong&gt;Data Processing&lt;/strong&gt; page, click &lt;strong&gt;Set LLM&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Embedding LLM&lt;/strong&gt;: Select OpenAI and the embedding model (e.g. &lt;code&gt;text-embedding-3-large&lt;/code&gt;). &lt;strong&gt;Note:&lt;/strong&gt; Make sure vector dimensions in PostgreSQL match the embedding model.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Chat LLM&lt;/strong&gt;: Select OpenAI and the chat model (e.g. &lt;code&gt;gpt-4o&lt;/code&gt;). &lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F14-ce81c80de144806e353adefc8d96e647.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F14-ce81c80de144806e353adefc8d96e647.png" width="800" height="479"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In &lt;strong&gt;Creation&lt;/strong&gt; page, click &lt;strong&gt;Create DataJob&lt;/strong&gt; to finish the setup. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F16-fef9a1a7ce238ab7cd024eb9cb66e0ca.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F16-fef9a1a7ce238ab7cd024eb9cb66e0ca.png" width="800" height="185"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Test
&lt;/h2&gt;

&lt;p&gt;You can test the RagApi with &lt;a href="https://cherry-ai.com/" rel="noopener noreferrer"&gt;CherryStudio&lt;/a&gt;, a visual tool that supports OpenAI-compatible APIs.&lt;/p&gt;

&lt;p&gt;Open &lt;a href="https://cherry-ai.com/" rel="noopener noreferrer"&gt;CherryStudio&lt;/a&gt;, click the Settings icon in the bottom left corner.&lt;br&gt;&lt;br&gt;
Under &lt;strong&gt;Model Provider&lt;/strong&gt;, search for &lt;strong&gt;OpenAI&lt;/strong&gt; and configure:&lt;br&gt;
    - &lt;strong&gt;API Key&lt;/strong&gt;: your RagApi key configured in BladePipe&lt;br&gt;
    - &lt;strong&gt;API Host&lt;/strong&gt;: &lt;a href="http://localhost:18089" rel="noopener noreferrer"&gt;http://localhost:18089&lt;/a&gt;&lt;br&gt;
    - &lt;strong&gt;Model ID&lt;/strong&gt;: BP_RAG&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fozdlt0wkmsm3i92vb9f1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fozdlt0wkmsm3i92vb9f1.png" width="800" height="215"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ogpbesa1rmgv80jv78u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ogpbesa1rmgv80jv78u.png" width="800" height="444"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Back on the chat page:&lt;br&gt;&lt;br&gt;
    - Click &lt;strong&gt;Add Assistant&lt;/strong&gt; &amp;gt; &lt;strong&gt;Default Assistant&lt;/strong&gt;.&lt;br&gt;
    - Right click &lt;strong&gt;Default Assistant&lt;/strong&gt; &amp;gt; &lt;strong&gt;Edit Assistant&lt;/strong&gt; &amp;gt; &lt;strong&gt;Model Settings&lt;/strong&gt;, and choose BP_RAG as the default model.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F26xfuu3fq2snz68msbei.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F26xfuu3fq2snz68msbei.png" width="800" height="684"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now try asking: &lt;code&gt;How to create an incremental DataJob in BladePipe?&lt;/code&gt;. RagApi will search your vector database and generate a response using the chat model.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxuv1uch069mn44nkcex6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxuv1uch069mn44nkcex6.png" width="800" height="433"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;With just a few steps, we’ve built a fully functional RagApi service from scratch—vectorized the data, connected to a vector DB, configured LLMs, generated Prompt and deployed an OpenAI-compatible API.&lt;/p&gt;

&lt;p&gt;With &lt;a href="https://www.bladepipe.com" rel="noopener noreferrer"&gt;BladePipe&lt;/a&gt;, teams can quickly build Q&amp;amp;A services based on outside knowledge without writing any code. It's a powerful yet accessible way to tap into GenAI for your own data.&lt;/p&gt;

</description>
      <category>rag</category>
      <category>openai</category>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>How to Load Data From MySQL to Iceberg in Real Time</title>
      <dc:creator>BladePipe</dc:creator>
      <pubDate>Wed, 28 May 2025 07:59:50 +0000</pubDate>
      <link>https://forem.com/bladepipe/how-to-load-data-from-mysql-to-iceberg-in-real-time-2f</link>
      <guid>https://forem.com/bladepipe/how-to-load-data-from-mysql-to-iceberg-in-real-time-2f</guid>
      <description>&lt;p&gt;As companies deal with more data than ever before, the need for real-time, scalable, and low-cost storage becomes critical. That's where Apache Iceberg shines. In this post, I’ll walk you through how to build a real-time data sync pipeline from MySQL to Iceberg using &lt;a href="https://www.bladepipe.com" rel="noopener noreferrer"&gt;BladePipe&lt;/a&gt;—a tool that makes data migration ridiculously simple.&lt;/p&gt;

&lt;p&gt;Let’s dive in.&lt;/p&gt;

&lt;h2&gt;
  
  
  Iceberg
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is Iceberg?
&lt;/h3&gt;

&lt;p&gt;If you haven’t heard of Iceberg yet, it’s an open table format designed for large analytic datasets. It’s kind of like a smarter table format for your data lake—supporting schema evolution, hidden partitioning, ACID-like operations, and real-time data access.&lt;/p&gt;

&lt;p&gt;It includes two key concepts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Catalog&lt;/strong&gt;: Think of this as metadata—the table names, columns, data types, etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Storage&lt;/strong&gt;: Where the metadata and actual files are stored—like on S3 or HDFS.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why Iceberg?
&lt;/h3&gt;

&lt;p&gt;Iceberg is open and flexible. It defines clear standards for catalog, file formats, data storage, and data access. This makes it widely compatible with different tools and services.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Catalogs&lt;/strong&gt;: AWS Glue, Hive, Nessie, JDBC, or custom REST catalogs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;File formats&lt;/strong&gt;: Parquet, ORC, Avro, etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage options&lt;/strong&gt;: AWS S3, Azure Blob, MinIO, HDFS, Posix FS, local file systems, and more.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data access&lt;/strong&gt;: Real-time data warehouses like StarRocks, Doris, ClickHouse, or batch/stream processing engines like Spark, Flink, and Hive can all read, process and analyze Iceberg data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Besides its openness, Iceberg strikes a good balance between large-scale storage and near real-time support for inserts, updates, and deletes.&lt;/p&gt;

&lt;p&gt;Here’s a quick comparison across several database types:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Database Type&lt;/th&gt;
&lt;th&gt;Relational DB&lt;/th&gt;
&lt;th&gt;Real-time Data Warehouse&lt;/th&gt;
&lt;th&gt;Traditional Big Data&lt;/th&gt;
&lt;th&gt;Data Lake&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data Capacity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Up to a few TBs&lt;/td&gt;
&lt;td&gt;100+ TBs&lt;/td&gt;
&lt;td&gt;PB level&lt;/td&gt;
&lt;td&gt;PB level&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Real-time Support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Millisecond-level latency, 10K+ QPS&lt;/td&gt;
&lt;td&gt;Second-to-minute latency, thousands QPS&lt;/td&gt;
&lt;td&gt;Hour-to-day latency, very low QPS&lt;/td&gt;
&lt;td&gt;Minute-level latency, low QPS (batch write)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Transactions&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;ACID compliant&lt;/td&gt;
&lt;td&gt;ACID compliant or eventually consistent&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Storage Cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;High or very high&lt;/td&gt;
&lt;td&gt;Very low&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Openness&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Medium(storage-compute decoupling)&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Very high&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;From this table, it’s clear that Iceberg offers &lt;strong&gt;low cost&lt;/strong&gt;, &lt;strong&gt;massive storage&lt;/strong&gt;, and &lt;strong&gt;strong compatibility with analytics tools&lt;/strong&gt;—a good replacement for older big data systems.&lt;/p&gt;

&lt;p&gt;And thanks to its open architecture, you can keep exploring new use cases for it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why BladePipe?
&lt;/h2&gt;

&lt;p&gt;Setting up Iceberg sounds great—until you realize how much work it takes to actually migrate and sync data from your transactional database. That’s where BladePipe comes in.&lt;/p&gt;

&lt;h3&gt;
  
  
  Supported Catalogs and Storage
&lt;/h3&gt;

&lt;p&gt;BladePipe currently supports 3 Iceberg catalogs and 2 storage backends:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS Glue + AWS S3&lt;/li&gt;
&lt;li&gt;Nessie + MinIO / AWS S3&lt;/li&gt;
&lt;li&gt;REST Catalog + MinIO / AWS S3&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a fully cloud-based setup: Use AWS RDS + EC2 to deploy BladePipe + AWS Glue + AWS S3.&lt;/p&gt;

&lt;p&gt;For an on-premise setup: Use a self-hosted relational database + On-Premise deployment of BladePipe + Nessie or REST catalog + MinIO.&lt;/p&gt;

&lt;h3&gt;
  
  
  One-Stop Data Sync
&lt;/h3&gt;

&lt;p&gt;Before data replication, there's often a lot of manual setup. BladePipe takes care of that for you—automatically handling schema mapping, historical data migration, and other preparation.&lt;/p&gt;

&lt;p&gt;Even though Iceberg isn't a traditional database, BladePipe supports an automatic data sync process, including converting schemas, mapping data types, adapting field lengths, cleaning constraints, etc. Everything happens in BladePipe.&lt;/p&gt;

&lt;h2&gt;
  
  
  Procedures
&lt;/h2&gt;

&lt;p&gt;In this post, we’ll use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Source: MySQL (self-hosted)&lt;/li&gt;
&lt;li&gt;Target: Iceberg backed by AWS Glue + S3&lt;/li&gt;
&lt;li&gt;Sync Tool: &lt;a href="https://www.bladepipe.com" rel="noopener noreferrer"&gt;BladePipe (Cloud)&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let’s go step-by-step.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Install BladePipe
&lt;/h3&gt;

&lt;p&gt;Follow the instructions in &lt;a href="https://doc.bladepipe.com/productOP/byoc/installation/install_worker_docker" rel="noopener noreferrer"&gt;Install Worker (Docker)&lt;/a&gt; or &lt;a href="https://doc.bladepipe.com/productOP/byoc/installation/install_worker_binary" rel="noopener noreferrer"&gt;Install Worker (Binary)&lt;/a&gt; to download and install a BladePipe Worker.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Add DataSources
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Log in to the &lt;a href="https://cloud.bladepipe.com" rel="noopener noreferrer"&gt;BladePipe Cloud&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;DataSource&lt;/strong&gt; &amp;gt; &lt;strong&gt;Add DataSource&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Add two sources – one MySQL, one Iceberg. For Iceberg, fill in the following (replace &lt;code&gt;&amp;lt;...&amp;gt;&lt;/code&gt; with your values):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Address&lt;/strong&gt;: Fill in the AWS Glue endpoint.
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;glue.&amp;lt;aws_glue_region_code&amp;gt;.amazonaws.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Version&lt;/strong&gt;: Leave as default.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Description&lt;/strong&gt;: Fill in meaningful words to help identify it.&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Extra Info&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;httpsEnabled&lt;/strong&gt;: Enable it to set the value as true.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;catalogName&lt;/strong&gt;: Enter a meaningful name, such as glue__catalog.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;catalogType&lt;/strong&gt;: Fill in GLUE.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;catalogWarehouse&lt;/strong&gt;: The place where metadata and files are stored, such as s3://_iceberg.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;catalogProps&lt;/strong&gt;:
&lt;/li&gt;
&lt;/ul&gt;

&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"io-impl"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"org.apache.iceberg.aws.s3.S3FileIO"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"s3.endpoint"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://s3.&amp;lt;aws_s3_region_code&amp;gt;.amazonaws.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"s3.access-key-id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;aws_s3_iam_user_access_key&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"s3.secret-access-key"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;aws_s3_iam_user_secret_key&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"s3.path-style-access"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"true"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"client.region"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;aws_s3_region&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"client.credentials-provider.glue.access-key-id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;aws_glue_iam_user_access_key&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"client.credentials-provider.glue.secret-access-key"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;aws_glue_iam_user_secret_key&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"client.credentials-provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"com.amazonaws.glue.catalog.credentials.GlueAwsCredentialsProvider"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F1-afd4d16b1739f59151ceb30d6189cfc4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F1-afd4d16b1739f59151ceb30d6189cfc4.png" width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 3: Create a DataJob
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Go to &lt;strong&gt;DataJob&lt;/strong&gt; &amp;gt; &lt;a href="https://doc.bladepipe.com/operation/job_manage/create_job/create_full_incre_task" rel="noopener noreferrer"&gt;&lt;strong&gt;Create DataJob&lt;/strong&gt;&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Select the source and target DataSources, and click &lt;strong&gt;Test Connection&lt;/strong&gt; for both. Here's the recommended Iceberg structure configuration:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="nl"&gt;"format-version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="nl"&gt;"parquet.compression"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"snappy"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="nl"&gt;"iceberg.write.format"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"parquet"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="nl"&gt;"write.metadata.delete-after-commit.enabled"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"true"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="nl"&gt;"write.metadata.previous-versions-max"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"3"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="nl"&gt;"write.update.mode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"merge-on-read"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="nl"&gt;"write.delete.mode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"merge-on-read"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="nl"&gt;"write.merge.mode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"merge-on-read"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="nl"&gt;"write.distribution-mode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"hash"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="nl"&gt;"write.object-storage.enabled"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"true"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="nl"&gt;"write.spark.accept-any-schema"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"true"&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F2-e436c11d029481dc58c5a86d17a2fc7b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F2-e436c11d029481dc58c5a86d17a2fc7b.png" width="800" height="426"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Select &lt;strong&gt;Incremental&lt;/strong&gt; for DataJob Type, together with the &lt;strong&gt;Full Data&lt;/strong&gt; option. Use at least the 1 GB or 2 GB DataJob specification. Smaller specification may hit memory issues with large batches.
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F3-aaf4ce14be8ce88cbcdb85c426ceab33.png" width="800" height="428"&gt;
&lt;/li&gt;
&lt;li&gt;Select the tables to be replicated. It’s best to stay under 1000 tables per DataJob.
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F4-a04b97e30e5784d7159c6a90e948cdbd.png" width="800" height="429"&gt;
&lt;/li&gt;
&lt;li&gt;Select the columns to be replicated.&lt;/li&gt;
&lt;li&gt;Confirm the DataJob creation, and start to run the DataJob.
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2F6-3a09842159318f4b02a02b13b575f071.png" alt="mysql_to_iceberg_running" width="800" height="125"&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Step 4: Test &amp;amp; Verify
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Generate some insert/update/delete operations on MySQL&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Fmysql_to_iceberg_incre_data-ec037ce34cb79945652094c5f056e4ce.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoc.bladepipe.com%2Fassets%2Fimages%2Fmysql_to_iceberg_incre_data-ec037ce34cb79945652094c5f056e4ce.png" alt="mysql_to_iceberg_incre_data" width="800" height="205"&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Stop data generation.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Set up a pay-as-you-go Aliyun EMR for StarRocks, add the AWS Glue Iceberg catalog, and run queries.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;In StarRocks, add the external catalog:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt; &lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;EXTERNAL&lt;/span&gt; &lt;span class="k"&gt;CATALOG&lt;/span&gt; &lt;span class="n"&gt;glue_test&lt;/span&gt;
 &lt;span class="n"&gt;PROPERTIES&lt;/span&gt;
 &lt;span class="p"&gt;(&lt;/span&gt;
   &lt;span class="nv"&gt;"type"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;"iceberg"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="nv"&gt;"iceberg.catalog.type"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;"glue"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="nv"&gt;"aws.glue.use_instance_profile"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;"false"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="nv"&gt;"aws.glue.access_key"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;"&amp;lt;aws_glue_iam_user_access_key&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="nv"&gt;"aws.glue.secret_key"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;"&amp;lt;aws_glue_iam_user_secret_key&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="nv"&gt;"aws.glue.region"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;"ap-southeast-1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="nv"&gt;"aws.s3.use_instance_profile"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;"false"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="nv"&gt;"aws.s3.access_key"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;"&amp;lt;aws_s3_iam_user_access_key&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="nv"&gt;"aws.s3.secret_key"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;"&amp;lt;aws_s3_iam_user_secret_key&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="nv"&gt;"aws.s3.region"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;"ap-southeast-1"&lt;/span&gt;
 &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;set&lt;/span&gt; &lt;span class="k"&gt;CATALOG&lt;/span&gt; &lt;span class="n"&gt;glue_test&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;set&lt;/span&gt; &lt;span class="k"&gt;global&lt;/span&gt; &lt;span class="n"&gt;new_planner_optimize_timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;&lt;p&gt;MySQL row count&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ol4wkxyqiqcgul663tt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ol4wkxyqiqcgul663tt.png" alt="mysql_data_count" width="800" height="227"&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Iceberg row count&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8pzix7ffe18fnd1f4c4m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8pzix7ffe18fnd1f4c4m.png" alt="iceberg_data_count" width="800" height="217"&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Building a robust, real-time data pipeline from MySQL to Iceberg used to be a heavy lift. With tools like &lt;a href="https://www.bladepipe.com" rel="noopener noreferrer"&gt;BladePipe&lt;/a&gt;, it becomes as easy as clicking through a setup wizard.&lt;/p&gt;

&lt;p&gt;Whether you're modernizing your data platform or experimenting with lakehouse architectures, this combo gives you a low-cost, high-scale option to play with.&lt;/p&gt;

</description>
      <category>iceberg</category>
      <category>mysql</category>
      <category>database</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
