<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Estuary</title>
    <description>The latest articles on Forem by Estuary (@estuary).</description>
    <link>https://forem.com/estuary</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F9217%2F645b5d1c-918b-4917-bae7-4467464c6119.png</url>
      <title>Forem: Estuary</title>
      <link>https://forem.com/estuary</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/estuary"/>
    <language>en</language>
    <item>
      <title>2x Faster MongoDB CDC: An Engineering Deep-Dive on Performance Optimization</title>
      <dc:creator>Sourabh Gupta</dc:creator>
      <pubDate>Tue, 31 Mar 2026 12:29:53 +0000</pubDate>
      <link>https://forem.com/estuary/2x-faster-mongodb-cdc-an-engineering-deep-dive-on-performance-optimization-4ghb</link>
      <guid>https://forem.com/estuary/2x-faster-mongodb-cdc-an-engineering-deep-dive-on-performance-optimization-4ghb</guid>
      <description>&lt;p&gt;Estuary’s focus on in-house crafted connectors isn’t an accident.&lt;/p&gt;

&lt;p&gt;It’s not about keeping secrets; we’re not a black box factory and &lt;a href="https://github.com/estuary/connectors" rel="noopener noreferrer"&gt;connector source code&lt;/a&gt; is publicly available for anyone to review. It’s about maintaining the responsibility of ownership, starting with a high-quality base product, and refining from there.&lt;/p&gt;

&lt;p&gt;Integrations are specifically designed to work seamlessly with Estuary, providing standard customization options and converting data to standard formats with as little waste as possible. And connectors get continuous updates to keep up with API changes or finetune performance.&lt;/p&gt;

&lt;p&gt;Our &lt;a href="https://docs.estuary.dev/reference/Connectors/capture-connectors/MongoDB/" rel="noopener noreferrer"&gt;MongoDB capture connector&lt;/a&gt; recently received one of these upgrades: while the connector reliably got the job done, it could fall behind in high-volume enterprise use cases. This could be especially detrimental for real-time pipelines that counted on the connector’s functionality with MongoDB’s change streams—if the connector couldn’t keep up with the data coming in, downstream systems could experience delays.&lt;/p&gt;

&lt;p&gt;For real-time native applications, even small slowdowns have an outsized impact. Consider the route change notification for a shipment that arrives just after a driver misses the turnoff. Or a triage system that doesn't capture the latest developments in its priority calculations.&lt;/p&gt;

&lt;p&gt;It was definitely time for some optimization work.&lt;/p&gt;

&lt;p&gt;On the case was Mahdi Dibaiee. Based in Dublin, Ireland when not on adventures around the world, Mahdi has been a Senior Software Engineer with Estuary for nearly four years. Having worked on data planes, Estuary’s &lt;a href="https://docs.estuary.dev/guides/get-started-with-flowctl/" rel="noopener noreferrer"&gt;&lt;code&gt;flowctl&lt;/code&gt;&lt;/a&gt; CLI, and various connectors, his deep knowledge of the platform lets him flexibly pick up whatever tasks have current top priority.&lt;/p&gt;

&lt;p&gt;This is a behind-the-scenes look at how he analyzed the existing implementation’s limitations, researched solutions, and ended up with double the speed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with Small Documents
&lt;/h2&gt;

&lt;p&gt;“Make this integration faster,” while a laudable goal, isn’t much to go on. Why were captures falling behind? What was the expected throughput rate? And how could we find specific areas to improve?&lt;/p&gt;

&lt;p&gt;First, start with a baseline.&lt;/p&gt;

&lt;p&gt;The MongoDB capture connector tended towards a throughput rate of 34 MB/s when working with standard-sized documents, such as those around 20 KB apiece.&lt;/p&gt;

&lt;p&gt;To test how the connector would react under different circumstances, Mahdi tried it out against a stream of much smaller documents, each around 250 bytes.&lt;/p&gt;

&lt;p&gt;Something concerning happened when the connector processed these small documents. The capture’s ingestion rate dropped down to a meager 6 MB/s. While it would be unlikely to find this “tiny document” use case in the wild, 6 MB/s was still far too slow.&lt;/p&gt;

&lt;p&gt;It also uncovered a possible path forward.&lt;/p&gt;

&lt;p&gt;“This told us that we had a large overhead-per-document,” Mahdi explained, which resulted in the abysmal slowdown.&lt;/p&gt;

&lt;p&gt;Essentially, all document processing would include some overhead. Changing the size of processed documents acted as a lever to quickly check just how much the overhead impacted performance: smaller documents with the same amount of overhead per document led to more overall time spent on the overhead rather than on making progress.&lt;/p&gt;

&lt;p&gt;If he could find ways to reduce that overhead, all pipelines should speed up, not just ones with tiny documents.&lt;/p&gt;

&lt;p&gt;But where exactly did that overhead come from? To tune the MongoDB capture’s performance, some digging would be required.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Reason Behind the Bottleneck
&lt;/h2&gt;

&lt;p&gt;To get a picture of the systems involved, Mahdi profiled a particular MongoDB capture that was struggling to keep up with its load.&lt;/p&gt;

&lt;p&gt;First up was to rule out a couple obvious answers. He checked CPU load and memory pressure on both MongoDB’s side and the capture connector’s side. Neither indicated any issues.&lt;/p&gt;

&lt;p&gt;Next, Mahdi wanted to see where Estuary spent the most time when ingesting data from MongoDB. He set up a detailed tracing view, dividing up the time for each data fetch and marking out network and CPU activity.&lt;/p&gt;

&lt;p&gt;The trace exposed two areas of note: one a suspiciously empty space, and one a suspiciously long process, both related to the connector call to get more documents. In total, this caused Estuary to spend around two seconds on each batch of fetched data, which isn’t quite the millisecond latency Estuary aims for.&lt;/p&gt;

&lt;p&gt;So, what was actually happening?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9nrovhcq8n9p3pi459mm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9nrovhcq8n9p3pi459mm.png" alt="A 2-second slice of time showing CPU activity in the MongoDB connector" width="800" height="203"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Activity trace for a MongoDB capture. ~2 seconds is highlighted, showing a noticeable gap in CPU usage before a string of activity.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;600ms at the beginning of this cycle corresponded to the data fetch itself. When one batch of data finished processing, the connector sent out a request over the network for more, then started working on the new batch once it arrived.&lt;/p&gt;

&lt;p&gt;Because of this synchronous mode of operation, the connector essentially sat around waiting for half a second each time it wanted to check for new data. When working with an end-to-end real-time system, those milliseconds in the pipeline add up. Not to mention the cumulative CPU idle time when the CPU’s doing nothing much for a full quarter of the connector’s process.&lt;/p&gt;

&lt;p&gt;There, then, was an obvious bottleneck, but the activity following the fetch was also curious. The remaining 1.4 seconds in the cycle were spent processing documents.&lt;/p&gt;

&lt;p&gt;By itself, emitting documents and checkpoints to Estuary shouldn’t take that long. But there was one more step in the processing phase that might: decoding MongoDB’s BSON documents in the first place.&lt;/p&gt;

&lt;p&gt;With the possibility of optimizing document processing in the mix, there were two routes forward, two avenues to improve the connector’s performance.&lt;/p&gt;

&lt;p&gt;Why not implement both?&lt;/p&gt;

&lt;h2&gt;
  
  
  From Go to Rust: An Expedient Solution
&lt;/h2&gt;

&lt;p&gt;The CPU’s idle time was perhaps the more straightforward fix. Mahdi immediately identified that making the connector slightly more asynchronous would keep the CPU busy and shave those 600ms off of each batch.&lt;/p&gt;

&lt;p&gt;To do so, he modified Estuary’s MongoDB connector to pre-fetch the next batch while still processing the first. In order to preserve both ordering and load on memory, he limited the number of fetched batches to four. With a maximum of 16 MB for each MongoDB cursor batch, this would keep the connector’s memory consumption to 64 MB.&lt;/p&gt;

&lt;p&gt;This change alone would provide a welcome performance boost, but there was still the unsatisfyingly slow document processing time to contend with. And it was a trickier problem.&lt;/p&gt;

&lt;p&gt;To standardize data coming from and going to a variety of different systems using a variety of different document formats and data types, Estuary translates everything to JSON as an intermediary. This makes it simple to mix and match data sources and destinations, or plug in a new connector: each connector only needs to handle its specific system and translation to or from the shared language.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmp9e20ltotbcov41xour.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmp9e20ltotbcov41xour.png" alt="Estuary connectors are plug-and-play by going through an intermediary JSON conversion" width="800" height="288"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Estuary translates MongoDB’s BSON documents to JSON so as to then easily translate the data to any destination format.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;MongoDB documents come in BSON, or Binary JSON. This modified version of JSON generally makes for efficient storage and retrieval. It also includes a handful of additional data types, such as datetime and more specific numeric types.&lt;/p&gt;

&lt;p&gt;This sounds like it would make for a reasonably simple conversion, but Mahdi found that Estuary’s MongoDB connector spent a lot of time decoding documents with Go’s &lt;a href="https://pkg.go.dev/github.com/mongodb/mongo-go-driver/bson" rel="noopener noreferrer"&gt;&lt;code&gt;bson&lt;/code&gt;&lt;/a&gt; package. On reflection, perhaps this wasn’t much of a surprise. Go’s &lt;a href="https://pkg.go.dev/github.com/goccy/go-reflect" rel="noopener noreferrer"&gt;&lt;code&gt;reflect&lt;/code&gt;&lt;/a&gt; package, which infers data types that aren’t already known, is notoriously slow and the &lt;code&gt;bson&lt;/code&gt; package relied heavily on &lt;code&gt;reflect&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Looking for alternatives, he first performed some benchmarks on Rust’s corresponding &lt;a href="https://github.com/mongodb/bson-rust" rel="noopener noreferrer"&gt;&lt;code&gt;bson&lt;/code&gt;&lt;/a&gt; crate. The results were demonstrable: Rust’s version was 2x faster than Go.&lt;/p&gt;

&lt;p&gt;Mahdi’s meticulous research also uncovered another option. Rust’s most popular serialization/deserialization crate, &lt;a href="https://crates.io/crates/serde" rel="noopener noreferrer"&gt;&lt;code&gt;serde&lt;/code&gt;&lt;/a&gt;, has a &lt;code&gt;serde-transcode&lt;/code&gt; plugin crate. This transcoder can convert documents from one format to another without any intermediary layer, cutting down on unnecessary processing steps. With this, the BSON to JSON conversion could be 3x faster than the existing Go implementation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F59iic0go2u7elb7yezcb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F59iic0go2u7elb7yezcb.png" alt="Rust's BSON conversion is 3x faster than Go" width="661" height="361"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;serde&lt;/code&gt; couldn’t simply be swapped in as-is. Mahdi wrapped the out-of-the-box serializer in custom logic, extending the JSON conversion and sanitizing the data. The resulting implementation fit Estuary’s specific needs while retaining the 3x performance boost.&lt;/p&gt;

&lt;p&gt;These changes would address both bottlenecks and refurbish the MongoDB capture connector.&lt;/p&gt;

&lt;h2&gt;
  
  
  End Result: Supercharged MongoDB Captures
&lt;/h2&gt;

&lt;p&gt;One question remained: would these improvements hold up across various scenarios? Thorough testing commenced.&lt;/p&gt;

&lt;p&gt;Mahdi started where it all began: the tiny documents scenario. He ran the MongoDB connector on a stream of small 250-byte documents, first using the main version before switching to use the improved branch. The measly ~6 MB/s throughput rate rose to around 17.5 MB/s, tripling throughput for the small documents use case.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fho53g8znox409za3i9wr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fho53g8znox409za3i9wr.png" alt="Throughput rate for small-sized documents, first using Go, then Rust" width="800" height="385"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Mahdi graphs out throughput results for the MongoDB connector, first using the original Go implementation, followed by the Rust transcoder.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Of course, this scenario was only ever meant as a test and example, a way to define how much overhead we were seeing as the connector processed documents.&lt;/p&gt;

&lt;p&gt;Mahdi therefore reran the test, this time using 20 KB documents, a more standard size. The original 34 MB/s rate jumped to 57 MB/s, almost doubling throughput.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzu5quofu12wtuyzpenue.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzu5quofu12wtuyzpenue.png" alt="Throughput rate for average-sized documents, first using Go, then Rust" width="800" height="426"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The difference when using larger documents is still substantial, even if less pronounced.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This rate was much more reasonable, allowing for around 200 GB of data ingestion per hour and ensuring the Estuary connector could keep up with higher volume use cases.&lt;/p&gt;

&lt;p&gt;What this means in practical terms is that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Huge initial databases would get backfilled in half the time&lt;/li&gt;
&lt;li&gt;The platform would be able to handle twice as much data in continuous CDC mode&lt;/li&gt;
&lt;li&gt;Which also means spikes in activity would be more easily handled: instead of choking performance, real-time events would stay real-time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After review and approval, Mahdi rolled out the changes to a select set of users first so he could closely monitor affected pipelines. He would be ready to quickly revert or revise as needed if any problems arose.&lt;/p&gt;

&lt;p&gt;With so many use cases and interactions, one minor issue did raise its head: Rust and Go handled invalid UTF-8 characters differently. With a little more customization, Mahdi updated the connector’s leniency on invalid characters to mimic the former behavior.&lt;/p&gt;

&lt;p&gt;Other than that, the rollout was smooth sailing, with capture throughput ticking upwards across the board.&lt;/p&gt;

&lt;p&gt;So if you recently noticed your MongoDB capture speeding up: now you know.&lt;/p&gt;

&lt;h2&gt;
  
  
  What’s Next?
&lt;/h2&gt;

&lt;p&gt;While 200 GB an hour is a decent clip, Mahdi noted that there is still room for further improvement. The main issue now is that the connector is relatively CPU-bound. And, after all, efficiency is one of those goals that doesn’t have a specific end.&lt;/p&gt;

&lt;p&gt;For now, though, there are new challenges to face.&lt;/p&gt;

&lt;p&gt;To test out the capture connector’s speed yourself, &lt;a href="https://dashboard.estuary.dev/register" rel="noopener noreferrer"&gt;try it out in Estuary&lt;/a&gt;. Or &lt;a href="https://estuary.dev/contact-us/" rel="noopener noreferrer"&gt;set up a call&lt;/a&gt; to discuss how the connector could fit into your particular use case.&lt;/p&gt;

&lt;p&gt;Or if you’re simply interested in switching to Rust for faster BSON decoding in your own code, check out Mahdi’s repo on &lt;a href="https://github.com/mdibaiee/bson-benchmarks" rel="noopener noreferrer"&gt;benchmarking Rust and Go&lt;/a&gt; or his work in &lt;a href="https://github.com/estuary/connectors/pull/3596" rel="noopener noreferrer"&gt;Estuary’s source code&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>mongodb</category>
    </item>
    <item>
      <title>How to Stream OLTP Data to MotherDuck in Real Time with Estuary</title>
      <dc:creator>Sourabh Gupta</dc:creator>
      <pubDate>Fri, 26 Sep 2025 05:51:23 +0000</pubDate>
      <link>https://forem.com/estuary/from-oltp-to-olap-streaming-databases-into-motherduck-with-estuary-1nd4</link>
      <guid>https://forem.com/estuary/from-oltp-to-olap-streaming-databases-into-motherduck-with-estuary-1nd4</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;DuckDB is quickly becoming one of the most talked about analytical databases. It is fast, lightweight, and designed to run inside your applications, often described as &lt;em&gt;SQLite for analytics&lt;/em&gt;. While it works great on a laptop for local analysis, production workflows need something more scalable.&lt;br&gt;&lt;br&gt;
That is where &lt;strong&gt;MotherDuck&lt;/strong&gt; comes in. MotherDuck takes the power of DuckDB and brings it to the cloud. It adds collaboration features, secure storage, and a serverless model that lets teams use DuckDB at scale without worrying about infrastructure.&lt;br&gt;&lt;br&gt;
In this guide, you will learn how to stream data from an OLTP system into MotherDuck using &lt;strong&gt;Estuary&lt;/strong&gt;. This approach lets you run analytical queries on fresh data without putting extra load on your production database.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;🎥Prefer watching instead of reading? Check out the short walkthrough below.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/2flyH-rjmqI"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  Why DuckDB Is Gaining Traction
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://duckdb.org/" rel="noopener noreferrer"&gt;DuckDB&lt;/a&gt; is an open source analytical database designed with a clear goal: to make complex queries fast and simple without heavy infrastructure. Instead of being a traditional client-server database, DuckDB is embedded. It runs inside the host process, which reduces overhead and makes it easy to integrate directly into applications, notebooks, or scripts.&lt;br&gt;&lt;br&gt;
Several features stand out:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;In-process operation&lt;/strong&gt;: Similar to SQLite, DuckDB runs where your code runs. This avoids network calls and gives you low-latency access to data.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Columnar and vectorized execution&lt;/strong&gt;: DuckDB is optimized for analytical queries. Its execution model speeds up heavy operations such as aggregations, filtering, and joins on large tables.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Portability and extensibility&lt;/strong&gt;: It has a very small footprint and no external dependencies. At the same time, extensions support advanced data types and file formats, including Parquet, JSON, and geospatial data.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Seamless file access&lt;/strong&gt;: DuckDB can query local files directly without requiring an ETL pipeline. For example, you can run SQL queries on CSV or Parquet files straight from disk.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integration with data science tools&lt;/strong&gt;: DuckDB connects smoothly with Python, R, and Jupyter notebooks, which makes it a favorite among data scientists.
Because of this balance of speed, flexibility, and simplicity, DuckDB is increasingly used as the analytical layer in modern data pipelines, as well as for ad hoc analysis by engineers and analysts.
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  MotherDuck: DuckDB in the Cloud
&lt;/h2&gt;

&lt;p&gt;DuckDB is excellent for local analysis, but production environments often require more than a local embedded database. Teams need collaboration, security, and scalability. That is where &lt;strong&gt;&lt;a href="https://motherduck.com/" rel="noopener noreferrer"&gt;MotherDuck&lt;/a&gt;&lt;/strong&gt; comes in.&lt;br&gt;&lt;br&gt;
MotherDuck is a managed cloud service built on top of DuckDB. It extends the same fast and lightweight query engine into a serverless environment while adding features that make it practical for organizations:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Serverless architecture&lt;/strong&gt;: No servers to manage and no infrastructure overhead. MotherDuck scales automatically with your workloads.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Collaboration&lt;/strong&gt;: Share queries, results, and datasets with teammates in real time. This makes it easier for teams to work from the same source of truth.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secure secret storage&lt;/strong&gt;: Manage credentials and connections safely in the cloud.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integration with pipelines&lt;/strong&gt;: Platforms like Estuary can write directly into MotherDuck, which means your data is always fresh and ready for analysis.
In practice, MotherDuck gives teams the best of both worlds: the performance and simplicity of DuckDB combined with the scalability and ease of use of a modern cloud service.
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  OLTP → OLAP: The Core Use Case
&lt;/h2&gt;

&lt;p&gt;Most production applications run on OLTP databases such as PostgreSQL, MySQL, or MongoDB. These systems are designed for fast inserts, updates, and deletes. They keep applications responsive but are not optimized for running heavy analytical queries.  &lt;/p&gt;

&lt;p&gt;Running aggregations, joins, or reports directly on an OLTP database can:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Slow down your application performance.
&lt;/li&gt;
&lt;li&gt;Increase operational risk by adding load to your production environment.
&lt;/li&gt;
&lt;li&gt;Limit the ability of analysts and data scientists to explore data freely.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why organizations separate &lt;strong&gt;OLTP (transactional)&lt;/strong&gt; systems from &lt;strong&gt;OLAP (analytical)&lt;/strong&gt; systems. The OLTP database handles day-to-day transactions, while an OLAP database is dedicated to complex queries and reporting.  &lt;/p&gt;

&lt;p&gt;DuckDB, and by extension MotherDuck, fits perfectly as an OLAP layer. With &lt;strong&gt;&lt;a href="https://estuary.dev/product/" rel="noopener noreferrer"&gt;Estuary&lt;/a&gt;&lt;/strong&gt;, you can capture real-time changes from your OLTP source and stream them into MotherDuck. This way, analysts always have up-to-date data to query without touching the production database.  &lt;/p&gt;
&lt;h2&gt;
  
  
  Setting Up Estuary with MotherDuck
&lt;/h2&gt;

&lt;p&gt;In this section, we’ll walk through the process of connecting your OLTP source to MotherDuck using Estuary. The setup is straightforward and only takes a few steps.&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 1: Prepare Your Source in Estuary
&lt;/h3&gt;

&lt;p&gt;Before you can send data to MotherDuck, you need a source system connected in Estuary. A source could be any OLTP database such as PostgreSQL, MySQL, or MongoDB. Estuary also supports SaaS applications, event streams, and file-based sources.  &lt;/p&gt;

&lt;p&gt;To prepare a source:  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Go to the &lt;strong&gt;Captures&lt;/strong&gt; tab in the Estuary dashboard.
&lt;/li&gt;
&lt;li&gt;Create a new capture and select the connector for your source system.
&lt;/li&gt;
&lt;li&gt;Provide the connection details (for example, host, port, database name, and credentials).
&lt;/li&gt;
&lt;li&gt;Save and publish the capture.
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Once this is done, Estuary begins ingesting data from your source and continuously tracks new changes. This stream of data is stored in an internal collection, which you will later connect to MotherDuck.  &lt;/p&gt;

&lt;p&gt;&lt;em&gt;Tip&lt;/em&gt;: If you are new to Estuary, try starting with a simple dataset (like PostgreSQL or a CSV file) before moving on to production-scale sources. &lt;/p&gt;
&lt;h3&gt;
  
  
  Step 2: Create a MotherDuck Materialization
&lt;/h3&gt;

&lt;p&gt;With your source capture running, the next step is to &lt;a href="https://docs.estuary.dev/reference/Connectors/materialization-connectors/motherduck/" rel="noopener noreferrer"&gt;set up MotherDuck&lt;/a&gt; as the destination for your data. In Estuary, this is called a &lt;strong&gt;materialization&lt;/strong&gt;.&lt;br&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft4qmbqsoxzdjfxz30lw9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft4qmbqsoxzdjfxz30lw9.png" alt="Search for “MotherDuck” in the Estuary catalog and choose it as your materialization connector."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To create one:  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Go to the &lt;strong&gt;Destinations&lt;/strong&gt; tab in the Estuary dashboard.
&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;New Materialization&lt;/strong&gt;.
&lt;/li&gt;
&lt;li&gt;Search for &lt;strong&gt;MotherDuck&lt;/strong&gt; in the connector catalog and select it.
&lt;/li&gt;
&lt;li&gt;Give the materialization a descriptive name so you can easily identify it later.
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;At this point, you will see the configuration screen for the MotherDuck connector. This is where you provide the details that allow Estuary to stage data and deliver it into your MotherDuck database.  &lt;/p&gt;

&lt;p&gt;In the next step, you’ll configure &lt;strong&gt;AWS S3 staging&lt;/strong&gt;, which Estuary uses as a temporary storage location for data loads.  &lt;/p&gt;
&lt;h3&gt;
  
  
  Step 3: Configure AWS S3 Staging
&lt;/h3&gt;

&lt;p&gt;The MotherDuck connector in Estuary uses an Amazon S3 bucket as a staging area. Data is first written to S3, then loaded into MotherDuck. This design ensures high reliability and scalability for large datasets.  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqsgu04n0bform8cgu7vi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqsgu04n0bform8cgu7vi.png" alt="Example IAM users in AWS for Estuary and MotherDuck. Each user should have S3 read and write permissions."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here’s what you need to set up:  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Create or choose an S3 bucket&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Note down the bucket name and its region.
&lt;/li&gt;
&lt;li&gt;Optionally, you can define a prefix if you want Estuary to organize staged files under a specific folder.
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Set up IAM permissions&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/id_users.html" rel="noopener noreferrer"&gt;Create or use an IAM user&lt;/a&gt; that has read and write access to the S3 bucket.
&lt;/li&gt;
&lt;li&gt;Attach a policy with at least the following actions:

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;s3:PutObject&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;s3:GetObject&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;s3:ListBucket&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Generate access keys&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In the AWS console, go to the IAM user’s &lt;strong&gt;Security Credentials&lt;/strong&gt; tab.
&lt;/li&gt;
&lt;li&gt;Create an access key and secret key.
&lt;/li&gt;
&lt;li&gt;Copy these values into the Estuary dashboard when configuring the MotherDuck connector.
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;At this point, Estuary knows where to stage data and has the permissions needed to write into your S3 bucket.  &lt;/p&gt;

&lt;p&gt;&lt;em&gt;Tip&lt;/em&gt;: For production, avoid using a root account. Always generate access keys from an IAM user with the least privileges necessary.  &lt;/p&gt;
&lt;h3&gt;
  
  
  Step 4: Set Up MotherDuck
&lt;/h3&gt;

&lt;p&gt;Now that AWS S3 staging is ready, it’s time to configure the MotherDuck side of the connection. This step makes sure MotherDuck can pull the staged data into your chosen database.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2xb3jypjvhv0230wecu1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2xb3jypjvhv0230wecu1.png" alt="Example of the MotherDuck connector configuration in Estuary, with service token, database, and S3 staging details filled in."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Generate an access token&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Log in to your MotherDuck account.
&lt;/li&gt;
&lt;li&gt;Open the &lt;strong&gt;Settings&lt;/strong&gt; menu and go to &lt;strong&gt;Access Tokens&lt;/strong&gt;.
&lt;/li&gt;
&lt;li&gt;Create a new token and copy it into the Estuary connector configuration.
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Provide AWS credentials to MotherDuck&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MotherDuck needs permission to read the staged files from your S3 bucket.
&lt;/li&gt;
&lt;li&gt;You can provide these credentials either:
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;a. By running SQL statements inside MotherDuck:&lt;br&gt;
&lt;/p&gt;

&lt;pre class="highlight sql"&gt;&lt;code&gt; &lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;SECRET&lt;/span&gt; &lt;span class="n"&gt;aws_access_key&lt;/span&gt; &lt;span class="s1"&gt;'&amp;lt;ACCESS_KEY&amp;gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
 &lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;SECRET&lt;/span&gt; &lt;span class="n"&gt;aws_secret_key&lt;/span&gt; &lt;span class="s1"&gt;'&amp;lt;SECRET_KEY&amp;gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;



&lt;p&gt;b. Or by entering them through the MotherDuck UI.  &lt;/p&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Choose a target database&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Select an existing database in your MotherDuck account, or create a new one.
&lt;/li&gt;
&lt;li&gt;Copy its name into the Estuary configuration.
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Decide on delete behavior&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Soft deletes&lt;/strong&gt;: Mark a record as deleted but keep it in the table for historical analysis.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hard deletes&lt;/strong&gt;: Remove the record entirely.
&lt;/li&gt;
&lt;li&gt;Choose the option that best matches your analytics or compliance needs.
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ol&gt;

&lt;h3&gt;
  
  
  Step 5: Publish and Stream Data
&lt;/h3&gt;

&lt;p&gt;Once your MotherDuck materialization is configured, the final step is to publish it and start the data flow.  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Select your source data&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Link an entire capture (for example, your PostgreSQL database)
&lt;/li&gt;
&lt;li&gt;Or choose specific collections you want to replicate.
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Review the configuration&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Double-check that your S3 credentials, MotherDuck token, and database name are correct.
&lt;/li&gt;
&lt;li&gt;Make sure you selected the right delete behavior (soft or hard).
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Save and publish&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Click &lt;strong&gt;Next&lt;/strong&gt;, then &lt;strong&gt;Save &amp;amp; Publish&lt;/strong&gt;.
&lt;/li&gt;
&lt;li&gt;Estuary will immediately begin streaming data from your OLTP source into MotherDuck.
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;From here, data updates in your source will flow continuously into your MotherDuck database. This gives you a near real-time OLAP environment for analytics, without adding load to your production system.  &lt;/p&gt;

&lt;h3&gt;
  
  
  Step 6: Query in MotherDuck
&lt;/h3&gt;

&lt;p&gt;With the connector published, your data is now flowing into MotherDuck. The final step is to start exploring it.  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Open the &lt;strong&gt;MotherDuck dashboard&lt;/strong&gt; and go to &lt;strong&gt;Notebooks&lt;/strong&gt;.
&lt;/li&gt;
&lt;li&gt;Select the database you configured as the destination.
&lt;/li&gt;
&lt;li&gt;Run queries using DuckDB’s familiar &lt;a href="https://duckdb.org/docs/stable/sql/introduction.html" rel="noopener noreferrer"&gt;SQL syntax&lt;/a&gt;.
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For example, if you replicated an &lt;code&gt;orders&lt;/code&gt; table from your OLTP database, you could analyze top customers like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;order_count&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;order_count&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fch5mqt7dn8d3ha7a5ake.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fch5mqt7dn8d3ha7a5ake.png" alt="Running a SQL query in MotherDuck to explore the replicated dataset streamed through Estuary."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrap-Up
&lt;/h2&gt;

&lt;p&gt;By combining Estuary and MotherDuck, you can build a modern pipeline that keeps analytics separate from your production workload without adding extra complexity.  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Estuary captures real-time changes from your OLTP databases.
&lt;/li&gt;
&lt;li&gt;Data is staged in S3 for reliability.
&lt;/li&gt;
&lt;li&gt;MotherDuck provides a cloud-native DuckDB environment where your team can query and collaborate.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This setup is fast to configure, easy to maintain, and scales with your needs. Instead of managing batch jobs or writing custom scripts, you can focus on analysis and insights.  &lt;/p&gt;




&lt;h2&gt;
  
  
  ✅ Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;DuckDB is lightweight and powerful for analytics, while MotherDuck brings it to the cloud for collaboration and scalability.
&lt;/li&gt;
&lt;li&gt;Estuary makes it simple to stream data from OLTP systems into MotherDuck in real time.
&lt;/li&gt;
&lt;li&gt;AWS S3 is used as a staging layer, requiring IAM permissions and credentials.
&lt;/li&gt;
&lt;li&gt;Once published, you can query fresh data in MotherDuck notebooks using DuckDB SQL.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 Ready to try it yourself? &lt;a href="https://dashboard.estuary.dev/register" rel="noopener noreferrer"&gt;Explore Estuary&lt;/a&gt; and see how quickly you can start streaming data into MotherDuck.  &lt;/p&gt;

</description>
      <category>duckdb</category>
      <category>dataengineering</category>
      <category>motherduck</category>
      <category>database</category>
    </item>
    <item>
      <title>2025 Data Warehouse Benchmark: What BigQuery, Snowflake, and Others Don’t Tell You</title>
      <dc:creator>Sourabh Gupta</dc:creator>
      <pubDate>Thu, 17 Jul 2025 08:11:33 +0000</pubDate>
      <link>https://forem.com/estuary/2025-data-warehouse-benchmark-what-bigquery-snowflake-and-others-dont-tell-you-392a</link>
      <guid>https://forem.com/estuary/2025-data-warehouse-benchmark-what-bigquery-snowflake-and-others-dont-tell-you-392a</guid>
      <description>&lt;h1&gt;
  
  
  We Benchmark-Tested 5 Data Warehouses. Here's What Broke.
&lt;/h1&gt;

&lt;p&gt;Choosing a data warehouse shouldn’t feel like a gamble — but it often is.&lt;/p&gt;

&lt;p&gt;Marketing sites are polished. Demos are cherry-picked. Docs are full of high-level promises. But when your data team starts moving &lt;strong&gt;terabytes of real data&lt;/strong&gt;, things change fast: performance bottlenecks, cost spikes, memory errors… and sometimes complete failure.&lt;/p&gt;

&lt;p&gt;At &lt;a href="https://estuary.dev" rel="noopener noreferrer"&gt;Estuary&lt;/a&gt;, we help teams build real-time data pipelines that push warehouses hard — across batch and streaming. We’ve seen the consequences of choosing the wrong warehouse. So we built the &lt;strong&gt;benchmark we wish existed earlier&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  🔍 The Estuary 2025 Data Warehouse Benchmark
&lt;/h2&gt;

&lt;p&gt;We benchmarked 5 major data warehouses under real workloads:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Google BigQuery&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Snowflake&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Databricks&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Amazon Redshift&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Microsoft Fabric&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We didn’t just run canned TPCH queries — we loaded &lt;strong&gt;over 8TB of structured + semi-structured data&lt;/strong&gt;, then hit each platform with real-world SQL:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjrfmdr3mtg0lxh18x9tb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjrfmdr3mtg0lxh18x9tb.png" alt=" " width="800" height="456"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Joins, window functions, filters, and nesting&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query-F&lt;/strong&gt; (“The Frankenquery”) — a deliberately brutal query that pushes limits&lt;/li&gt;
&lt;li&gt;Full lifecycle tracking from ingest to query via &lt;a href="https://estuary.dev" rel="noopener noreferrer"&gt;Estuary Flow&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Cost-to-runtime ratios with no vendor tuning or caching games&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;📂 Our full methodology is &lt;a href="https://github.com/estuary/estuary-warehouse-benchmark" rel="noopener noreferrer"&gt;open source&lt;/a&gt;. Clone it. Run your own tests. Contribute.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  🧠 What We Learned
&lt;/h2&gt;

&lt;h3&gt;
  
  
  🔵 BigQuery
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Fast — especially on nested JSON
&lt;/li&gt;
&lt;li&gt;But &lt;strong&gt;zero cost guardrails&lt;/strong&gt; = high bill risk
&lt;/li&gt;
&lt;li&gt;Cost-per-minute hit &lt;strong&gt;$15+&lt;/strong&gt; under some setups&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  ⚪ Snowflake
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Stable, predictable, smart scaling
&lt;/li&gt;
&lt;li&gt;Good balance of performance and cost
&lt;/li&gt;
&lt;li&gt;Strong default choice for teams who want reliability&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🟨 Databricks
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Great for ML workflows
&lt;/li&gt;
&lt;li&gt;SQL under load? Needs tuning
&lt;/li&gt;
&lt;li&gt;Performance quirks at scale&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🟥 Redshift &amp;amp; 🟩 Fabric
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Memory errors, long runtimes, incomplete results
&lt;/li&gt;
&lt;li&gt;Multiple queries failed or stalled for hours
&lt;/li&gt;
&lt;li&gt;Definitely &lt;strong&gt;not&lt;/strong&gt; plug-and-play ready&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  📉 Chart: Cost vs Runtime
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fyourimageurl.com" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fyourimageurl.com" alt="Estuary Cost-to-Runtime Benchmark" width="" height=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This graph tracks &lt;strong&gt;$ per minute of query runtime&lt;/strong&gt; across warehouses and instance sizes.&lt;br&gt;&lt;br&gt;
Red bands = platforms that failed under load or threw memory errors.&lt;/p&gt;




&lt;h2&gt;
  
  
  ⚙️ Rankings That Actually Matter
&lt;/h2&gt;

&lt;p&gt;We scored each platform on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cost-efficiency 💰
&lt;/li&gt;
&lt;li&gt;Runtime performance ⚡
&lt;/li&gt;
&lt;li&gt;Scalability 📈
&lt;/li&gt;
&lt;li&gt;Reliability under pressure 🧱
&lt;/li&gt;
&lt;li&gt;Startup-friendliness 🚀
&lt;/li&gt;
&lt;li&gt;Enterprise readiness 🏢&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;🎯 Some platforms were efficient at small scale but crashed under growth. Others performed well but cost 10x more than peers.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  📥 Get the Full Report
&lt;/h2&gt;

&lt;p&gt;If you’re:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Planning a warehouse migration
&lt;/li&gt;
&lt;li&gt;Scaling analytics or ML pipelines
&lt;/li&gt;
&lt;li&gt;Comparing Snowflake vs BigQuery vs Databricks
&lt;/li&gt;
&lt;li&gt;Or just tired of guessing…&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 &lt;a href="https://estuary.dev/data-warehouse-benchmark-report/" rel="noopener noreferrer"&gt;&lt;strong&gt;Download the full benchmark report&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  👨‍🔬 Built by Engineers, Not Marketers
&lt;/h2&gt;

&lt;p&gt;We created this benchmark at Estuary because we work with these warehouses daily. Our product — &lt;a href="https://estuary.dev" rel="noopener noreferrer"&gt;Estuary Flow&lt;/a&gt; — streams real-time data from sources like PostgreSQL, Kafka, MongoDB, and SaaS apps into modern warehouses.&lt;/p&gt;

&lt;p&gt;We’ve helped teams recover from 18-month migrations and $100k+ in wasted compute. So we’re publishing what we’ve learned.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🤝 Contribute or fork the test harness here:&lt;br&gt;&lt;br&gt;
&lt;a href="https://github.com/estuary/estuary-warehouse-benchmark" rel="noopener noreferrer"&gt;🔗 GitHub Repo&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;a href="https://github.com/estuary" rel="noopener noreferrer"&gt;🌐 Estuary GitHub&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  💬 Join the Discussion
&lt;/h2&gt;

&lt;p&gt;Have you had similar (or better?) experiences with these platforms?&lt;br&gt;&lt;br&gt;
Spot something we should test next?&lt;/p&gt;

&lt;p&gt;Drop your thoughts, logs, or horror stories in the comments. We’re all ears 👇&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>cloud</category>
      <category>datawarehouse</category>
      <category>benchmarking</category>
    </item>
    <item>
      <title>Refresh Smarter: How Estuary’s Dataflow Reset Makes Backfills a Breeze</title>
      <dc:creator>Sourabh Gupta</dc:creator>
      <pubDate>Tue, 15 Jul 2025 04:14:10 +0000</pubDate>
      <link>https://forem.com/estuary/refresh-smarter-how-estuarys-dataflow-reset-makes-backfills-a-breeze-4jd8</link>
      <guid>https://forem.com/estuary/refresh-smarter-how-estuarys-dataflow-reset-makes-backfills-a-breeze-4jd8</guid>
      <description>&lt;p&gt;Backfills have always been a critical - but sometimes tedious - part of managing robust data pipelines. Whether you're dealing with schema drift, outdated destination tables, or bad source data, initiating a full reset of your pipeline used to require multiple steps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not anymore.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;With &lt;strong&gt;Estuary’s new Dataflow Reset&lt;/strong&gt; feature, you can perform a clean-sweep backfill in just one step - reloading your sources, refreshing schemas, re-triggering derivations, and updating destination tables - all at once.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is a Dataflow Reset?
&lt;/h2&gt;

&lt;p&gt;A Dataflow Reset is Estuary’s one-click solution to refresh your &lt;strong&gt;entire dataflow&lt;/strong&gt;. It works from top to bottom:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Re-extracts data from the source
&lt;/li&gt;
&lt;li&gt;Re-runs all derivations
&lt;/li&gt;
&lt;li&gt;Recalculates schemas using updated data
&lt;/li&gt;
&lt;li&gt;Rebuilds destination tables
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn't just a re-run - it's a &lt;strong&gt;recalibration&lt;/strong&gt;. If your schemas previously became too broad (due to inconsistent or junk data), the reset starts fresh and reflects the true shape of your source.&lt;/p&gt;




&lt;h2&gt;
  
  
  When Should You Use It?
&lt;/h2&gt;

&lt;p&gt;The new Dataflow Reset option is ideal for scenarios like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Structural changes in your source system
&lt;/li&gt;
&lt;li&gt;Schema inference gone awry
&lt;/li&gt;
&lt;li&gt;Destination tables out of sync with upstream logic
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Bonus:&lt;/strong&gt; It automatically tracks which downstream resources (like materializations) need updating - no manual selection required.&lt;/p&gt;




&lt;h2&gt;
  
  
  How to Use It
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Go to your &lt;strong&gt;Capture&lt;/strong&gt; in the Estuary Flow web app.
&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Edit&lt;/strong&gt;.
&lt;/li&gt;
&lt;li&gt;Select &lt;strong&gt;Backfill&lt;/strong&gt;.
&lt;/li&gt;
&lt;li&gt;The default backfill mode will now trigger a &lt;strong&gt;Dataflow Reset&lt;/strong&gt;.
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That’s it - your pipeline is reset and refreshed in one action.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fod3l6rq9ira7bzcqj1r7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fod3l6rq9ira7bzcqj1r7.png" alt=" " width="800" height="298"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Prefer Fine-Grained Control?
&lt;/h2&gt;

&lt;p&gt;You can still choose from advanced backfill options:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Incremental Backfill&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Reprocess only the source data while keeping the existing destination intact.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Materialization-Only Backfill&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Rebuild destination tables from current collection data - no need to touch the source.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These modes are perfect for more targeted recovery and testing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Known Limitation
&lt;/h2&gt;

&lt;p&gt;Avoid using &lt;strong&gt;Dataflow Reset&lt;/strong&gt; with &lt;strong&gt;Dekaf materializations&lt;/strong&gt; (Estuary’s Kafka-compatible interface). This combination is currently unsupported.&lt;/p&gt;




&lt;h2&gt;
  
  
  Learn More
&lt;/h2&gt;

&lt;p&gt;Want a deeper dive into backfilling options, use cases, and caveats? Check out the Estuary docs:&lt;/p&gt;

&lt;p&gt;👉 &lt;a href="https://docs.estuary.dev/reference/backfilling-data/" rel="noopener noreferrer"&gt;https://docs.estuary.dev/reference/backfilling-data/&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dataflow Reset&lt;/strong&gt; is a full-pipeline refresh: source -&amp;gt; schema -&amp;gt; derivation -&amp;gt; destination
&lt;/li&gt;
&lt;li&gt;Automatically recalculates schema to avoid issues caused by bad or outdated data
&lt;/li&gt;
&lt;li&gt;Easy to trigger and safer than ever to run
&lt;/li&gt;
&lt;li&gt;Still supports advanced, partial backfill modes
&lt;/li&gt;
&lt;li&gt;Avoid using with Dekaf (for now)
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Make your next backfill a breeze with Estuary.&lt;/p&gt;

</description>
      <category>dataengineering</category>
    </item>
    <item>
      <title>How to Load Data from Amazon S3 to Snowflake in Real Time</title>
      <dc:creator>Sourabh Gupta</dc:creator>
      <pubDate>Wed, 09 Jul 2025 06:39:46 +0000</pubDate>
      <link>https://forem.com/estuary/how-to-load-data-from-amazon-s3-to-snowflake-in-real-time-4i02</link>
      <guid>https://forem.com/estuary/how-to-load-data-from-amazon-s3-to-snowflake-in-real-time-4i02</guid>
      <description>&lt;p&gt;Got a bunch of raw data sitting in Amazon S3 and need to get it into Snowflake for analytics — fast? You’re not alone.&lt;/p&gt;

&lt;p&gt;Maybe it’s JSON logs, CSV exports, or event data piling up in your S3 bucket. Maybe you’ve tried batch pipelines or custom scripts but ran into delays, duplicates, or schema chaos. What you actually need is a clean, reliable way to load that S3 data to Snowflake, without spending weeks building and maintaining it.&lt;/p&gt;

&lt;p&gt;That’s exactly what Estuary Flow is built for.&lt;/p&gt;

&lt;p&gt;Flow makes it easy to build real-time S3 to Snowflake data pipelines with no code, no ops overhead, and no latency headaches. It connects directly to your S3 bucket, picks up new files as they arrive, and keeps your Snowflake warehouse in sync continuously.&lt;/p&gt;

&lt;p&gt;In this walkthrough, we’ll show you how to set up an Amazon S3 to Snowflake pipeline using Estuary Flow from start to finish. You’ll go from raw files to live Snowflake tables in just a few steps.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;TL;DR: If you're looking to stream data from Amazon S3 to Snowflake, you're in the right place — and Flow makes it a breeze.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Why Stream S3 Data to Snowflake in Real Time?
&lt;/h2&gt;

&lt;p&gt;Let’s be honest — batch processing worked fine back when dashboards only needed to update once a day. But today, teams expect real-time answers: marketing needs up-to-the-minute campaign performance, operations teams need live inventory data, and product managers want to react to user behavior as it happens.&lt;/p&gt;

&lt;p&gt;That’s where streaming data from S3 to Snowflake changes the game.&lt;/p&gt;

&lt;p&gt;If you’re storing raw files — like logs, events, or exports — in Amazon S3, you’re already halfway there. The missing piece is a low-latency pipeline that gets that data into Snowflake the moment it arrives. No waiting for hourly jobs. No stale reports. Just fresh, query-ready data flowing in 24/7.&lt;/p&gt;

&lt;p&gt;Here are a few reasons real-time sync matters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Analytics that actually keep up – Get real-time insights instead of reading yesterday’s data.
&lt;/li&gt;
&lt;li&gt;Automation that reacts fast – Trigger workflows in Snowflake based on live S3 updates.
&lt;/li&gt;
&lt;li&gt;Simplified ops – Eliminate brittle scripts, manual backfills, and sync delays.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Since Amazon S3 doesn’t support native change notifications, Flow polls your bucket every few minutes to detect new files, then streams them to Snowflake immediately. It’s batch under the hood, but real-time in effect.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Use Estuary Flow Instead of Traditional ETL Tools?
&lt;/h2&gt;

&lt;p&gt;If you’ve tried to move data from Amazon S3 to Snowflake before, you probably know the drill: patch together an ETL tool, deal with scheduling, wrestle with schema mismatches, and hope the job doesn’t break halfway through.&lt;/p&gt;

&lt;p&gt;The thing is, most ETL tools were built for a different era — one where “real time” meant “hourly,” and everything ran in batches. Estuary Flow flips that on its head.&lt;/p&gt;

&lt;p&gt;Here’s how Flow makes your S3 to Snowflake pipeline way easier:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Real-Time by Default:&lt;/strong&gt; Flow isn’t just fast — it’s built for continuous streaming. Once you connect your S3 bucket, Flow automatically picks up new files as they land and streams the data directly into Snowflake.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No Code Required:&lt;/strong&gt; Set up everything — capture, schema, and materialization — through a clean UI. You don’t need to write Python, wrangle Airflow, or babysit cron jobs.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schema-Aware + Smart:&lt;/strong&gt; Flow infers the structure of your S3 data and helps you map it to Snowflake tables. You can tighten up schemas, apply transformations, and evolve structure over time without breaking your pipeline.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Exactly-Once Delivery:&lt;/strong&gt; No duplicates. No reprocessing. Flow uses cloud-native guarantees to ensure data lands in Snowflake exactly once, even if things get weird.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Built to Scale:&lt;/strong&gt; Whether you're syncing a few JSON files or streaming terabytes a day, Flow scales automatically without locking you into complex infrastructure.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Estuary Flow takes the friction out of real-time data integration from S3 to Snowflake, so you can focus on using the data, not moving it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What You Need to Get Started
&lt;/h2&gt;

&lt;p&gt;You don’t need much to build an Amazon S3 to Snowflake pipeline with Estuary Flow — just a few basics:&lt;/p&gt;

&lt;h3&gt;
  
  
  Estuary Flow Account
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://dashboard.estuary.dev/register" rel="noopener noreferrer"&gt;Sign up for free&lt;/a&gt; to access the Flow web app — no downloads or setup required.&lt;/p&gt;

&lt;h3&gt;
  
  
  Amazon S3 Bucket
&lt;/h3&gt;

&lt;p&gt;This is your data source. You’ll need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bucket name &amp;amp; region
&lt;/li&gt;
&lt;li&gt;Either public access or your AWS access key + secret key
&lt;/li&gt;
&lt;li&gt;(Optional) A folder path, called a “prefix”&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Snowflake Account
&lt;/h3&gt;

&lt;p&gt;Your destination for the data. Make sure you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A database, schema, and virtual warehouse
&lt;/li&gt;
&lt;li&gt;A user with access
&lt;/li&gt;
&lt;li&gt;Your account URL + login credentials
&lt;/li&gt;
&lt;li&gt;(Optional) warehouse name and role&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s it. With these in place, you’re ready to connect the pieces and start streaming.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Capture Data from Amazon S3
&lt;/h2&gt;

&lt;p&gt;First up, you’ll connect Estuary Flow to your S3 bucket — this step is called a capture. It’s how Flow knows where to pull your data from.&lt;/p&gt;

&lt;p&gt;Here’s how to set it up:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1oifx0r91ls9pq46w9ao.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1oifx0r91ls9pq46w9ao.png" alt=" " width="800" height="372"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Log into Estuary Flow at &lt;a href="https://dashboard.estuary.dev/" rel="noopener noreferrer"&gt;dashboard.estuary.dev&lt;/a&gt;.
&lt;/li&gt;
&lt;li&gt;Click the Sources tab and select New Capture. &lt;/li&gt;
&lt;li&gt;Choose Amazon S3 from the list of connectors.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You’ll see a form where you enter your S3 details:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Capture name – Something like myorg/s3-orders
&lt;/li&gt;
&lt;li&gt;AWS credentials – Only needed if your bucket isn’t public
&lt;/li&gt;
&lt;li&gt;Bucket name &amp;amp; region – From your S3 console
&lt;/li&gt;
&lt;li&gt;Prefix (optional) – To pull from a specific folder
&lt;/li&gt;
&lt;li&gt;Match keys (optional) – For filtering files, like *.json&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once you click Next, Flow will connect to your bucket and auto-generate a schema based on your data. You’ll see a preview of your Flow collection — this acts as a live copy of your S3 data inside Flow.&lt;/p&gt;

&lt;p&gt;Click Save and Publish to finish the capture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Behind the scenes, Flow checks your S3 bucket on a 5-minute schedule (by default) to pick up new or updated files. This is how it delivers near-real-time sync, even though S3 itself doesn’t support streaming events.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Next, let’s connect this to Snowflake.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Materialize to Snowflake
&lt;/h2&gt;

&lt;p&gt;Now that your data is flowing into Estuary, it’s time to materialize it to Snowflake — in other words, stream it directly into a Snowflake table.&lt;/p&gt;

&lt;p&gt;Here’s how to set it up:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl6lpje0vpzaksqr15o0l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl6lpje0vpzaksqr15o0l.png" alt=" " width="800" height="375"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;After saving your S3 capture, click Materialize Collections.
&lt;/li&gt;
&lt;li&gt;Choose the Snowflake connector from the destination list.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You’ll fill out a simple form with your Snowflake details:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Materialization name – e.g., myorg/s3-to-snowflake
&lt;/li&gt;
&lt;li&gt;Account URL – Like myorg-account.snowflakecomputing.com
&lt;/li&gt;
&lt;li&gt;User + Password – A Snowflake user with the right permissions
&lt;/li&gt;
&lt;li&gt;Database &amp;amp; Schema – Where the table will live
&lt;/li&gt;
&lt;li&gt;Warehouse – Optional, but recommended
&lt;/li&gt;
&lt;li&gt;Role – Optional if already assigned to the user&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once Flow connects, you’ll see your captured collection (from S3) listed.&lt;/p&gt;

&lt;p&gt;From here, you can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rename the output table
&lt;/li&gt;
&lt;li&gt;Enable delta updates (if you want changes applied instead of full inserts)
&lt;/li&gt;
&lt;li&gt;Use Schema Inference to map your flat S3 data into Snowflake’s tabular format&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;To do that:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Click the Collection tab
&lt;/li&gt;
&lt;li&gt;Select Schema Inference
&lt;/li&gt;
&lt;li&gt;Review the suggested schema → Click Apply&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Finally, hit Save and Publish.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;✅ That’s it — you’ve now got a fully working, real-time S3 to Snowflake pipeline. Flow will continuously sync new files from your bucket straight into your Snowflake warehouse.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What’s Next? Supercharge Your S3 to Snowflake Pipeline
&lt;/h2&gt;

&lt;p&gt;You now have a fully operational, real-time pipeline from Amazon S3 to Snowflake — and it runs continuously, no scripts or schedulers required.&lt;/p&gt;

&lt;p&gt;But that’s just the beginning.&lt;/p&gt;

&lt;p&gt;With Estuary Flow, you can take things even further:&lt;/p&gt;

&lt;h3&gt;
  
  
  Add Transformations (a.k.a. Derivations)
&lt;/h3&gt;

&lt;p&gt;Want to clean, filter, or join your data before it lands in Snowflake? Use derivations to apply real-time transformations using SQL or TypeScript, right inside Flow.&lt;br&gt;&lt;br&gt;
You can enrich JSON objects, flatten nested structures, or create entirely new views.&lt;/p&gt;

&lt;h3&gt;
  
  
  Plug into More Systems
&lt;/h3&gt;

&lt;p&gt;Need to send the same S3 data to BigQuery, Kafka, or a dashboard tool? Just add another materialization — Flow supports multi-destination sync out of the box.&lt;/p&gt;

&lt;h3&gt;
  
  
  Monitor + Optimize
&lt;/h3&gt;

&lt;p&gt;Use Flow’s built-in observability tools or plug into OpenMetrics to monitor throughput, schema evolution, and pipeline health in real time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Start Streaming S3 Data to Snowflake Today
&lt;/h2&gt;

&lt;p&gt;The old way — batch jobs, manual scripts, clunky ETL — just can’t keep up with today’s speed of data.&lt;/p&gt;

&lt;p&gt;With Estuary Flow, you can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sync Amazon S3 to Snowflake in real time
&lt;/li&gt;
&lt;li&gt;Handle schema changes effortlessly
&lt;/li&gt;
&lt;li&gt;Scale without infrastructure headaches&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Ready to go from raw files to real-time insights?&lt;br&gt;&lt;br&gt;
Try Estuary Flow for free and build your first streaming data pipeline today.&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>snowflake</category>
      <category>datascience</category>
    </item>
  </channel>
</rss>
