Forem: Estuary

2x Faster MongoDB CDC: An Engineering Deep-Dive on Performance Optimization

Sourabh Gupta — Tue, 31 Mar 2026 12:29:53 +0000

Estuary’s focus on in-house crafted connectors isn’t an accident.

It’s not about keeping secrets; we’re not a black box factory and connector source code is publicly available for anyone to review. It’s about maintaining the responsibility of ownership, starting with a high-quality base product, and refining from there.

Integrations are specifically designed to work seamlessly with Estuary, providing standard customization options and converting data to standard formats with as little waste as possible. And connectors get continuous updates to keep up with API changes or finetune performance.

Our MongoDB capture connector recently received one of these upgrades: while the connector reliably got the job done, it could fall behind in high-volume enterprise use cases. This could be especially detrimental for real-time pipelines that counted on the connector’s functionality with MongoDB’s change streams—if the connector couldn’t keep up with the data coming in, downstream systems could experience delays.

For real-time native applications, even small slowdowns have an outsized impact. Consider the route change notification for a shipment that arrives just after a driver misses the turnoff. Or a triage system that doesn't capture the latest developments in its priority calculations.

It was definitely time for some optimization work.

On the case was Mahdi Dibaiee. Based in Dublin, Ireland when not on adventures around the world, Mahdi has been a Senior Software Engineer with Estuary for nearly four years. Having worked on data planes, Estuary’s flowctl CLI, and various connectors, his deep knowledge of the platform lets him flexibly pick up whatever tasks have current top priority.

This is a behind-the-scenes look at how he analyzed the existing implementation’s limitations, researched solutions, and ended up with double the speed.

The Problem with Small Documents

“Make this integration faster,” while a laudable goal, isn’t much to go on. Why were captures falling behind? What was the expected throughput rate? And how could we find specific areas to improve?

First, start with a baseline.

The MongoDB capture connector tended towards a throughput rate of 34 MB/s when working with standard-sized documents, such as those around 20 KB apiece.

To test how the connector would react under different circumstances, Mahdi tried it out against a stream of much smaller documents, each around 250 bytes.

Something concerning happened when the connector processed these small documents. The capture’s ingestion rate dropped down to a meager 6 MB/s. While it would be unlikely to find this “tiny document” use case in the wild, 6 MB/s was still far too slow.

It also uncovered a possible path forward.

“This told us that we had a large overhead-per-document,” Mahdi explained, which resulted in the abysmal slowdown.

Essentially, all document processing would include some overhead. Changing the size of processed documents acted as a lever to quickly check just how much the overhead impacted performance: smaller documents with the same amount of overhead per document led to more overall time spent on the overhead rather than on making progress.

If he could find ways to reduce that overhead, all pipelines should speed up, not just ones with tiny documents.

But where exactly did that overhead come from? To tune the MongoDB capture’s performance, some digging would be required.

The Reason Behind the Bottleneck

To get a picture of the systems involved, Mahdi profiled a particular MongoDB capture that was struggling to keep up with its load.

First up was to rule out a couple obvious answers. He checked CPU load and memory pressure on both MongoDB’s side and the capture connector’s side. Neither indicated any issues.

Next, Mahdi wanted to see where Estuary spent the most time when ingesting data from MongoDB. He set up a detailed tracing view, dividing up the time for each data fetch and marking out network and CPU activity.

The trace exposed two areas of note: one a suspiciously empty space, and one a suspiciously long process, both related to the connector call to get more documents. In total, this caused Estuary to spend around two seconds on each batch of fetched data, which isn’t quite the millisecond latency Estuary aims for.

So, what was actually happening?

Activity trace for a MongoDB capture. ~2 seconds is highlighted, showing a noticeable gap in CPU usage before a string of activity.

600ms at the beginning of this cycle corresponded to the data fetch itself. When one batch of data finished processing, the connector sent out a request over the network for more, then started working on the new batch once it arrived.

Because of this synchronous mode of operation, the connector essentially sat around waiting for half a second each time it wanted to check for new data. When working with an end-to-end real-time system, those milliseconds in the pipeline add up. Not to mention the cumulative CPU idle time when the CPU’s doing nothing much for a full quarter of the connector’s process.

There, then, was an obvious bottleneck, but the activity following the fetch was also curious. The remaining 1.4 seconds in the cycle were spent processing documents.

By itself, emitting documents and checkpoints to Estuary shouldn’t take that long. But there was one more step in the processing phase that might: decoding MongoDB’s BSON documents in the first place.

With the possibility of optimizing document processing in the mix, there were two routes forward, two avenues to improve the connector’s performance.

Why not implement both?

From Go to Rust: An Expedient Solution

The CPU’s idle time was perhaps the more straightforward fix. Mahdi immediately identified that making the connector slightly more asynchronous would keep the CPU busy and shave those 600ms off of each batch.

To do so, he modified Estuary’s MongoDB connector to pre-fetch the next batch while still processing the first. In order to preserve both ordering and load on memory, he limited the number of fetched batches to four. With a maximum of 16 MB for each MongoDB cursor batch, this would keep the connector’s memory consumption to 64 MB.

This change alone would provide a welcome performance boost, but there was still the unsatisfyingly slow document processing time to contend with. And it was a trickier problem.

To standardize data coming from and going to a variety of different systems using a variety of different document formats and data types, Estuary translates everything to JSON as an intermediary. This makes it simple to mix and match data sources and destinations, or plug in a new connector: each connector only needs to handle its specific system and translation to or from the shared language.

Estuary translates MongoDB’s BSON documents to JSON so as to then easily translate the data to any destination format.

MongoDB documents come in BSON, or Binary JSON. This modified version of JSON generally makes for efficient storage and retrieval. It also includes a handful of additional data types, such as datetime and more specific numeric types.

This sounds like it would make for a reasonably simple conversion, but Mahdi found that Estuary’s MongoDB connector spent a lot of time decoding documents with Go’s bson package. On reflection, perhaps this wasn’t much of a surprise. Go’s reflect package, which infers data types that aren’t already known, is notoriously slow and the bson package relied heavily on reflect.

Looking for alternatives, he first performed some benchmarks on Rust’s corresponding bson crate. The results were demonstrable: Rust’s version was 2x faster than Go.

Mahdi’s meticulous research also uncovered another option. Rust’s most popular serialization/deserialization crate, serde, has a serde-transcode plugin crate. This transcoder can convert documents from one format to another without any intermediary layer, cutting down on unnecessary processing steps. With this, the BSON to JSON conversion could be 3x faster than the existing Go implementation.

serde couldn’t simply be swapped in as-is. Mahdi wrapped the out-of-the-box serializer in custom logic, extending the JSON conversion and sanitizing the data. The resulting implementation fit Estuary’s specific needs while retaining the 3x performance boost.

These changes would address both bottlenecks and refurbish the MongoDB capture connector.

End Result: Supercharged MongoDB Captures

One question remained: would these improvements hold up across various scenarios? Thorough testing commenced.

Mahdi started where it all began: the tiny documents scenario. He ran the MongoDB connector on a stream of small 250-byte documents, first using the main version before switching to use the improved branch. The measly ~6 MB/s throughput rate rose to around 17.5 MB/s, tripling throughput for the small documents use case.

Mahdi graphs out throughput results for the MongoDB connector, first using the original Go implementation, followed by the Rust transcoder.

Of course, this scenario was only ever meant as a test and example, a way to define how much overhead we were seeing as the connector processed documents.

Mahdi therefore reran the test, this time using 20 KB documents, a more standard size. The original 34 MB/s rate jumped to 57 MB/s, almost doubling throughput.

The difference when using larger documents is still substantial, even if less pronounced.

This rate was much more reasonable, allowing for around 200 GB of data ingestion per hour and ensuring the Estuary connector could keep up with higher volume use cases.

What this means in practical terms is that:

Huge initial databases would get backfilled in half the time
The platform would be able to handle twice as much data in continuous CDC mode
Which also means spikes in activity would be more easily handled: instead of choking performance, real-time events would stay real-time

After review and approval, Mahdi rolled out the changes to a select set of users first so he could closely monitor affected pipelines. He would be ready to quickly revert or revise as needed if any problems arose.

With so many use cases and interactions, one minor issue did raise its head: Rust and Go handled invalid UTF-8 characters differently. With a little more customization, Mahdi updated the connector’s leniency on invalid characters to mimic the former behavior.

Other than that, the rollout was smooth sailing, with capture throughput ticking upwards across the board.

So if you recently noticed your MongoDB capture speeding up: now you know.

What’s Next?

While 200 GB an hour is a decent clip, Mahdi noted that there is still room for further improvement. The main issue now is that the connector is relatively CPU-bound. And, after all, efficiency is one of those goals that doesn’t have a specific end.

For now, though, there are new challenges to face.

To test out the capture connector’s speed yourself, try it out in Estuary. Or set up a call to discuss how the connector could fit into your particular use case.

Or if you’re simply interested in switching to Rust for faster BSON decoding in your own code, check out Mahdi’s repo on benchmarking Rust and Go or his work in Estuary’s source code.

How to Stream OLTP Data to MotherDuck in Real Time with Estuary

Sourabh Gupta — Fri, 26 Sep 2025 05:51:23 +0000

Introduction

DuckDB is quickly becoming one of the most talked about analytical databases. It is fast, lightweight, and designed to run inside your applications, often described as SQLite for analytics. While it works great on a laptop for local analysis, production workflows need something more scalable.

That is where MotherDuck comes in. MotherDuck takes the power of DuckDB and brings it to the cloud. It adds collaboration features, secure storage, and a serverless model that lets teams use DuckDB at scale without worrying about infrastructure.

In this guide, you will learn how to stream data from an OLTP system into MotherDuck using Estuary. This approach lets you run analytical queries on fresh data without putting extra load on your production database.

🎥Prefer watching instead of reading? Check out the short walkthrough below.

Why DuckDB Is Gaining Traction

DuckDB is an open source analytical database designed with a clear goal: to make complex queries fast and simple without heavy infrastructure. Instead of being a traditional client-server database, DuckDB is embedded. It runs inside the host process, which reduces overhead and makes it easy to integrate directly into applications, notebooks, or scripts.

Several features stand out:

In-process operation: Similar to SQLite, DuckDB runs where your code runs. This avoids network calls and gives you low-latency access to data.
Columnar and vectorized execution: DuckDB is optimized for analytical queries. Its execution model speeds up heavy operations such as aggregations, filtering, and joins on large tables.
Portability and extensibility: It has a very small footprint and no external dependencies. At the same time, extensions support advanced data types and file formats, including Parquet, JSON, and geospatial data.
Seamless file access: DuckDB can query local files directly without requiring an ETL pipeline. For example, you can run SQL queries on CSV or Parquet files straight from disk.
Integration with data science tools: DuckDB connects smoothly with Python, R, and Jupyter notebooks, which makes it a favorite among data scientists. Because of this balance of speed, flexibility, and simplicity, DuckDB is increasingly used as the analytical layer in modern data pipelines, as well as for ad hoc analysis by engineers and analysts.

MotherDuck: DuckDB in the Cloud

DuckDB is excellent for local analysis, but production environments often require more than a local embedded database. Teams need collaboration, security, and scalability. That is where MotherDuck comes in.

MotherDuck is a managed cloud service built on top of DuckDB. It extends the same fast and lightweight query engine into a serverless environment while adding features that make it practical for organizations:

Serverless architecture: No servers to manage and no infrastructure overhead. MotherDuck scales automatically with your workloads.
Collaboration: Share queries, results, and datasets with teammates in real time. This makes it easier for teams to work from the same source of truth.
Secure secret storage: Manage credentials and connections safely in the cloud.
Integration with pipelines: Platforms like Estuary can write directly into MotherDuck, which means your data is always fresh and ready for analysis. In practice, MotherDuck gives teams the best of both worlds: the performance and simplicity of DuckDB combined with the scalability and ease of use of a modern cloud service.

OLTP → OLAP: The Core Use Case

Most production applications run on OLTP databases such as PostgreSQL, MySQL, or MongoDB. These systems are designed for fast inserts, updates, and deletes. They keep applications responsive but are not optimized for running heavy analytical queries.

Running aggregations, joins, or reports directly on an OLTP database can:

Slow down your application performance.
Increase operational risk by adding load to your production environment.
Limit the ability of analysts and data scientists to explore data freely.

This is why organizations separate OLTP (transactional) systems from OLAP (analytical) systems. The OLTP database handles day-to-day transactions, while an OLAP database is dedicated to complex queries and reporting.

DuckDB, and by extension MotherDuck, fits perfectly as an OLAP layer. With Estuary, you can capture real-time changes from your OLTP source and stream them into MotherDuck. This way, analysts always have up-to-date data to query without touching the production database.

Setting Up Estuary with MotherDuck

In this section, we’ll walk through the process of connecting your OLTP source to MotherDuck using Estuary. The setup is straightforward and only takes a few steps.

Step 1: Prepare Your Source in Estuary

Before you can send data to MotherDuck, you need a source system connected in Estuary. A source could be any OLTP database such as PostgreSQL, MySQL, or MongoDB. Estuary also supports SaaS applications, event streams, and file-based sources.

To prepare a source:

Go to the Captures tab in the Estuary dashboard.
Create a new capture and select the connector for your source system.
Provide the connection details (for example, host, port, database name, and credentials).
Save and publish the capture.

Once this is done, Estuary begins ingesting data from your source and continuously tracks new changes. This stream of data is stored in an internal collection, which you will later connect to MotherDuck.

Tip: If you are new to Estuary, try starting with a simple dataset (like PostgreSQL or a CSV file) before moving on to production-scale sources.

Step 2: Create a MotherDuck Materialization

With your source capture running, the next step is to set up MotherDuck as the destination for your data. In Estuary, this is called a materialization.

To create one:

Go to the Destinations tab in the Estuary dashboard.
Click New Materialization.
Search for MotherDuck in the connector catalog and select it.
Give the materialization a descriptive name so you can easily identify it later.

At this point, you will see the configuration screen for the MotherDuck connector. This is where you provide the details that allow Estuary to stage data and deliver it into your MotherDuck database.

In the next step, you’ll configure AWS S3 staging, which Estuary uses as a temporary storage location for data loads.

Step 3: Configure AWS S3 Staging

The MotherDuck connector in Estuary uses an Amazon S3 bucket as a staging area. Data is first written to S3, then loaded into MotherDuck. This design ensures high reliability and scalability for large datasets.

Here’s what you need to set up:

Create or choose an S3 bucket
- Note down the bucket name and its region.
- Optionally, you can define a prefix if you want Estuary to organize staged files under a specific folder.
Set up IAM permissions
- Create or use an IAM user that has read and write access to the S3 bucket.
- Attach a policy with at least the following actions:
  - s3:PutObject
  - s3:GetObject
  - s3:ListBucket
Generate access keys
- In the AWS console, go to the IAM user’s Security Credentials tab.
- Create an access key and secret key.
- Copy these values into the Estuary dashboard when configuring the MotherDuck connector.

At this point, Estuary knows where to stage data and has the permissions needed to write into your S3 bucket.

Tip: For production, avoid using a root account. Always generate access keys from an IAM user with the least privileges necessary.

Step 4: Set Up MotherDuck

Now that AWS S3 staging is ready, it’s time to configure the MotherDuck side of the connection. This step makes sure MotherDuck can pull the staged data into your chosen database.

Generate an access token
- Log in to your MotherDuck account.
- Open the Settings menu and go to Access Tokens.
- Create a new token and copy it into the Estuary connector configuration.
Provide AWS credentials to MotherDuck
- MotherDuck needs permission to read the staged files from your S3 bucket.
- You can provide these credentials either:
a. By running SQL statements inside MotherDuck:
```
 CREATE SECRET aws_access_key '<ACCESS_KEY>';
 CREATE SECRET aws_secret_key '<SECRET_KEY>';
```
b. Or by entering them through the MotherDuck UI.
Choose a target database
- Select an existing database in your MotherDuck account, or create a new one.
- Copy its name into the Estuary configuration.
Decide on delete behavior
- Soft deletes: Mark a record as deleted but keep it in the table for historical analysis.
- Hard deletes: Remove the record entirely.
- Choose the option that best matches your analytics or compliance needs.

Step 5: Publish and Stream Data

Once your MotherDuck materialization is configured, the final step is to publish it and start the data flow.

Select your source data
- Link an entire capture (for example, your PostgreSQL database)
- Or choose specific collections you want to replicate.
Review the configuration
- Double-check that your S3 credentials, MotherDuck token, and database name are correct.
- Make sure you selected the right delete behavior (soft or hard).
Save and publish
- Click Next, then Save & Publish.
- Estuary will immediately begin streaming data from your OLTP source into MotherDuck.

From here, data updates in your source will flow continuously into your MotherDuck database. This gives you a near real-time OLAP environment for analytics, without adding load to your production system.

Step 6: Query in MotherDuck

With the connector published, your data is now flowing into MotherDuck. The final step is to start exploring it.

Open the MotherDuck dashboard and go to Notebooks.
Select the database you configured as the destination.
Run queries using DuckDB’s familiar SQL syntax.

For example, if you replicated an orders table from your OLTP database, you could analyze top customers like this:

SELECT customer_id, COUNT(*) AS order_count
FROM orders
GROUP BY customer_id
ORDER BY order_count DESC
LIMIT 10;

Wrap-Up

By combining Estuary and MotherDuck, you can build a modern pipeline that keeps analytics separate from your production workload without adding extra complexity.

Estuary captures real-time changes from your OLTP databases.
Data is staged in S3 for reliability.
MotherDuck provides a cloud-native DuckDB environment where your team can query and collaborate.

This setup is fast to configure, easy to maintain, and scales with your needs. Instead of managing batch jobs or writing custom scripts, you can focus on analysis and insights.

✅ Key Takeaways

DuckDB is lightweight and powerful for analytics, while MotherDuck brings it to the cloud for collaboration and scalability.
Estuary makes it simple to stream data from OLTP systems into MotherDuck in real time.
AWS S3 is used as a staging layer, requiring IAM permissions and credentials.
Once published, you can query fresh data in MotherDuck notebooks using DuckDB SQL.

👉 Ready to try it yourself? Explore Estuary and see how quickly you can start streaming data into MotherDuck.

2025 Data Warehouse Benchmark: What BigQuery, Snowflake, and Others Don’t Tell You

Sourabh Gupta — Thu, 17 Jul 2025 08:11:33 +0000

We Benchmark-Tested 5 Data Warehouses. Here's What Broke.

Choosing a data warehouse shouldn’t feel like a gamble — but it often is.

Marketing sites are polished. Demos are cherry-picked. Docs are full of high-level promises. But when your data team starts moving terabytes of real data, things change fast: performance bottlenecks, cost spikes, memory errors… and sometimes complete failure.

At Estuary, we help teams build real-time data pipelines that push warehouses hard — across batch and streaming. We’ve seen the consequences of choosing the wrong warehouse. So we built the benchmark we wish existed earlier.

🔍 The Estuary 2025 Data Warehouse Benchmark

We benchmarked 5 major data warehouses under real workloads:

Google BigQuery
Snowflake
Databricks
Amazon Redshift
Microsoft Fabric

We didn’t just run canned TPCH queries — we loaded over 8TB of structured + semi-structured data, then hit each platform with real-world SQL:

Joins, window functions, filters, and nesting
Query-F (“The Frankenquery”) — a deliberately brutal query that pushes limits
Full lifecycle tracking from ingest to query via Estuary Flow
Cost-to-runtime ratios with no vendor tuning or caching games

📂 Our full methodology is open source. Clone it. Run your own tests. Contribute.

🧠 What We Learned

🔵 BigQuery

Fast — especially on nested JSON
But zero cost guardrails = high bill risk
Cost-per-minute hit $15+ under some setups

⚪ Snowflake

Stable, predictable, smart scaling
Good balance of performance and cost
Strong default choice for teams who want reliability

🟨 Databricks

Great for ML workflows
SQL under load? Needs tuning
Performance quirks at scale

🟥 Redshift & 🟩 Fabric

Memory errors, long runtimes, incomplete results
Multiple queries failed or stalled for hours
Definitely not plug-and-play ready

📉 Chart: Cost vs Runtime

This graph tracks $ per minute of query runtime across warehouses and instance sizes.

Red bands = platforms that failed under load or threw memory errors.

⚙️ Rankings That Actually Matter

We scored each platform on:

Cost-efficiency 💰
Runtime performance ⚡
Scalability 📈
Reliability under pressure 🧱
Startup-friendliness 🚀
Enterprise readiness 🏢

🎯 Some platforms were efficient at small scale but crashed under growth. Others performed well but cost 10x more than peers.

📥 Get the Full Report

If you’re:

Planning a warehouse migration
Scaling analytics or ML pipelines
Comparing Snowflake vs BigQuery vs Databricks
Or just tired of guessing…

👉 Download the full benchmark report

👨‍🔬 Built by Engineers, Not Marketers

We created this benchmark at Estuary because we work with these warehouses daily. Our product — Estuary Flow — streams real-time data from sources like PostgreSQL, Kafka, MongoDB, and SaaS apps into modern warehouses.

We’ve helped teams recover from 18-month migrations and $100k+ in wasted compute. So we’re publishing what we’ve learned.

🤝 Contribute or fork the test harness here:

🔗 GitHub Repo

🌐 Estuary GitHub

💬 Join the Discussion

Have you had similar (or better?) experiences with these platforms?

Spot something we should test next?

Drop your thoughts, logs, or horror stories in the comments. We’re all ears 👇

Refresh Smarter: How Estuary’s Dataflow Reset Makes Backfills a Breeze

Sourabh Gupta — Tue, 15 Jul 2025 04:14:10 +0000

Backfills have always been a critical - but sometimes tedious - part of managing robust data pipelines. Whether you're dealing with schema drift, outdated destination tables, or bad source data, initiating a full reset of your pipeline used to require multiple steps.

Not anymore.

With Estuary’s new Dataflow Reset feature, you can perform a clean-sweep backfill in just one step - reloading your sources, refreshing schemas, re-triggering derivations, and updating destination tables - all at once.

What Is a Dataflow Reset?

A Dataflow Reset is Estuary’s one-click solution to refresh your entire dataflow. It works from top to bottom:

Re-extracts data from the source
Re-runs all derivations
Recalculates schemas using updated data
Rebuilds destination tables

This isn't just a re-run - it's a recalibration. If your schemas previously became too broad (due to inconsistent or junk data), the reset starts fresh and reflects the true shape of your source.

When Should You Use It?

The new Dataflow Reset option is ideal for scenarios like:

Structural changes in your source system
Schema inference gone awry
Destination tables out of sync with upstream logic

Bonus: It automatically tracks which downstream resources (like materializations) need updating - no manual selection required.

How to Use It

Go to your Capture in the Estuary Flow web app.
Click Edit.
Select Backfill.
The default backfill mode will now trigger a Dataflow Reset.

That’s it - your pipeline is reset and refreshed in one action.

Prefer Fine-Grained Control?

You can still choose from advanced backfill options:

Incremental Backfill

Reprocess only the source data while keeping the existing destination intact.
Materialization-Only Backfill

Rebuild destination tables from current collection data - no need to touch the source.

These modes are perfect for more targeted recovery and testing.

Known Limitation

Avoid using Dataflow Reset with Dekaf materializations (Estuary’s Kafka-compatible interface). This combination is currently unsupported.

Learn More

Want a deeper dive into backfilling options, use cases, and caveats? Check out the Estuary docs:

👉 https://docs.estuary.dev/reference/backfilling-data/

TL;DR

Dataflow Reset is a full-pipeline refresh: source -> schema -> derivation -> destination
Automatically recalculates schema to avoid issues caused by bad or outdated data
Easy to trigger and safer than ever to run
Still supports advanced, partial backfill modes
Avoid using with Dekaf (for now)

Make your next backfill a breeze with Estuary.

How to Load Data from Amazon S3 to Snowflake in Real Time

Sourabh Gupta — Wed, 09 Jul 2025 06:39:46 +0000

Got a bunch of raw data sitting in Amazon S3 and need to get it into Snowflake for analytics — fast? You’re not alone.

Maybe it’s JSON logs, CSV exports, or event data piling up in your S3 bucket. Maybe you’ve tried batch pipelines or custom scripts but ran into delays, duplicates, or schema chaos. What you actually need is a clean, reliable way to load that S3 data to Snowflake, without spending weeks building and maintaining it.

That’s exactly what Estuary Flow is built for.

Flow makes it easy to build real-time S3 to Snowflake data pipelines with no code, no ops overhead, and no latency headaches. It connects directly to your S3 bucket, picks up new files as they arrive, and keeps your Snowflake warehouse in sync continuously.

In this walkthrough, we’ll show you how to set up an Amazon S3 to Snowflake pipeline using Estuary Flow from start to finish. You’ll go from raw files to live Snowflake tables in just a few steps.

TL;DR: If you're looking to stream data from Amazon S3 to Snowflake, you're in the right place — and Flow makes it a breeze.

Why Stream S3 Data to Snowflake in Real Time?

Let’s be honest — batch processing worked fine back when dashboards only needed to update once a day. But today, teams expect real-time answers: marketing needs up-to-the-minute campaign performance, operations teams need live inventory data, and product managers want to react to user behavior as it happens.

That’s where streaming data from S3 to Snowflake changes the game.

If you’re storing raw files — like logs, events, or exports — in Amazon S3, you’re already halfway there. The missing piece is a low-latency pipeline that gets that data into Snowflake the moment it arrives. No waiting for hourly jobs. No stale reports. Just fresh, query-ready data flowing in 24/7.

Here are a few reasons real-time sync matters:

Analytics that actually keep up – Get real-time insights instead of reading yesterday’s data.
Automation that reacts fast – Trigger workflows in Snowflake based on live S3 updates.
Simplified ops – Eliminate brittle scripts, manual backfills, and sync delays.

Note: Since Amazon S3 doesn’t support native change notifications, Flow polls your bucket every few minutes to detect new files, then streams them to Snowflake immediately. It’s batch under the hood, but real-time in effect.

Why Use Estuary Flow Instead of Traditional ETL Tools?

If you’ve tried to move data from Amazon S3 to Snowflake before, you probably know the drill: patch together an ETL tool, deal with scheduling, wrestle with schema mismatches, and hope the job doesn’t break halfway through.

The thing is, most ETL tools were built for a different era — one where “real time” meant “hourly,” and everything ran in batches. Estuary Flow flips that on its head.

Here’s how Flow makes your S3 to Snowflake pipeline way easier:

Real-Time by Default: Flow isn’t just fast — it’s built for continuous streaming. Once you connect your S3 bucket, Flow automatically picks up new files as they land and streams the data directly into Snowflake.
No Code Required: Set up everything — capture, schema, and materialization — through a clean UI. You don’t need to write Python, wrangle Airflow, or babysit cron jobs.
Schema-Aware + Smart: Flow infers the structure of your S3 data and helps you map it to Snowflake tables. You can tighten up schemas, apply transformations, and evolve structure over time without breaking your pipeline.
Exactly-Once Delivery: No duplicates. No reprocessing. Flow uses cloud-native guarantees to ensure data lands in Snowflake exactly once, even if things get weird.
Built to Scale: Whether you're syncing a few JSON files or streaming terabytes a day, Flow scales automatically without locking you into complex infrastructure.

Estuary Flow takes the friction out of real-time data integration from S3 to Snowflake, so you can focus on using the data, not moving it.

What You Need to Get Started

You don’t need much to build an Amazon S3 to Snowflake pipeline with Estuary Flow — just a few basics:

Estuary Flow Account

Amazon S3 Bucket

This is your data source. You’ll need:

Bucket name & region
Either public access or your AWS access key + secret key
(Optional) A folder path, called a “prefix”

Snowflake Account

Your destination for the data. Make sure you have:

A database, schema, and virtual warehouse
A user with access
Your account URL + login credentials
(Optional) warehouse name and role

That’s it. With these in place, you’re ready to connect the pieces and start streaming.

Step 1: Capture Data from Amazon S3

First up, you’ll connect Estuary Flow to your S3 bucket — this step is called a capture. It’s how Flow knows where to pull your data from.

Here’s how to set it up:

Log into Estuary Flow at dashboard.estuary.dev.
Click the Sources tab and select New Capture.
Choose Amazon S3 from the list of connectors.

You’ll see a form where you enter your S3 details:

Capture name – Something like myorg/s3-orders
AWS credentials – Only needed if your bucket isn’t public
Bucket name & region – From your S3 console
Prefix (optional) – To pull from a specific folder
Match keys (optional) – For filtering files, like *.json

Once you click Next, Flow will connect to your bucket and auto-generate a schema based on your data. You’ll see a preview of your Flow collection — this acts as a live copy of your S3 data inside Flow.

Click Save and Publish to finish the capture.

Behind the scenes, Flow checks your S3 bucket on a 5-minute schedule (by default) to pick up new or updated files. This is how it delivers near-real-time sync, even though S3 itself doesn’t support streaming events.

Next, let’s connect this to Snowflake.

Step 2: Materialize to Snowflake

Now that your data is flowing into Estuary, it’s time to materialize it to Snowflake — in other words, stream it directly into a Snowflake table.

Here’s how to set it up:

After saving your S3 capture, click Materialize Collections.
Choose the Snowflake connector from the destination list.

You’ll fill out a simple form with your Snowflake details:

Materialization name – e.g., myorg/s3-to-snowflake
Account URL – Like myorg-account.snowflakecomputing.com
User + Password – A Snowflake user with the right permissions
Database & Schema – Where the table will live
Warehouse – Optional, but recommended
Role – Optional if already assigned to the user

Once Flow connects, you’ll see your captured collection (from S3) listed.

From here, you can:

Rename the output table
Enable delta updates (if you want changes applied instead of full inserts)
Use Schema Inference to map your flat S3 data into Snowflake’s tabular format

To do that:

Click the Collection tab
Select Schema Inference
Review the suggested schema → Click Apply

Finally, hit Save and Publish.

✅ That’s it — you’ve now got a fully working, real-time S3 to Snowflake pipeline. Flow will continuously sync new files from your bucket straight into your Snowflake warehouse.

What’s Next? Supercharge Your S3 to Snowflake Pipeline

You now have a fully operational, real-time pipeline from Amazon S3 to Snowflake — and it runs continuously, no scripts or schedulers required.

But that’s just the beginning.

With Estuary Flow, you can take things even further:

Add Transformations (a.k.a. Derivations)

Want to clean, filter, or join your data before it lands in Snowflake? Use derivations to apply real-time transformations using SQL or TypeScript, right inside Flow.

You can enrich JSON objects, flatten nested structures, or create entirely new views.

Plug into More Systems

Need to send the same S3 data to BigQuery, Kafka, or a dashboard tool? Just add another materialization — Flow supports multi-destination sync out of the box.

Monitor + Optimize

Use Flow’s built-in observability tools or plug into OpenMetrics to monitor throughput, schema evolution, and pipeline health in real time.

Start Streaming S3 Data to Snowflake Today

The old way — batch jobs, manual scripts, clunky ETL — just can’t keep up with today’s speed of data.

With Estuary Flow, you can:

Sync Amazon S3 to Snowflake in real time
Handle schema changes effortlessly
Scale without infrastructure headaches

Ready to go from raw files to real-time insights?

Try Estuary Flow for free and build your first streaming data pipeline today.