Forem: Giovanni Martinez

Beyond the API: DocumentDB vs. Aurora PostgreSQL for JSON Workloads

Giovanni Martinez — Tue, 28 Apr 2026 11:44:33 +0000

In my last post, we dug into the deep internal trade-offs of running JSON workloads on PostgreSQL vs. MongoDB. We looked at how PostgreSQL's MVCC and TOAST architectures create hidden write amplification, and how MongoDB's WiredTiger engine handles documents differently with a copy-on-write B-Tree design.

But what happens when you decide you do not want to manage those database servers yourself? You migrate to AWS, open the managed services menu, and start comparing Amazon DocumentDB with Amazon Aurora PostgreSQL.

Here is the plot twist most architecture discussions miss: at this point, you are no longer comparing PostgreSQL heap internals directly against MongoDB WiredTiger internals. You are comparing two different database compute layers built on AWS's decoupled, log-centric, multi-AZ distributed storage model.

That shift completely changes the "Postgres vs. Mongo" math for JSON workloads.

The Shared Foundation: The Database as a Log-Centric System

To understand both DocumentDB and Aurora PostgreSQL, you first need to understand Aurora's storage architecture.

In traditional single-instance database deployments, the database engine is responsible for page management and writes data pages plus WAL over network-attached block storage. The compute node and storage node are tightly coupled from a write path perspective.

Aurora changed that model.

In Aurora's architecture, the compute layer primarily emits redo records to a distributed storage subsystem spanning three Availability Zones. The storage service is responsible for durable replication and page materialization across its internal fleet. This substantially reduces the amount of page-oriented write work the compute node must manage directly.

Data is replicated across multiple copies in three AZs, continuously backed up to S3, and storage scales automatically.

The key insight for this comparison: Amazon DocumentDB and Aurora PostgreSQL both use this same architectural pattern of decoupled compute and distributed, log-oriented storage.

Amazon DocumentDB Unmasked: Compatibility Layer, Not MongoDB Internals

DocumentDB is often described informally as "managed MongoDB on AWS." That description is directionally useful for app teams, but technically misleading — and the distinction matters at the architecture level.

Amazon DocumentDB is not MongoDB. It does not run MongoDB's codebase. It does not use MongoDB's WiredTiger storage engine. It does not contain any MongoDB SSPL-licensed code. It is a fully proprietary AWS-built database engine that implements a subset of the MongoDB wire protocol and API surface. The "with MongoDB compatibility" qualifier in the product name is doing a lot of heavy lifting — it means your MongoDB drivers and application code can connect and issue operations, but the engine executing those operations is fundamentally different under the hood.

When your application sends a MongoDB-style operation to DocumentDB, the service accepts the wire protocol request at its compute layer and executes it through DocumentDB's own engine and storage path built on AWS's distributed storage architecture. The query planner is different. The index internals are different. The replication model is different. What you get is API-level compatibility — not behavioral equivalence.

As of early 2026, DocumentDB supports compatibility modes for MongoDB 3.6, 4.0, 5.0, and the recently launched 8.0 — which brings a new query planner (Planner v3), collation, views, and additional aggregation stages. However, even with 8.0, the compatibility story has real gaps that matter at the architecture level.

The Trade-offs of Compatibility by Emulation

The decoupled storage model brings real operational advantages:

You do not manage MongoDB replica set internals directly.
Storage growth is decoupled from traditional node-local disk constraints.
Read scaling is straightforward via replicas that share the underlying distributed storage model.
Backups, durability, and failover characteristics inherit AWS-managed behavior.

But compatibility-by-emulation comes with engineering consequences: API compatibility is not the same as internal engine equivalence or full ecosystem parity.

Here is where things get concrete. These are real compatibility gaps that have bitten teams mid-migration:

`$lookup` with correlated subqueries

DocumentDB (5.0 and earlier) supports equality-based $lookup joins and uncorrelated subqueries, but does not support correlated subqueries — the $lookup variant where you use let and pipeline together with $expr to reference parent fields inside the child pipeline. This is a common MongoDB pattern for filtered joins. Your aggregation will fail at runtime with:

MongoServerError: Aggregation stage not supported:
  '$lookup on multiple join conditions and uncorrelated subquery'

If your application relies on correlated $lookup pipelines for cross-collection queries, this is a migration-blocking gap that forces you to restructure the aggregation into multiple application-level round trips or flatten the data model.

Missing or version-gated aggregation stages

On DocumentDB 5.0, several commonly used aggregation stages are missing entirely: $facet, $unionWith, $graphLookup, $setWindowFields, and $merge. DocumentDB 8.0 adds some of these (notably $merge, $bucket, $facet, and $set/$unset), but $setWindowFields and $graphLookup remain unavailable. If your analytics pipelines use window functions via $setWindowFields — a feature MongoDB introduced in 5.0 — you will not find a DocumentDB equivalent.

Index behavior differences

DocumentDB does not support partial indexes, case-insensitive indexes, or leverage indexes for queries using certain operators. For teams relying on partial indexes to reduce index size on sparse data or conditional queries, this means either bloated indexes or redesigned query patterns.

Ecosystem tooling assumptions

Tools that assume native MongoDB internals — such as mongodump/mongorestore beyond version 100.6.1, or change stream consumers expecting MongoDB-native latency characteristics — may behave differently or fail. Client-side field-level encryption and queryable encryption are not supported.

The bottom line: Run the AWS DocumentDB compatibility tool against your actual workload in staging. Pay special attention to aggregation pipelines with $lookup subqueries and any stage not listed in the supported APIs documentation. Do not assume that "MongoDB-compatible" means "drop-in replacement."

Aurora PostgreSQL: Does It Fix the MVCC / TOAST Problem?

If you read my previous post, you know that PostgreSQL's MVCC and TOAST mechanisms cause heavy write amplification for large, frequently updated JSON documents. So if you migrate to Aurora PostgreSQL — where the compute node sends redo records to the distributed storage layer rather than writing full pages over a network-attached volume — does that fix the Postgres JSON penalty?

The short answer: No. But the bottleneck shifts.

Aurora vastly reduces page-level network I/O, but the Aurora PostgreSQL compute node still runs the standard PostgreSQL engine in memory. That means:

MVCC still happens. When you update a jsonb document, PostgreSQL still creates a new tuple version and leaves a dead tuple behind in the heap.
TOAST still happens. Documents over 2KB are still compressed and chunked into separate TOAST tables.
HOT updates are still blocked on jsonb. Expression indexes on jsonb fields prevent PostgreSQL from applying Heap-Only Tuple optimization, meaning every update — even a single key change — forces a new index entry across all indexes on that table. This is a write amplification multiplier that compounds with the number of indexes.
VACUUM is still required. You still have to tune autovacuum aggressively to clean up dead tuples. In Aurora, unbounded bloat also means unbounded storage cost — you pay per GB on the Aurora volume.

The most critical Aurora-specific consideration is I/O cost.

In standard Aurora configurations, you are billed per I/O request at $0.20 per million operations. Because updating a TOASTed jsonb document causes write amplification (MVCC new tuple + TOAST rewrite + index entries), a write-heavy JSON workload can generate I/O charges that aren't obvious until your bill arrives.

The practical tip: If you are running a heavy JSON mutation workload on Aurora PostgreSQL, evaluate Aurora I/O-Optimized. This cluster configuration eliminates per-request I/O charges in exchange for a higher baseline on compute and storage pricing. AWS's guidance is that if I/O spend exceeds 25% of your total Aurora database bill, I/O-Optimized likely saves money — and for workloads dominated by large jsonb updates with multiple indexes, you can easily clear that threshold. Teams have reported 30–40% cost reductions on I/O-intensive workloads after switching.

A secondary option worth evaluating: Aurora Optimized Reads with NVMe-backed instance types (r6gd/r6id). This extends the buffer pool to local SSD, which helps with read-heavy JSON access patterns where TOAST decompression causes repeated storage round-trips.

Querying, Indexing, and the Compute Bottleneck

Because both services use the same underlying distributed storage model, performance differences are won and lost in the compute layer — specifically in the buffer cache and how efficiently each engine executes queries.

Aurora PostgreSQL (`jsonb` + GIN)

PostgreSQL's query optimizer has full native visibility into your tables and indexes. When you query a jsonb column using a GIN index, PostgreSQL knows exactly how to traverse the decompressed JSON structure. You can join a deeply nested JSON document against multiple normalized relational tables, filter with a CTE, and aggregate with window functions — and the query planner handles all of it natively with decades of optimization work behind it.

You also get partial indexes, expression indexes, and exclusion constraints on JSON data — tools that let you build highly targeted index structures that DocumentDB cannot replicate.

Amazon DocumentDB

DocumentDB is highly efficient for standard document retrieval and basic filtering. For pure document access patterns — key lookups, single-collection queries with indexed predicates — performance is strong and the operational simplicity is real.

However, complex analytics are where the emulation model shows its limits. A multi-stage aggregation pipeline ($match, $unwind, $group, $lookup) requires the DocumentDB compute layer to pull data from the shared storage volume, hold it in memory, and execute pipeline stages iteratively. The $lookup operator in particular lacks the relational query optimizer that PostgreSQL uses to build efficient join execution plans — DocumentDB uses hash, sort-merge, or nested-loop algorithms but without the cost-based planner that selects between them intelligently based on table statistics.

For reporting or ad-hoc analytics that span multiple collections, this becomes a compute bottleneck at scale. And because $setWindowFields is unavailable, any window-function-style analytics require client-side post-processing or a separate analytics layer entirely.

The Decision Framework

Since the underlying storage model is shared, your decision is primarily about your application architecture, team skill set, and where your workload is headed.

Choose Amazon DocumentDB if:

You are doing a lift-and-shift of an existing MongoDB application and want managed AWS infrastructure without rewriting your data access layer — but only after you have validated compatibility using AWS's tooling against your actual aggregation pipelines and index usage
Your data is heavily siloed into independent documents, cross-collection joins are rare, and your aggregation pipelines use only the supported subset
Schema flexibility and rapid iteration on document shape are more valuable than referential integrity — and you have a team experienced with MongoDB operational patterns

Choose Aurora PostgreSQL if:

You are building greenfield — the relational model with jsonb gives you the highest ceiling for complexity as requirements evolve, without having to migrate engines later
Your data model requires or will eventually require referential integrity, foreign keys, and complex JOIN operations
You need robust analytics: full SQL, window functions, GIN indexes, partial indexes, and expression indexes on JSON data
You want one engine that handles both structured and semi-structured data without a separate system

The case that often gets decided wrong: A team has a MongoDB application that uses $lookup with correlated subqueries, $facet for multi-faceted search, or $setWindowFields for analytical aggregations. They assume DocumentDB is the natural AWS landing zone because the API looks the same. In practice, they would be better served migrating to Aurora PostgreSQL with a jsonb-first schema, where the features they depend on have native equivalents that are more mature and more performant — SQL JOIN with a cost-based planner instead of $lookup emulation, FILTER clauses and window functions instead of $facet and $setWindowFields.

If your MongoDB usage is already pushing past simple document access into relational territory, the migration to Aurora PostgreSQL is an investment that pays back in query capability and long-term maintainability.

The Verdict

Amazon DocumentDB is a genuine feat of cloud engineering — decoupling a MongoDB-compatible wire protocol from its underlying storage internals enabled legacy application teams to scale on AWS without rearchitecting their data layer. The 8.0 release narrows the gap significantly. For the right workload — high-throughput document access with straightforward query patterns — it earns its place.

But for modern, data-intensive architectures — where data models inevitably become relational as a product matures — Aurora PostgreSQL remains the stronger long-term foundation. Tune your autovacuum, watch your TOAST tables, evaluate Aurora I/O-Optimized for write-heavy JSON workloads, and keep an eye on HOT update limitations if you are indexing jsonb fields aggressively. The operational overhead is real, but the query expressiveness and data integrity guarantees pay it back.

Have you hit unexpected I/O costs from TOAST write amplification on Aurora? Or run into DocumentDB compatibility gaps mid-migration? Drop a comment — I'd love to compare notes.

PostgreSQL vs. MongoDB for JSON: The Internal Trade-offs They Don't Tell You in the Docs

Giovanni Martinez — Sun, 15 Mar 2026 13:02:35 +0000

The question comes up constantly in architecture discussions: "Should we use MongoDB or PostgreSQL for our JSON-heavy workload?" Having managed both at scale, I can tell you the answer is not as simple as "MongoDB is for documents, Postgres is for tables." There are deep internals at play on both sides that will affect your performance, storage footprint, and operational burden in ways that a quick benchmark won't reveal. Let's dig in.

First, a Fundamental Framing Problem

MongoDB is often called a "document database," which people interpret as: great for JSON, superior to relational databases for flexible data. That framing is misleading. MongoDB is not simply a JSON store with a query layer on top. It is a non-relational database, meaning it has no native concept of joins, no foreign key enforcement, no referential integrity, and no support for multi-document ACID transactions that span arbitrary collections (multi-document transactions were added in v4.0 but carry significant performance overhead and are not the default usage pattern).

To be precise for teams running MongoDB Atlas or Enterprise: while MongoDB does support multi-document transactions, they are bound by a 60-second execution limit and incur significant throughput penalties as lock contention increases. In PostgreSQL, a transaction is a first-class citizen — the default behavior for every statement. In MongoDB, a multi-document transaction is a specialized escape hatch you reach for when your schema design has failed to keep related data local to a single document. That distinction matters enormously at scale.

This matters because the moment your data has relationships — orders belong to customers, line items belong to orders, products belong to categories — MongoDB forces you to either embed everything (document bloat, duplication, update anomalies) or handle joins in application code ($lookup is available but is a post-processing aggregation step, not a query optimizer join). Neither is free. PostgreSQL's relational model with JSON support gives you both flexibility and the full power of set-based relational operations.

PostgreSQL's JSON Capabilities: More Than You Think

PostgreSQL has two JSON data types: json (stored as plain text, re-parsed on each access) and jsonb (stored as a parsed binary format, indexed, and operator-rich). For any production workload, use jsonb.

With jsonb you get:

GIN indexes on the entire document or specific paths for fast containment queries (@> operator)
Path-based expression indexes: CREATE INDEX ON events ((payload ->> 'event_type'))
Full SQL: join your JSON documents against normalized tables, filter with CTEs, aggregate with window functions
Partial indexes: index only the subset of rows where a JSON field meets a condition
Schema validation via CHECK constraints on JSON paths when you need it

MongoDB also has rich query capabilities on nested documents, but it lacks the composability of SQL. Complex reporting that mixes document access with aggregation across related collections becomes an aggregation pipeline exercise that few SQL developers would recognize as readable.

MVCC: The Hidden Cost in PostgreSQL JSON Workloads

PostgreSQL uses Multi-Version Concurrency Control (MVCC) to handle concurrent reads and writes without locking. The mechanics create a write amplification problem that is especially painful for large jsonb columns.

How MVCC works on an UPDATE:

BEFORE UPDATE:
[Heap Page]
 +------------------------------------------+
 | Tuple v1 (xmin=100, xmax=0) | ...data... |
 +------------------------------------------+

AFTER UPDATE (change one JSON key):
[Heap Page]
 +------------------------------------------+
 | Tuple v1 (xmin=100, xmax=200) | ...data... | <-- marked DEAD
 | Tuple v2 (xmin=200, xmax=0)   | ...data... | <-- NEW full copy
 +------------------------------------------+
                                               ^
                    Dead tuple occupies space until VACUUM runs

When you update a row, PostgreSQL does not modify the existing row in place. It writes a new version of the entire row and marks the old version as dead. Even if you change a single key in a 10KB jsonb document, the full 10KB is written again. Readers on older snapshots see the prior version until their transaction completes — which is excellent for read concurrency, but means dead tuples accumulate on disk.

For JSON-heavy workloads with frequent partial updates, this means:

Table bloat builds faster than in equivalent workloads on narrow rows
Index bloat follows, because index entries point to specific heap tuple versions
Query performance degrades as the visibility map becomes stale and more pages need checking

MongoDB's WiredTiger storage engine also uses MVCC internally, but it employs a copy-on-write B-Tree model rather than PostgreSQL's heap-based tuple versioning. When you update a document, WiredTiger caches the modification in memory and appends it to a Write-Ahead Log. During its periodic checkpoint process, it writes modified pages to new block locations on disk and eventually frees the old space.

While WiredTiger avoids the exact single-row write amplification seen in PostgreSQL, it is not zero-cost. It still involves writing out entire compressed pages during checkpoints, and relies heavily on background cache eviction to maintain performance.

VACUUM: PostgreSQL's Maintenance Obligation

VACUUM is PostgreSQL's answer to MVCC dead tuple accumulation. It reclaims space occupied by dead tuples, updates the visibility map (allowing Index-Only Scans to skip heap fetches), and prevents transaction ID wraparound — the catastrophic failure mode where Postgres refuses to accept new transactions.

PostgreSQL has autovacuum, a background daemon that triggers based on a dead tuple threshold (autovacuum_vacuum_scale_factor defaults to 20% of table size). For large tables, this default is dangerously high — a 500 million row table would need 100 million dead tuples before autovacuum wakes up.

For JSON-heavy workloads, tune aggressively:

Lower autovacuum_vacuum_scale_factor to 0.01 or even 0.005 for large tables
Raise autovacuum's I/O budget by reducing autovacuum_vacuum_cost_delay
Monitor pg_stat_user_tables: track n_dead_tup, last_autovacuum, and last_autoanalyze
Consider VACUUM ANALYZE after bulk loads or mass updates to refresh planner statistics

MongoDB does not have an equivalent to VACUUM. WiredTiger reclaims space within its B-tree pages automatically via checkpointing, and collection-level compaction can be triggered manually. There is no "transaction ID wraparound" risk, and space reclamation is generally more transparent to the application.

TOAST: PostgreSQL's Large Value Storage

PostgreSQL has a hard limit: a single row must fit on one 8KB page. Since jsonb documents can easily exceed 8KB, PostgreSQL uses TOAST (The Oversized-Attribute Storage Technique) to handle large values.

When a jsonb value exceeds roughly 2KB (the TOAST threshold), PostgreSQL will automatically:

Compress the value (using LZ compression by default)
If still too large, chunk it into 2KB segments stored in a separate TOAST table (pg_toast_<oid>)
Store a pointer in the main heap row referencing the TOAST chunks

This is largely transparent, but the performance implications are real:

Reads: fetching a TOASTed column requires an additional heap scan on the TOAST table — extra I/O on every large document fetch
Updates: updating any field in a large jsonb document causes the entire value to be re-TOASTed, even if you only changed one key. Combined with MVCC write amplification, this is double the I/O penalty
VACUUM on TOAST tables: autovacuum must process the TOAST table separately; TOAST table bloat is a common source of hidden disk usage that operators miss
Index access: GIN indexes on jsonb operate on the decompressed value, so retrieving the full document still requires a TOAST table hit, even if the query filter was satisfied entirely by a GIN index

The practical recommendation: if your JSON documents regularly exceed 4–8KB, consider splitting large, rarely-queried fields into separate columns or an object store. Keep the frequently-queried JSON fields in a compact jsonb column.

MongoDB documents have their own size limit (16MB per document) and store data in BSON format. WiredTiger handles variable-length documents natively without a separate overflow mechanism, which gives MongoDB an advantage for workloads dominated by large, frequently-updated documents.

The Decision Framework

PostgreSQL (jsonb)

Transactions: ACID by default for every statement
Joins: native SQL joins with optimizer support
Integrity: foreign keys and relational constraints
Update write cost: full-row rewrite on updates
Large JSON reads: TOAST can add extra I/O on large values
Space maintenance: requires VACUUM tuning on write-heavy workloads
Index options: GIN, B-tree, partial, and expression indexes
Analytics: full SQL, CTEs, and window functions
Best fit: mixed relational + JSON workloads

MongoDB

Transactions: multi-document ACID available, but with higher overhead
Joins: $lookup as an aggregation stage
Integrity: no native relational integrity model
Update write cost: copy-on-write page checkpointing in WiredTiger
Large JSON reads: BSON inline storage up to 16MB document limit
Space maintenance: automatic space reuse, optional manual compaction
Index options: compound, multikey, text, and geospatial indexes
Analytics: aggregation pipeline model
Best fit: document-first workloads

The Operational Reality

In my experience managing both at scale, PostgreSQL's MVCC + VACUUM model requires more active DBA engagement for write-heavy JSON workloads. You will fight bloat if you don't tune autovacuum aggressively. TOAST adds I/O overhead that isn't obvious until you instrument it. But the payoff — full SQL expressiveness, relational integrity, and a single database for everything — is significant.

MongoDB's operational model is simpler for pure document workloads, but the moment your product evolves and relationships emerge (they always do), you pay the cost of having chosen a non-relational foundation at a time when re-architecting is expensive.

The best database for JSON is the one you understand deeply enough to tune, monitor, and operate at production scale. For most teams building data-intensive applications in 2026, that database is PostgreSQL.

One contender I haven't covered here is Amazon DocumentDB — a MongoDB-compatible service built on the Aurora storage layer that deserves its own deep dive. I'll be publishing a follow-up post that adds DocumentDB to the mix, including what it actually is under the hood, where it diverges from native MongoDB, and how it stacks up against Aurora PostgreSQL for JSON workloads on AWS.

Have you run into TOAST bloat or MVCC write amplification in a PostgreSQL JSON workload? Or migrated from MongoDB back to Postgres? Drop a comment — I'd love to compare notes.

canonical_url: https://iqtoolkit.ai/blog/postgresql-vs-mongodb-json-internal-tradeoffs

Forem: Giovanni Martinez

Beyond the API: DocumentDB vs. Aurora PostgreSQL for JSON Workloads

The Shared Foundation: The Database as a Log-Centric System

Amazon DocumentDB Unmasked: Compatibility Layer, Not MongoDB Internals

The Trade-offs of Compatibility by Emulation

$lookup with correlated subqueries

Missing or version-gated aggregation stages

Index behavior differences

Ecosystem tooling assumptions

Aurora PostgreSQL: Does It Fix the MVCC / TOAST Problem?

Querying, Indexing, and the Compute Bottleneck

Aurora PostgreSQL (jsonb + GIN)

Amazon DocumentDB

The Decision Framework

The Verdict

PostgreSQL vs. MongoDB for JSON: The Internal Trade-offs They Don't Tell You in the Docs

First, a Fundamental Framing Problem

PostgreSQL's JSON Capabilities: More Than You Think

MVCC: The Hidden Cost in PostgreSQL JSON Workloads

VACUUM: PostgreSQL's Maintenance Obligation

TOAST: PostgreSQL's Large Value Storage

The Decision Framework

PostgreSQL (jsonb)

MongoDB

The Operational Reality

`$lookup` with correlated subqueries

Aurora PostgreSQL (`jsonb` + GIN)