Forem: Rick Houlihan

Lakebase, Meet PDB: The "Third-Generation" Database Oracle Shipped in 2013

Rick Houlihan — Mon, 11 May 2026 19:32:27 +0000

By Rick Houlihan & Patrick Meredith

Databricks named the right problem. Their answer is a credible execution of an idea Oracle Multitenant solved a decade earlier — and as it turns out, the gap they think they've found in Oracle was only one PL/SQL package away from closing.

The Pitch That Started This

A colleague forwarded me the Databricks blog post the other day. Opening line:

"In our previous blog, we introduced Lakebase, the third-generation database architecture that fundamentally separates storage and compute."

— Databricks, "How agentic software development will change databases"

So, like what Oracle did 12 years ago.

I'm being a little snide. Bear with me — there's a real article underneath. The blog is a thoughtful read about how AI agents are changing database workloads, and most of the diagnosis is right. Their telemetry is interesting:

"In Databricks's Lakebase service, AI agents now create roughly 4x more databases than human users."
"[O]n average, each database project has ~10 branches and some databases with nested branches reaching depths of over 500 iterations…"
"[F]or about half of these agentic applications, the database compute lifetime is less than 10 seconds."

That last number is real. Agents don't behave like humans. They generate variants by the dozen, run them in parallel, evaluate against an eval set, keep the winner, throw away the losers. Evolutionary development. The economics break down completely on a database that costs $200/month per instance with a five-minute provisioning cycle.

So Databricks is right about the problem. They're right that databases need a branching primitive. They're right that storage and compute need to scale independently. They're right that the always-on cost floor doesn't survive contact with agents.

This article is not about whether they're wrong on the diagnosis.

It's about whether their answer is novel — and what the architecture-correct version looks like. Because Oracle has been shipping the same primitive in the engine since July 2013, and a small Python + PL/SQL wrapper is all that separates it from the developer experience Databricks just announced.

Patrick and I thought it was worth writing this down.

What Lakebase Actually Is

Spoiler: it's Neon.

Databricks announced its agreement to acquire Neon on May 14, 2025. The press release didn't disclose a price (industry reporting put it at roughly $1 billion), but it did volunteer a useful telemetry data point: "over 80 percent of the databases provisioned on Neon were created automatically by AI agents rather than by humans." That number is also the reason this acquisition happened — Neon, founded in 2021 by Postgres committers, had built a serverless Postgres architecture that AI agents could actually afford to use: stateless compute nodes, a Paxos-based safekeeper quorum holding WAL, and a pageserver materializing pages on demand from object storage. Branches were stamped as metadata pointers at a moment in WAL history; copy-on-write at the storage layer made divergence cheap.

That architecture is good engineering. It's also exactly what Databricks now ships as Lakebase. Their own architecture deep-dive opens with:

"In the lakebase architecture, your compute is stateless. It does not rely on a local data directory. Instead, it streams WAL to a Paxos-based quorum of safekeepers."

— Databricks, "How lakebase architecture delivers 5x faster Postgres writes"

The same post describes how, when Postgres compute requests a page from storage, the pageserver "reconstructs it by finding the most recent materialized image of that page and replaying any WAL deltas on top." If you've read Neon's published architecture overview, this is familiar vocabulary — stateless compute → safekeepers → pageserver → object storage — because it is Neon's architecture. Lakebase is Neon with a Databricks brand on top.

To be clear: that's not a problem. Neon is good engineering. Acquiring it and integrating it with the lakehouse is a perfectly defensible product move — buying a four-year-old startup whose technology already solves the agent-economics problem is faster than building one yourself. Nobody should be mad about an acquisition.

The problem is the next thing Databricks did, which was call a four-year-old Postgres-branching architecture "the third-generation database architecture that fundamentally separates storage and compute." That's a marketing claim, not an architectural one, and it has two specific issues. First, "third generation" implies a chronology — first generation was monolithic, second was something, this is the third — and Databricks has never been particularly clear about what the second generation was, which is convenient because any honest answer would include systems older than Lakebase that already do what Lakebase does. Second, the "fundamentally separates storage and compute" phrasing treats compute/storage separation as a 2025 innovation, which is awkward when Snowflake shipped that architecture commercially in 2014 and Oracle shipped a multitenant variant of it in July 2013.

"Third generation" sells better than "we acquired a 2021 startup six months ago, here's what they built." It also doesn't survive a history check.

That's the next section.

The "Third-Generation" Sleight of Hand

Same Databricks blog post — "A New Era of Databases: Lakebase," June 12, 2025 — one "Database Architecture Evolution" section, three generations laid out in sequence.

Generation 1 — the monoliths:

"Examples: MySQL, Postgres, classic Oracle"

"Database systems started as absolute monoliths."

Generation 2 — proprietary loose coupling:

"Examples: Aurora, Oracle Exadata"

"As cloud infrastructure improved, vendors physically separated storage from compute, moving storage into proprietary backend tiers."

Same Oracle. Two generations. One page apart. Pick one.

I'll be charitable and assume the intended argument was "early Oracle was a monolith, modern Oracle isn't." Fine. Then "modern" deserves a timeline.

Year	System	What was separated
2001	Oracle Real Application Clusters (RAC)	Multiple compute nodes against a single shared SAN/NAS storage substrate (Oracle 9i)
2008	Oracle Exadata v1	Database servers vs. intelligent storage cells with predicate offload (Smart Scan), GA September 2008
2010	Google Dremel / BigQuery	Disaggregated storage and compute, columnar — VLDB 2010 paper
July 1, 2013	Oracle Database 12c / Multitenant	`CREATE PLUGGABLE DATABASE … FROM … SNAPSHOT COPY` ships in the engine
2014	Snowflake (GA)	Three-layer cloud-native: storage / virtual warehouses / cloud services
Nov 2014 / Jul 2015	Amazon Aurora	Compute decoupled from a 6-way replicated storage layer across 3 AZs (preview Nov 2014, GA July 2015)
2021	Neon (founded)	Postgres-specific WAL-level disaggregation with branching
May 14, 2025	Lakebase = Databricks acquires Neon	Neon's architecture wrapped around open lake storage

Storage and compute have been separated in production databases for 25 years. Across two paradigms, four vendors, and at minimum seven shipping systems before Lakebase showed up. "Third generation" isn't an architectural claim. It's a marketing label that requires the reader to forget about Oracle RAC, Exadata, Dremel, Multitenant, Snowflake, Aurora, and Neon in roughly that order.

So what's actually new in Lakebase? The same blog is honest about this if you read past the generation label:

"Like Gen 2, it separates compute from storage, but with a critical difference: both the storage infrastructure and the data formats are completely open."

Translation: Gen 2 already separated storage from compute. Their own text concedes the point. The Gen 3 differentiator they're actually claiming is open data formats. We'll dismantle that claim in Section 10 — short version, "open formats" turns out to do less work than the marketing suggests once you ask which formats, governed by whom, queryable how. But file the claim for now.

The other thing the launch blog flags as Gen 3 distinctive is branching:

"Databases can be branched and cloned the way developers branch code."

Branching as a developer-experience primitive is a fair thing to call out — it genuinely changes how AI agents and dev workflows interact with databases, and we conceded that point in Section 1. Branching as a database-engine primitive, though, has shipped in Oracle Multitenant since July 1, 2013, with documented syntax, multiple supported storage substrates, and a hard limit four to eight times higher than Lakebase's. Which is the next section.

"Third-generation database architecture? We're on our fifth." - Patrick Meredith

PDB Snapshot Copy: The Branching Primitive Oracle Has Shipped Since 2013

The syntax is one statement:

CREATE PLUGGABLE DATABASE my_experiment_branch
  FROM base_experiment_pdb
  SNAPSHOT COPY;

The Oracle 19c SQL Reference describes what happens underneath:

"The SNAPSHOT COPY clause instructs the database to clone the source PDB using storage snapshots. This reduces the time required to create the clone because the database does not need to make a complete copy of the source data files."

— Oracle Database 19c SQL Language Reference: CREATE PLUGGABLE DATABASE

What "storage snapshots" means depends on the substrate. The same reference is explicit: with CLONEDB=FALSE, "the underlying file system for the source PDB's files must support storage snapshots. Such file systems include Oracle Automatic Storage Management Cluster File System (Oracle ACFS) and Direct NFS Client storage." With CLONEDB=TRUE, "the underlying file system for the source PDB's files can be any local file system, network file system (NFS), or clustered file system that has Direct NFS enabled. However, the source PDB must remain in open read-only mode as long as any clones exist."

So:

Storage substrate	Snapshot mechanism	Notes
Oracle ACFS	Copy-on-write storage snapshots	`CLONEDB=FALSE` path
Direct NFS Client (dNFS)	Copy-on-write storage snapshots on snapshot-capable NFS array	`CLONEDB=FALSE` path
Exadata sparse disk groups	Copy-on-write	Source PDB must be read-only
Standard FS + `CLONEDB=TRUE`	dNFS sparse files over NFS	Source PDB must remain open read-only while clones exist
Exascale (23ai+)	Redirect-on-write	"created quickly, consume little storage space upon initial creation, and can be created in practically unlimited numbers"

Note the precision on "redirect-on-write" — that's Oracle's official term only for Exascale snapshots in 23ai+. Older substrates use copy-on-write semantics. Per the Exadata Database Service on Exascale Infrastructure documentation: "These PDB snapshots leverage Exascale redirect-on-write technology so that they are created quickly, consume little storage space upon initial creation, and can be created in practically unlimited numbers." The distinction matters if you're going to argue with someone about it.

Sibling features in the Multitenant family:

PDB Snapshot Carousel (introduced in 18c, not 19c — common citation error). Per oracle-base.com: "Oracle 18c introduced the concept of a snapshot carousel, which is a series of point-in-time copies, or snapshots, of a PDB." Default 8 snapshots, hard cap at 8 via MAX_PDB_SNAPSHOTS. Oldest is overwritten when full. Useful for short-horizon point-in-time recovery without the overhead of full backups.
Refreshable Clones. Physically full copies with incremental redo apply. Different beast from snapshot copies (full storage cost, but ongoing sync from source). Convertible one-way to a regular PDB.
PDB density. Up to 4098 PDBs per CDB on Enterprise Edition with Multitenant licensing — the MAX_PDBS reference lists possible values of 5, 254, or 4098 by edition (Standard/Express, Standard Edition 2, Enterprise Edition respectively).

Now compare ceilings:

Platform	Branch limit	Branch depth	Cross-region
AWS Aurora	15 copy-on-write clones per source; 16th becomes a full copy	No explicit depth ceiling, but each level re-consumes the 15 budget	"You can't create a clone in a different AWS Region from the source Aurora DB cluster"
Lakebase (Databricks doc)	500 per project; only 10 unarchived (active) at once	Hundreds nested (per their telemetry)	Per region
Oracle Multitenant	Up to 4098 PDBs per CDB	No documented depth limit	RAC + Data Guard, cross-region via Active Data Guard

Lakebase's 500-per-project ceiling is generous compared to Aurora's 15. Oracle's 4098 is generous compared to Lakebase's 500 by an order of magnitude. And Lakebase has another hard cap that doesn't appear in the cloning side of the comparison: it allows only 10 unarchived (active) branches at once. Oracle has no equivalent active-cap; you tune branch density via Resource Manager based on your workload, which is the next section.

This primitive shipped on July 1, 2013, in Oracle Database 12c. Twelve years before Lakebase. In the database engine, not in a wrapper. With a single SQL statement, documented in the official SQL Language Reference. There is no Postgres extension here. There is no separate page server, no Paxos quorum, no $1B acquisition. It's just CREATE PLUGGABLE DATABASE … SNAPSHOT COPY, and it has been since the series finale of Breaking Bad.

The Compute Story Most People Get Wrong

A note on this section: the structural argument here came from Patrick during a Slack thread when he challenged me on the scale-to-zero comparison. I had it wrong initially. Here's the correct read, in his voice.

The naive comparison says Lakebase wins on scale-to-zero because branches scale individually to zero compute when idle. Oracle, the story goes, is "always on" — fixed ECPUs allocated to the ADB instance, multiple PDBs sharing the pool, no per-branch zero-cost dormancy.

That comparison gets the shape right and the conclusion wrong.

Yes, in Autonomous Database Serverless, ECPUs are allocated at the instance level, not per PDB. Yes, Snapshot Copy PDB branches inside an ADB share that pool. The naive read says: "uh-oh, no isolation, abandoned branches will eat compute." The correct read is: abandoned branches in a shared pool consume nothing by construction — because they aren't reserving anything.

Walk through the mechanics:

Closed PDBs consume zero CPU and zero shadow processes. ALTER PLUGGABLE DATABASE foo CLOSE IMMEDIATE; and the branch is dormant. The 26c SQL Reference describes the semantic: "the PDB equivalent of the SQL*Plus SHUTDOWN command with the immediate mode." Metadata stays in the dictionary; nothing else stays resident.
Idle open PDBs consume near-zero. Just metadata pages.
Active PDBs draw from the shared pool. That pool auto-scales: per the Oracle docs, "with compute auto scaling enabled the database can use up to three times more CPU and IO resources than specified by the number of ECPUs." You pay for the burst when it happens, not when it doesn't.
Resource Manager governs the priority. CPU shares, MAX_IOPS, MAX_MBPS, sessions, parallel servers, per-PDB SGA_TARGET and PGA_AGGREGATE_LIMIT. You decide which branches get more pool when contended.
V$PDBS and V$RESOURCE_LIMIT expose per-branch consumption so a supervisor process can watch and auto-suspend.

So what's the real difference? Lakebase per-DB scale-to-zero with cold-start latency on resume. Oracle shared elastic pool with no cold start.

For an agentic workflow, where the supervisor might wake an "abandoned" branch tomorrow to revisit a hypothesis it shelved today, the no-cold-start property matters. The branch has been consuming nothing; the moment it gets a connection, it's responsive within milliseconds because the compute pool is already warm. Lakebase, by design, has to spin compute back up.

Which means the elasticity scoreboard most people read off the spec sheet — "Lakebase: scale-to-zero ✅ / Oracle: shared pool ❌" — is solving the same problem two different ways and pretending one wins. Different shape. Same economics for abandoned experiments. Faster wakeup on Oracle when the agent comes back.

Sharing compute between PDBs isn't a bug. It means abandoned branches aren't wasting compute, period.

Or as I put it in Slack when this came up: "What we want is exactly what we already have. The compute is scaled. Abandoned branches contribute nothing." That's the architecture.

— Patrick

The Hard Limits

Side-by-side, with citations on every claim:

Capability	Lakebase	Oracle Multitenant + ADB
Total branches	500 / project (Databricks doc)	Up to 4098 / CDB (MAX_PDBS)
Active branches	10 (hard cap)	No hard cap; tuned via Resource Manager
Branch creation speed	Instant (metadata + COW)	Near-instant on snapshot-capable storage
Cold-start on resume	Sub-second to multi-second	None — shared pool
ACID	Postgres MVCC	Full ACID, RAC, Active Data Guard
Failover behavior	Postgres-standard (kills in-flight)	Transparent Application Continuity — in-flight transaction replay
Vector search	Postgres extension	In-engine, optimized by 40-year-old CBO
JSON	jsonb (sequential traversal)	OSON binary, hash-indexed O(1) field access
Graph	Postgres extension	SQL/PGQ, in-engine
Cross-modal queries (vector + JSON + graph + relational)	Limited by extension boundaries	Single transaction, single query plan
Open data format	"Postgres page on S3" (Postgres-only readable)	OSON + Iceberg + Parquet + Mongo wire + native SQL
Mongo wire compatibility	None	Yes (Oracle MongoDB API)

Lakebase wins on developer-experience polish today. The branching UX is wired into the product, the CLI is published, the dashboard renders branch trees. Credit where due — that's a real product investment.

Oracle wins on every limit that matters once you stop counting GitHub stars. Density (4098 vs 500). Active concurrency (no cap vs 10). ACID. Failover that doesn't kill your transactions. Vector + JSON + graph + spatial + relational in one query plan optimized by 40 years of CBO development. Mongo wire compatibility, for the developers who already wrote against MongoDB and don't want to rewrite their app to evaluate a database.

The DX gap is real. It's also the easiest gap to close, which is the next section.

The DX Gap, And Why It's Trivial to Close

Patrick said it best in the original Slack thread: "We probably should develop a lightweight external API too. That should be extremely simple — it's all external to the database."

He was right and he's already shipped it.

The DX gap is real. There is no pdb branch my-experiment command in stock Oracle. Lakebase has a polished branching UX with a published CLI, a dashboard, and git-shaped semantics. We're not going to pretend otherwise.

But this is a wrapper-shaped problem, not a kernel-shaped problem. Patrick built the wrapper:

pmeredit/pdb-branch — "a small multi-language library over a shared PL/SQL package for making Oracle PDB snapshot copies feel like cheap database branches for agentic workflow experiments."

Python, Node.js, Rust, and Java bindings, plus a Rust-built pdb CLI. Releases alongside this article.

The architecture is small enough to fit on a napkin:

PDB_BRANCH PL/SQL package — installed and upgraded automatically by the language binding at startup. Wraps CREATE PLUGGABLE DATABASE … SNAPSHOT COPY with idempotent lifecycle DDL.
Three control tables in CDB$ROOT:
- PDB_BRANCH_BRANCHES — branch registry (name, parent, state, expiration, score)
- PDB_BRANCH_EVENTS — audit log of branch lifecycle events
- PDB_BRANCH_PROFILES — branch-to-Resource-Manager-profile mapping
BranchClient wrappers in four languages — Python over python-oracledb, Node.js over oracledb, Rust over the ODPI-C-based oracle crate (with a pure-Rust oracle-rs path for non-SYSDBA work), and Java. One PL/SQL contract, four idiomatic surfaces.
A pdb Rust CLI — bin/pdb wraps the Rust binding so callers don't need to know Cargo's target/ layout. git branch-shaped commands, .pdbprofile TOML config, and per-flag environment-variable overrides.
Optional Resource Manager profiles: PDB_BRANCH_ACTIVE, PDB_BRANCH_IDLE, PDB_BRANCH_BACKGROUND.

Two ways to drive it. The library surface (Python shown; Node/Rust/Java are equivalents):

from pdb_branch import BranchClient

client = BranchClient(connection)  # auto-installs/upgrades PL/SQL package

client.create_branch(
    "AGENT_RAG_042",
    from_pdb="GOLDEN_MASTER",
    notes="try smaller chunk size and rerank before answer synthesis",
)
client.record_score("AGENT_RAG_042", 0.91, notes="eval: qa_regression_v3")
client.promote("AGENT_RAG_042", notes="winner for current retrieval policy")
client.cleanup(close_idle_after_minutes=60, drop_expired=True)

Or, at the shell, the same workflow via the pdb CLI:

bin/pdb init --dsn localhost:1521/FREE --user sys --password ... --from FREEPDB1
bin/pdb branch AGENT_RAG_042 --notes "try smaller chunk size and rerank"
bin/pdb score   AGENT_RAG_042 0.91 --notes "eval: qa_regression_v3"
bin/pdb promote AGENT_RAG_042
bin/pdb branch -d AGENT_RAG_042

bin/pdb init writes a .pdbprofile so the daily commands stay short. The CLI also accepts environment-variable overrides and flag overrides — flags beat env vars beat .pdbprofile beat local defaults.

That's the entire developer experience. Branch, score, promote, reap. The argument that Oracle "doesn't have git branch for databases" was true a week ago. Today there's a CLI in the repo, an integration test that runs it against an Oracle Free container in CI, and a Rust binary you can drop in your $PATH.

One architectural point worth elevating: the two-connection security model. The agent never gets SYSDBA. There are two distinct connections:

Control-plane connection — trusted orchestration code → CDB$ROOT as SYSDBA → uses BranchClient to create, open, close, and drop PDB branches.
Workload connection — the agent → branch PDB → normal application user → ordinary SQL against branch-local data.

The agent receives only a DSN to its assigned branch and standard application credentials. It cannot create branches, drop branches, or escape its sandbox. Lakebase has nothing analogous in its branching API today; the agent-vs-supervisor security boundary is enforced at the cloud-IAM layer rather than in the database itself, and that's a category weaker than separation of concerns enforced inside the engine.

Snapshot-copy fallback is engineered, not aspirational. When the library requests SNAPSHOT COPY and the underlying storage rejects it — Oracle Free's container filesystem returns ORA-17525 / ORA-65169, for instance — the library transparently retries as a full clone, records a SNAPSHOT_COPY_FALLBACK row in PDB_BRANCH_EVENTS, and (in the Python binding) emits a SnapshotCopyFallbackWarning. Correctness is preserved on substrates that can't sparse-clone; the events table makes it visible when that happened so capacity planning isn't a guessing game.

Free deployment path:

Oracle Database 23ai/26ai Free Docker image — container-registry.oracle.com/database/free. CDB service FREE, default PDB FREEPDB1. Multiple branch PDBs supported. The Free image's container filesystem doesn't support storage snapshots, so snapshot_copy=True is silently treated as a full clone via the fallback path above — which means 10–30 branches realistic on a laptop, not hundreds. $0 cost forever, and the Oracle Free integration tests in the repo run the Python, Node.js, Rust, Java, and CLI surfaces against this image in CI.
Self-managed CDB on 19c+ with snapshot-capable storage — production target. ACFS, dNFS, Exadata sparse, or Exascale. Branch DDL uses Oracle Managed Files via CREATE_FILE_DEST, preferring DB_CREATE_FILE_DEST when set and otherwise deriving a destination from the parent PDB's datafile directory.
ADB Serverless / Always Free is explicitly NOT a v1 target. ADB application connections land in an existing PDB, not in CDB$ROOT, so they cannot run PDB branch DDL. A real architectural constraint of ADB's tenancy model, not a pdb-branch limitation.

The README is honest about v1 boundaries: the idempotent installer doesn't migrate destructive schema changes yet; PL/SQL identifiers are restricted to simple unquoted Oracle names; promotion is metadata-only, with scaling and export workflows left to deployment-specific adapters. That's an honest v1 scope.

The article is the "why." The repo is the "how." They land together, today.

The Agentic Workflow on Oracle

The lifecycle Patrick described in our Slack thread, mapped to the actual pdb-branch API:

Phase 1 — heavy experimentation. The supervisor holds the SYSDBA control-plane connection and spins up branches:

for hypothesis in hypotheses:
    branches.create_branch(
        f"AGENT_{hypothesis.id}",
        from_pdb="GOLDEN_MASTER",
        notes=hypothesis.description,
    )
    branches.set_profile(f"AGENT_{hypothesis.id}", "PDB_BRANCH_ACTIVE")

Each agent receives a DSN to its assigned branch plus an app-user credential. Agents do not see CDB$ROOT. They run their experiments — vector queries, JSON queries, SQL, whatever the eval needs — against ordinary Oracle PDBs. Once the branch PDB is open there is no special "branch query mode": the branch is just an isolated Oracle PDB service.

Phase 2 — evaluate. Supervisor logs scores back to PDB_BRANCH_BRANCHES as agents finish:

branches.record_score("AGENT_RAG_042", 0.91, notes="eval: qa_regression_v3")

The supervisor process can watch V$PDBS (open mode, last open time, total size) and V$RESOURCE_LIMIT (per-PDB CPU and I/O draw) for liveness and resource consumption.

Phase 3 — promote and reap. Winners stay active. Losers get downgraded or closed:

branches.promote("AGENT_RAG_042", notes="winner for current retrieval policy")
branches.cleanup(close_idle_after_minutes=60, drop_expired=True)

cleanup is the auto-suspend / auto-drop primitive. In production you don't run this from the supervisor; you schedule PDB_BRANCH.CLEANUP from DBMS_SCHEDULER so the orchestration code doesn't need to babysit branch lifecycle.

Behind those four method calls, the SQL is exactly what you'd expect:

CREATE PLUGGABLE DATABASE AGENT_RAG_042
    FROM GOLDEN_MASTER SNAPSHOT COPY;

ALTER PLUGGABLE DATABASE AGENT_RAG_042 OPEN;

ALTER PLUGGABLE DATABASE AGENT_RAG_042
    SET DB_PERFORMANCE_PROFILE='PDB_BRANCH_ACTIVE';

INSERT INTO PDB_BRANCH_BRANCHES (NAME, PARENT, STATE, NOTES, CREATED)
VALUES ('AGENT_RAG_042', 'GOLDEN_MASTER', 'ACTIVE',
        'try smaller chunk size...', SYSTIMESTAMP);

INSERT INTO PDB_BRANCH_EVENTS (BRANCH_NAME, EVENT_TYPE, DETAILS, EVENT_TIME)
VALUES ('AGENT_RAG_042', 'CREATED', '{"from":"GOLDEN_MASTER"}', SYSTIMESTAMP);

Five statements, one transaction. The branch is live. An agent connects to AGENT_RAG_042 as app_user and runs its experiment.

This is what Databricks calls evolutionary algorithms in the database. It's the right framing. The substrate has been Oracle for a decade; what was missing was the wrapper that makes it feel like git. Each language binding is roughly one module long, the Rust pdb CLI is one binary, and they all sit on top of one shared PL/SQL package. The whole DX gap was about that much code.

Cost Reality

Both platforms have real costs and real free entry points. Skipping the marketing-deck pricing slide and going straight to what an engineer would actually pay:

Workload pattern	Lakebase	Oracle ADB Serverless 2 ECPU
50 mostly-idle branches, occasional bursts	$80–$150/mo	$190–$290/mo
100+ branches, high density	Hits the 10-active wall	Scales naturally to thousands
Sustained 8+ hr/day activity	Capacity-unit cost climbs	Cheaper at sustained load
Storage at scale	$0.345 / GB-month	~$0.024 / GB-month (≈15× cheaper)
Free for prototyping	Always Free tier (limited)	Free Docker image: $0 forever

These are public list prices as of mid-2026, picked from each vendor's published rates. Run the numbers for your workload.

The honest read:

Lakebase wins on bursty, mostly-idle floors with light data. That's the optimization point of per-DB scale-to-zero, and they do it well.
Oracle wins on density, sustained activity, and storage at scale. When agents are actually doing work, the shared-pool model delivers more compute per dollar. When experiment data grows, the storage cost differential alone (~15×) can dominate the total.
Oracle Free Docker is genuinely free. No cloud signup, no credit card, no quotas. Patrick's pdb-branch README documents this as the recommended local prototyping path.

This is the compute story restated as economics. Per-DB scale-to-zero looks cheap when nothing is running. Shared elastic pool is cheaper when anything is running. Pick the model that matches your workload, not the marketing scoreboard.

What's Actually New About Lakebase

Worth giving Databricks an honest hearing. The "third-generation" framing collapses the moment you check the dates. What about their other claim — that in Lakebase "both the storage infrastructure and the data formats are completely open"?

That one survives partway and dies in the details.

The operational store in Lakebase is Postgres page format on cloud object storage. That's what they mean by "open storage infrastructure." But Postgres' on-disk page layout is a physical storage format, not a portable interchange format. The only thing that can read a Postgres page file is the Postgres engine. Calling that "open" because the Postgres source code is open is a category error. By that logic, MongoDB's BSON is "open" because the spec is published.

The other openness claim — that the same data is queryable as Iceberg by external analytical engines — is true. But the Iceberg view isn't the operational store. It's a separate projection layer (the "Mooncake" bridge — Databricks' OLTP-to-lakehouse export pipeline). Iceberg files are derived from the operational Postgres pages, not the same bytes.

Which means Lakebase's actual architecture is:

A canonical store in Postgres-only page format. Closed to anything that isn't Postgres.
A projected shape in Iceberg, exported to make the data analytically accessible.

That's exactly canonical form + projected shape. It's the architecture pattern I've been calling Unified Model Theory for the last two years. Databricks reinvented UMT, called the closed canonical store "open," and called the projection layer "openness."

Oracle's answer to "open data" is the converged engine itself: same canonical store, multiple shapes natively in the engine — SQL, JSON Duality Views, Property Graph, Vector, Spatial, Full-Text Search, Mongo wire protocol, OSON serialization out, Iceberg/Parquet for analytics. No bridge layer required. The cost-based optimizer sees all the modalities in a single query plan.

The architecture-correct way to expose canonical data through multiple shapes is to do it in the engine. That is what Oracle has been shipping for 40 years and what UMT formalizes. Databricks' Lakebase + Mooncake architecture is one valid implementation pattern of the same idea, with two extra hops and a new vocabulary.

What's actually new in Lakebase isn't the architecture. It's the packaging — a polished branching UX wired into a data lake brand and a billion dollars of marketing oxygen. That's a real product investment and a credible push into a market segment Oracle has under-marketed. Credit where due.

It's just not "third-generation database architecture." It's first-generation Postgres branching with a second-generation marketing department.

The Real Take

Three things to land:

1. Agents do need branching. Databricks' diagnosis is correct, and the agentic future they describe is real. Database branching is the missing primitive for evolutionary development. Cost floors do break the economics. Storage and compute do need to scale independently. Credit where due.

2. Lakebase is competent execution of an idea Oracle Multitenant solved in 2013. Neon is good engineering. Lakebase is Neon plus a brand and a UX layer. That's fine — but it isn't "third generation." It's a four-year-old Postgres-branching architecture, recently acquired and rebranded.

3. The architecture-correct version exists today. Full ACID. Up to 4098 branches per CDB. Vector, graph, JSON, spatial, full-text — single engine, single transaction, single query plan optimized by 40 years of cost-based optimizer development. Transparent Application Continuity replays in-flight transactions across failover. The two-connection security model keeps agents out of CDB$ROOT by construction.

The only real gap was developer experience. Patrick's pdb-branch closes it. Today. A Python client, a PL/SQL package, three control tables, and a sane API. Branch, score, promote, reap.

Stop reinventing 2013. Build the wrapper. Ship.

Third-generation database architecture? We're on our fifth.

— Rick & Patrick

Citations

Databricks (primary subject):

"How agentic software development will change databases" — https://www.databricks.com/blog/how-agentic-software-development-will-change-databases
"A New Era of Databases: Lakebase" (June 12, 2025) — https://www.databricks.com/blog/what-is-a-lakebase
"How lakebase architecture delivers 5x faster Postgres writes" — https://www.databricks.com/blog/how-lakebase-architecture-delivers-5x-faster-postgres-writes
"Database Branching in Postgres: Git-Style Workflows" — https://www.databricks.com/blog/database-branching-postgres-git-style-workflows-databricks-lakebase
"Databricks Agrees to Acquire Neon" press release (May 14, 2025) — https://www.databricks.com/company/newsroom/press-releases/databricks-agrees-acquire-neon-help-developers-deliver-ai-systems

Oracle Database documentation:

12c Multitenant Concepts — https://docs.oracle.com/database/121/CNCPT/cdbovrvw.htm
19c CREATE PLUGGABLE DATABASE — https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/CREATE-PLUGGABLE-DATABASE.html
19c Cloning a PDB — https://docs.oracle.com/en/database/oracle/oracle-database/19/multi/cloning-a-pdb.html
19c Administering a PDB Snapshot Carousel — https://docs.oracle.com/en/database/oracle/oracle-database/19/multi/administering-pdb-snapshots.html
19c MAX_PDBS reference — https://docs.oracle.com/en/database/oracle/oracle-database/19/refrn/MAX_PDBS.html
21c V$PDBS reference — https://docs.oracle.com/en/database/oracle/oracle-database/21/refrn/V-PDBS.html
26c ALTER PLUGGABLE DATABASE — https://docs.oracle.com/en/database/oracle/oracle-database/26/sqlrf/ALTER-PLUGGABLE-DATABASE.html
Resource Manager for PDBs (19c) — https://docs.oracle.com/en/database/oracle/oracle-database/19/multi/using-oracle-resource-manager-for-pdbs-with-sql-plus.html
ADB Compute Models (ECPU/OCPU) — https://docs.oracle.com/en/cloud/paas/autonomous-database/serverless/adbsb/autonomous-compute-models.html
ADB Auto-Scale 3× — https://docs.oracle.com/en-us/iaas/autonomous-database-serverless/doc/autonomous-auto-scale.html
PDB Snapshots on Exadata Exascale (23ai+) — https://docs.oracle.com/en/learn/exadb-xs-pdb-snapshot/index.html

Historical context:

Dremel 2020 retrospective (VLDB) — https://www.vldb.org/pvldb/vol13/p3461-melnik.pdf
Aurora 10-year retrospective — https://aws.amazon.com/blogs/aws/celebrating-10-years-of-amazon-aurora-innovation/
Aurora cloning hard limits — https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.Managing.Clone.html
Snowflake architecture — https://docs.snowflake.com/en/user-guide/intro-key-concepts
Oracle 18c PDB Snapshot Carousel introduction — https://oracle-base.com/articles/18c/multitenant-pdb-snapshot-carousel-18c

Neon / Postgres branching:

Neon architecture overview — https://neon.com/docs/introduction/architecture-overview
Neon branching docs — https://neon.com/docs/introduction/branching
TechTarget on Databricks/Neon acquisition — https://www.techtarget.com/searchdatamanagement/news/366623864/Databricks-adds-Postgres-database-with-1B-Neon-acquisition

Companion repository:

pmeredit/pdb-branch — https://github.com/pmeredit/pdb-branch

Oracle SQL/JSON: The Developer's Guide to Querying JSON Like a Pro

Rick Houlihan — Wed, 15 Apr 2026 14:12:57 +0000

SQL and JSON aren't enemies. They never were.

For a decade, the industry sold you a false choice: pick the rigidity of relational tables or the flexibility of JSON documents. Build your app on SQL or build it on a document store. Structure or freedom. Choose.

It was never a real tradeoff. It was a failure of implementation masquerading as a law of nature.

Oracle's SQL/JSON support — built into the database engine since 12c and dramatically expanded through 19c, 21c, and 26ai — proves the point. You get the full expressive power of SQL (joins, aggregations, window functions, a 40-year-old cost-based optimizer) and the full flexibility of JSON (nested documents, arrays, schema-optional structures) in the same query, the same transaction, the same execution plan.

This article is a practical developer's guide. We'll start simple and build to sophisticated. By the end, you'll understand CTEs, every major SQL/JSON function, and how they compose into queries that would require five different databases in a polyglot architecture.

Run It Yourself

Every SQL example below is executed end-to-end against a live Oracle 26ai database by the article's companion validator: rhoulihan/json-sql-guide.

The validator extracts every fenced SQL block, runs it against Oracle 26ai Free in a Docker container, and produces an annotated copy of this article with a green ✓ or red ✗ badge after every snippet. If a Pull Request to this article ever introduces a broken example, CI catches it before merge.

If you want to follow along in your own database — or contribute a fix — clone the repo, boot Oracle 26ai Free, and run validator run against this markdown file. The README walks through it. Every sample table, every seed row, every CREATE INDEX in this article comes from there.

Let's build.

The Sample Data

Everything in this article runs against a single table. Simple enough to follow, rich enough to be real. The companion repository (rhoulihan/json-sql-guide) provides this schema and sample data — make setup && validator run will create it for you. The schema is:

orders (
  id         NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  order_doc  JSON
)

Notice the column type: JSON. Not VARCHAR2. Not CLOB. The native JSON data type — introduced in 21c — stores documents in Oracle's OSON binary format. Hash-indexed field navigation. O(1) access to any field at any depth. This matters more than most developers realize, and we'll come back to why.

Documents in order_doc follow this shape:

{
  "orderId":   1001,
  "customer":  {"name": "Acme Corp", "tier": "platinum",
                "address": {"city": "Belmont", "state": "CA", "zip": "94002"}
               },
  "status":    "shipped",
  "priority":  "high",
  "orderDate": "2025-03-15",
  "shipping":  {"method": "express",
                "address": {"city": "Austin", "state": "TX", "zip": "78701"}
               },
  "items":     [{"product": "Widget Pro",  "quantity": 10, "unitPrice": 29.99, "category": "hardware"},
                {"product": "Gadget Plus", "quantity": 5,  "unitPrice": 49.99, "category": "electronics"},
                {"product": "Cable Kit",   "quantity": 50, "unitPrice": 4.99,  "category": "accessories"}],
  "tags":      ["wholesale", "priority", "Q1-promo"]
}

The validator loads ten such documents — different customers, statuses, item arrays, and tag combinations — so every filter and projection example below returns meaningful results. To follow along in your own database, clone the companion repo, boot Oracle 26ai Free, and run validator run against this article.

1. Dot-Notation: The Easy On-Ramp

If you've never queried JSON in Oracle, start here. Dot-notation gives you direct field access using the syntax you'd expect from any programming language:

SELECT o.order_doc.customer,
       o.order_doc.status,
       o.order_doc.shipping.address.city
FROM   orders o;

That's it. Table alias, JSON column, dot-separated field path. Oracle navigates the OSON binary structure, hashes the field names, jumps directly to the offsets. No parsing. No scanning.

Types are preserved. When the column is declared as JSON (the native OSON type), Oracle knows the underlying types of your values. A number stored as a JSON number comes back as a number. A string comes back as a string. You can use dot-notation results directly in comparisons, arithmetic, and predicates — Oracle handles the type mapping:

-- Types flow naturally — no casting required
SELECT o.order_doc.customer       AS customer_name,
       o.order_doc.orderId        AS order_id,
       o.order_doc.status         AS status
FROM   orders o
WHERE  o.order_doc.orderId > 1000;   -- numeric comparison works

Item methods let you coerce types when you need explicit control — converting to a specific SQL type or applying a transformation:

SELECT o.order_doc.customer.string()       AS customer_name,
       o.order_doc.orderId.number()         AS order_id,
       o.order_doc.orderDate.date()         AS order_date,
       o.order_doc.items.size()             AS item_count
FROM   orders o
WHERE  o.order_doc.priority.string() = 'high';

The distinction matters: you don't have to call .string() to use a string value as a string. But item methods give you precision when you need it — .date() parses a JSON string into a SQL DATE you can use in date arithmetic, .number() ensures numeric precision for financial calculations, and methods like .upper(), .size(), and .sum() apply transformations inline without a separate function call.

Available item methods include: .string(), .number(), .date(), .timestamp(), .boolean(), .double(), .length(), .upper(), .lower(), .size(), .type(), .abs(), .ceiling(), .floor(), .count(), .sum(), .avg(), .minNumber(), .maxNumber().

That's not a thin wrapper around JSON. That's a full type conversion system built into the path expression itself.

Array access works how you'd expect:

-- First item
SELECT o.order_doc.items[0].product AS first_product
FROM   orders o;

-- All items (returns a JSON array)
SELECT o.order_doc.items[*].product AS all_products
FROM   orders o;

-- Last item — dot-notation doesn't recognize [last], so reach for
-- JSON_VALUE or JSON_QUERY when you need it
SELECT JSON_VALUE(o.order_doc, '$.items[last].product') AS last_product
FROM   orders o;

Dot-notation is perfect for quick queries, dashboards, and ad-hoc exploration. When you need more control — type safety, error handling, complex path expressions — you reach for the SQL/JSON functions.

2. JSON_VALUE: Surgical Scalar Extraction

JSON_VALUE extracts a single scalar value from a JSON document and returns it as a SQL type. Think of it as dot-notation's more disciplined sibling.

SELECT JSON_VALUE(o.order_doc, '$.customer.name')                     AS customer,
       JSON_VALUE(o.order_doc, '$.orderId' RETURNING NUMBER)          AS order_id,
       JSON_VALUE(o.order_doc, '$.shipping.address.zip')              AS zip_code
FROM   orders o;

Why use JSON_VALUE when dot-notation exists?

Dot-notation gives you type preservation and coercion via item methods. JSON_VALUE adds two things dot-notation can't do: error handling and SQL-standard portability (it's part of the ISO SQL/JSON spec, so your queries translate across databases that support the standard).

Error handling saves you at runtime:

-- NULL ON ERROR (default): missing path returns NULL silently
SELECT JSON_VALUE(o.order_doc, '$.nonexistent' NULL ON ERROR) AS nullable_value
FROM   orders o
FETCH  FIRST 1 ROW ONLY;

-- ERROR ON ERROR: missing path raises ORA-40462 (shown as reference;
-- uncomment to see the error in action)
-- SELECT JSON_VALUE(o.order_doc, '$.nonexistent' ERROR ON ERROR)
-- FROM   orders o;

-- DEFAULT ... ON ERROR: missing path returns your fallback
-- (RETURNING comes before DEFAULT; order matters)
SELECT JSON_VALUE(o.order_doc, '$.discount'
                  RETURNING NUMBER DEFAULT 0 ON ERROR) AS discount
FROM   orders o
FETCH  FIRST 1 ROW ONLY;

The same three options — NULL, ERROR, DEFAULT — work for ON EMPTY (path exists but matches nothing). During development, use ERROR ON ERROR to catch path mistakes early. In production, choose the behavior that matches your business logic.

Pro tip (26ai+): Instead of adding error clauses to every function, set the session-level default and clear it when you're done:

ALTER SESSION SET JSON_BEHAVIOR = 'ON_ERROR:ERROR';
-- ... debug session ...
ALTER SESSION SET JSON_BEHAVIOR = 'ON_ERROR:NULL';

This applies to all SQL/JSON functions in the session — JSON_VALUE, JSON_QUERY, JSON_TABLE, and JSON_EXISTS all inherit the error default. Any function that doesn't have an explicit ON ERROR clause now raises errors instead of silently returning NULL. The second statement above resets back to the permissive default; in production the most common pattern is to leave the session at ON_ERROR:NULL and add explicit ERROR ON ERROR only on the specific paths where you want strict validation.

Indexability is where both JSON_VALUE and dot-notation earn their keep in production. You can create functional B-tree indexes on JSON fields using either syntax:

-- Functional index using JSON_VALUE
CREATE INDEX idx_order_customer ON orders (
  JSON_VALUE(order_doc, '$.customer.name' RETURNING VARCHAR2(100) ERROR ON ERROR)
);

-- Equivalent index using dot-notation
CREATE INDEX idx_order_status ON orders o (
  o.order_doc.status.string()
);

The type-matching gotcha: The RETURNING type in your index definition and your query must match exactly. If you create the index with RETURNING NUMBER but your query omits the RETURNING clause (which defaults to VARCHAR2), the CBO won't recognize them as the same expression — and your index sits unused. This is the single most common SQL/JSON performance mistake:

-- Index defined with RETURNING NUMBER
CREATE INDEX idx_order_id ON orders (
  JSON_VALUE(order_doc, '$.orderId' RETURNING NUMBER ERROR ON ERROR)
);

-- This query USES the index (types match)
SELECT * FROM orders
WHERE  JSON_VALUE(order_doc, '$.orderId' RETURNING NUMBER) = 1001;

-- This query IGNORES the index — JSON_VALUE defaults to VARCHAR2, not NUMBER
SELECT * FROM orders
WHERE  JSON_VALUE(order_doc, '$.orderId') = 1001;

Same field. Same data. Completely different execution plans. Check your EXPLAIN PLAN — if you see a full table scan where you expected an index range scan, type mismatch is the first thing to investigate.

You can also create composite indexes across multiple JSON fields, just like relational composite indexes:

CREATE INDEX idx_order_cust_date ON orders (
  JSON_VALUE(order_doc, '$.customer.name' RETURNING VARCHAR2(100)),
  JSON_VALUE(order_doc, '$.orderDate' RETURNING DATE)
);

This powers queries that filter on customer and sort by date — a common API pattern — without touching the table at all.

3. Extended Types: What OSON Actually Stores

Here's the thing that surprises developers: OSON stores more than the JSON spec defines.

The JSON standard has six types — string, number, boolean, null, object, array. That's the entire type system. Dates are strings. Timestamps are strings. High-precision decimals are floats (with all the IEEE 754 rounding baggage that implies). Binary data? Base64-encoded strings. Vectors? Not a thing.

OSON is a superset. It stores Oracle's native scalar types directly in the document — as themselves, not as stringified approximations. Write a DATE, store a DATE, read a DATE. No parsing on the way in or out.

This is why dot-notation has so many item methods and JSON_VALUE has so many RETURNING options. They're not type conversions — they're accessors for types that are already there.

The Extended Type Table

OSON Type	Dot-notation	JSON_VALUE RETURNING	When to use
`VARCHAR2` / text	`.string()`	`RETURNING VARCHAR2(n)`	Text, identifiers, categorical data
`NUMBER`	`.number()`	`RETURNING NUMBER`	Money, quantities, anything where rounding matters
`BINARY_DOUBLE`	`.double()`	`RETURNING BINARY_DOUBLE`	IEEE 754 double — scientific math
`BINARY_FLOAT`	`.float()`	`RETURNING BINARY_FLOAT`	Single-precision float
`BOOLEAN`	`.boolean()`	`RETURNING BOOLEAN` (23ai+)	True/false flags
`DATE`	`.date()`	`RETURNING DATE`	Second-precision date+time (see note below)
`TIMESTAMP`	`.timestamp()`	`RETURNING TIMESTAMP`	Sub-second precision, no timezone
`TIMESTAMP WITH TIME ZONE`	`.timestamp()`	`RETURNING TIMESTAMP WITH TIME ZONE`	Zone-aware moments in time (see note below)
`INTERVAL YEAR TO MONTH`	—	`RETURNING INTERVAL YEAR TO MONTH`	Month/year durations
`INTERVAL DAY TO SECOND`	—	`RETURNING INTERVAL DAY TO SECOND`	Sub-second durations
`RAW`	`.binary()`	`RETURNING RAW`	Hashes, fingerprints, binary payloads
`VECTOR` (26ai)	—	`RETURNING VECTOR`	Embeddings for similarity search

A note on DATE and temporal types. Oracle's DATE despite the name actually stores date + time down to the second — year, month, day, hour, minute, second. There's no "date without time" primitive in Oracle. If you want day-level matching (e.g., "all orders placed on April 11" regardless of the time of day), you'll want to TRUNC() the value or use a range predicate like WHERE dt >= DATE '2026-04-11' AND dt < DATE '2026-04-12'. TIMESTAMP WITH TIME ZONE has the opposite subtlety — the same moment in time can have multiple representations (2026-04-11 14:30 UTC == 2026-04-11 10:30 EDT), so equality comparisons need to either normalize to UTC or use the right comparison operator. These aren't OSON quirks — they're inherited from Oracle's core type system. The important thing is that OSON stores them as those types, with full fidelity, so you get the same semantics as a relational column of the same type.

Why You Care

Dates are the feature developers underestimate until trouble tickets start piling up. In a text-based JSON store, your orderDate is a string — and that string can be anything. Someone stores "2026-04-11", someone else stores "2026-04-11T14:30:00Z", a third person stores "04/11/2026" because their frontend didn't normalize. Your sort order is now a lottery. Your range filter silently skips rows. Your index is polluted with mismatched formats that the database has no way to validate.

So you push validation into application code. Every write path has to sanitize the date format before it hits the database. Every integration has to agree on the convention. Every bug report starts with "wait, what format is this field actually in?"

In OSON, your orderDate is a DATE. Period. The database enforces it at insert time — anything that isn't a valid date gets rejected before it pollutes storage. Queries just work:

-- Native date comparison. No parse. No cast. Index still works.
SELECT o.order_doc.customer
FROM   orders o
WHERE  o.order_doc.orderDate.date() > SYSDATE - 7;

The same logic applies to every extended type:

NUMBER — decimal precision IEEE 754 can't represent. Critical for money.
TIMESTAMP WITH TIME ZONE — scheduling across regions without reinventing timezone math.
VECTOR — similarity search in the same document as your operational data. No sidecar. No sync lag.
RAW — binary hashes without base64 inflation.

You stop writing parsing code. The database handles it.

Getting Typed Values Into OSON

Three paths, depending on where the data comes from:

1. SQL/PL/SQL construction — automatic. JSON_OBJECT and JSON_ARRAY preserve native types:

INSERT INTO events (event_doc) VALUES (JSON_OBJECT(
  'name'      VALUE 'Product Launch',
  'scheduled' VALUE TIMESTAMP '2026-06-15 09:30:00 America/New_York',
  'price'     VALUE 299.99,     -- stored as NUMBER
  'confirmed' VALUE TRUE        -- stored as BOOLEAN
));

2. JSON_SCALAR — explicit. Force a specific SQL type into a JSON scalar:

SELECT JSON_OBJECT('expiresAt' VALUE JSON_SCALAR(SYSTIMESTAMP + INTERVAL '30' DAY))
FROM DUAL;

3. Extended JSON syntax — the text path. When JSON arrives as a string from an API, use type markers with the EXTENDED keyword:

INSERT INTO events (event_doc) VALUES (JSON('{
  "name": "Product Launch",
  "scheduled": {"$oracleTimestampTZ": "2026-06-15T09:30:00-04:00"}
}' EXTENDED));

Oracle recognizes $oracleDate, $oracleTimestamp, $oracleTimestampTZ, $oracleBinary and stores them as native types.

The Payoff

OSON makes JSON a structured document with first-class types — the type system in your database matches the type system in your application. No translation layer. No parsing code. No rounding bugs you find in production. Dates stay dates, numbers stay numbers, vectors stay vectors, all the way from storage to your response body.

4. JSON_QUERY: When You Need the Whole Object

JSON_VALUE returns scalars. JSON_QUERY returns JSON fragments — objects, arrays, or multiple values.

-- Extract the shipping object (PRETTY needs a textual RETURNING type)
SELECT JSON_QUERY(o.order_doc, '$.shipping'
                  RETURNING VARCHAR2(4000) PRETTY) AS shipping_info
FROM   orders o;

-- Extract the items array (returns JSON by default)
SELECT JSON_QUERY(o.order_doc, '$.items') AS all_items
FROM   orders o;

-- Extract all product names (multiple values → need a wrapper)
SELECT JSON_QUERY(o.order_doc, '$.items[*].product' WITH WRAPPER) AS product_names
FROM   orders o;
-- Result: ["Widget Pro","Gadget Plus","Cable Kit"]

The Wrapper Clause

This is the part that confuses people. Here's the rule:

What the path returns	WITHOUT WRAPPER (default)	WITH WRAPPER	WITH CONDITIONAL WRAPPER
Single object/array	Returns as-is	Wraps in array	Returns as-is
Single scalar	Error/NULL	Wraps in array `[42]`	Wraps in array `[42]`
Multiple values	Error/NULL	Wraps in array `[1,2,3]`	Wraps in array `[1,2,3]`

WITH CONDITIONAL WRAPPER is the pragmatic choice — it wraps only when the result isn't already a single JSON value. Use it when you're not sure whether a path will return one thing or many.

Error handling in JSON_QUERY offers additional options beyond the NULL/ERROR pattern:

-- Return empty array when path doesn't match
SELECT JSON_QUERY(o.order_doc, '$.reviews' EMPTY ARRAY ON ERROR) FROM orders o;
-- Result: []

-- Return empty object
SELECT JSON_QUERY(o.order_doc, '$.metadata' EMPTY OBJECT ON ERROR) FROM orders o;
-- Result: {}

These are Oracle-specific. No other database gives you EMPTY ARRAY ON ERROR. Sounds small — until you're building a JSON API response and don't want null checks littering your application code.

5. JSON_EXISTS: Filtering with Path Predicates

JSON_EXISTS isn't a function — it's a SQL condition. It returns TRUE or FALSE. Use it in WHERE clauses to filter rows based on JSON content.

-- Orders that have a shipping address
SELECT o.order_doc.customer.string()
FROM   orders o
WHERE  JSON_EXISTS(o.order_doc, '$.shipping.address');

-- Orders with at least one item over $25
SELECT o.order_doc.customer.string()
FROM   orders o
WHERE  JSON_EXISTS(o.order_doc, '$.items[*]?(@.unitPrice > 25)');

That ?(@.unitPrice > 25) is a filter expression. The @ symbol refers to the current element being evaluated by the filter — which for $.items[*] means each item in the array, one at a time. You can combine multiple conditions:

-- Items that are expensive AND high quantity
SELECT o.order_doc.orderId
FROM   orders o
WHERE  JSON_EXISTS(o.order_doc,
         '$.items[*]?(@.unitPrice > 25 && @.quantity >= 5)');

-- Orders with specific tags
SELECT o.order_doc.orderId
FROM   orders o
WHERE  JSON_EXISTS(o.order_doc,
         '$.tags[*]?(@.string() == "wholesale")');

-- Nested existence checks
SELECT o.order_doc.orderId
FROM   orders o
WHERE  JSON_EXISTS(o.order_doc,
         '$?(@.status == "shipped"
             && exists(@.items[*]?(@.category == "electronics")))');

Filters aren't just for arrays. A filter is a predicate that applies to whatever path step it's attached to — the step doesn't have to return an array. For example:

-- Return the customer name only when the customer lives in Belmont
SELECT JSON_VALUE(o.order_doc,
                  '$.customer?(@.address.city == "Belmont").name'
                  NULL ON ERROR) AS belmont_customer
FROM   orders o;

There's no array in that path. $.customer is a single object, and the filter either passes (returning the name) or fails (returning NULL). When used with JSON_VALUE, non-matching rows come back as NULL — which is usually exactly what you want for "give me the name only if..." queries.

Path expressions transparently process intermediate arrays. This is a subtlety that catches people off guard, especially if you're coming from a language where arrays require explicit iteration. In Oracle's path expression engine, $.a.b doesn't just navigate from field a to field b — if a happens to be an array, the engine automatically iterates through every element and evaluates .b on each one. You don't need $.a[*].b — the [*] is implicit.

This matters most with JSON_EXISTS:

-- Works even if 'address' is an array of objects
SELECT o.order_doc.orderId
FROM   orders o
WHERE  JSON_EXISTS(o.order_doc, '$.shipping.address?(@.city == "Boston")');

If addresses is a single object, the filter tests that object. If addresses is an array, the filter tests each element — and returns true if any element matches. Same path expression, both shapes. You don't have to know (or care) whether the field is a scalar, an object, or an array when you write the filter. The engine handles the polymorphism.

This is particularly useful for schemas that evolve over time — a field that started as a single value and later became an array doesn't break your existing queries.

Filter-then-project is a common pattern. You can chain a filter with a projection to return a subset of fields from matching elements:

-- Get the SKUs of line items that cost more than $25
SELECT JSON_QUERY(o.order_doc,
  '$.items[*]?(@.unitPrice > 25).product' WITH WRAPPER) AS expensive_skus
FROM orders o;
-- ["Widget Pro","Gadget Plus"]

Notice the WITH WRAPPER clause. When a filter-then-project returns multiple values — like the SKUs above — you need JSON_QUERY to re-wrap them into a JSON array. JSON_VALUE won't work here because it only returns single scalars. This is the most common reason developers reach for JSON_QUERY over JSON_VALUE in real applications.

The PASSING Clause: Bind Variables in Path Expressions

This is a feature most developers don't know exists — and it changes how you write dynamic JSON filters:

WHERE JSON_EXISTS(o.order_doc,
  '$.items[*]?(@.category == $cat && @.unitPrice < $max)'
  PASSING 'electronics' AS "cat", 50 AS "max")

No string concatenation. No SQL injection risk. Bind variables plugged directly into the JSON path expression.

This matters more than most developers realize. In document databases, dynamic filters typically mean building query strings in application code — concatenating user input into $match pipelines or find() predicates. Every concatenated string is a potential injection vector. Every dynamically constructed query is a unique query shape the engine has to parse and plan from scratch.

The PASSING clause eliminates both problems at once. The CBO parses the path expression once, builds an execution plan once, and reuses it across every parameter combination. Ten thousand different category/price filter combinations hit the same optimized plan. That's not just safer — it's measurably faster. Hard-parsed queries burn CPU. Soft-parsed queries with bind variables don't. At scale, that difference shows up on your cloud bill.

The Full Filter Predicate Toolkit

Oracle's path expressions support far more than basic comparisons:

Operator	Example
`==`, `!=`, `<`, `<=`, `>`, `>=`	`?(@.price > 100)`
`&&` (AND), `\	\
{% raw %}`exists()`	`$?exists(@.items[*]?(@.flagged == true))`
`in()`	`?(@.category in ("electronics","tools"))`
`has substring`	`?(@.name has substring "Pro")`
`starts with`	`?(@.sku starts with "WDG-")`
`like_regex`	`?(@.email like_regex ".*@oracle\\.com")`

PostgreSQL users: this is roughly equivalent to jsonb_path_exists, but richer.

MongoDB users: this is your $elemMatch and $regex — except it runs inside the SQL optimizer with access to indexes, join reordering, and cost-based plan selection.

6. JSON_TABLE: The Bridge Between Worlds

If you learn one SQL/JSON function from this article, make it JSON_TABLE. It projects JSON data into relational rows and columns — the bridge between document flexibility and relational power.

SELECT jt.*
FROM   orders o,
       JSON_TABLE(o.order_doc, '$.items[*]'
         COLUMNS (
           row_num    FOR ORDINALITY,
           product    VARCHAR2(100)  PATH '$.product',
           quantity   NUMBER         PATH '$.quantity',
           unit_price NUMBER         PATH '$.unitPrice',
           category   VARCHAR2(50)   PATH '$.category'
         )
       ) jt;

Result:

ROW_NUM	PRODUCT	QUANTITY	UNIT_PRICE	CATEGORY
1	Widget Pro	10	29.99	hardware
2	Gadget Plus	5	49.99	electronics
3	Cable Kit	50	4.99	accessories

Each array element becomes a row. Each JSON field becomes a column. The FOR ORDINALITY column auto-generates row numbers. And now you can do everything SQL does: GROUP BY, ORDER BY, SUM(), AVG(), JOIN to other tables, window functions — the full relational toolkit applied to JSON data.

Column Types

JSON_TABLE supports four column types:

SELECT jt.*
FROM   orders o,
       JSON_TABLE(o.order_doc, '$' COLUMNS (
         -- Value column (like JSON_VALUE): extracts a scalar
         customer    VARCHAR2(100) PATH '$.customer.name',

         -- Query column (like JSON_QUERY): extracts JSON fragment
         shipping    VARCHAR2(4000) FORMAT JSON PATH '$.shipping',

         -- Exists column (like JSON_EXISTS): returns 'true'/'false'
         has_priority VARCHAR2(5) EXISTS PATH '$.priority',

         -- Ordinality column: auto-generated row number
         row_num     FOR ORDINALITY
       )) jt
FETCH FIRST 5 ROWS ONLY;

NESTED PATH: Hierarchical Flattening

This is where JSON_TABLE gets powerful. Real-world JSON is nested. Orders contain items. Items contain variants. You need to flatten multiple levels without writing self-joins.

SELECT jt.*
FROM   orders o,
       JSON_TABLE(o.order_doc, '$'
         COLUMNS (
           order_id    NUMBER        PATH '$.orderId',
           customer    VARCHAR2(100) PATH '$.customer.name',
           NESTED PATH '$.items[*]' COLUMNS (
             item_num   FOR ORDINALITY,
             product    VARCHAR2(100) PATH '$.product',
             quantity   NUMBER        PATH '$.quantity',
             unit_price NUMBER        PATH '$.unitPrice'
           )
         )
       ) jt;

Here's what that produces from our sample order document:

ORDER_ID	CUSTOMER	ITEM_NUM	PRODUCT	QUANTITY	UNIT_PRICE
1001	Acme Corp	1	Widget Pro	10	29.99
1001	Acme Corp	2	Gadget Plus	5	49.99
1001	Acme Corp	3	Cable Kit	50	4.99

One JSON document became three relational rows. The parent fields (order_id, customer) repeat for each array element. The ITEM_NUM ordinality column tracks position within the array. This is the document-to-relational bridge in action — and from here, you can GROUP BY category, JOIN to a products table, run window functions, or anything else SQL gives you.

The NESTED PATH clause creates an implicit lateral join — Oracle automatically cross-applies the nested array to the parent fields. Parent rows are preserved even when the array is empty (left outer join semantics). Sibling NESTED PATHs at the same level produce a union join, not a Cartesian product — Oracle knows you don't want row explosion.

You can nest multiple levels deep, and you can have sibling NESTED PATHs at the same level. This is where the join semantics get interesting. Consider a document with both items and tags:

SELECT jt.*
FROM   orders o,
       JSON_TABLE(o.order_doc, '$'
         COLUMNS (
           order_id    NUMBER        PATH '$.orderId',
           NESTED PATH '$.items[*]' COLUMNS (
             product    VARCHAR2(100) PATH '$.product',
             quantity   NUMBER        PATH '$.quantity'
           ),
           NESTED PATH '$.tags[*]' COLUMNS (
             tag        VARCHAR2(50)  PATH '$'
           )
         )
       ) jt;

ORDER_ID	PRODUCT	QUANTITY	TAG
1001	Widget Pro	10
1001	Gadget Plus	5
1001	Cable Kit	50
1001			wholesale
1001			priority
1001			Q1-promo

Sibling NESTED PATHs produce a union join, not a Cartesian product. You get 3 item rows + 3 tag rows = 6 total rows, not 3 × 3 = 9. Oracle fills the non-matching columns with NULLs. This is critical — without union join semantics, sibling arrays would cause row explosion that scales multiplicatively with array size.

Compare with MongoDB's $unwind. MongoDB's equivalent operation is the aggregation pipeline's $unwind stage, which flattens one array at a time. To flatten both items and tags, you'd write:

db.orders.aggregate([
  { $unwind: "$items" },
  { $unwind: "$tags" }
])

This produces a Cartesian product — |items| × |tags| rows per document. For our three-item, three-tag example, that's 9 rows per order, not 6. Now imagine a document with 50 line items and 20 tags: MongoDB gives you 1,000 rows. Oracle gives you 70. The ratio gets worse as the arrays grow.

You can work around this in MongoDB with $facet (which runs the two unwinds as independent sub-pipelines and merges the results), but now you're hand-engineering join semantics that Oracle handles automatically as part of the core JSON_TABLE operator. One more case where the developer is the optimizer.

You can also nest paths within paths for multi-level hierarchies:

SELECT jt.*
FROM   orders o,
       JSON_TABLE(o.order_doc, '$'
         COLUMNS (
           order_id NUMBER PATH '$.orderId',
           NESTED PATH '$.items[*]' COLUMNS (
             product   VARCHAR2(100) PATH '$.product',
             quantity  NUMBER        PATH '$.quantity',
             item_ord  FOR ORDINALITY
           )
         )) jt
FETCH FIRST 5 ROWS ONLY;

Why JSON_TABLE Matters

Here's the thing most developers miss: JSON_TABLE doesn't just make JSON queryable. It makes JSON optimizable.

Once JSON is projected into relational columns, the CBO treats it like any other relational data. It can push predicates through the projection. It can use functional indexes. It can choose between nested loops, hash joins, and merge joins. It can parallelize.

The document database world doesn't have this. MongoDB's aggregation pipeline is a sequential chain of stages — each stage processes the full result set of the previous stage. There's no cost-based optimizer reordering stages, pushing filters earlier, or choosing between join algorithms.

What does that actually mean for you as a developer? It means you stop writing optimization logic in your application code.

In a document database, you learn — usually the hard way — that the order you filter, sort, and aggregate matters. You hand-tune your aggregation pipeline stages. You restructure queries because putting $match before $unwind is faster than after. You read blog posts about which pipeline stages can use indexes and which can't. You become the optimizer. That's not your job.

With JSON_TABLE, you describe what you want and the database figures out how to get it. You write WHERE category = 'electronics' on the outer query, and the CBO decides — without you asking — to push that filter down into the JSON scan, hit the functional index, and skip 99% of the documents. You join the flattened JSON to a products table, and the CBO picks the join algorithm (nested loops for small results, hash join for large ones) based on actual table statistics, not your guess about data volume.

Add a column to the SELECT? The plan adapts. Data distribution changes as your table grows from 10K to 10M rows? The plan adapts. You didn't change a line of code. You write declarative SQL. The database does the engineering.

Oracle's CBO has had 40+ years of development. When you use JSON_TABLE, you're handing your JSON to that optimizer. That's not a small thing — it's the difference between writing query logic and writing query infrastructure.

7. What's a CTE and Why Should You Care?

A Common Table Expression (CTE) is a named, temporary result set defined in a WITH clause that exists for the duration of a single SQL statement. If you've never used one, think of it as a named subquery that you can reference multiple times — and that makes complex SQL readable.

WITH high_value_items AS (
  SELECT jt.*
  FROM   orders o,
         JSON_TABLE(o.order_doc, '$.items[*]'
           COLUMNS (
             product    VARCHAR2(100)  PATH '$.product',
             quantity   NUMBER         PATH '$.quantity',
             unit_price NUMBER         PATH '$.unitPrice'
           )
         ) jt
  WHERE  jt.unit_price > 10
)
SELECT product, quantity * unit_price AS line_total
FROM   high_value_items
ORDER  BY line_total DESC;

That's a CTE. The WITH clause defines high_value_items. The main query uses it like a table. The query reads top-to-bottom, each step building on the last.

Why CTEs Matter for JSON Work

JSON transformations are inherently multi-step: flatten the document, filter and enrich, then re-assemble into a new shape. Without CTEs, you're nesting subqueries five levels deep. With CTEs, each step is a named, testable, readable block.

WITH
  -- Step 1: Flatten JSON into rows
  raw_items AS (
    SELECT jt.product, jt.quantity, jt.unit_price, jt.category
    FROM   orders o,
           JSON_TABLE(o.order_doc, '$.items[*]'
             COLUMNS (
               product    VARCHAR2(100) PATH '$.product',
               quantity   NUMBER        PATH '$.quantity',
               unit_price NUMBER        PATH '$.unitPrice',
               category   VARCHAR2(50)  PATH '$.category'
             )) jt
  ),
  -- Step 2: Join with relational product catalog
  enriched_items AS (
    SELECT ri.*, p.supplier
    FROM   raw_items ri
    LEFT JOIN products p ON p.product_name = ri.product
  ),
  -- Step 3: Aggregate by category
  category_totals AS (
    SELECT category, SUM(quantity * unit_price) AS revenue
    FROM   enriched_items
    GROUP  BY category
  )
-- Step 4: Build the JSON response
SELECT JSON_OBJECT('category' VALUE category, 'revenue' VALUE revenue)
FROM   category_totals;

Four steps. Each one readable in isolation. Each one debuggable by replacing the final SELECT with SELECT * FROM step_N.

Multiple CTEs and Chaining

CTEs can reference earlier CTEs in the same WITH clause:

WITH
  step1 AS (SELECT id, JSON_VALUE(order_doc, '$.customer.name') AS customer FROM orders),
  step2 AS (SELECT id, customer FROM step1 WHERE customer LIKE 'A%'),
  step3 AS (SELECT s2.id, s2.customer, p.supplier
            FROM step2 s2 JOIN products p ON p.supplier LIKE '%Acme%')
SELECT DISTINCT customer, supplier FROM step3;

CTEs vs Aggregation Pipelines: A Mental Model

If you're coming from MongoDB, you already understand multi-step data transformations — that's what the aggregation pipeline is. CTEs are the same idea, with two critical differences: named addressability and optimizer-driven execution order.

A MongoDB pipeline is an array of stages that execute in declaration order. Modern pipelines can branch via $facet, $lookup, and $unionWith, so technically the pipeline is a DAG too — but stages still run in the order you wrote them, with only limited pipelining across non-blocking stages. There's no cost-based reordering. There's no "the optimizer decided it was cheaper to filter before the unwind." If you put $match after $unwind when it should have been before, you pay for it — and so does every row that had to be flattened just to get thrown away a stage later.

And here's a detail that matters more than it looks: $facet branches execute sequentially, not in parallel. Each sub-pipeline runs to completion before the next one starts. You get the shape of a branch — separate result sets from the same input — but not the physics of parallelism. If Branch A takes 200ms and Branch B takes 300ms, you wait 500ms, not 300ms. The branches share an input document set, but the engine processes them one at a time.

CTEs are also a DAG, but with three key differences:

Each step is named and addressable. The final SELECT can reach into any CTE simultaneously — join raw_items to category_totals, filter on enriched, aggregate across all three. You're not locked into the most-recent-stage-feeds-the-next-stage structure.
The CBO decides the execution order. Push a predicate from the outer query into the innermost CTE? Sure. Reorder joins based on row estimates? Yes. Pick a hash join here and a nested loop there? Automatic. You describe the logic; the optimizer decides the physics.
Independent branches can execute in parallel. When the CBO sees that two CTEs have no data dependency on each other, it can run them concurrently. The same two branches that take 500ms sequentially in $facet take 300ms under Oracle's parallel execution — because the optimizer knows they're independent and schedules them accordingly.

The differences compound in practice:

	Aggregation Pipeline	CTEs
Step identity	Anonymous (position in array)	Named (referenced by name)
Data flow	Mostly linear; branches via `$facet`/`$lookup`/`$unionWith`	Full DAG: any step can reference any earlier step
Execution order	Runs in declaration order, no cost-based reordering	CBO reorders based on statistics and predicates
Parallelism	`$facet` branches execute sequentially, not in parallel	CBO parallelizes independent branches automatically
Debugging	Insert `$out` or `explain()` at specific stages	Replace final SELECT with `SELECT * FROM step_N`
Reuse	Can't reference earlier stages from later ones (except via `$facet`)	Any CTE reusable by multiple downstream steps
Optimization	Developer hand-tunes stage order	CBO pushes predicates, picks join methods, parallelizes

The aggregation pipeline forces you to think about how the database should process data — stage order matters, $match belongs before $unwind, put your projections last to minimize intermediate document size. CTEs let you think about what the data should look like at each logical stage, and the CBO figures out the how. That's the core difference: you become the optimizer in one model, and you stop being the optimizer in the other.

Recursive CTEs

For hierarchical data — org charts, category trees, bill-of-materials — Oracle supports recursive CTEs. The structure has two parts: an anchor query that selects the starting rows, and a recursive query that joins back to the CTE itself to walk the hierarchy:

WITH category_tree (id, parent_id, name, lvl) AS (
  -- Anchor: start at root categories (no parent)
  SELECT id, parent_id, name, 1
  FROM   categories
  WHERE  parent_id IS NULL
  UNION ALL
  -- Recursive: for each row found so far, find its children
  SELECT c.id, c.parent_id, c.name, t.lvl + 1
  FROM   category_tree t
  JOIN   categories c ON c.parent_id = t.id
)
SEARCH DEPTH FIRST BY name SET seq
SELECT LPAD(' ', (lvl-1)*2) || name AS category_hierarchy
FROM   category_tree
ORDER  BY seq;

The SEARCH clause controls what order the recursion visits nodes, and generates a seq column you can use for deterministic ORDER BY. Think of it like walking a tree — you have two choices for how to walk it:

SEARCH DEPTH FIRST BY name — go deep before going wide. Visit a node, then immediately visit its children, then their children, all the way to the leaf, before backtracking to visit siblings. This produces the indented tree output you'd expect:

Electronics
  Laptops
    Gaming Laptops
    Ultrabooks
  Phones
    Android
    iOS
Clothing
  Men
  Women

SEARCH BREADTH FIRST BY name — go wide before going deep. Visit all nodes at level 1, then all nodes at level 2, then level 3. This produces a level-by-level listing:

Electronics
Clothing
  Laptops
  Phones
  Men
  Women
    Gaming Laptops
    Ultrabooks
    Android
    iOS

Depth-first is what you want for tree displays, navigation menus, and indented reports. Breadth-first is useful when you care about levels — "show me all second-level categories" or "find everything within two hops of this node."

The BY name part determines the sort order among siblings at each level. Replace name with any column — BY created_date visits the oldest siblings first, BY priority DESC visits the highest priority first.

Cycle detection: Hierarchical data can have bugs — a category that's its own grandparent creates an infinite loop. Oracle catches this with the CYCLE clause on the recursive CTE:

WITH category_tree (id, parent_id, name, lvl) AS (
  SELECT id, parent_id, name, 1
  FROM   categories WHERE parent_id IS NULL
  UNION ALL
  SELECT c.id, c.parent_id, c.name, t.lvl + 1
  FROM   category_tree t
  JOIN   categories c ON c.parent_id = t.id
)
CYCLE id SET is_cycle TO 'Y' DEFAULT 'N'
SELECT id, name, lvl, is_cycle FROM category_tree;

This adds an is_cycle column and stops recursion on any row that would revisit an already-seen id. Without it, a cyclic reference means a runaway query.

Materialization: When Performance Matters

The CBO decides whether to materialize a CTE (store results in a temp table) or inline it (copy the SQL text into the main query). The default behavior:

Referenced once: the CBO inlines it (allows predicate pushing, index usage)
Referenced multiple times: the CBO materializes it (compute once, read many)

You can override this with hints:

WITH expensive_json_parse AS (
  SELECT /*+ MATERIALIZE */
         JSON_VALUE(order_doc, '$.customer.name') AS customer,
         JSON_VALUE(order_doc, '$.orderDate' RETURNING DATE) AS order_date
  FROM   orders
)
SELECT * FROM expensive_json_parse FETCH FIRST 5 ROWS ONLY;

When to materialize: The CTE does expensive work that's reused multiple times, especially when the CBO might otherwise inline and recompute. Compute it once, read it many.

When to inline: The main query has selective predicates that need to push down into the JSON scan to hit functional indexes. Inlining keeps the optimization boundary open.

Here's a case where MATERIALIZE is genuinely the right call — a self-join on an expensively-parsed CTE:

-- Cross-sell analysis: find customer pairs who both bought products
-- in the same category. The CTE is self-joined, so it's referenced twice.
WITH customer_categories AS (
  SELECT /*+ MATERIALIZE */
         JSON_VALUE(o.order_doc, '$.customer.name') AS customer,
         jt.category
  FROM   orders o,
         JSON_TABLE(o.order_doc, '$.items[*]'
           COLUMNS (category VARCHAR2(50) PATH '$.category')) jt
)
SELECT DISTINCT a.customer       AS customer_a,
                b.customer       AS customer_b,
                a.category       AS shared_category
FROM   customer_categories a
JOIN   customer_categories b ON a.category = b.category
                            AND a.customer < b.customer;

Without materialization, the CBO might inline customer_categories into both sides of the self-join — meaning the JSON_TABLE flattening runs twice, parsing every order document twice. With MATERIALIZE, the parse happens once, the result lands in a session-scoped temp table, and both sides of the self-join read from that temp table at memory speed.

For a 10-million-row orders table with 5 items per document on average, that's the difference between 10 million JSON parses and 20 million. Same result. Half the work.

Note on auto-materialization: Oracle automatically materializes CTEs that are referenced multiple times — so in the example above, the CBO often makes this decision on its own. The explicit /*+ MATERIALIZE */ hint does two things: it documents intent (so future readers understand the query was designed with materialization in mind), and it protects against CBO regressions when statistics shift or plans change between database versions. For a CTE that's expensive to compute and referenced multiple times, making the hint explicit is defensive engineering.

The INLINE hint goes the other direction — it tells Oracle to treat the CTE as if you had copy-pasted its SQL into the main query, instead of walling it off in a temp table. Why would you want that? Because materialization creates a boundary the optimizer can't see through. Once results are in a temp table, Oracle can't look at what the outer query is going to do with them — it has to compute the full CTE result, every row, before your WHERE clause even runs.

Here's what that looks like in practice:

WITH recent_orders AS (
  SELECT /*+ INLINE */
         JSON_VALUE(order_doc, '$.customer.name') AS customer,
         JSON_VALUE(order_doc, '$.orderDate' RETURNING DATE) AS order_date
  FROM   orders
)
SELECT * FROM recent_orders WHERE order_date > SYSDATE - 7;

Think of it like this: if the CTE is inlined, Oracle can see that you only want orders from the last 7 days. It looks at your functional index on $.orderDate, jumps straight to those rows, and reads maybe 0.1% of the table.

If the CTE is materialized, Oracle has to compute recent_orders first — which means reading every order, parsing every JSON document, building the full intermediate result — and then filtering to the last 7 days. On a 10-million-row table, that's parsing 10 million documents to return 10,000 rows. The index sits unused because the optimizer never gets a chance to use it.

The INLINE hint removes the wall. It tells Oracle "don't stage this — fold it into the main query so you can see the whole picture and optimize across it." When your outer query has a selective filter that should hit an index, inlining is how you make sure that happens.

Both hints exist for the same reason: sometimes you know the shape of your data better than the optimizer's statistics do. Check your execution plans with EXPLAIN PLAN or DBMS_XPLAN. If you see a full table scan where you expected an index range scan, materialization is often the culprit — and INLINE is the fix. If you see the same expensive CTE appearing multiple times in the plan, that's the opposite problem — and MATERIALIZE is the fix.

Hint only when the execution plan tells you to. Trust the CBO by default.

8. Building JSON: The Construction Functions

Reading JSON is half the story. The other half is constructing JSON from relational data — building API responses, materializing document views, assembling complex payloads.

Important: the default return type is VARCHAR2, not JSON. JSON_OBJECT, JSON_ARRAY, JSON_ARRAYAGG, and JSON_OBJECTAGG all default to RETURNING VARCHAR2(4000) for backward compatibility. That means without an explicit RETURNING JSON clause, you're getting a serialized text string — not the binary OSON representation. This causes two real problems: (1) when you nest one of these inside another, you have to use FORMAT JSON to prevent double-escaping (more on this in a moment), and (2) you lose the type fidelity OSON gives you (numbers stay numbers, dates stay dates, etc.). Always add RETURNING JSON when you mean to produce JSON-typed data, especially in CTEs that feed downstream JSON construction. The examples in this section show both forms — explicit and implicit — so you can see how each behaves. In production code, default to RETURNING JSON and only drop it when you specifically want a serialized text result.

The JSON { } and JSON [ ] shorthand (26ai). Oracle added JSON-like constructor syntax that solves the RETURNING problem by defaulting to RETURNING JSON instead of VARCHAR2. JSON { 'foo' : 'bar' } is equivalent to JSON_OBJECT('foo' : 'bar' RETURNING JSON), and JSON [ 1, 2, 3 ] is equivalent to JSON_ARRAY(1, 2, 3 RETURNING JSON). If you're writing new code on 26ai, prefer the JSON { } form — it's shorter, the return type is correct by default, and you never have to worry about missing RETURNING JSON or FORMAT JSON. The JSON_OBJECT / JSON_ARRAY functions remain available and are the right choice when you need explicit RETURNING VARCHAR2 or FORMAT JSON control.

JSON_OBJECT: Rows to Objects

-- Key-value pairs
SELECT JSON_OBJECT(
  'customer' VALUE o.order_doc.customer.string(),
  'total'    VALUE (
    SELECT SUM(jt.quantity * jt.unit_price)
    FROM   JSON_TABLE(o.order_doc, '$.items[*]'
             COLUMNS (quantity NUMBER PATH '$.quantity',
                      unit_price NUMBER PATH '$.unitPrice')) jt
  )
  RETURNING JSON
) AS order_summary
FROM orders o;
-- {"customer":"Acme Corp","total":799.4}

Column shorthand — the column name becomes the key:

SELECT JSON_OBJECT(employee_id, first_name, last_name, email)
FROM   employees WHERE employee_id = 100;
-- {"employee_id":100,"first_name":"Steven","last_name":"King","email":"SKING"}

Wildcard — all columns at once:

SELECT JSON_OBJECT(*) FROM employees WHERE employee_id = 100;

NULL handling — choose your API contract:

-- Include nulls explicitly (default)
SELECT JSON_OBJECT('name' VALUE 'Alice', 'middle' VALUE NULL NULL ON NULL);
-- {"name":"Alice","middle":null}

-- Omit nulls entirely
SELECT JSON_OBJECT('name' VALUE 'Alice', 'middle' VALUE NULL ABSENT ON NULL);
-- {"name":"Alice"}

JSON_ARRAY: Building Arrays

What's DUAL? Think of it as Oracle's REPL. When you open a Node or Python shell and type 1 + 1, you get 2 back — no variables, no setup, just the expression evaluated. DUAL is how you do that in SQL. It's a built-in one-row table that exists purely as a target for expressions you want to run standalone. SELECT SYSDATE FROM DUAL is the SQL equivalent of typing Date.now() into a JavaScript console. Any time you see FROM DUAL in this article, we're just running an expression to show you the result — treat the code block like a REPL session.

SELECT JSON_ARRAY(1, 'two', 3.0, NULL RETURNING JSON) FROM DUAL;
-- [1,"two",3.0]   (NULL omitted by default — ABSENT ON NULL)

SELECT JSON_ARRAY(1, 'two', 3.0, NULL NULL ON NULL RETURNING JSON) FROM DUAL;
-- [1,"two",3.0,null]

26ai: Build JSON arrays directly from a query. This is the syntax you want when you're building an API response. You have rows in a table. You need them as a JSON array in the response body. In 26ai, you just write that:

SELECT JSON_ARRAY(
  SELECT JSON_OBJECT('product' VALUE product_name, 'price' VALUE list_price)
  FROM   products
  WHERE  category_id = 10
  ORDER  BY list_price DESC
  RETURNING JSON
) AS product_catalog
FROM DUAL;

Read it like you'd read code: "give me a JSON array containing an object per product in category 10, sorted by price." That's exactly what the SQL says. No extra steps.

Before 26ai, that simple intent required extra SQL plumbing — you had to stage the rows into a helper query and pull them out with a different function. For anything beyond trivial API responses, this added real noise to your code. The 26ai version is what you'd write on a whiteboard if somebody asked you how it had to work under the hood.

JSON_ARRAYAGG and JSON_OBJECTAGG: Aggregation into JSON

Here's the mental model: JSON_OBJECT and JSON_ARRAY build JSON from one row at a time. JSON_ARRAYAGG and JSON_OBJECTAGG build JSON from many rows at once. Same as the difference between writing row.toJSON() in your ORM versus calling .map(r => r.toJSON()) over a result set — the singular version shapes one record, the plural version rolls up a collection.

These are the workhorses of API response building. Any time you have a parent-child relationship — customer and their orders, order and its line items, post and its comments — you're going to use JSON_ARRAYAGG. It's the function that collapses "many rows" into "one JSON array" so the structure matches what your frontend or API consumer actually wants.

-- Roll up all items in an order into a ranked JSON array
SELECT JSON_ARRAYAGG(
  JSON_OBJECT('product' VALUE jt.product,
              'total'   VALUE jt.quantity * jt.unit_price
              RETURNING JSON)
  ORDER BY jt.quantity * jt.unit_price DESC
  RETURNING JSON
) AS items_ranked
FROM   orders o,
       JSON_TABLE(o.order_doc, '$.items[*]'
         COLUMNS (product VARCHAR2(100) PATH '$.product',
                  quantity NUMBER PATH '$.quantity',
                  unit_price NUMBER PATH '$.unitPrice')) jt;

Notice the RETURNING JSON on both the inner JSON_OBJECT and the outer JSON_ARRAYAGG. Without those, you'd be aggregating VARCHAR2 strings into a VARCHAR2 string, which means double-escaping issues and lost type fidelity. With them, you're aggregating JSON values into a JSON value — clean, binary, native.

Read that as: "take every item row, shape each one as a {product, total} object, then roll them up into a single array sorted by total descending." One statement does the row-level shaping and the collection-level rollup. No application code. No loops. No post-processing.

The ORDER BY inside the aggregate is important — it controls the order of elements in the resulting JSON array. Without it, array element order is undefined. With it, you can guarantee that your API consumers get data in the order they expect (top-N lists, chronological feeds, priority queues).

JSON_OBJECTAGG is the less common sibling — it builds a single JSON object where keys and values both come from table rows. Think of it as turning two columns into a dictionary:

-- Turn a settings table into a JSON config object
SELECT JSON_OBJECTAGG(setting_name VALUE setting_value RETURNING JSON) AS config
FROM   user_settings
WHERE  user_id = 42;
-- {"theme":"dark","language":"en","timezone":"America/Chicago"}

Use it when you have key-value pair data in rows and need a lookup object in the response. Settings, feature flags, translation dictionaries, metadata maps — anything where you'd otherwise reduce rows into an object in application code.

Why these are the "workhorses": Every non-trivial API response needs nested collections. Every nested collection starts as rows in a table. JSON_ARRAYAGG and JSON_OBJECTAGG are the bridge between those two worlds. Combined with JSON_OBJECT for per-row shaping and the FORMAT JSON clause for composition, they let you build arbitrarily nested API responses in a single SQL statement — no middleware, no serialization layer, no ORM-to-JSON mapping code.

The database returns the JSON. Your handler just sends it.

The FORMAT JSON Clause

Here's the gotcha that catches every developer on day one. You build a JSON array in a CTE. You reference it in a JSON_OBJECT to nest it inside a parent object. You run the query. You get back this:

{"orderItems":"[{\"product\":\"Widget Pro\"},{\"product\":\"Gadget Plus\"}]"}

That's not a nested array. That's a string containing an escaped JSON array. Your frontend just received "orderItems" as text — they'd have to call JSON.parse() on the value to use it. Broken.

There are two fixes — and the better one is what we just talked about.

Fix #1 (preferred): use RETURNING JSON so the intermediate result is JSON, not VARCHAR2.

WITH item_array AS (
  SELECT JSON_ARRAYAGG(
    JSON_OBJECT('product' VALUE jt.product RETURNING JSON)
    RETURNING JSON
  ) AS items_json
  FROM orders o,
       JSON_TABLE(o.order_doc, '$.items[*]'
         COLUMNS (product VARCHAR2(100) PATH '$.product')) jt
)
SELECT JSON_OBJECT(
  'orderItems' VALUE items_json     -- Already JSON, no FORMAT JSON needed
  RETURNING JSON
) FROM item_array;

items_json is now a JSON-typed column. When you nest it inside the outer JSON_OBJECT, Oracle knows it's already JSON and doesn't escape it. This is the cleanest pattern — and the one Zhen Hua Liu (the OSON architect) recommends — because it preserves type fidelity all the way through the pipeline.

Fix #2 (legacy / interop): use FORMAT JSON to vouch for a VARCHAR2 value.

If you're working with a VARCHAR2 or CLOB column that already contains serialized JSON — perhaps from an external source, a legacy table, or a function that returns text — you don't have the luxury of changing the source type. That's where FORMAT JSON earns its keep:

SELECT JSON_OBJECT(
  'payload' VALUE legacy_text_column FORMAT JSON   -- Trust me, this is JSON
  RETURNING JSON
) FROM legacy_table;

FORMAT JSON tells Oracle "this string is already valid JSON, insert it as-is, don't escape it." Without that clause, Oracle assumes text — and text gets escaped for safety (otherwise a malicious value containing "} could break out of the parent structure). With it, Oracle inserts the value verbatim.

When you need FORMAT JSON: working with VARCHAR2/CLOB columns that store serialized JSON. Legacy tables. External payloads. Anywhere the source type is text but the content is JSON.

When you don't: When you control the source type. Use RETURNING JSON on every constructor in the pipeline and the problem disappears entirely.

This trips up every developer exactly once — usually on a Monday morning, usually ten minutes before a demo. Now you know how to avoid it (use RETURNING JSON everywhere) and how to fix it when you can't (use FORMAT JSON to vouch for the bytes).

9. Transforming JSON In-Place

Sometimes you don't want to decompose and rebuild — you want to modify the JSON document directly. Oracle gives you two tools.

JSON_TRANSFORM: The Swiss Army Knife

Here's the pattern every developer has written at least once: pull a document out of the database, JSON.parse it, mutate a few fields in application code, JSON.stringify it back, and write it to the database. Maybe wrap the whole thing in a read-modify-write loop with optimistic concurrency to avoid losing updates. Maybe get it wrong the first time. Probably write a retry.

JSON_TRANSFORM is the SQL-native alternative. You tell Oracle what you want changed. Oracle does the mutation on the binary OSON structure in place — no parse, no serialize, no round-trip, no race. Introduced in 21c and dramatically expanded in 26ai, it's the most expressive in-place JSON modification function in any database.

-- Multiple operations in a single call
SELECT JSON_TRANSFORM(o.order_doc,
  SET    '$.status'      = 'delivered',
  SET    '$.deliveredAt'  = SYSTIMESTAMP,
  REMOVE '$.priority',
  RENAME '$.customer'     = 'customerName',
  APPEND '$.tags'         = 'completed'
  RETURNING VARCHAR2(4000) PRETTY
)
FROM orders o;

Read that like a script: update the status, stamp the delivery time, drop the priority field, rename customer to customerName, append "completed" to the tags array. Five mutations. One function call. One atomic operation.

Core operations:

Operation	What it does	Think of it like
`SET`	Create the field if missing, replace if present	`obj[key] = value` (JavaScript)
`INSERT`	Create only — error if the field already exists	`dict.setdefault(key, value)` (Python)
`REPLACE`	Replace only — no-op if the field is missing	`if key in obj: obj[key] = value`
`APPEND`	Push a value onto a JSON array	`arr.push(value)`
`REMOVE`	Delete a field or array element	`delete obj[key]`
`RENAME`	Change a field name, keeping its value	`obj[newKey] = obj[oldKey]; delete obj[oldKey]`
`KEEP`	Whitelist: keep these fields, drop everything else	`pick(obj, [...])` from Lodash

Each operation has its own error handling — you decide what should happen if the target path is missing, already exists, or has a NULL value:

-- Only stamp createdAt if it's not already there (won't overwrite existing audits)
UPDATE orders SET order_doc = JSON_TRANSFORM(order_doc,
  INSERT '$.audit.createdAt' = SYSTIMESTAMP   -- create the path if missing
  IGNORE ON EXISTING                          -- skip silently if already set
  RETURNING JSON
);

Why this matters: Every one of these operations happens atomically inside a single SQL transaction. No lost updates. No torn writes. No "I parsed version 1, you parsed version 1, I wrote version 2, you wrote version 2, my change is gone." You skip the entire class of concurrency bugs that the parse-mutate-write pattern introduces — because there's no parse and no intermediate state for another session to race against.

And because Oracle modifies the OSON binary format in place, updates are piecewise — only the changed portions of the document get written to disk, undo, and redo. A 2KB update to a 2MB document costs 2KB of I/O, not 2MB. At scale, that's the difference between a blog-post database and a system of record.

Conditional and Array Transformations

Set a value on every element of an array using [*] in the path:

-- Mark every item as tax-exempt
SELECT JSON_TRANSFORM(o.order_doc,
  SET '$.items[*].taxExempt' = TRUE
  RETURNING VARCHAR2(4000) PRETTY
)
FROM orders o
FETCH FIRST 1 ROW ONLY;

Conditional updates use a CASE expression in plain SQL — set the value based on a JSON_VALUE extraction, then write it back:

SELECT JSON_TRANSFORM(o.order_doc,
  SET '$.trackable' = CASE
                        WHEN JSON_VALUE(o.order_doc, '$.status') = 'shipped'
                          THEN TRUE
                        ELSE FALSE
                      END
  RETURNING VARCHAR2(4000) PRETTY
)
FROM orders o
FETCH FIRST 3 ROWS ONLY;

Array set operations: UNION, INTERSECT, MINUS, SORT — applied to JSON arrays like relational set operators.

JSON_MERGEPATCH: The Simple Alternative

For straightforward updates, JSON_MERGEPATCH implements RFC 7396:

-- Update status, add a field, remove a field (set to null)
UPDATE orders SET order_doc = JSON_MERGEPATCH(
  order_doc,
  '{"status": "delivered", "trackingUrl": "https://track.example.com/1001", "priority": null}'
);

Rules are simple: new keys are added, existing keys are replaced, null removes the key.

Limitation: You can't modify individual array elements. JSON_MERGEPATCH replaces arrays wholesale. For granular array work, use JSON_TRANSFORM.

10. Putting It All Together: The CTE Pipeline

Here's where everything converges. Let's build a realistic query: flatten orders from JSON, join with a relational product catalog, compute analytics, and construct a new JSON API response. One statement. One transaction.

WITH
-- Step 1: Flatten JSON orders into relational rows
order_lines AS (
  SELECT o.id                                         AS order_pk,
         JSON_VALUE(o.order_doc, '$.orderId'
                    RETURNING NUMBER)                  AS order_id,
         JSON_VALUE(o.order_doc, '$.customer.name')    AS customer,
         JSON_VALUE(o.order_doc, '$.orderDate'
                    RETURNING DATE)                    AS order_date,
         jt.product,
         jt.quantity,
         jt.unit_price,
         jt.category
  FROM   orders o,
         JSON_TABLE(o.order_doc, '$.items[*]'
           COLUMNS (
             product    VARCHAR2(100) PATH '$.product',
             quantity   NUMBER        PATH '$.quantity',
             unit_price NUMBER        PATH '$.unitPrice',
             category   VARCHAR2(50)  PATH '$.category'
           )
         ) jt
),

-- Step 2: Enrich with relational product catalog
enriched AS (
  SELECT ol.*,
         p.supplier,
         p.weight_kg,
         ol.quantity * ol.unit_price AS line_total
  FROM   order_lines ol
  LEFT JOIN products p ON p.product_name = ol.product
),

-- Step 3: Aggregate by category
category_summary AS (
  SELECT customer,
         category,
         SUM(line_total)   AS category_total,
         SUM(quantity)      AS units_sold
  FROM   enriched
  GROUP  BY customer, category
)

-- Step 4: Construct the JSON API response
SELECT JSON_OBJECT(
  'customer'   VALUE cs.customer,
  'categories' VALUE JSON_ARRAYAGG(
    JSON_OBJECT(
      'category' VALUE cs.category,
      'total'    VALUE cs.category_total,
      'units'    VALUE cs.units_sold
      RETURNING JSON
    )
    ORDER BY cs.category_total DESC
    RETURNING JSON
  )
  RETURNING JSON
) AS api_response
FROM   category_summary cs
GROUP  BY cs.customer;

Read it top to bottom. Each CTE is one logical step. Each step is testable in isolation (replace the final SELECT with SELECT * FROM enriched to debug step 2). The CBO optimizes the entire pipeline as a single execution plan — it can push predicates from step 4 down into step 1, reorder joins, and choose the optimal access path for each table.

Now stop and think about what just happened. You described the data you wanted. You described how to shape it. You handed that description to Oracle. What you got back was a fully-formed JSON API response — computed, aggregated, sorted, assembled — in one round trip, one transaction, one execution plan.

Count the code that didn't have to exist:

No ORM.
No serialization layer.
No hand-written JSON builder walking through result sets in application code.
No stitching together results from multiple queries.
No worrying about which query reads the latest data and which one doesn't.
No retry logic when a mid-pipeline read returns stale state.

The handler code to deliver this API response is now one line: res.send(row). That's the whole request path.

The Multi-Model Query

This is where the gloves come off. In Oracle 26ai, the pipeline extends to vector search, graph traversal, and spatial queries — all in the same WITH clause, all in the same transaction, all under the same optimizer.

To make the example self-contained, here's a minimal RAG-style schema with three docs, a vector embedding per doc, and a small knowledge graph:

CREATE TABLE documents (
  id     NUMBER PRIMARY KEY,
  title  VARCHAR2(200),
  author VARCHAR2(100)
);

CREATE TABLE knowledge_base (
  doc_id     NUMBER REFERENCES documents,
  chunk_text VARCHAR2(4000),
  embedding  VECTOR(4, FLOAT32)
);

CREATE TABLE kg_nodes (
  node_id   NUMBER PRIMARY KEY,
  doc_id    NUMBER REFERENCES documents,
  name      VARCHAR2(100)
);

CREATE TABLE kg_edges (
  edge_id   NUMBER PRIMARY KEY,
  src_id    NUMBER REFERENCES kg_nodes,
  dst_id    NUMBER REFERENCES kg_nodes,
  rel_type  VARCHAR2(50)
);

INSERT INTO documents VALUES (1, 'Indexing JSON in Oracle',  'R. Houlihan');
INSERT INTO documents VALUES (2, 'Vector search at scale',   'A. Lovelace');
INSERT INTO documents VALUES (3, 'Graph queries with SQL/PGQ','G. Hopper');

INSERT INTO knowledge_base VALUES
  (1, 'Functional indexes on JSON_VALUE', TO_VECTOR('[0.10, 0.20, 0.30, 0.40]'));
INSERT INTO knowledge_base VALUES
  (2, 'Approximate nearest neighbor with HNSW', TO_VECTOR('[0.11, 0.21, 0.31, 0.42]'));
INSERT INTO knowledge_base VALUES
  (3, 'Property graph traversal in 23ai+',  TO_VECTOR('[0.50, 0.60, 0.70, 0.80]'));

INSERT INTO kg_nodes VALUES (10, 1, 'Indexes');
INSERT INTO kg_nodes VALUES (11, 2, 'HNSW');
INSERT INTO kg_nodes VALUES (12, 3, 'SQL/PGQ');
INSERT INTO kg_edges VALUES (100, 10, 11, 'related_to');
INSERT INTO kg_edges VALUES (101, 11, 12, 'used_by');

CREATE PROPERTY GRAPH knowledge_graph
  VERTEX TABLES (kg_nodes KEY (node_id) PROPERTIES (doc_id, name))
  EDGE TABLES (kg_edges KEY (edge_id)
                        SOURCE      KEY (src_id) REFERENCES kg_nodes (node_id)
                        DESTINATION KEY (dst_id) REFERENCES kg_nodes (node_id)
                        PROPERTIES (rel_type));

WITH
semantic_matches AS (
  -- Vector similarity search (literal probe vector — bind one in production)
  SELECT doc_id, chunk_text,
         VECTOR_DISTANCE(embedding,
                         TO_VECTOR('[0.10, 0.20, 0.30, 0.40]'),
                         COSINE) AS score
  FROM   knowledge_base
  ORDER  BY score
  FETCH  FIRST 20 ROWS ONLY
),
graph_context AS (
  -- Graph traversal (SQL/PGQ). GRAPH_TABLE projects the doc_id alongside
  -- the matched entities so the join can stay outside the operator.
  SELECT g.entity, g.related, g.rel_type
  FROM   GRAPH_TABLE(knowledge_graph
           MATCH (v1) -[e]-> (v2)
           COLUMNS (v1.doc_id  AS source_doc,
                    v1.name    AS entity,
                    v2.name    AS related,
                    e.rel_type AS rel_type)
         ) g
  JOIN   semantic_matches s ON s.doc_id = g.source_doc
),
enriched AS (
  -- Relational metadata
  SELECT s.doc_id, s.score, d.title, d.author
  FROM   semantic_matches s
  JOIN   documents d ON d.id = s.doc_id
)
SELECT JSON_OBJECT(
  'results' VALUE (
    SELECT JSON_ARRAYAGG(
      JSON_OBJECT(
        'title'     VALUE e.title,
        'score'     VALUE ROUND(e.score, 4),
        'relations' VALUE (
          SELECT JSON_ARRAYAGG(
            JSON_OBJECT('entity'  VALUE g.entity,
                        'related' VALUE g.related,
                        'type'    VALUE g.rel_type
                        RETURNING JSON)
            RETURNING JSON
          )
          FROM graph_context g
        )
        RETURNING JSON
      ) ORDER BY e.score
      RETURNING JSON
    ) FROM enriched e
  )
  RETURNING JSON
) AS rag_context
FROM DUAL;

Vector search. Graph traversal. Relational joins. JSON construction. One query. One transaction. One optimizer.

The Alternative: Polyglot Persistence

Let's walk through building the same feature on a "purpose-built" stack. You know the pitch — "use the right tool for the job." Here are the jobs:

Vector similarity search → vector database (Pinecone, Weaviate, or Atlas Search)
Graph traversal → graph database (Neo4j, Neptune)
Document storage → MongoDB or DynamoDB
Relational metadata → Postgres or MySQL
Assembly → your application code, on an EC2 instance somewhere, holding the bag

Now your request path looks like this:

HTTP call out to the vector service. Marshall the request, send it over TCP, wait for the response, unmarshall 20 results.
HTTP call out to the graph database. Pass the IDs from step 1. Marshall, send, wait, unmarshall.
HTTP call out to the document store. Pass IDs again. Marshall, send, wait, unmarshall.
HTTP call out to the relational database. Pass IDs again. Marshall, send, wait, unmarshall.
Write application code to JOIN all four result sets in-memory.
Hand-build the JSON response object.
Ship it.

Count the overhead. Four network round-trips. Four serialization cycles. Four deserialization cycles. Four connection pools. Four auth handshakes. Four monitoring dashboards. Four SDKs with four different query APIs your team has to learn. And the developer — not the database — is the one writing the join logic that used to be FROM a JOIN b.

"But we have a unified API!" — okay, let's actually look at that. A unified API over multiple backend services reduces the number of network calls your app makes, because the vendor's gateway is now doing the fan-out instead of your code. That's a real win for developer ergonomics. It is also the entire win.

What the unified API does not give you is a unified data model. MongoDB, for instance, exposes document storage, Atlas Search, vector search, time series, and a bolt-on graph primitive ($graphLookup) through a common surface. Looks clean. But underneath, there's still no relational model, no cost-based optimizer that spans those stores, and no way to join four data shapes in a single execution plan. You can $lookup your way to something that resembles a join, but you're stitching stages in an aggregation pipeline — the developer is still the optimizer, and cross-shard $graphLookup can't even participate in a multi-document transaction. Consistency? Atlas Search indexes update in a separate process with one-to-fifteen seconds of lag. Your vector results and your document results can — and routinely do — disagree about what's actually in the database right now.

So yes, the unified API saves you some network round trips. It does not save you from the thing that breaks your RAG pipeline at 3am: the underlying stores don't share a model, don't share an optimizer, and don't share a transaction boundary. The seams are still there. You just can't see them anymore — and neither can your debugger.

The Consistency Problem

Now the part that isn't obvious until it's production.

In the polyglot model, every service has its own clock. The vector database indexed your document three seconds ago. The graph database processed the relationship change yesterday. The document store has today's data. The relational metadata is fresh from the last ETL job that ran at 4am. All four sources are individually correct and collectively lying.

When you're building a dashboard, you shrug. When you're feeding a retrieval-augmented LLM, you don't. The model gets vector results pointing to a document that no longer exists, graph relationships that were severed an hour ago, and metadata from a snapshot taken last night. What does the model do with contradictory facts?

It hallucinates. With confidence. And your users — or worse, your agents taking actions on behalf of your users — treat that hallucination as truth.

There's no application fix for this. You can't retry your way out of it. You can't throw more monitoring at it. The inconsistency is baked into the architecture the moment you decided to split your data across systems that don't share a transaction boundary.

What the Converged Engine Gives You

In the Oracle query above, the CBO sees the entire statement. It knows the vector search returns ~20 rows, so it'll probably use that as the driving input. It knows the graph traversal is bounded by a predicate. It knows the documents join can use an index. It builds one execution plan that's optimized across vector, graph, relational, and JSON operations simultaneously.

But the part that matters for your users is this: every piece of data in the response came from the same moment in time. The vector search, the graph traversal, and the relational join all ran inside the same transaction against the same snapshot. There's no "the vector index says this but the document says that." There's just one answer, internally consistent, from one source of truth.

For an LLM, that's the difference between grounding and gambling. For an autonomous agent, it's the difference between correct action and expensive liability. For you — the developer on the hook when the 3am page comes in — it's the difference between "the model got confused" and "the system did what it was supposed to do."

One query. One transaction. One optimizer. One truth. That's not a feature. That's architecture — and no amount of clever SDK design on top of a polyglot stack gets you there.

11. Indexing and Performance

Good SQL/JSON queries are fast. Properly indexed SQL/JSON queries are blazing. Wrong indexes — or missing ones — turn beautiful declarative SQL into a sequential scan of your entire table, which is an experience every developer enjoys exactly once before they learn to run EXPLAIN PLAN.

Oracle gives you five different kinds of JSON indexes, each optimized for a different access pattern. Here's the decision tree:

What you're doing	Use this
Querying one scalar field, known queries	Functional index on `JSON_VALUE`
Same field queried with multiple types	Multiple functional indexes — one per type
Looking up values inside a JSON array	Multivalue index (21c+)
Ad-hoc or unknown query patterns across the whole document	JSON search index
Only want to index rows matching a condition	Add `WHERE` to any of the above — partial index (23ai+)

The CBO picks whichever one matches your predicate. You don't hint, you don't route — you just write SQL and Oracle figures out which index to use. Most production workloads combine two or three of these. Let's walk through them.

Functional Indexes on JSON_VALUE

For known access patterns — the queries your app runs a million times a day:

CREATE INDEX idx_order_status ON orders (
  JSON_VALUE(order_doc, '$.status' RETURNING VARCHAR2(20) ERROR ON ERROR)
);

-- Composite index across multiple JSON fields
CREATE INDEX idx_order_cust_date ON orders (
  JSON_VALUE(order_doc, '$.customer.name' RETURNING VARCHAR2(100)),
  JSON_VALUE(order_doc, '$.orderDate' RETURNING DATE)
);

Or use dot-notation with item methods:

CREATE INDEX idx_order_customer ON orders o (
  o.order_doc.customer.string()
);

Critical rule: the expression has to match exactly. The CBO compares the index definition and the query predicate as expressions, not as semantic equivalents. "Close enough" is not enough. Specifically:

Return type must match. RETURNING NUMBER in the index and a query without a RETURNING clause (which defaults to VARCHAR2) = no match. Index ignored.
VARCHAR2 length is part of the type. RETURNING VARCHAR2(100) in the index and RETURNING VARCHAR2(400) in the query = no match. These are different type instances as far as the optimizer is concerned.
Dot-notation and JSON_VALUE are different expressions. An index created with o.order_doc.customer.string() will not be used by a query written with JSON_VALUE(order_doc, '$.customer' RETURNING VARCHAR2(100)) — even though both compute "the customer field as a string." The item method .string() is internally equivalent to JSON_VALUE with a default VARCHAR2 return (typically VARCHAR2(4000)), and that's rarely the same type as your RETURNING VARCHAR2(n) index definition.

The practical rule: pick one syntax for each access path — either dot-notation with item methods, or JSON_VALUE with explicit RETURNING clauses — and use it consistently for both the index definition and every query that should hit it. Mixing them is the single most common reason "my index isn't getting used" in Oracle SQL/JSON.

If you find out too late that your queries and your index are using different expressions, you have two options: rebuild the index to match, or change the queries to match the index. There's no magic reconciliation. Run EXPLAIN PLAN or check DBMS_XPLAN.DISPLAY_CURSOR — if you see a full table scan where you expected an index range scan, expression mismatch is almost always the culprit.

What if your queries hit the same path with different types? Create multiple functional indexes — one per type. Each one is a thin B-tree, each one matches exactly one query shape, and the CBO picks whichever one fits:

-- Index for string comparisons: WHERE amount.string() = '100.00'
CREATE INDEX idx_amount_str ON orders o (o.order_doc.amount.string());

-- Index for numeric comparisons: WHERE amount.number() > 100
CREATE INDEX idx_amount_num ON orders o (o.order_doc.amount.number());

You pay for the extra storage and write amplification on that field, but every query shape gets index access. We will discuss how to avoid this in a minute. This is the "explicit coverage" approach — and it's especially useful for legacy code paths that query the same field with different conventions than your newer code.

If you'd rather not enumerate types by hand, the next option covers every type at once.

JSON Search Indexes: Index Everything, Query Anywhere

Functional and multivalue indexes are targeted — they cover specific paths you already know you'll query. But document data is document data: schemas drift, new fields appear, analysts show up with questions nobody anticipated, multi-tenant customers each care about different attributes. What do you do when you don't know the access patterns in advance?

You create a JSON search index:

CREATE SEARCH INDEX idx_orders_search ON orders (order_doc) FOR JSON;

That one statement makes every path in every document indexed. $.customer? Indexed. $.shipping.address.city? Indexed. $.items[*]?(@.category == "electronics")? Indexed. $.totally.new.field.you.added.last.tuesday? Indexed, automatically, the moment data with that path shows up. You didn't declare any of those paths individually. You didn't add DDL when the schema evolved. The index adapts.

Any JSON_EXISTS, JSON_VALUE, or full-text predicate on that column becomes a candidate for index access. The CBO chooses it automatically when no more targeted index fits.

Comparisons with MongoDB — the honest version. People draw a few different comparisons between Oracle's JSON search index and MongoDB's indexing options. Some are closer than others. The accurate mapping:

MongoDB	Oracle	What they both do
Wildcard index `{ "$**": 1 }`	JSON search index (scalar use)	Index every path for ad-hoc queries
Atlas Search	JSON search index (full-text use)	Above + tokenization, stemming, phrase, regex
Multikey index	Multivalue index	Per-field array indexing
Compound index	Composite functional index	Multiple fields in one index

The wildcard index is the closest comparison when you're just talking about "index everything for unknown access patterns" — both let you skip the up-front decision of which fields to index, and both stay in the operational engine (no sidecar). That's the fair comparison and it's worth giving MongoDB credit for the feature.

Where wildcard indexes fall short for production use:

One field per query. A wildcard index can only help a query that predicates on a single field. Multi-field queries like { customer: "Acme", status: "shipped" } can't use the wildcard index for both predicates — it picks one and scans for the other. Compound queries still need compound indexes, which means you're back to enumerating access patterns.
No complex filters. Oracle's JSON_EXISTS with filter predicates (?(@.price > 100 && @.category in ("electronics","tools"))) gets evaluated at the index level. MongoDB's wildcard index can't support that class of predicate — it has to scan after the index narrows things down.
Can't be unique, can't be TTL, can't be a shard key. A lot of functionality you give up for the convenience.
No full-text. If you want tokenization, stemming, or phrase search, you stand up Atlas Search as a separate process. Separate process means separate sync cycle, and here's the punchline: Atlas Search runs as a sidecar (mongot) with one to fifteen seconds of indexing lag. During that window, your document exists but isn't searchable. Your RAG pipeline reads the vector result and then can't find the corresponding document because the search index is playing catch-up. Your support team files a ticket that says "the search is broken." Again.

Oracle's JSON search index does all of it — scalar ad-hoc queries, complex filter predicates, full-text search — from a single index definition, in-kernel and transactionally consistent by default. The index updates inside the same transaction that updates the row. Commit returns. The document is searchable. Period. No sidecar process. No replication lag. No "eventual" anything.

-- Transactional (default)
CREATE SEARCH INDEX idx_orders_search ON orders (order_doc) FOR JSON
  PARAMETERS ('SYNC (ON COMMIT)');

-- Or periodic, if you'd rather trade freshness for write throughput
CREATE SEARCH INDEX idx_orders_search ON orders (order_doc) FOR JSON
  PARAMETERS ('SYNC (EVERY "FREQ=SECONDLY; INTERVAL=1")');

Your choice, your trade-off. Oracle's default is consistency. MongoDB's only option is async.

When to use it: Analytics workloads. Multi-tenant systems where each tenant queries different fields. Applications where the schema evolves faster than you can deploy new DDL. Anywhere you'd otherwise be tempted to stand up Elasticsearch as a sidecar just to make ad-hoc queries fast. The JSON search index gives you that capability inside your operational database, consistent with your operational data, without standing up new infrastructure.

The trade-off: it's heavier to maintain than a targeted functional index. Every insert tokenizes the whole document and updates the inverted lists. For write-heavy OLTP workloads with well-known access patterns, targeted functional indexes are still the better call. Most real systems use both:

Two or three functional indexes on the hot-path fields (status, customer, date, whatever drives 80% of your queries)
One JSON search index as the catch-all for everything else
A couple of partial indexes layered on top for high-cardinality optional fields

The CBO picks whichever index fits the query. You write normal SQL and let the optimizer sort it out.

Multivalue Indexes (21c+)

For indexing values inside JSON arrays — something ordinary functional indexes can't do:

CREATE MULTIVALUE INDEX idx_order_tags ON orders o (
  o.order_doc.tags.string()
);

-- This query uses the multivalue index
SELECT * FROM orders o
WHERE  JSON_EXISTS(o.order_doc, '$.tags[*]?(@.string() == "wholesale")');

Execution plans show INDEX RANGE SCAN (MULTI VALUE) when the optimizer picks it up.

A note on type coverage for array values. The multivalue index above is locked to .string() — it indexes the tag values as text. If your arrays can hold values of different types (say, a metrics array that mixes numbers and strings, or an audit log with mixed scalar types), the same pattern applies as with scalar indexes: create one multivalue index per type you actually query on:

-- Tags queried as strings
CREATE MULTIVALUE INDEX idx_order_tags ON orders o (o.order_doc.tags.string());

-- Numeric IDs queried as numbers
CREATE MULTIVALUE INDEX idx_order_ids_num ON orders o (o.order_doc.relatedIds.number());

MongoDB's multikey indexes are strict about BSON types — and this is where developers get burned. Say two services write to the same tags array, one as strings and one as numbers:

// Inserted by Service A
{ _id: 1, tags: ["urgent", "123"] }    // "123" is a string

// Inserted by Service B
{ _id: 2, tags: ["urgent", 123] }      // 123 is a number

Now query:

db.orders.find({ tags: 123 })          // returns _id: 2 only
db.orders.find({ tags: "123" })        // returns _id: 1 only

Neither query finds both documents. MongoDB's BSON comparison treats "123" and 123 as different values, so the index stores them as different entries and queries silently miss across the type boundary. To catch both, you need $in: [123, "123"] — and now your application code has to know the full set of type variations that might exist in the database, which defeats the point of having a schema-flexible store.

Oracle's multivalue indexes normalize at index time. When you create a multivalue index with .string(), both "123" and 123 are stored as the string "123" in the index. When you create one with .number(), both are stored as the number 123. You pick the canonical representation and the index enforces it, regardless of how heterogeneous the source data is:

-- Both "123" and 123 are indexed as the string "123"
CREATE MULTIVALUE INDEX idx_order_tags ON orders o (o.order_doc.tags.string());

-- Both "123" and 123 are indexed as the number 123
CREATE MULTIVALUE INDEX idx_order_tags_num ON orders o (o.order_doc.tags.number());

A query using @.string() == "123" on the string index hits both documents. A query using @.number() == 123 on the number index hits both documents. You choose the type view once at index creation time and get consistent results forever, without your app code having to enumerate every type variation the source data might contain.

If you want broad coverage across multiple types without declaring each one by hand, use a JSON search index — it catches everything.

Sparse Indexes: Skip Rows You Don't Care About

Here's a question every developer asks once they start putting indexes on optional JSON fields: if most of my documents don't have a trackingUrl, am I still paying to index NULL on every insert?

For single-column B-tree indexes, the answer is automatic — Oracle doesn't index rows where the indexed column is NULL. An index on JSON_VALUE(order_doc, '$.trackingUrl') skips unshipped orders for free.

For everything else — composite indexes, indexes you want to skip based on conditions other than NULL — wrap the index expression in a CASE that returns NULL for the rows you want to exclude. Oracle treats those rows as un-indexed, the same way it treats genuine NULLs:

-- Only index trackingUrl on shipped orders. The CASE returns NULL for
-- non-shipped rows, and Oracle skips NULLs from the B-tree.
CREATE INDEX idx_shipped_tracking ON orders (
  CASE
    WHEN JSON_VALUE(order_doc, '$.status') = 'shipped'
      THEN JSON_VALUE(order_doc, '$.trackingUrl' RETURNING VARCHAR2(200))
  END
);

Rows where status isn't 'shipped' are never stored in this B-tree — not on insert, not on update, not on delete. The index stays small and the write path stays fast. The CBO matches a query of the form WHERE CASE WHEN ... = 'shipped' THEN ... END = '<url>' straight to this index.

Pattern: Type-Routed Indexes for Polymorphic Documents

Now combine the CASE trick with the multi-type coverage pattern and you get something MongoDB can't replicate cleanly: write-path routing based on document shape.

A lot of real-world document collections are polymorphic. One table holds customers, products, and invoices. Or one table holds v1 documents with amount stored as a string ("1,234.56") and v2 documents with amount stored as a NUMBER. Standard indexing says you either pick one type and break the other, or you index the lowest-common-denominator representation and lose query precision. Neither is great.

CASE-gated indexes let you define one index per shape and route writes to the right one based on a discriminator attribute:

-- Schema-version routing: each row contributes to exactly one of these
CREATE INDEX idx_amount_v1 ON orders (
  CASE
    WHEN JSON_VALUE(order_doc, '$.schemaVersion') = '1'
      THEN JSON_VALUE(order_doc, '$.amount' RETURNING VARCHAR2(30))
  END
);

CREATE INDEX idx_amount_v2 ON orders (
  CASE
    WHEN JSON_VALUE(order_doc, '$.schemaVersion') = '2'
      THEN JSON_VALUE(order_doc, '$.amount' RETURNING NUMBER)
  END
);

Now when you insert a v1 document, only idx_amount_v1 gets maintained. When you insert a v2 document, only idx_amount_v2. Neither index stores NULL entries for the other version, and neither contends with the other on the write path. The total write amplification stays the same as a single-version index even though the collection handles multiple schemas simultaneously.

The same pattern works for polymorphic entity types:

-- One table, multiple entity types, write-path-isolated indexes
CREATE INDEX idx_customer_name ON entities (
  CASE
    WHEN JSON_VALUE(doc, '$.entityType') = 'customer'
      THEN JSON_VALUE(doc, '$.companyName' RETURNING VARCHAR2(200))
  END
);

CREATE INDEX idx_product_sku ON entities (
  CASE
    WHEN JSON_VALUE(doc, '$.entityType') = 'product'
      THEN JSON_VALUE(doc, '$.sku' RETURNING VARCHAR2(50))
  END
);

CREATE INDEX idx_invoice_number ON entities (
  CASE
    WHEN JSON_VALUE(doc, '$.entityType') = 'invoice'
      THEN JSON_VALUE(doc, '$.invoiceNo' RETURNING NUMBER)
  END
);

Insert a customer document: only the customer index gets touched. Insert a product: only the product index. Insert an invoice: only the invoice index. Write overhead becomes proportional to the entity type you're writing, not to the total number of indexes on the table.

This is the cleanest answer to the "polymorphic collection" problem I've seen in any database. You don't pay for indexes that don't apply. You don't deal with NULL pollution. You don't need to split the data across multiple tables to get write-path isolation. And query time, the CBO picks the right index automatically based on the predicate — no hints, no routing logic in application code.

The trade-off: schema-version or entity-type discriminators have to be discoverable from the document. Make sure your write path sets them consistently — these become load-bearing fields. But if you're already tagging documents with a type or version (and most polyglot migrations do), you get write-path routing for free.

The OSON Advantage

Every one of these indexes is riding on top of the native JSON type's OSON binary format — and that matters more than most developers realize.

Here's why. When your JSON is stored as text (VARCHAR2, CLOB, or MongoDB's BSON), every field access has to walk the document byte-by-byte until it finds the right offset. O(n) complexity per lookup, where n is how deep into the document the field lives. Every index maintenance operation — every insert, every update — pays that parse cost again. At scale, this shows up as CPU burn on your write path and latency spikes on deeply nested field queries.

OSON doesn't parse. It navigates. Field names are hash-indexed, so looking up $.shipping.address.zip is three hash jumps — direct offset lookups, not scans. O(1) per field, regardless of document size or nesting depth. Your index maintenance is cheaper because the underlying field access is cheaper. Your queries are cheaper. Your writes are cheaper.

I've benchmarked this extensively. At field position 1000 in a document, OSON is 529x faster than BSON for field access. At position 50, it's 28.6x. This isn't marketing. It's storage engine mechanics — hash-indexed O(1) versus length-prefixed sequential O(n). Same input, same output, radically different physics.

And because OSON supports piecewise updates, modifying one field in a large document only writes the changed portion to disk, undo, and redo. Update $.status on a 2MB document and you pay for 2KB of I/O, not 2MB. Functional indexes on unchanged fields don't get touched at all. Text-stored JSON has to rewrite the entire column on every update — including BSON, which has no piecewise update path at the storage layer.

For the deep dive on the binary format mechanics, see my article: Why Binary Document Protocols Aren't All Created Equal.

The bottom line for developers: the index features in this section work on top of OSON because OSON makes them practical. You can have type-routed partial indexes and full-document search indexes and multivalue arrays and targeted functional indexes on the same column because the underlying format is fast enough that maintaining all of them on every write is affordable. On a text-based or sequentially-scanned format, this mix of indexing strategies would murder your write throughput. On OSON, it's just how you build things.

12. The Bigger Picture

We've covered the mechanics. Let's zoom out.

JSON Schema Validation (26ai)

Oracle 26ai lets you enforce document structure at the storage layer:

CREATE TABLE validated_orders (
  id   NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  data JSON VALIDATE '{
    "type": "object",
    "properties": {
      "orderId":  {"type": "number"},
      "customer": {"type": "string"},
      "items":    {"type": "array", "items": {
        "type": "object",
        "required": ["product", "quantity"]
      } }
    },
    "required": ["orderId", "customer", "items"]
  }'
);

Invalid documents are rejected at insert time with ORA-40875. No application-layer validation. No "trust the client" hope. The database enforces the contract. Oracle is the first major relational database to support JSON Schema validation natively as a constraint.

JSON_SERIALIZE: Making Binary Readable

When you need to inspect OSON binary content as text:

-- Pretty print with alphabetically sorted keys (26ai). The clauses
-- must come in the order: RETURNING <type>, PRETTY, ORDERED.
SELECT JSON_SERIALIZE(order_doc RETURNING VARCHAR2(4000) PRETTY ORDERED)
FROM   orders;

Essential for debugging, logging, and API responses. The ORDERED keyword (26ai) sorts keys alphabetically — useful for deterministic output in tests and diffs.

JSON Relational Duality Views: The UMT Implementation

Everything in this article leads here. Duality Views are the physical implementation of Unified Model Theory — one truth (normalized relational tables), many shapes (JSON document views), zero data duplication.

Throughout this article we've worked with JSON-storage tables and projected them into relational shapes via JSON_TABLE. Duality Views go the other direction: normalized relational tables are the source of truth, and the view projects them as updatable JSON documents. To make the example concrete, here's the relational schema:

CREATE TABLE dv_orders (
  order_id      NUMBER PRIMARY KEY,
  customer_name VARCHAR2(200) NOT NULL
);

CREATE TABLE dv_order_items (
  item_id      NUMBER PRIMARY KEY,
  order_id     NUMBER REFERENCES dv_orders ON DELETE CASCADE,
  product_name VARCHAR2(200),
  quantity     NUMBER,
  unit_price   NUMBER(12, 2)
);

INSERT INTO dv_orders VALUES (5001, 'Acme Corp');
INSERT INTO dv_orders VALUES (5002, 'Beta LLC');

INSERT INTO dv_order_items VALUES (1, 5001, 'Widget Pro', 10, 29.99);
INSERT INTO dv_order_items VALUES (2, 5001, 'Cable Kit', 50, 4.99);
INSERT INTO dv_order_items VALUES (3, 5002, 'Gadget Plus', 5, 49.99);

CREATE OR REPLACE JSON RELATIONAL DUALITY VIEW order_dv AS
  SELECT JSON {
    '_id'      : o.order_id,         -- root PK must be projected
    'customer' : o.customer_name,
    'items'    : [ SELECT JSON {
                     '_id'      : i.item_id,   -- nested PK must also be projected
                     'product'  : i.product_name,
                     'quantity' : i.quantity,
                     'price'    : i.unit_price
                   }
                   FROM dv_order_items i WITH UPDATE
                   WHERE i.order_id = o.order_id ]
  }
  FROM dv_orders o WITH INSERT UPDATE DELETE;

That's a fully updatable JSON document view over normalized relational tables. INSERT a document through the view and it decomposes into relational rows across multiple tables. Query through the view and it assembles documents from those tables. Full ACID. Optimistic concurrency via ETags. Accessible via SQL, REST, or even the MongoDB wire protocol.

Model the domain. Project the access. One truth. Many shapes. Zero tradeoffs.

The Bottom Line

Oracle's SQL/JSON isn't a compatibility checkbox. It's not "we also do JSON." It's a complete query language for hierarchical data, integrated into the most mature optimizer in the industry, backed by a binary format designed by database research veterans, and extended with construction, transformation, validation, and duality features that don't exist anywhere else.

But the real story for developers isn't the feature list. It's what these features eliminate from your stack.

The Code That Stops Existing

Look at the shape of a typical application built around a traditional database. You have an ORM translating between objects and tables, because your tables don't look like your objects. You have DTO classes shuttling data between the ORM layer and your API layer, because your ORM objects don't look like your API responses. You have serialization code building JSON from those DTOs, because your API consumers want JSON. You have validation libraries on top of all of it, because none of those layers enforce a contract.

And when your JSON data doesn't fit the relational model cleanly, you stand up a second database — a document store, a search engine, a vector index — and now you have integration code. Change data capture. Sync jobs. Consistency compensators. Retry logic. Pipelines that fail at 2am. Tickets nobody wants.

Oracle SQL/JSON collapses this stack. Your tables can hold both normalized relational data and JSON documents. Your queries return JSON directly, shaped to match your API contract. Your indexes cover both data shapes. Your transactions span both models. Your optimizer plans across both.

Count the code that stops existing:

No ORM required. Your query emits the JSON structure your API consumer wants. res.send(row.api_response) is the whole handler.
No DTO layer. There's nothing to translate. The database returned the final shape.
No serialization library. The database did the serialization, and it did it on the binary OSON structure, not in your application memory.
No validation framework for the API response. JSON Schema validation runs at the storage layer, not the application layer.
No integration layer between document and relational data. Both live in the same table, same transaction, same optimizer.
No sidecar search or vector services. JSON search and native VECTOR types live inside the database, consistent with the operational data.
No consistency compensators. The engine gives you ACID across JSON, relational, graph, and vector operations in the same transaction.

Every one of these is a class of code that exists in a polyglot architecture because the underlying databases don't talk to each other. Eliminate the need for them to talk to each other, and the code evaporates. What's left is the part of your application that actually creates business value.

Let Physics Define Your Data Model

Here's the principle that should drive every storage decision you make: if data is accessed together, store it together. If access patterns aren't clear, normalize it. That's it. That's the rule. The database's job is to make both possible without forcing you to choose an architecture up front.

All data is ultimately relational — entities, attributes, and the relationships between them. That's not a philosophy; that's how information works. But storing everything as normalized relational tables is an implementation choice, and it's the wrong choice when your access patterns are well-known and involve reading hierarchical data together. You're paying for joins you don't need, on data you always fetch as a unit, to get back a result you have to re-hierarchize in application code anyway.

The physics argument goes the other way too. When your access patterns are unknown, varied, or analytical — when different consumers need the same data in different shapes, or when you need aggregation across entities — normalized relational is the only sensible choice. Embedding everything in documents means duplicating data, fighting update anomalies, and losing the ability to reason efficiently about relationships.

Oracle lets you pick per-workload without committing to a paradigm. You have the full toolkit:

Native JSON type columns. Store documents as first-class data alongside relational columns in the same table. When your access patterns are well-known and you want document locality, put the document in the row. One read, one fetch, one shape.
JSON Collections (SODA / MongoDB API). Pure document storage with the MongoDB wire protocol on top. Drop-in compatibility for existing document applications, with all the indexing, ACID, and optimizer capabilities of the Oracle engine underneath.
Hybrid tables. Relational columns for structured, queryable fields alongside a JSON column for flexible, schema-optional attributes. The best of both worlds when your data has both known structure and emergent properties.
Normalized relational tables. When access patterns are varied, analytical workloads matter, and entity relationships drive the queries, normal form is still the right answer. Use it.
JSON Relational Duality Views. When the same canonical data needs to serve multiple access patterns with multiple document shapes, model once in relational form and project as many document views as you need. Updates flow both ways with full ACID guarantees. One source of truth, many projections, no CDC.

Every one of these stores the same kind of data. Every one participates in the same transactions. Every one is indexed and optimized by the same cost-based optimizer. You don't have to pick one approach for the whole application — different tables can use different strategies based on how the data is actually accessed. Let the physics of your access patterns define the model, not the other way around.

This is the answer to "should we use a document database or a relational database?" that the industry has been arguing about for fifteen years. The answer isn't one or the other. The answer is: whichever shape fits the access pattern, with full SQL and full JSON support either way, inside a single engine with a single transaction boundary.

Your application code sees documents where documents make sense. Your analytics queries see tables where tables make sense. Your reports see normalized joins. Your AI agents see vector-indexed context. Nobody sees the seams because there are no seams.

Your New Stack

The developer who actually internalizes what Oracle SQL/JSON can do doesn't just write faster JSON queries. They build fundamentally simpler systems.

No ORM. The database returns your API response.
No sidecar services. Search, vector, graph, and relational share the same engine and transaction.
No sync jobs. There's nothing to sync — one truth, many projections.
No complex middleware. The query is the middleware.
No "which database does this belong in?" debates. The answer is always "this one."
No praying that five systems agree on current state. They're all the same system.

You write one query. You get one truth. Your application is smaller, your operational surface is smaller, your bug reports are fewer, and your pages at 3am have better reasons than "the search index is lagging again."

That's not a feature. That's physics. ⚡