Forem: Ryan Giggs

I Built a Real-Time Crypto Analytics Pipeline for $0.01/Month — Here's the Full Architecture

Ryan Giggs — Mon, 27 Apr 2026 08:39:51 +0000

How I combined Apache Flink, Redpanda, Airflow, dbt Cloud, and Grafana to track Bitcoin, Ethereum, Solana, BNB, and Cardano in real time — all running on Google Cloud for less than a cup of coffee per month.

If you've been learning data engineering, you've probably built pipelines that move CSV files from A to B. Nothing wrong with that — but real data engineering interviews are starting to ask a different question: Can you handle data that never stops arriving?

This article walks through CoinPulse, a project I built from scratch to answer that question. It's a hybrid streaming and batch crypto analytics pipeline that:

Consumes live trade events from Binance's WebSocket feed at ~1 tick per second per coin
Processes them with Apache Flink using 1-minute tumbling windows
Enriches the streaming data daily with CoinGecko market metadata via Airflow
Transforms everything with dbt Cloud
Serves a live, auto-refreshing Grafana dashboard — publicly accessible

Total cloud cost: ~$0.01/month.

Here's how I built it.

The Problem I Was Solving

Crypto market data has two distinct tempos:

Real-time (milliseconds to seconds): Price ticks from exchanges. You need stream processing here — you can't batch-load ticks after the fact and call it "live." The data volume is too high to dump raw into a warehouse row by row.

Daily (slow-moving context): Market cap, trading volume, rankings, OHLC candlesticks. These change once a day. You don't need Flink for this — a simple scheduled API call works fine.

The interesting engineering challenge is: how do you join these two tempos cheaply and reliably?

The naive answer — stream everything into BigQuery using the Streaming Insert API — costs money at scale. The smarter answer is what I built: stream to GCS first, batch-load to BigQuery for free, and join the two lanes in dbt.

Architecture Overview

Before diving in, here's the full picture:

STREAMING LANE
Binance WebSocket → Python Producer → Redpanda → PyFlink → GCS → BigQuery
                    (real-time ticks)  (Kafka)  (windows)  (JSONL) (free load)

BATCH LANE  
CoinGecko API → Airflow DAG (7 tasks) → GCS → BigQuery
                (daily @ 06:00 UTC)    (Parquet)  (free load)

TRANSFORMATION LAYER
BigQuery → dbt Cloud → crypto_staging.* → crypto_mart.*
           (daily @ 07:00 UTC)

VISUALIZATION
BigQuery → Grafana Cloud → 6 panels, auto-refresh 30s
           (BigQuery plugin, free)

Everything runs locally in Docker except the data warehouse (BigQuery) and transformation/visualization layers (dbt Cloud + Grafana Cloud). This is the "local-to-cloud" pattern — keep compute on your machine, use GCP only for storage and analytics.

The Streaming Lane: Binance → Redpanda → PyFlink → GCS

Why Binance Instead of CoinCap?

I originally planned to use CoinCap's WebSocket feed. Halfway through the build, I discovered CoinCap now requires an API key even for the free tier. Binance's WebSocket, however, is completely open — no account, no key, no rate limits on the trade stream.

The URL pattern is:

wss://stream.binance.com:9443/stream?streams=btcusdt@trade/ethusdt@trade/solusdt@trade/bnbusdt@trade/adausdt@trade

Each message looks like:

{
  "stream": "btcusdt@trade",
  "data": {
    "s": "BTCUSDT",
    "p": "76034.12",
    "q": "0.00150000",
    "T": 1776875894886
  }
}

The Python Producer

The producer is a Docker container running websockets + confluent-kafka. It connects to Binance, maps Binance pair names to readable coin names (BTCUSDT → bitcoin), and publishes JSON messages to a Redpanda topic:

message = {
    "symbol": "bitcoin",
    "price_usd": 76034.12,
    "event_timestamp": "2026-04-23T18:31:34+00:00"
}
producer.produce(topic="crypto-prices", key="bitcoin", value=json.dumps(message))

Why Redpanda Instead of Kafka?

Kafka requires Zookeeper. In a local Docker setup, that's three containers just for the message broker (Zookeeper, Kafka broker, Schema Registry if you want it). Redpanda is Kafka-API compatible — meaning PyFlink's Kafka connector works with it unchanged — but runs as a single binary, no Zookeeper. It starts in under 3 seconds and uses ~512MB RAM instead of Kafka's 2-3GB.

The PyFlink Job

The Flink job does the heavy lifting. It consumes from Redpanda, applies a 1-minute tumbling event-time window per coin symbol, and computes:

avg_price — average price within the window
min_price / max_price — price range
price_stddev — standard deviation (volatility proxy)
open_price / close_price — first and last tick
record_count — number of ticks received

Event-time windows (as opposed to processing-time) are important here. They use the timestamp embedded in the Binance message, so late-arriving messages (within a 10-second watermark) land in the correct window rather than the current one.

The output of each completed window is a JSONL file written to GCS:

gs://coinpulse-data-lake/streaming/2026/04/23/18/stream_bitcoin_20260423_183134.jsonl

A scheduled BigQuery Load Job (free) then loads these files into crypto_raw.stream_prices, partitioned by HOUR on event_timestamp and clustered by symbol.

The key insight here: I never use BigQuery's Streaming Insert API, which charges ~$0.01 per 200MB. Load Jobs are completely free. The data is 1 minute old in the warehouse instead of 0 seconds old — but for a dashboard refreshing every 30 seconds, nobody notices.

The Batch Lane: CoinGecko → Airflow → GCS → BigQuery

The Airflow DAG

The batch pipeline runs as a 7-task Airflow DAG at 06:00 UTC daily with two parallel branches:

fetch_coingecko → transform_to_parquet → upload_to_gcs → load_to_bigquery
fetch_ohlc      → upload_ohlc_to_gcs   → load_ohlc_to_bigquery

Markets branch hits CoinGecko's /coins/markets endpoint for all 5 coins in a single call, getting current price, market cap, 24h volume, price change percentage, rank, and fully diluted valuation.

OHLC branch hits /coins/{id}/ohlc?days=1 for each coin, getting candlestick arrays [timestamp, open, high, low, close].

Both branches serialize data to typed Parquet (using explicit schema definitions, not autodetect — more on why this matters below) and upload to GCS before loading to BigQuery.

The Schema War (And How I Won It)

This took me longer than I'd like to admit. The root cause: PyArrow (used by Pandas .to_parquet()) infers Python datetime objects as TIMESTAMP in Parquet metadata. But when BigQuery's autodetect reads a datetime.isoformat() string, it creates the column as STRING. When the same column arrives as a proper datetime object in the next run, BigQuery rejects the load because TIMESTAMP ≠ STRING.

The permanent fix: define explicit schemas in the BigQuery Load Job config and force all string columns to str dtype in Pandas before writing Parquet:

df["snapshot_date"] = df["snapshot_date"].astype(str)
df["ingested_at"]   = df["ingested_at"].astype(str)

job_config = bigquery.LoadJobConfig(
    source_format=bigquery.SourceFormat.PARQUET,
    schema=MARKETS_SCHEMA,  # explicit, no autodetect
    write_disposition=bigquery.WriteDisposition.WRITE_APPEND,
)

Once I stopped fighting autodetect and defined schemas explicitly, the pipeline ran clean for weeks.

The Transformation Layer: dbt Cloud

Why dbt Cloud Over dbt Core

I already had Redpanda, Flink, Airflow, and PostgreSQL running locally — approximately 13GB RAM. Adding a dbt Core Docker container would push that higher for no functional gain. dbt Cloud's free developer plan gives a managed scheduler, visual lineage graph, and test runner at zero cost and zero local RAM.

The Model Structure

Staging layer (views, zero storage cost):

stg_stream_prices — cleans and type-casts streaming data, filters nulls and zero prices
stg_coingecko_markets — adds market_cap_category (large/mid/small/micro cap) derived column
stg_ohlc_candles — computes candle_range, candle_change_pct, and candle_direction (bullish/bearish)

Mart layer (incremental tables, insert_overwrite):

mart_crypto_prices — joins streaming aggregations with latest daily market metadata
mart_volatility — computes hourly composite_volatility_score combining streaming stddev with OHLC candle metrics

The mart models use insert_overwrite with hour-level partitioning and an incremental filter that scans only the last 2 hours of streaming data per run. This means dbt never full-scans the streaming table — it touches only the partitions it needs to update.

The dbt Deploy Job

A scheduled dbt Cloud production environment job runs daily at 07:00 UTC — one hour after Airflow — ensuring mart tables always have fresh data before the dashboard refreshes.

After some initial failures (a common dbt Fusion syntax issue with accepted_values requiring arguments: nesting), the deploy job has run cleanly every day since April 19, 2026. Each run completes in under 45 seconds.

The Dashboard: Grafana Cloud

Why Grafana Over Looker Studio

Looker Studio is the natural choice for BigQuery — it has a native connector and no setup friction. But it has one fatal flaw for streaming analytics: it doesn't auto-refresh.

Grafana Cloud's free tier includes the BigQuery plugin (one-click install, no credit card), supports 30-second auto-refresh, has proper time series panels with logarithmic Y-axis (essential when plotting BTC at $76K alongside ADA at $0.25 on the same chart), and produces significantly more polished visualizations.

My prior projects (Sovereign Debt Observatory, Tech Ecosystem Observatory) both used Looker Studio, so using Grafana here also demonstrates range across tooling.

The 6 Dashboard Panels

Panel 1 — Live Crypto Prices (Streaming)
A treemap showing the latest price for each coin from crypto_raw.stream_prices, colour-coded green/red by price direction. BTC at $76K, ETH at $2.3K, SOL at $86, BNB at $630, ADA at $0.249.

Panel 2 — Price Trend Over Time (Streaming)
A time series chart with logarithmic Y-axis showing avg_price per 1-minute window for all coins. The log scale handles the 300,000x price spread between BTC and ADA.

Panel 3 — Hourly Volatility Score (Streaming)
A gradient bar chart (green → red) showing the composite_volatility_score per coin from mart_volatility. Cardano leads at 0.566 (most volatile relative to its price), Binance Coin lowest at 0.264.

Panel 4 — Market Cap Rankings (Batch)
BTC dominates at $1.57T, ETH at $289B. The scale difference is immediately visible and gives context to the streaming data.

Panel 5 — Daily OHLC Candles (Batch)
A table panel showing raw candle data from coingecko_ohlc_candles with candle_direction colour-coded — bearish rows highlighted in red.

Panel 6 — 24h Price Change % (Batch)
ADA leads at +0.563%, ETH is the only negative at -0.132%.

Public URL: https://derrickryangiggs.grafana.net/public-dashboards/81560968e15140f08f65b52d78a4b252

Infrastructure: Terraform for Everything

All GCP resources are defined in Terraform — nothing clicked in the console. Running terraform apply provisions:

GCP project with billing account linked
Required APIs (BigQuery, BigQuery Storage, Cloud Storage, Resource Manager)
Service account with least-privilege IAM roles (BQ Data Editor, BQ Job User, GCS Object Admin)
GCS bucket with 90-day lifecycle rule
BigQuery datasets and tables with explicit schemas, partitioning, and clustering

Nothing is hardcoded. All values flow from .env via TF_VAR_ prefixed variables. This is non-negotiable — hardcoded project IDs in .tf files are a code review failure waiting to happen.

The Cost Breakdown (For Real)

Service	Details	Monthly Cost
Redpanda	Local Docker	$0.00
PyFlink	Local Docker	$0.00
Apache Airflow	Local Docker	$0.00
Google Cloud Storage	~200MB of Parquet + JSONL	~$0.01
BigQuery	<1GB data, Load Jobs only	~$0.00
dbt Cloud	Developer plan	$0.00
Grafana Cloud	Free tier	$0.00
Total		~$0.01/month

The $0.01 is literally the GCS storage bill. Everything else is free.

The three decisions that make this possible:

Use GCS + Load Jobs instead of BigQuery Streaming Inserts. Streaming Inserts charge per row. Load Jobs are free. The trade-off is 1-minute data freshness instead of real-time — acceptable for a dashboard refreshing every 30 seconds.
Run compute locally. Cloud Composer (managed Airflow) starts at $300/month. Dataproc (managed Flink) is similarly expensive. Running on your own machine costs nothing except electricity.
Use free tiers everywhere else. dbt Cloud developer plan, Grafana Cloud free tier, CoinGecko Demo API free tier, Binance WebSocket (no auth required). Every external service in this pipeline has a genuinely free tier sufficient for this workload.

Hard-Won Lessons

1. Flink's WindowFunction.apply() signature changed in 2.2.0
In PyFlink 1.18, apply(self, key, window, inputs, collector) uses a collector argument. In 2.2.0, the signature is apply(self, key, window, inputs) and you use yield instead. This isn't documented prominently. I lost an afternoon to this.

2. Read env vars inside functions, not at module import time
My Airflow DAG originally read GCS_BUCKET_NAME = os.environ["GCS_BUCKET_NAME"] at the module level. When Airflow's scheduler reloads the module after a container restart, GCS_BUCKET_NAME is empty — the env var isn't available at import time in the scheduler process. Moving all os.environ reads inside the task functions fixed this permanently.

3. Never use autodetect for BigQuery Load Jobs if you care about schema stability
autodetect=True infers schema from the first file it sees. If that file has a datetime object, BQ creates a TIMESTAMP column. If the next file has an .isoformat() string, BQ rejects it as a STRING / TIMESTAMP mismatch. Define explicit schemas in the LoadJobConfig and cast DataFrame columns explicitly before writing Parquet.

4. The Flink TaskManager won't connect if jobmanager.rpc.address in config.yaml doesn't match your Docker container name
The baked-in config.yaml in the PyFlink image has jobmanager.rpc.address: jobmanager. My container is named coinpulse-flink-jobmanager. The FLINK_PROPERTIES env var doesn't always override config.yaml. The fix: add a sed command in the Dockerfile to rewrite the address in config.yaml directly.

5. Redpanda's rpk topic create command arguments must be on one line
Multi-line bash in Docker Compose entrypoint with > YAML block scalar doesn't handle line continuations the way you'd expect. The --brokers, --partitions, and --replicas flags ended up being interpreted as separate commands. Fix: use ["bash", "-c", "single long command string"] format instead.

What I'd Add Next (The "Circuit Breaker" Enhancements)

After sharing the project, I got feedback suggesting three production-grade improvements. Here's my assessment:

1. dbt-expectations data contracts
Add guardrails so the pipeline fails loudly if bad data arrives — negative prices, null coin symbols, impossible price spikes. In dbt, this is two lines per test in your schema YAML. The dbt-expectations package brings expect_column_values_to_be_between and similar assertions. Highly recommended, zero cost.

2. BigQuery ML ARIMA_PLUS price forecasting
Train a time series model directly in BigQuery SQL to forecast the next 7 days of prices per coin. The first 10GB of CREATE MODEL data processed monthly is free — both projects' datasets are well under that threshold, making this effectively free. It turns the dashboard from "here's what happened" into "here's what might happen."

3. GitHub Actions CI/CD for Terraform
A terraform plan on every PR to the repo, ensuring infrastructure changes are reviewed before applying. Standard practice in production environments, takes about 20 lines of GitHub Actions YAML to implement.

The Full Stack, Summarised

Layer	Tool	Why
IaC	Terraform	Reproducible, version-controlled infrastructure
Message Broker	Redpanda	Kafka-compatible, no Zookeeper, runs in one container
Stream Processing	PyFlink 2.2.0	Stateful windowed aggregations, event-time semantics
Orchestration	Apache Airflow	Full end-to-end DAG, retry logic, monitoring
Object Storage	GCS	Staging layer between compute and warehouse
Data Warehouse	BigQuery	Serverless, free Load Jobs, columnar, partitioned
Transformations	dbt Cloud	Incremental models, lineage graph, scheduled runs
Visualization	Grafana Cloud	Auto-refresh, BigQuery plugin free on all tiers

Try It Yourself

The full project is open source:

GitHub: https://github.com/Derrick-Ryan-Giggs/coinpulse

Live Dashboard: https://derrickryangiggs.grafana.net/public-dashboards/81560968e15140f08f65b52d78a4b252

The README has step-by-step reproduction instructions — clone, fill in .env, run terraform apply, start the Docker stacks, and you have a running pipeline within about 20 minutes. All you need is a GCP account (free tier works), a CoinGecko Demo API key (free, no card), dbt Cloud account (free developer plan), and Grafana Cloud account (free forever).

What This Project Taught Me

Building this reinforced something I suspected but didn't fully appreciate: the hardest problems in data engineering aren't the algorithms — they're the plumbing.

Getting PyFlink to connect to a Redpanda broker with the right container hostname was a day of debugging. Getting the Airflow DAG to read env vars correctly on scheduled runs (not just manual triggers) was another. Getting BigQuery to accept the same schema consistently across runs required understanding how PyArrow serialises Python types to Parquet column metadata.

None of these problems are glamorous. But solving them is what separates a "pipeline that worked once on my laptop" from a "pipeline that runs reliably every day."

That's what CoinPulse is — a pipeline that runs reliably every day, for $0.01/month.

If you found this useful or have questions about any part of the implementation, feel free to reach out. I write regularly about data engineering, cloud infrastructure, and GCP on Medium, Dev.to, and Hashnode.

Oracle GoldenGate 23ai: Powering Distributed AI with Real-Time Data Replication

Ryan Giggs — Sat, 18 Apr 2026 11:10:46 +0000

The database world has always needed a reliable bridge between operational systems and analytical or AI workloads. For decades, Oracle GoldenGate has served that role — silently moving terabytes of transactional changes across databases, clouds, and continents with minimal latency and near-zero impact on source systems. With the release of GoldenGate 23ai in 2024, that bridge now carries a new payload: AI vector embeddings.

This post covers GoldenGate's core use cases, the Microservices Architecture that underpins modern deployments, and the powerful new distributed AI capabilities that make GoldenGate 23ai a critical component of enterprise GenAI pipelines.

What Is Oracle GoldenGate?

At its heart, GoldenGate is a Change Data Capture (CDC) and real-time replication engine. Rather than querying the source database directly — which would impose load — GoldenGate reads the database's redo/transaction logs, captures only the changed deltas (inserts, updates, deletes), and delivers them to one or more target systems with sub-second latency.

This log-based approach makes GoldenGate extremely low-impact on production systems while supporting highly demanding replication topologies across heterogeneous environments: Oracle to PostgreSQL, Oracle to Kafka, MySQL to BigQuery, and many more.

Common GoldenGate Use Cases

GoldenGate's flexibility makes it applicable across a broad range of enterprise scenarios:

1. Multi-Active, High Availability, and Cross-Region Deployments

GoldenGate enables active-active (bidirectional) replication, where multiple database instances in different regions accept both reads and writes simultaneously. Built-in conflict detection and resolution ensures data consistency across sites. This architecture powers mission-critical applications that require zero planned downtime and regional failover capabilities — with GoldenGate's ExaPortMon-style awareness ensuring continuous replication even under network disruptions.

2. Data Offloading and Data Hub

Rather than running analytical queries against production OLTP databases, GoldenGate streams only the changed data to a dedicated analytics database, data warehouse, or data lake in real time. This eliminates the traditional "nightly CSV batch" pattern and gives analysts access to data that is seconds, not hours, behind production.

3. Migrations and Upgrades

GoldenGate is the gold standard for zero-downtime database migrations — moving from on-premises to cloud, upgrading Oracle Database versions, or switching from Oracle to PostgreSQL. The approach: synchronize old and new systems in parallel using GoldenGate, then cut over the application connection once replication lag reaches zero. Even migrations of hundreds of terabytes are reduced to a switchover window measured in minutes.

4. Analytics Data Feeds

GoldenGate feeds real-time change streams into analytics platforms — data warehouses, Apache Kafka topics, Oracle Streaming Service, Confluent Cloud, and cloud object stores. This underpins event-driven architectures and real-time dashboards that would be impossible with batch-based ETL.

5. Stream Analytics

With GoldenGate Stream Analytics, change events flowing through GoldenGate can be enriched, filtered, and analyzed in motion — before they land in a target system. Integrations with Oracle Machine Learning (OML) and ONNX (Open Neural Network Exchange) enable actionable AI/ML directly from streaming pipelines, turning raw CDC events into predictive signals.

GoldenGate Microservices Architecture

When Oracle introduced the Microservices Architecture (MA) in GoldenGate 12.3 (August 2017), it represented a fundamental modernization of how GoldenGate is deployed, managed, and integrated. The classic command-line-driven architecture (GGSCI) was complemented by a fully API-driven, web-based platform.

Five Core Components

The Microservices Architecture is built around five independently manageable services:

Service Manager — acts as a watchdog, managing and monitoring all other services within a deployment
Administration Server — the central control entity for managing Extract, Replicat, and all replication processes
Distribution Server — a high-performance data distribution agent that routes trail files to one or more targets over HTTPS/WebSockets
Receiver Server — handles all incoming trail files from remote Distribution Servers
Performance Metrics Server — collects and exposes real-time performance data for all GoldenGate processes via REST API or JSON

Key Benefits of Microservices Architecture

REST API-driven management: Every GoldenGate operation — starting a replicat, checking lag, configuring a pipeline — is exposed via a standardized REST API, enabling automation with Python, Terraform, and CI/CD pipelines.
Renewed web UI in 23ai: GoldenGate 23ai ships with a completely redesigned graphical interface, with more logical, guided assistant steps for setting up integrations — a marked improvement over the previous look and feel.
TLS encryption and OAuth 2.0 authentication out of the box, with integration into external identity providers.
Simpler patching and upgrades: Each microservice can be updated with minimal disruption. Upgrading from GoldenGate 21c to 23ai, for example, requires only downloading the new software and updating the OGG_HOME directory parameters.
Deployment flexibility: Runs on-premises, as a fully managed service in OCI (OCI GoldenGate), or on third-party clouds including AWS, Azure, and GCP.

GoldenGate Version Upgrade Path

For teams still running older GoldenGate versions, the supported upgrade sequence through the modern microservices era is:

12.1 → 12.2 → 12.3 (Microservices introduced) → 18c → 19c → 21c → 23ai → 26ai (next LTS)

Note: Classic Architecture support for Oracle Database is deprecated in 23ai. All new deployments should use Microservices Architecture. GoldenGate 26ai has been announced as the next Long-Term Support (LTS) release, with further AI and Data Mesh enhancements planned.

Distributed AI Processing with Vector Replication

This is where GoldenGate 23ai breaks genuinely new ground. The release adds first-class support for the Oracle Database 23ai VECTOR data type — the foundational building block of AI Vector Search and RAG (Retrieval-Augmented Generation) pipelines.

What Is a Vector Embedding?

When AI models process documents, images, or other unstructured data, they convert that content into vector embeddings — high-dimensional numerical representations that encode semantic meaning. These vectors are stored in vector databases and searched using similarity algorithms (cosine distance, dot product, Euclidean distance) to retrieve contextually relevant content.

Keeping vector stores synchronized with the operational databases that hold the source data is a significant engineering challenge — and exactly what GoldenGate 23ai solves.

Key Vector Replication Capabilities

Migrate Vectors into Oracle Vector Database Without Downtime

GoldenGate can migrate vector embeddings stored in one database into Oracle Database 23ai (or Oracle AI Database 26ai) without any service interruption. This enables seamless transitions from on-premises environments to OCI, or between cloud providers, while keeping AI-powered applications running continuously.

Replicate and Consolidate Vector Changes in Real Time

As source data changes — a customer record is updated, a product description is revised — GoldenGate captures those changes and propagates updated vectors to all target vector stores in real time. This ensures that RAG pipelines are always querying fresh, consistent embeddings rather than stale snapshots.

Heterogeneous Vector Replication

GoldenGate 23ai can replicate vectors both homogeneously (Oracle to Oracle) and heterogeneously across different vector stores, provided the embedding algorithm, type, and dimension are consistent across source and target. If the source and target use different embedding algorithms, GoldenGate can replicate the raw source data instead, allowing the target system to re-vectorize it with its own model.

Support extends beyond Oracle: GoldenGate 23ai also supports capture and delivery of the pgvector extension for PostgreSQL and its derivatives, as well as the VECTOR datatype for MySQL 9.0+.

Multi-Cloud, Multi-Active Oracle Vector Database

GoldenGate enables active-active vector database architectures spanning multiple cloud regions or providers. AI applications in different regions can write to and read from local vector stores while GoldenGate keeps all instances synchronized — enabling both low-latency AI inference and high availability for vector workloads.

Stream Changes to Search Engines

GoldenGate can stream vector and document changes in real time to Elasticsearch or OpenSearch compatible search indexes, enabling hybrid semantic + keyword search architectures without manual synchronization pipelines.

GoldenGate Data Streams: Pub/Sub for Database Events

A major new capability in GoldenGate 23ai is the Data Streams service, which provides simplified, event-driven access to database change records via AsyncAPI over WebSocket connections. This enables developers to subscribe to database change events — including vector updates — in a pub/sub pattern, without needing to manage trail files or configure traditional Replicat processes. Combined with Oracle Database 23ai's JSON Relational Duality views, this means document-side changes can be captured and propagated as structured events in real time.

Generative AI with Your Own Business Data

GoldenGate 23ai directly enables three patterns for building AI applications on private enterprise data:

1. From-Scratch LLMs (Custom Foundation Models)

Organizations with truly unique proprietary datasets — medical records, legal case files, financial transaction histories — may opt to train custom Large Language Models from the ground up on that data. GoldenGate ensures the training datasets fed into these models are continuously fresh and complete, sourced in real time from operational databases rather than periodic exports.

This is an advanced, resource-intensive approach suited to well-resourced organizations with data assets that are sufficiently distinctive to justify the investment.

2. Fine-Tuned LLMs (Domain Adaptation)

A more practical approach for most enterprises: take a pre-trained foundational LLM (such as Llama, Mistral, or an OCI Generative AI model) and fine-tune it on a private domain dataset. GoldenGate replicates the relevant operational data into a training pipeline, keeping the fine-tuned model aligned with current business reality as data evolves over time.

3. RAG (Retrieval-Augmented Generation) — The Most Common Pattern

RAG is currently the dominant architecture for enterprise GenAI. Instead of baking private knowledge into model weights (which requires expensive retraining), RAG retrieves relevant documents at query time and injects them into the LLM's prompt as context.

GoldenGate 23ai is purpose-built to power better RAG pipelines:

Keeps vector stores current: As source data changes, GoldenGate propagates those changes to the vector index within seconds — not hours.
Enables a consolidated vector hub: Rather than maintaining separate, potentially inconsistent vector stores for each application, GoldenGate can consolidate vectors from multiple source databases into a single Oracle AI Database vector hub.
Reduces hallucination risk: Stale vector indexes are one of the primary causes of LLM hallucinations in RAG systems. Fresh data = more accurate, grounded responses.
Supports heterogeneous sources: GoldenGate can pull data from Oracle, PostgreSQL, MySQL, SQL Server, DB2, and more — vectorize it, and deliver it to Oracle AI Database, all in a unified pipeline.

Why GoldenGate 23ai for AI: The Summary

Capability	Impact
Real-time vector replication	RAG pipelines query fresh embeddings, not stale snapshots
Zero-downtime vector migration	Move AI workloads to cloud without service interruption
Multi-cloud active-active vectors	Low-latency AI inference globally with high availability
Heterogeneous source support	Unify vectors from Oracle, PostgreSQL, MySQL, and more
Data Streams (AsyncAPI)	Event-driven AI pipelines without complex trail file management
Stream to search engines	Elasticsearch/OpenSearch integration for hybrid search
Fine-tuning data pipelines	Keep domain-adapted LLMs current with live business data

The underlying thesis is the same one Oracle argues for Exadata AI Storage: bring AI to the data, not data to the AI. GoldenGate 23ai extends this philosophy to the replication layer — ensuring that wherever your AI application runs, it has access to consistent, real-time data from your authoritative operational systems.

As enterprise AI moves from experimentation to production, the quality of data pipelines will increasingly determine the quality of AI outcomes. GoldenGate 23ai is Oracle's answer to that challenge.

Are you using GoldenGate for AI pipelines or RAG architectures? What challenges have you run into keeping vector stores synchronized? Share in the comments below.

Building the Sovereign Debt Observatory: An End-to-End ELT Pipeline on World Bank Debt Data for Low and Middle-Income Countries

Ryan Giggs — Thu, 09 Apr 2026 10:02:52 +0000

Introduction

Global sovereign debt is one of the most consequential datasets in existence. It shapes foreign policy, determines credit ratings, drives IMF bailout decisions, and affects the daily lives of billions of people in developing countries. The World Bank publishes this data openly — 130+ countries, 27 years of history, updated quarterly — yet there is no ready-made analytical layer on top of it.

If you want to answer a question like "which African countries have the highest ratio of private nonguaranteed debt to total external debt, and how has that changed since 2010?", you have to manually download Excel files from multiple World Bank portals, clean inconsistent column names, handle missing values, and stitch everything together in a spreadsheet. Every time the data updates, you do it again.

That is the problem this project solves.

The Sovereign Debt Observatory is an end-to-end ELT pipeline that ingests World Bank external debt data, lands it in a cloud data lake, transforms it in BigQuery using dbt Cloud, orchestrates everything quarterly with Apache Airflow, and surfaces the answers in a Looker Studio dashboard.

This article walks through every layer of the pipeline — the architecture decisions, the technical challenges, and how I solved them.

The Five Questions This Pipeline Answers

Before writing a single line of code, I defined the analytical questions the pipeline needed to answer. This kept every decision grounded in purpose rather than technology for its own sake.

How is gross external debt distributed across public, publicly guaranteed, private nonguaranteed, and multilateral sectors per country?
Which countries carry the highest short-term external debt exposure and how has that changed since 2010?
What share of external debt is foreign-currency denominated and where is that ratio worsening?
How has regional external debt stock evolved from 1998 to 2025 across Africa, Latin America, East Asia, South Asia, Europe and Central Asia, and the Middle East?
Which countries face the heaviest debt service pressure relative to their total debt position?

Architecture Overview

The pipeline follows a modern ELT pattern — extract and load first, transform inside the warehouse.

World Bank API (IDS source 2)
        |
        v
PySpark job (Docker container)
        |
        v
Google Cloud Storage — raw Parquet, partitioned by extracted_date
        |
        v
BigQuery external tables (raw dataset)
        |
        v
dbt Cloud — staging views + mart tables
        |
        v
Looker Studio dashboard
        |
Orchestrated by Apache Airflow on Docker Compose
Infrastructure provisioned by Terraform

Why ELT and not ETL?

In ETL, transformation happens before loading — your Spark job does the cleaning, aggregation, and business logic before writing to the warehouse. In ELT, raw data lands untransformed and the warehouse does all the heavy lifting.

For this project, ELT is the right choice for three reasons. First, raw data is preserved in GCS indefinitely — if analytical requirements change six months from now, I just write a new dbt model without re-running the ingestion layer. Second, BigQuery is optimized for analytical SQL transformations at scale — it is far better at this than PySpark running in a Docker container on a local machine. Third, dbt gives us version-controlled, tested, documented transformations that are readable by anyone with SQL knowledge.

PySpark's job in this pipeline is purely mechanical: hit the API, paginate, write Parquet. No business logic. No aggregations. Pure extract and load.

Data Sources

International Debt Statistics (IDS) — World Bank source 2

The IDS database is the flagship World Bank debt dataset. It covers external debt stocks and flows for low and middle income countries, with annual data going back to 1998. I access it through the wbgapi Python library, which wraps the World Bank Indicators API v2.

I ingest nine series:

Series Code	Description
DT.DOD.DECT.CD	Total external debt stocks
DT.DOD.DLXF.CD	Long-term external debt
DT.DOD.DPNG.CD	Private nonguaranteed debt
DT.DOD.MIBR.CD	PPG IBRD loans
DT.DOD.DPPG.CD	Public and publicly guaranteed debt
DT.DOD.DIMF.CD	IMF credit
DT.DOD.PVLX.CD	Present value of external debt
DT.DOD.MWBG.CD	IBRD loans and IDA credits
DT.DOD.MIDA.CD	PPG IDA loans

One important lesson from building this: the World Bank has multiple API source databases, and not all of them support the standard wbgapi query format. Sources 22 (QEDS SDDS), 23 (QEDS GDDS), and 54 (JEDH) all return JSON decode errors when queried programmatically — they use a separate DataBank backend. Only source 2 (IDS) reliably supports the Indicators API. This cost me several hours of debugging.

Quarterly External Debt Statistics SDDS (QEDS)

QEDS provides quarterly debt payment schedule data broken down by sector and maturity. Unlike IDS, QEDS does not support programmatic API access in the standard format. The World Bank provides bulk Excel downloads for each supplementary table instead.

I download five Excel files directly:

Table 1.5 — Net external debt position by sector
Table 3 — Debt service payment schedule by sector
Table 3.2 — Debt service by sector and instrument
Table 2.1 — Foreign currency and domestic currency debt
Table 1.6 — Reconciliation of positions and flows

Infrastructure as Code with Terraform

Every GCP resource in this project is provisioned by Terraform. The three core resources are a GCS bucket for the data lake and three BigQuery datasets — raw, staging, and mart.

resource "google_storage_bucket" "data_lake" {
  name          = var.gcs_bucket_name
  location      = var.region
  force_destroy = true

  lifecycle_rule {
    action { type = "Delete" }
    condition { age = 90 }
  }

  versioning { enabled = true }
}

resource "google_bigquery_dataset" "raw" {
  dataset_id                 = "raw"
  location                   = var.location
  delete_contents_on_destroy = true
}

The 90-day lifecycle rule on GCS automatically deletes old partitions, keeping storage costs near zero. The entire GCP footprint for this project costs less than $0.05 per month — BigQuery's free tier covers 1 TB of queries and 10 GB of storage, which is far more than this dataset requires.

A setup_gcp.sh script handles the one-time bootstrap — creating the GCP project, enabling APIs, creating the service account, and granting IAM roles. The billing account ID is passed as an environment variable so it never appears in version-controlled files.

BILLING_ACCOUNT=your-billing-id bash scripts/setup_gcp.sh

Ingestion Layer — PySpark on Docker

JEDH / IDS Ingestion

The IDS ingestion script uses wbgapi to fetch all nine series across all available countries from 1998 to 2025. The API returns data in wide format — one row per country per series, with year columns as separate fields. PySpark writes this to GCS as Snappy-compressed Parquet, partitioned by extraction date.

One critical issue I hit: the World Bank API returns year columns as bare integers (1998, 1999, etc.). BigQuery rejects column names that start with numbers. The fix was to prefix all year columns with year_ before writing:

combined.columns = [
    f"year_{col}" if str(col).isdigit() else col
    for col in combined.columns
]

This produces clean column names like year_1998, year_1999 that BigQuery accepts without complaint.

QEDS Ingestion

The QEDS ingestion downloads five Excel files from World Bank DataBank, reads all sheets from each file using pandas.read_excel, and concatenates them into a single DataFrame.

The Excel files have messy column names — spaces, brackets, special characters. A clean_column_name function normalizes everything:

def clean_column_name(col):
    col = str(col)
    col = re.sub(r'\s*\[.*?\]', '', col)  # remove [YR2021Q4] suffixes
    col = col.strip()
    col = re.sub(r'[^a-zA-Z0-9_]', '_', col)
    col = re.sub(r'_+', '_', col)
    col = col.strip('_')
    if col and col[0].isdigit():
        col = 'q_' + col  # prefix quarter columns: q_2021q4
    return col.lower()

I also hit an OOM error trying to write the QEDS data through PySpark — 76,835 rows with 90+ columns across 214 sheets was too much for the JVM heap in the Docker container. The fix was to bypass Spark entirely for QEDS and write directly to GCS using the google-cloud-storage Python client:

buffer = io.BytesIO()
df.to_parquet(buffer, index=False, engine="pyarrow")
buffer.seek(0)

client = storage.Client()
bucket = client.bucket(GCS_BUCKET)
blob = bucket.blob(output_path)
blob.upload_from_file(buffer, content_type="application/octet-stream")

No JVM, no Spark executor, no OOM. The lesson: use the right tool for the job. PySpark is excellent for large distributed datasets. For a 76K-row DataFrame from an Excel file, plain pandas and the GCS Python client is simpler and more reliable.

Docker Image

The ingestion image is built on eclipse-temurin:17-jdk-jammy rather than a plain Python image. This is because PySpark requires Java, and the python:3.11-slim base image uses Debian Trixie which does not carry openjdk-17-jdk in its default repositories. The Temurin image ships Java 17 out of the box, which is exactly what Spark 3.5.1 needs.

The GCS connector JAR is downloaded at build time:

RUN curl --progress-bar -L \
    https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar \
    -o ${SPARK_HOME}/jars/gcs-connector-hadoop3-latest.jar

Credentials are never baked into the image. They are mounted at runtime as a volume:

docker run --rm \
  -v /path/to/key.json:/app/credentials/key.json \
  -e GOOGLE_APPLICATION_CREDENTIALS=/app/credentials/key.json \
  sovereign-debt-ingestion:v1 python3 extract_jedh.py

Orchestration with Apache Airflow

Airflow runs on Docker Compose using the official apache/airflow:2.9.2 image. The stack includes a Celery executor with Redis as the message broker and Postgres as the metadata database.

The DAG runs quarterly — on the 1st of January, April, July, and October at 06:00 UTC:

schedule_interval="0 6 1 1,4,7,10 *"

Task flow:

extract_load_jedh >> extract_load_qeds

Both tasks use the DockerOperator to spin up the ingestion container on the host machine, which means Airflow itself does not need PySpark, Java, or any data dependencies — it just tells Docker to run the job.

The Docker socket problem

The DockerOperator communicates with the host Docker daemon through the Unix socket at /var/run/docker.sock. This requires two things: the socket must be mounted into the Airflow worker container, and the worker must have permission to use it.

The socket mount goes in docker-compose.yml under the common volumes section:

volumes:
  - /var/run/docker.sock:/var/run/docker.sock

The permission fix requires adding the Docker group ID to the Airflow worker. On my machine the Docker group ID is 984:

group_add:
  - "984"

Without the group add, the worker can see the socket but gets PermissionError(13, 'Permission denied'). This is a common Airflow + Docker-in-Docker gotcha that took several debugging sessions to resolve.

Transformation with dbt Cloud

Staging Layer

The staging models are materialized as views — no storage cost, no latency, just a SQL lens over the raw external tables.

stg_jedh does the heavy lifting: it unpivots the wide year-column format into long format using BigQuery's UNPIVOT operator:

unpivot(
    value for year in (
        year_1998, year_1999, year_2000, ..., year_2025
    )
)

Then extracts the year integer from the column name:

cast(replace(year, 'year_', '') as int64) as year

And adds a human-readable series description via a CASE statement, turning DT.DOD.DECT.CD into Total external debt stocks.

The result is a clean long-format table: one row per country per series per year.

stg_qeds is simpler — it selects the clean columns, uses SAFE_CAST to convert quarter values to float64, and filters out null countries and series codes. SAFE_CAST is preferable to CAST here because it returns NULL on failure rather than throwing an error, which is the right behaviour for messy Excel data.

Mart Layer

The mart models are materialized as tables with partitioning and clustering for query efficiency.

mart_debt_stocks is partitioned by year and clustered by country_code and series_code. It enriches the staging data with YoY percentage change and debt as a percentage of total external debt per country per year — both computed using window functions:

lag(b.debt_value_usd) over (
    partition by b.country_code, b.series_code
    order by b.year
) as prev_year_value,

safe_divide(
    b.debt_value_usd - lag(b.debt_value_usd) over (
        partition by b.country_code, b.series_code
        order by b.year
    ),
    lag(b.debt_value_usd) over (
        partition by b.country_code, b.series_code
        order by b.year
    )
) * 100 as yoy_change_pct

mart_regional_debt assigns countries to six World Bank regions using a CASE statement on ISO3 country codes, then aggregates total, average, max, and min debt stocks per region per series per year. This powers the regional trajectory time series on the dashboard.

mart_debt_service computes total annual debt payments and average quarterly payments per country from the QEDS payment schedule data, enabling debt service pressure analysis.

dbt Tests

All models have not_null tests on primary dimension columns. Running dbt test after every model change ensures data quality is enforced at the transformation layer rather than discovered downstream in the dashboard.

The Dashboard

The Looker Studio dashboard has two pages, both connected directly to the BigQuery mart tables.

Page 1 — Global Debt Overview answers Q4 at a glance. A time series chart using mart_regional_debt shows six regional debt trajectories from 1998 to 2024. Africa's trajectory is notably steeper post-2010. A bar chart shows the top 20 countries by total external debt for the selected year. A scorecard shows total global external debt.

Page 2 — Country Deep-Dive answers Q1, Q2, and Q5. A stacked bar chart shows debt composition by sector over time for any selected country. A line chart shows short-term vulnerability trends. A table sorted by total 2021 payments shows which countries face the most acute debt service pressure.

Live dashboard: https://lookerstudio.google.com/reporting/7fc18e9e-a5c6-4616-b920-b5b4bddf2264

Key Technical Lessons

1. Know which World Bank API sources support programmatic access. Only source 2 (IDS) reliably supports the Indicators API v2 via wbgapi. Sources 22, 23, and 54 use a different DataBank backend and return JSON decode errors. For QEDS data, use the bulk Excel downloads instead.

2. BigQuery rejects column names starting with numbers. Prefix them before writing Parquet. A simple list comprehension handles this: f"year_{col}" if str(col).isdigit() else col.

3. Use the right tool for the data size. PySpark is excellent for large datasets but overkill for a 76K-row Excel file. The GCS Python client with pandas and PyArrow is simpler, faster, and doesn't OOM.

4. Docker-in-Docker with Airflow requires explicit socket mounting and group permissions. Mount /var/run/docker.sock and add the Docker group ID to the worker container. Without the group add, you get a silent permission error.

5. ELT separates concerns cleanly. When the analytical questions evolved during development, I only needed to update dbt models — never the ingestion layer. This separation is the most valuable architectural decision in the project.

What I Would Do Differently

If I were building this again, I would use the World Bank DataBank bulk download API to get historical QEDS time series data instead of point-in-time Excel files. The current QEDS data only has a handful of quarters because the Excel files are snapshots of the latest publication. A proper time series would require downloading and archiving each quarterly release.

I would also add a load_to_bigquery step that loads Parquet directly into native BigQuery tables rather than using external tables. External tables work well but they require manually updating the source URI list each time a new partition is added. A native table with partitioning handles this automatically.

Conclusion

The Sovereign Debt Observatory took the World Bank's raw debt data from scattered Excel files and API endpoints to a fully automated, tested, and documented analytical pipeline. Every component is reproducible — Terraform provisions the infrastructure, Docker packages the ingestion environment, Airflow schedules the runs, and dbt documents and tests the transformations.

The full source code is on GitHub: https://github.com/Derrick-Ryan-Giggs/sovereign-debt-observatory

If you have questions about any part of the implementation, drop them in the comments. I am happy to go deeper on any layer.

Ryan Derrick Giggs is a data engineering practitioner and technical writer based in Nairobi, Kenya. He is currently completing the DataTalksClub Data Engineering Zoomcamp 2026.

LinkedIn: https://linkedin.com/in/ryan-giggs-a19330265
GitHub: https://github.com/Derrick-Ryan-Giggs

Exadata AI Storage: A New Era of AI-Powered Database Infrastructure

Ryan Giggs — Mon, 06 Apr 2026 07:41:04 +0000

Oracle's Exadata platform has always been synonymous with extreme database performance. But with the release of Exadata System Software 24ai and 25ai, alongside the debut of Oracle Exadata X11M in January 2025, Oracle has taken a decisive step into the AI era. Exadata is no longer just a high-performance OLTP and analytics machine — it is now purpose-built to accelerate AI vector search, in-database machine learning, and mixed enterprise workloads, all on a single converged platform.

This post breaks down the key new features across software, infrastructure, high availability, monitoring, and security — and explains why they matter.

What Is Exadata AI Storage?

At its core, Exadata AI Storage refers to Oracle's strategy of pushing intelligence deeper into the storage layer. Rather than offloading AI computation to separate, purpose-built vector databases or GPUs, Oracle brings AI operations — particularly AI Vector Search — directly to the storage servers themselves. This means vector index builds, similarity searches, and distance calculations happen closer to where data lives, dramatically reducing the data movement that kills performance in distributed architectures.

The result? Key vector search operations running up to 30x faster with Exadata System Software 24ai, and further accelerated on X11M with in-memory vector index (HNSW) scans running up to 43% faster on database servers and up to 55% faster on storage servers compared to the previous generation.

Key New Software Features

1. AI Smart Scan

AI Smart Scan is an extension of Exadata's legendary Smart Scan technology, now purpose-built for AI workloads. It offloads compute-intensive AI Vector Search operations — including vector index builds and similarity queries — directly to Exadata's intelligent storage servers. This eliminates the need to ship raw data up to the database tier for processing.

Critically, AI Smart Scan enables thousands of concurrent AI vector searches in multi-user environments. This is a significant differentiator for enterprise RAG (Retrieval Augmented Generation) pipelines and AI applications that need to serve many users simultaneously, not just batch processes.

With the latest release (Exadata System Software 25ai / 25.1), Adaptive Top-K Filtering further extends this: each storage server maintains a running Top-K result set, reducing data returned to the database servers by up to 4.7x. Similarly, VECTOR_DISTANCE() calculations are now projected from storage, delivering up to 4.6x faster distance-based queries.

2. Exadata RDMA Memory (XRMEM)

XRMEM replaces the persistent memory (Intel Optane PMem) that earlier generations used, adapting to changes in the memory vendor landscape. It is built on DDR5 DRAM and accessed via RDMA (Remote Direct Memory Access) over Converged Ethernet (RoCE), bypassing the OS, I/O, and network software stacks entirely.

The practical benefit: ultra-low read latency as low as 14 microseconds — a 21% improvement over prior generations — with scan throughput of up to 500 GB/s from XRMEM alone. Each Exadata X11M Extreme Flash Storage Server contains 1.25 TB of XRMEM, which sits as an acceleration tier in front of the Smart Flash Cache.

XRMEM is particularly impactful for OLTP workloads that require sub-20-microsecond response times, and it now also accelerates AI vector index reads transparently.

3. On-Storage Processing

This is the foundational principle that makes all the above possible. Exadata storage servers are not passive disk arrays — they are intelligent compute nodes in their own right. SQL filtering, column projection, decompression, encryption/decryption, bloom filters, and now AI vector operations all execute on the storage servers, not the database servers.

This dramatically reduces the volume of data sent over the internal network and processed by database-tier CPUs. For analytics workloads, this "push-down" processing model is why Exadata consistently delivers 10–100x better throughput than general-purpose storage.

4. In-Memory Columnar Speeds for JSON Queries

Exadata's In-Memory Columnar Cache on storage servers (also called Columnar Cache) stores a columnar representation of row-oriented data directly in flash cache. When queries access this data, they get the performance of columnar analytics without requiring data to be reformatted or migrated.

With Oracle Database 23ai — which is required to unlock the full Exadata Exascale feature set — JSON documents stored natively benefit from this columnar acceleration. Oracle Database 23ai's JSON Relational Duality views, which expose the same data as both JSON and relational tables simultaneously, can be queried at columnar memory speeds on Exadata, collapsing the performance gap between document and relational workloads.

5. Transparent Cross-Tier Scan

Exadata's multi-tier storage hierarchy — XRMEM → Smart Flash Cache → disk — is managed automatically and transparently. When a Smart Scan or AI Smart Scan runs, Exadata intelligently sources data from whichever tier contains it, combining reads from memory, flash, and disk in parallel.

This means administrators and developers never need to manually partition hot vs. cold data or tune tier placement. The system continuously tracks access patterns and moves data to the appropriate tier based on usage, keeping the hottest data in XRMEM and the next hottest in flash. The database never sees these tiers explicitly — it simply issues a SQL or vector query and receives results.

6. Caching Enhancements

Several caching improvements ship with the latest Exadata System Software releases:

Automatic KEEP Object Load into Exadata Flash Cache: Objects tagged with the KEEP storage clause in Oracle Database are automatically and proactively loaded into Exadata Smart Flash Cache — even before they are first accessed — ensuring zero cold-start latency for critical tables and indexes.
Write Back Flash Cache: Database block writes are cached in flash, eliminating disk I/O bottlenecks for large OLTP and batch workloads.
Cell-to-Cell Rebalance preserving XRMEM and Flash Cache: When data rebalances across storage servers (due to maintenance or failure), both the XRMEM and flash cache contents are also rebalanced, preserving performance levels rather than causing a cold-cache performance dip.

7. Columnar Smart Scan at Memory Speed

When Oracle Database In-Memory is enabled, Exadata automatically stores data in columnar format within Flash Cache and XRMEM if it will improve query performance. This brings columnar analytics performance — historically associated only with in-database memory (DRAM on database servers) — to storage-resident data, enabling analytics at memory speeds even when datasets exceed what fits in database-server DRAM.

A single Exadata X11M rack can deliver up to 100 GB/s of flash throughput and 500 GB/s from XRMEM for the hottest data, far exceeding what traditional storage arrays can achieve even with flash added.

8. Exadata Cache Observability

Exadata System Software 24.1 introduced ecstat (Exadata Storage Cache Statistics), a real-time utility that provides per-storage-server statistics on Smart Flash Cache usage, XRMEM hits, and I/O performance. This was a long-standing gap — DBAs previously had to rely on AWR snapshots to understand cache behavior.

In Exadata System Software 25.2, this was extended with CellSQLStat, which provides real-time, per-storage-server insights into active Smart Scan operations: CPU and memory usage, Storage Index and Columnar Cache I/O savings, flash and XRMEM hit rates, scan rates, and more. Both ecstat and CellSQLStat data are automatically included in ExaWatcher collections, making them available for historical analysis.

Infrastructure Improvements

Increased Number of Virtual Machines

With Exadata Exascale (the new intelligent storage architecture introduced in 2024 and available on X11M), the limit on Virtual Machine clusters per database server increases dramatically. Traditional Exadata with ASM supported 4, 8, or 12 VMs per database server. Exascale raises this ceiling to 50 VMs per database server, enabling far greater consolidation of Oracle Database workloads on a single Exadata system without sacrificing isolation or performance.

Exascale also centralizes VM storage in the shared Exascale storage pool rather than on individual database servers, increasing flexibility and simplifying management.

Secure Boot for KVM Virtual Machines

Exadata System Software now supports Secure Boot for KVM guest VMs, ensuring that only cryptographically signed and trusted OS images can boot on Exadata database servers. This closes a significant attack vector in virtualized deployments and aligns Exadata's on-premises security posture with cloud-native security standards. It complements existing features like Trusted Partitions for Oracle Linux Virtualization.

High Availability and Network Resilience

Improved RoCE Network Resilience: ExaPortMon

Every Exadata database and storage server connects to the internal network via dual 100 GbE RoCE ports for an aggregate of 200 Gbps. If a RoCE leaf switch port becomes stalled — appearing online but unable to pass traffic — it can cause cluster instability without triggering a clean failure.

ExaPortMon (introduced in Exadata System Software 24ai) solves this. It continuously monitors both RoCE ports on every server. When it detects a stalled port, it automatically migrates the IP address to the healthy port, keeping network traffic flowing and preventing outages. When the stalled port recovers, ExaPortMon automatically returns the IP address to its original port. No manual intervention required.

Enhanced RoCE Network Security: Exadata Secure RDMA Fabric Isolation

Exadata Secure RDMA Fabric Isolation (Secure Fabric) provides strict network isolation between VM clusters sharing the same physical Exadata infrastructure. It prevents database servers in one VM cluster from communicating with those in another over the RoCE fabric, eliminating lateral movement risk in consolidated and multi-tenant deployments.

Starting with Exadata System Software 25.1, Secure Fabric is automatically selected by default for all new on-premises deployments with X8M and newer hardware — bringing on-premises deployments into alignment with cloud deployments, which have always used Secure Fabric.

Monitoring and Management

AWR & SQL Monitor Enhancements

Oracle Automatic Workload Repository (AWR) on Exadata is enhanced with Exadata-specific storage server metrics alongside standard Oracle wait event data. AWR now collects and reports on XRMEM, Flash Cache, and HDD device performance, enabling DBAs to correlate database wait events with storage-tier behavior in a single report.

SQL Monitor is similarly enhanced, providing end-to-end visibility into query execution that includes storage offload statistics, Smart Scan I/O savings, and flash cache hit rates — all tied to the specific SQL statement being analyzed.

JSON API for Management Server

Exadata's Management Server (MS) now exposes a JSON REST API, enabling programmatic access to Exadata management and monitoring functions. This is a significant modernization of Exadata's management interface, making it easier to integrate Exadata health metrics, alerts, and configuration into modern observability stacks (Grafana, custom dashboards, CI/CD pipelines) without relying solely on traditional CLI tools like cellcli or Enterprise Manager.

Security Enhancements

SNMP v3 Security

Exadata System Software 24.1 introduced mandatory SNMP security improvements across all storage servers and database servers. The key changes:

SNMP v3 is now the recommended and encouraged standard, supporting SHA-256, SHA-384, and SHA-512 authentication protocols for strong authentication and encryption.
All SNMP subscriber definitions now require the connection type to be explicitly specified — administrators can no longer leave it ambiguous.
SNMP v1 remains available but triggers an explicit warning discouraging its use.
Default community strings like public and private are actively discouraged by the system, prompting administrators to set strong, unique values.

This tightens Exadata's management plane security, closing a common vulnerability in enterprise infrastructure where SNMP v1 with default community strings is still widely used.

Why This All Matters: The Convergence Thesis

The strategic bet Oracle is making with Exadata AI Storage is one of convergence over fragmentation. The enterprise AI market has seen an explosion of purpose-built vector databases (Pinecone, Weaviate, Qdrant, Milvus, Chroma), and many organizations are building RAG pipelines that shuffle data between separate systems: a relational database for operational data, a vector database for embeddings, an object store for documents.

Exadata offers a fundamentally different architecture: bring the AI to the data, not the data to the AI. With Oracle AI Database 23ai (now succeeded by Oracle AI Database 26ai as the long-term support release), all of this runs in a single converged engine — relational queries, vector similarity search, JSON document queries, graph traversals, and full-text search — executed as optimized SQL on Exadata hardware. Advanced AI features including AI Vector Search are included at no additional charge.

And with Exadata Exascale reducing the entry cost for Exadata Database Service by up to 95% and enabling organizations to start with as little as 300 GB of storage, the platform is no longer exclusively for Fortune 500 database estates. It is increasingly accessible to organizations of any size that need to build AI applications on enterprise-grade, governed, transactionally consistent data.

Summary Table

Category	Key Feature	Benefit
AI Workloads	AI Smart Scan + Adaptive Top-K	Up to 30x faster vector search; 4.7x less data to DB servers
Memory	XRMEM (DDR5 + RDMA)	14µs read latency; 500 GB/s scan throughput
Caching	Auto KEEP Load, Write Back, Columnar Cache	Zero cold-start for critical objects; analytics at memory speed
Observability	ecstat + CellSQLStat	Real-time per-cell Smart Scan and cache monitoring
Infrastructure	Exascale VM limit increase	Up to 50 VMs per DB server (up from 12)
Security	Secure Fabric default on-prem	Automatic lateral-movement isolation for VM clusters
Network HA	ExaPortMon	Auto-failover between RoCE ports; no manual intervention
Security	SNMP v3 enforcement	SHA-512 auth; eliminates default community string risk
Management	JSON API for Management Server	Programmatic integration with modern observability stacks

Exadata AI Storage represents Oracle's clearest articulation yet of its "converged data" strategy: a single platform that handles OLTP, analytics, and AI workloads without requiring organizations to build and manage a fragmented ecosystem of specialized tools. With Exadata System Software 25ai, the X11M generation, and the Exascale architecture now generally available across OCI, multicloud (AWS, Azure, Google Cloud), and on-premises, there has never been a better time to evaluate what Exadata can do for your AI application stack.

The numbers speak for themselves — but the architecture is the real story.

Have you worked with Exadata AI Storage or Oracle Database 23ai/26ai AI Vector Search? Share your experience in the comments.

Tech Ecosystem Observatory: How I Built a Cloud-Native Data Pipeline to Track Global Tech Layoffs vs YC Startup Activity

Ryan Giggs — Mon, 30 Mar 2026 07:44:40 +0000

Just completed my DEZ Zoomcamp 2026 capstone project — the Tech Ecosystem Observatory

Built a full cloud-native batch data pipeline from scratch that answers: which industries are shedding the most jobs, and how does that correlate with YC startup activity?

Here's what the pipeline looks like end to end:

✅ Terraform — provisioned GCS bucket and BigQuery datasets as infrastructure as code

✅ Docker — containerized all ingestion scripts into a portable image

✅ Kestra — orchestrated a 4-task batch DAG running weekly every Monday 6AM UTC

✅ Google Cloud Storage — raw JSONL data lake storing layoffs and YC company data

✅ BigQuery — partitioned by date (monthly) and clustered by industry/country for optimized queries

✅ dbt Cloud — built staging views and mart tables (mart_monthly_layoffs + mart_tech_ecosystem)

✅ Looker Studio — 2-page interactive dashboard with layoff trends, geo maps, ecosystem stress ratios

📊 Data: 4,317 layoff events (2023–2024) + 5,690 YC-backed companies

🔗 Live dashboard: https://lookerstudio.google.com/reporting/b1620cae-97cb-4911-82b8-dd0c46ee8acb

💻 GitHub: https://github.com/Derrick-Ryan-Giggs/tech-ecosystem-observatory

Huge thanks to @DataTalksClub and Alexey Grigorev for building and maintaining this incredible free course. If you're serious about data engineering, this is where you start 👇

https://github.com/DataTalksClub/data-engineering-zoomcamp/

Who else is building data pipelines? Drop your projects below 👇

Streaming with PyFlink and Redpanda

Ryan Giggs — Mon, 16 Mar 2026 13:34:19 +0000

Week 7 of Data Engineering Zoomcamp by @DataTalksClub complete

Just finished Module 7 - Streaming with PyFlink. Learned how to:

Set up Redpanda as a Kafka replacement
Build Kafka producers and consumers in Python
Create tumbling and session windows in Flink
Analyze real-time taxi trip data with stream processing

Here's my homework solution: https://github.com/Derrick-Ryan-Giggs/Streaming-Homework

You can sign up here: https://github.com/DataTalksClub/data-engineering-zoomcamp/

Oracle Database 23ai: Vector Similarity Search - Exact, Approximate, and Multi-Vector Strategies

Ryan Giggs — Mon, 16 Mar 2026 08:57:31 +0000

Oracle Database 23ai's AI Vector Search provides multiple strategies for finding similar vectors, each with different trade-offs between accuracy, speed, and resource usage. Understanding when to use exact search, approximate search, or multi-vector search—and knowing the essential vector functions—is crucial for building high-performance semantic search applications.

Understanding Similarity Search Types

1. Exact Similarity Search (Flat Search)

Exact similarity search calculates a query vector's distance to all other vectors. It's also called flat search or exhaustive search because every vector in the dataset is compared.

Characteristics:

Gives the most accurate results
Perfect search quality (100% recall)
Involves potentially significant time as dataset grows
No indexes required
Suitable for small to medium datasets (thousands to hundreds of thousands of vectors)

SQL Example:

SELECT 
  product_id,
  product_name,
  VECTOR_DISTANCE(embedding, :query_vector, COSINE) AS similarity
FROM products
ORDER BY similarity
FETCH FIRST 10 ROWS ONLY;

When to Use:

Small datasets where performance is acceptable
When perfect accuracy is required
Development and testing phases
Benchmarking approximate search results

2. Approximate Similarity Search

Approximate similarity search uses vector indexes to dramatically speed up searches with minimal accuracy loss. Instead of checking every vector, it leverages specialized data structures to narrow the search space.

Key Requirements:

You must enable vector pool in the SGA for HNSW indexes (in-memory neighbor graph indexes).

ALTER SYSTEM SET vector_memory_size = 800M SCOPE=SPFILE;
-- Restart required

Characteristics:

Can be more efficient but less accurate than exact search
Uses target accuracy setting (typically 90-99%)
Requires vector indexes (HNSW or IVF)
Scales to millions or billions of vectors
Typical accuracy: 95%+ with proper configuration

SQL Example:

-- Using FETCH APPROXIMATE
SELECT 
  product_id,
  product_name
FROM products
ORDER BY embedding less-than equals greater-than :query_vector
FETCH APPROXIMATE FIRST 10 ROWS ONLY;

Performance Comparison:

According to Oracle benchmarks, exact search on 50,000 vectors took 1.50 seconds versus 0.47 seconds with HNSW index—over 3x faster with the same top-10 results.

Vector Index Types

Oracle AI Vector Search supports two main types of vector indexes:

HNSW (Hierarchical Navigable Small World)

In-Memory Neighbor Graph vector index that creates a navigable graph structure for ultra-fast similarity search.

Creating HNSW Index:

CREATE VECTOR INDEX products_hnsw_idx
ON products(embedding)
ORGANIZATION NEIGHBOR GRAPH
DISTANCE COSINE
WITH TARGET ACCURACY 95;

Characteristics:

Fully built in-memory in the vector pool
Extremely fast queries (sub-100ms)
Requires substantial memory
No DML operations allowed after creation
Not available in RAC environments

Memory Calculation:

Formula: 1.3 times number_of_vectors times number_of_dimensions times dimension_byte_size

Example for 1M vectors, 768 dimensions, FLOAT32 (4 bytes):

1.3 × 1,000,000 × 768 × 4 = 3.99 GB

IVF (Inverted File Flat)

Neighbor Partition vector index built on disk with blocks cached in the buffer cache.

Creating IVF Index:

CREATE VECTOR INDEX products_ivf_idx
ON products(embedding)
ORGANIZATION NEIGHBOR PARTITIONS
DISTANCE COSINE
WITH TARGET ACCURACY 95;

Characteristics:

Disk-based, blocks cached in buffer cache
More scalable for very large datasets
Supports DML operations (may require periodic rebuild)
Works in RAC environments
Slightly slower than HNSW but still very fast

How IVF Works:

The index partitions vectors into clusters based on similarity. By default, the number of partitions equals the square root of the dataset size. During search, only relevant clusters are examined.

3. Multi-Vector Similarity Search

Multi-vector similarity search is usually used for multi-document search where documents are split into chunks, and chunks are embedded individually into vectors.

Use Cases:

Long documents split into paragraphs or sections
Product catalogs with multiple descriptions
Research papers chunked for semantic search
Multi-modal data (text + images)

How It Works:

Documents are split into chunks
Each chunk is embedded individually into a separate vector
Chunks are stored with partition keys linking them to parent documents
Search retrieves top chunks across all documents
Results can be aggregated by document

Uses partitions to organize chunks belonging to the same document.

Table Structure Example:

CREATE TABLE document_chunks (
  doc_id NUMBER,
  chunk_id NUMBER,
  chunk_text CLOB,
  embedding VECTOR(768, FLOAT32),
  PRIMARY KEY (doc_id, chunk_id)
) PARTITION BY RANGE (doc_id) (
  PARTITION p1 VALUES LESS THAN (1000),
  PARTITION p2 VALUES LESS THAN (2000),
  PARTITION p3 VALUES LESS THAN (MAXVALUE)
);

Multi-Vector Search Query:

-- Find top chunks across all documents
SELECT 
  doc_id,
  chunk_id,
  chunk_text,
  VECTOR_DISTANCE(embedding, :query_vector, COSINE) AS score
FROM document_chunks
ORDER BY score
FETCH APPROXIMATE FIRST 20 ROWS ONLY;

Aggregating by Document:

-- Get top documents based on best chunk match
SELECT 
  doc_id,
  MIN(VECTOR_DISTANCE(embedding, :query_vector, COSINE)) AS best_score
FROM document_chunks
GROUP BY doc_id
ORDER BY best_score
FETCH FIRST 10 ROWS ONLY;

Narrowing Search Results

Attribute Filtering with WHERE Clause

Use the WHERE clause to filter results based on metadata or business attributes. You are not limited by the use of the ORDER BY clause.

This powerful combination enables hybrid search: semantic similarity (vector search) plus exact filters (traditional SQL).

Example: Filter by Category and Date

SELECT 
  product_id,
  product_name,
  category,
  COSINE_DISTANCE(embedding, :query_vector) AS similarity
FROM products
WHERE category = 'Electronics'
  AND launch_date greater-than DATE '2024-01-01'
ORDER BY embedding less-than equals greater-than :query_vector
FETCH APPROXIMATE FIRST 10 ROWS ONLY;

Example: Price Range Filter

SELECT 
  product_id,
  product_name,
  price,
  L2_DISTANCE(embedding, :query_vector) AS distance
FROM products
WHERE price BETWEEN 100 AND 500
  AND in_stock = 'Y'
ORDER BY embedding less-than hyphen greater-than :query_vector
FETCH APPROXIMATE FIRST 5 ROWS ONLY;

Best Practices for Filtering:

Apply filters in WHERE clause before vector search
Use indexed columns for better performance
Combine multiple conditions as needed
Leverage partition pruning for partitioned tables

Testing Other Distance Functions

Oracle provides shorthand distance functions that simplify syntax and improve code readability.

Distance Function Equivalents

L1_DISTANCE(v1, v2) is similar to Manhattan distance

SELECT L1_DISTANCE(
  VECTOR('[1, 2, 3]'),
  VECTOR('[4, 5, 6]')
) AS manhattan_dist
FROM DUAL;

L2_DISTANCE(v1, v2) is similar to Euclidean distance

SELECT L2_DISTANCE(
  VECTOR('[3, 0]'),
  VECTOR('[0, 4]')
) AS euclidean_dist
FROM DUAL;
-- Result: 5.0

COSINE_DISTANCE is the same as cosine

SELECT COSINE_DISTANCE(
  VECTOR('[1, 0]'),
  VECTOR('[1, 1]')
) AS cosine_dist
FROM DUAL;

INNER_PRODUCT(v1, v2) is the same as dot product

SELECT INNER_PRODUCT(
  VECTOR('[2, 3]'),
  VECTOR('[4, 5]')
) AS dot_product
FROM DUAL;
-- Result: 2*4 + 3*5 = 23

Note on DOT vs INNER_PRODUCT:

The VECTOR_DISTANCE function with DOT metric returns the negated inner product, while the INNER_PRODUCT function returns the actual dot product.

-- These are NOT equivalent:
SELECT VECTOR_DISTANCE(v1, v2, DOT) FROM table;     -- Returns -1 * dot_product
SELECT INNER_PRODUCT(v1, v2) FROM table;             -- Returns dot_product

Other Essential Vector Functions

1. Vector Constructors

TO_VECTOR() converts a string or a character large object (CLOB) to a vector

-- From string
SELECT TO_VECTOR('[1.5, 2.3, 4.1]') AS vec FROM DUAL;

-- From CLOB
DECLARE
  my_clob CLOB := '[0.1, 0.2, 0.3, 0.4]';
  my_vector VECTOR;
BEGIN
  my_vector := TO_VECTOR(my_clob);
END;
/

TO_VECTOR also takes another vector as input, adjusts its format, and returns the adjusted vector as output.

VECTOR() converts a string or CLOB into a vector

SELECT VECTOR('[1, 2, 3]', 3, FLOAT32) AS vec FROM DUAL;

TO_VECTOR and VECTOR are synonymous—they perform the same function.

2. Vector Serializer

VECTOR_SERIALIZE() converts a vector into a string or a CLOB

SELECT VECTOR_SERIALIZE(embedding) AS vec_string
FROM products
WHERE product_id = 1;

-- Result: '[0.12, 0.45, 0.78, ...]'

Use Cases:

Exporting vectors for external processing
Debugging and inspection
Logging and auditing
Integration with non-Oracle systems

3. Vector Norm

VECTOR_NORM() returns the Euclidean norm of a vector—the distance from the origin to the vector.

Also known as magnitude or length, it's calculated as the square root of the sum of squared components.

Formula:

norm = square_root(x1-squared + x2-squared + ... + xn-squared)

SQL Example:

SELECT VECTOR_NORM(TO_VECTOR('[3, 4]')) AS magnitude
FROM DUAL;
-- Result: 5.0 (because square_root(9+16) = 5)

Practical Use: Normalizing Vectors

-- Normalize vector to unit length
SELECT 
  embedding / VECTOR_NORM(embedding) AS normalized_vector
FROM products
WHERE product_id = 1;

Normalized vectors have a magnitude of 1.0, which is required for meaningful dot product similarity comparisons.

4. Vector Dimension Count

VECTOR_DIMENSION_COUNT() returns the number of dimensions of a vector

SELECT VECTOR_DIMENSION_COUNT(embedding) AS dimensions
FROM products
WHERE product_id = 1;

-- Result: 768 (for BERT-style embeddings)

Use Cases:

Validating embedding dimensions
Debugging dimension mismatches
Dynamic schema inspection
Migration validation

5. Vector Dimension Format

VECTOR_DIMENSION_FORMAT() returns the storage format of the vector

SELECT VECTOR_DIMENSION_FORMAT(embedding) AS format
FROM products
WHERE product_id = 1;

-- Possible results: 'INT8', 'FLOAT32', 'FLOAT64', 'BINARY'

Use Cases:

Schema documentation
Storage optimization analysis
Migration planning
Model compatibility verification

Complete Working Example

Here's a comprehensive example demonstrating all three search types:

-- Create table
CREATE TABLE research_papers (
  paper_id NUMBER PRIMARY KEY,
  title VARCHAR2(500),
  abstract CLOB,
  category VARCHAR2(100),
  publish_date DATE,
  embedding VECTOR(768, FLOAT32)
);

-- Insert sample data
INSERT INTO research_papers VALUES (
  1,
  'Advances in Vector Search',
  'This paper explores efficient algorithms...',
  'Computer Science',
  DATE '2024-06-15',
  VECTOR('[0.1, 0.2, ...]', 768, FLOAT32)
);
COMMIT;

-- 1. EXACT SIMILARITY SEARCH
-- Most accurate, slower for large datasets
SELECT 
  paper_id,
  title,
  COSINE_DISTANCE(embedding, :query_vector) AS similarity
FROM research_papers
WHERE category = 'Computer Science'
ORDER BY similarity
FETCH FIRST 10 ROWS ONLY;

-- 2. CREATE VECTOR INDEX for approximate search
CREATE VECTOR INDEX papers_hnsw_idx
ON research_papers(embedding)
ORGANIZATION NEIGHBOR GRAPH
DISTANCE COSINE
WITH TARGET ACCURACY 95;

-- 3. APPROXIMATE SIMILARITY SEARCH
-- Faster, 95%+ accuracy
SELECT 
  paper_id,
  title,
  publish_date
FROM research_papers
WHERE category = 'Computer Science'
  AND publish_date greater-than DATE '2024-01-01'
ORDER BY embedding less-than equals greater-than :query_vector
FETCH APPROXIMATE FIRST 10 ROWS ONLY;

-- 4. MULTI-VECTOR SEARCH (for chunked documents)
-- Create chunked version
CREATE TABLE paper_chunks (
  paper_id NUMBER,
  chunk_id NUMBER,
  chunk_text CLOB,
  embedding VECTOR(768, FLOAT32),
  PRIMARY KEY (paper_id, chunk_id)
);

-- Find best matching chunks
SELECT 
  paper_id,
  chunk_id,
  L2_DISTANCE(embedding, :query_vector) AS distance
FROM paper_chunks
ORDER BY distance
FETCH APPROXIMATE FIRST 20 ROWS ONLY;

-- Aggregate to document level
SELECT 
  p.paper_id,
  p.title,
  MIN(COSINE_DISTANCE(c.embedding, :query_vector)) AS best_match_score
FROM research_papers p
JOIN paper_chunks c ON p.paper_id = c.paper_id
GROUP BY p.paper_id, p.title
ORDER BY best_match_score
FETCH FIRST 10 ROWS ONLY;

Best Practices

1. Choose the Right Search Strategy

Exact search: Less than 100K vectors, need perfect accuracy
Approximate with HNSW: 100K to 10M vectors, need sub-100ms latency, have memory available
Approximate with IVF: 10M+ vectors, limited memory, can tolerate slightly higher latency
Multi-vector: Document chunks, multi-modal data, detailed granularity needed

2. Optimize Index Configuration

-- For high accuracy (slower but better results)
CREATE VECTOR INDEX idx1 ON table(embedding)
ORGANIZATION NEIGHBOR GRAPH
DISTANCE COSINE
WITH TARGET ACCURACY 99;

-- For speed (faster but slightly lower accuracy)
CREATE VECTOR INDEX idx2 ON table(embedding)
ORGANIZATION NEIGHBOR GRAPH
DISTANCE COSINE
WITH TARGET ACCURACY 90;

3. Use Attribute Filtering Effectively

Combine vector similarity with traditional filters to narrow results:

-- Efficient: Filter first, then search
SELECT * FROM products
WHERE price less-than 100              -- Traditional filter
  AND category = 'Electronics'         -- Traditional filter
ORDER BY embedding less-than equals greater-than :query_vector  -- Vector search
FETCH APPROXIMATE FIRST 10 ROWS ONLY;

4. Monitor and Tune Performance

-- Check if index is being used
EXPLAIN PLAN FOR
SELECT * FROM products
ORDER BY embedding less-than equals greater-than :query_vector
FETCH APPROXIMATE FIRST 10 ROWS ONLY;

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);

5. Handle Dimension Mismatches

-- Validate dimensions before comparison
SELECT 
  VECTOR_DIMENSION_COUNT(embedding) AS dims,
  VECTOR_DIMENSION_FORMAT(embedding) AS format
FROM products
WHERE VECTOR_DIMENSION_COUNT(embedding) != 768;

Common Pitfalls and Solutions

Pitfall 1: Forgetting to Enable Vector Pool

Problem: HNSW index creation fails with memory error

Solution: Configure vector pool before creating HNSW indexes

ALTER SYSTEM SET vector_memory_size = 1G SCOPE=SPFILE;
-- Restart database

Pitfall 2: Using Wrong Index Type

Problem: Slow performance despite having an index

Solution: Match your workload to index type:

HNSW for read-heavy, memory-available scenarios
IVF for write-heavy or memory-constrained scenarios

Pitfall 3: Metric Mismatch

Problem: Index not being used despite proper syntax

Solution: Ensure query metric matches index metric

-- Index uses COSINE
CREATE VECTOR INDEX idx ON table(embedding)
DISTANCE COSINE;

-- Query must also use COSINE
-- Correct: Uses less-than equals greater-than (COSINE operator)
ORDER BY embedding less-than equals greater-than :query_vector

-- Wrong: Uses less-than hyphen greater-than (EUCLIDEAN operator)
-- This will NOT use the index!
ORDER BY embedding less-than hyphen greater-than :query_vector

Pitfall 4: Not Using FETCH APPROXIMATE

Problem: Query is slow despite having vector index

Solution: Use FETCH APPROXIMATE to enable index usage

-- Without APPROXIMATE - performs exact search
FETCH FIRST 10 ROWS ONLY;

-- With APPROXIMATE - uses index
FETCH APPROXIMATE FIRST 10 ROWS ONLY;

Oracle Database 23ai provides a comprehensive vector search framework with multiple strategies to fit different use cases:

Search Strategies:

Exact: Perfect accuracy, suitable for smaller datasets
Approximate: 95%+ accuracy with dramatic speed improvements using HNSW or IVF indexes
Multi-vector: Chunk-level granularity for document search

Key Requirements:

Enable vector pool for HNSW indexes: vector_memory_size
Use FETCH APPROXIMATE for index-accelerated queries
Match distance metrics between index and query

Essential Functions:

Constructors: TO_VECTOR(), VECTOR()
Serializer: VECTOR_SERIALIZE()
Utilities: VECTOR_NORM(), VECTOR_DIMENSION_COUNT(), VECTOR_DIMENSION_FORMAT()
Distance Functions: L1_DISTANCE(), L2_DISTANCE(), COSINE_DISTANCE(), INNER_PRODUCT()

Best Practices:

Choose search strategy based on dataset size and accuracy requirements
Combine vector search with WHERE clause filters for powerful hybrid queries
Monitor index usage with EXPLAIN PLAN
Normalize vectors when using dot product similarity

By understanding these search strategies and vector functions, you can build high-performance semantic search applications that scale from thousands to billions of vectors while maintaining excellent accuracy and query speed.

Oracle Database 23ai: Creating Vectors and Understanding Distance Metrics for Similarity Search

Ryan Giggs — Mon, 09 Mar 2026 07:52:54 +0000

Oracle Database 23ai introduces native vector capabilities that enable semantic search directly within SQL. Understanding how to create vectors, calculate distances, and choose appropriate metrics is fundamental to building effective AI-powered applications. This comprehensive guide explores vector operations in Oracle 23ai with practical examples and best practices.

The VECTOR Data Type

Oracle 23ai introduces a native VECTOR data type designed specifically to store and manage vector embeddings efficiently within the database.

Declaration Syntax:

-- Flexible: any dimensions and format
embedding VECTOR

-- Specific dimensions, flexible format
embedding VECTOR(512)

-- Fully specified: dimensions and format
embedding VECTOR(512, FLOAT32)

Format Options:

INT8: 8-bit integers
FLOAT32: 32-bit floating-point (IEEE standard, most common)
FLOAT64: 64-bit floating-point (IEEE standard, higher precision)
BINARY: Binary vectors for specialized use cases

Oracle Database automatically casts values if needed between formats.

Creating Vectors with the VECTOR Constructor

The VECTOR constructor is a function that allows us to create vectors without storing them. It's particularly useful for learning purposes, testing, and ad-hoc queries.

Basic Syntax

VECTOR(array_literal [, number_of_dimensions] [, format])

Parameters:

array_literal: String representation of vector values
number_of_dimensions: Optional, inferred from array if not specified
format: Optional (INT8, FLOAT32, FLOAT64, BINARY)

Examples

Simple 2D Vector:

SELECT VECTOR('[0, 0]');

Output: [0, 0]

Vector with Scientific Notation:

SELECT VECTOR('[10, 0]');

Output: [1.0E+001, 0] (equivalent to [10, 0])

Specifying Dimensions:

SELECT VECTOR('[1, 2, 3]', 3, FLOAT32);

Creating Vectors from Variables:

DECLARE
  my_vector VECTOR;
BEGIN
  my_vector := VECTOR('[0.5, 0.8, 0.2]');
  DBMS_OUTPUT.PUT_LINE(my_vector);
END;
/

Best Practices for Learning

Use small number of dimensions (2-4) when learning vector concepts, as they're easier to visualize and understand. For production, match the dimensions to your embedding model (typically 384-1536).

The VECTOR_DISTANCE Function

VECTOR_DISTANCE is the main function for calculating distance between two vectors. It allows you to calculate distance between two parameters and therefore takes 2 vectors as params.

Syntax

VECTOR_DISTANCE(expr1, expr2 [, distance_metric])

Parameters:

expr1, expr2: Two vector expressions to compare
distance_metric: Optional; defaults to COSINE (or HAMMING for BINARY vectors)

Returns: BINARY_DOUBLE representing the distance

Default Behavior

If you do not specify a distance metric, then the default distance metric is cosine. If the input vectors are BINARY vectors, the default metric is hamming.

Basic Example

SELECT 
  VECTOR_DISTANCE(
    VECTOR('[1, 0]'),
    VECTOR('[0, 1]'),
    COSINE
  ) AS distance;

Vector Distance Metrics: Choosing the Right One

Oracle AI Vector Search supports multiple distance metrics, each suited for different use cases and data characteristics.

1. Euclidean Distance (L2)

Euclidean distance gives us the straight-line distance between two vectors. It uses the Pythagorean theorem and is sensitive to both vector size and direction.

Formula:

d = √[(x₂-x₁)² + (y₂-y₁)²]

SQL Example:

SELECT TO_NUMBER(
  VECTOR_DISTANCE(
    VECTOR('[3, 0]'),
    VECTOR('[0, 4]'),
    EUCLIDEAN
  )
) AS distance;

Result: 5 (forming a 3-4-5 right triangle)

Use Cases:

Spatial data and geographical coordinates
Physical measurements
Image similarity (pixel-level comparisons)
When absolute distance matters

2. Euclidean Squared Distance (L2_SQUARED)

The Euclidean distance without taking the square root. When ordering is more important than the actual distance values, squared Euclidean distance is very useful as it's faster to calculate, avoiding the square root computation.

Advantage: Faster computation for ranking/sorting

SQL Example:

SELECT TO_NUMBER(
  VECTOR_DISTANCE(
    VECTOR('[3, 0]'),
    VECTOR('[0, 4]'),
    EUCLIDEAN_SQUARED
  )
) AS distance_squared;

Result: 25 (5²)

3. Cosine Similarity / Cosine Distance

Cosine similarity is the most widely used similarity metric, especially in natural language processing (NLP). The smaller the angle, the more similar the vectors are.

Cosine measures the angle between two vectors rather than their magnitude. It's ideal for text embeddings where direction matters more than magnitude.

Formula:

cosine_similarity = (A · B) / (||A|| × ||B||)
cosine_distance = 1 - cosine_similarity

SQL Example:

SELECT 
  VECTOR_DISTANCE(
    VECTOR('[1, 0]'),
    VECTOR('[1, 1]'),
    COSINE
  ) AS cosine_dist;

Key Characteristics:

Range: 0 (identical) to 2 (opposite)
Normalized: Insensitive to vector magnitude
Best for: Text embeddings, document similarity, semantic search

Why It's Popular:

Cosine is one of the most useful metrics since it measures the angle between two vectors instead of the difference in size or position. This makes it perfect for comparing text embeddings where the semantic meaning is encoded in direction, not magnitude.

4. Dot Product Similarity

Dot product allows us to multiply the size of each vector by the cosine of their angle. It is equivalent to the sum of the vectors' coordinates. Larger values mean more similar; smaller values mean less similar.

Formula:

dot_product = Σ(aᵢ × bᵢ)

Note: Oracle's DOT metric calculates the negated dot product, so more negative values indicate greater similarity.

SQL Example:

SELECT 
  VECTOR_DISTANCE(
    VECTOR('[2, 3]'),
    VECTOR('[4, 5]'),
    DOT
  ) AS dot_distance;

Use Cases:

Recommendation systems
Similarity ranking with normalized vectors
Fast approximate nearest neighbor search

Important: For meaningful results, vectors should be normalized to unit length.

5. Manhattan Distance (L1)

Manhattan distance is useful for describing uniform grids. It's useful for city blocks, power grids, chessboards, and is faster than Euclidean metrics.

Also known as taxicab distance or L1 distance, it calculates the sum of absolute differences between vector components.

Formula:

manhattan = Σ|aᵢ - bᵢ|

SQL Example:

SELECT 
  VECTOR_DISTANCE(
    VECTOR('[1, 2]'),
    VECTOR('[4, 6]'),
    MANHATTAN
  ) AS manhattan_dist;

Result: |4-1| + |6-2| = 7

Use Cases:

Grid-based problems
Route planning on city streets
Feature selection in machine learning
When diagonal movement isn't allowed

6. Hamming Distance

Hamming distance describes where vector dimensions differ. They are binary vectors and tell us the number of bits that require change to match.

Hamming distance computes the position of each bit in the sequence and is used for network error detection and correction.

SQL Example:

SELECT 
  VECTOR_DISTANCE(
    VECTOR('[1, 0, 1, 1, 0]', 5, BINARY),
    VECTOR('[1, 1, 1, 0, 0]', 5, BINARY),
    HAMMING
  ) AS hamming_dist;

Result: 2 (positions 2 and 4 differ)

Use Cases:

Error detection and correction
Genetic sequence comparison
Digital communication
Binary classification problems

7. Jaccard Distance

Jaccard distance measures dissimilarity between binary vectors based on the ratio of intersection to union.

Requirements: Both vectors must be BINARY

Formula:

jaccard = 1 - (|A ∩ B| / |A ∪ B|)

SQL Example:

SELECT 
  VECTOR_DISTANCE(
    VECTOR('[1, 1, 0, 1]', 4, BINARY),
    VECTOR('[1, 0, 1, 1]', 4, BINARY),
    JACCARD
  ) AS jaccard_dist;

Use Cases:

Set similarity comparison
Document deduplication
Recommendation systems
Clustering binary data

Shorthand Distance Operators

Oracle provides convenient shorthand operators for common distance calculations, making SQL queries more concise and readable.

Available Operators

<-> Euclidean Distance Operator

SELECT '[1, 2]' <-> '[0, 1]' AS euclidean_dist;

Equivalent to:

SELECT L2_DISTANCE(VECTOR('[1, 2]'), VECTOR('[0, 1]')) AS euclidean_dist;

SELECT VECTOR_DISTANCE(VECTOR('[1, 2]'), VECTOR('[0, 1]'), EUCLIDEAN) AS euclidean_dist;

<=> Cosine Distance Operator

SELECT '[1, 0]' <=> '[0, 1]' AS cosine_dist;

Equivalent to:

SELECT COSINE_DISTANCE(VECTOR('[1, 0]'), VECTOR('[0, 1]')) AS cosine_dist;

<#> Negative Dot Product Operator

SELECT '[2, 3]' <#> '[4, 5]' AS neg_dot_product;

Equivalent to:

SELECT -1 * INNER_PRODUCT(VECTOR('[2, 3]'), VECTOR('[4, 5]')) AS neg_dot_product;

Practical Example

-- Compare products by embedding similarity
SELECT 
  p1.product_name,
  p2.product_name,
  p1.embedding <=> p2.embedding AS similarity_score
FROM products p1, products p2
WHERE p1.product_id = 100
  AND p2.product_id != 100
ORDER BY similarity_score
FETCH FIRST 5 ROWS ONLY;

Shorthand Distance Functions

In addition to operators, Oracle provides shorthand functions for cleaner code:

L1_DISTANCE - Manhattan distance
L2_DISTANCE - Euclidean distance
COSINE_DISTANCE - Cosine distance
INNER_PRODUCT - Dot product (not negated)
HAMMING_DISTANCE - Hamming distance
JACCARD_DISTANCE - Jaccard distance

Example:

SELECT 
  L2_DISTANCE(v1.embedding, v2.embedding) AS euclidean,
  COSINE_DISTANCE(v1.embedding, v2.embedding) AS cosine,
  INNER_PRODUCT(v1.embedding, v2.embedding) AS dot_prod
FROM vectors v1, vectors v2
WHERE v1.id = 1 AND v2.id = 2;

Performing Similarity Search

The VECTOR_DISTANCE function can be used to perform similarity search by ordering results based on vector proximity.

Exact Similarity Search

SELECT 
  id,
  description,
  VECTOR_DISTANCE(embedding, :query_vector, COSINE) AS distance
FROM documents
ORDER BY distance
FETCH FIRST 10 ROWS ONLY;

Approximate Similarity Search (with Index)

-- Using FETCH APPROXIMATE with vector index
SELECT 
  id,
  description
FROM documents
ORDER BY embedding <=> :query_vector
FETCH APPROXIMATE FIRST 10 ROWS ONLY;

Key Difference:

EXACT: Compares query vector with every vector (slower, 100% accurate)
APPROXIMATE: Uses vector indexes (HNSW/IVF) for fast search with ~95% accuracy

Choosing the Right Distance Metric

Decision Matrix

Use Case	Recommended Metric	Why
Text embeddings	COSINE	Captures semantic similarity, magnitude-invariant
Image similarity	EUCLIDEAN	Pixel-level comparisons benefit from absolute distance
Recommendation systems	DOT (normalized vectors)	Fast computation, works well with normalized data
Grid/route problems	MANHATTAN	Natural fit for grid-based navigation
Binary classification	HAMMING	Direct bit difference counting
Error detection	HAMMING	Counts differing positions
Set similarity	JACCARD	Measures intersection/union ratio

Performance Considerations

Fastest to Slowest:

DOT (simple multiplication and sum)
EUCLIDEAN_SQUARED (avoids square root)
MANHATTAN (absolute values and sum)
EUCLIDEAN (includes square root)
COSINE (normalization overhead)

Matching Index and Query Metrics

If a similarity search query specifies a distance metric that conflicts with the metric in a vector index, the vector index is not used and an exact search is performed instead.

Best Practice: Ensure your query metric matches your index metric for optimal performance.

-- Index created with COSINE
CREATE VECTOR INDEX docs_idx ON documents(embedding)
ORGANIZATION NEIGHBOR GRAPH
DISTANCE COSINE
WITH TARGET ACCURACY 95;

-- Query should also use COSINE for index usage
SELECT id FROM documents
ORDER BY embedding <=> :query_vector  -- Uses COSINE operator
FETCH APPROXIMATE FIRST 10 ROWS ONLY;

Complete Working Example

Here's a comprehensive example demonstrating vector creation, storage, and similarity search:

-- Create table with vector column
CREATE TABLE product_embeddings (
  product_id NUMBER PRIMARY KEY,
  product_name VARCHAR2(100),
  description CLOB,
  embedding VECTOR(384, FLOAT32)
);

-- Insert sample data
INSERT INTO product_embeddings VALUES (
  1,
  'Laptop Computer',
  'High-performance laptop for developers',
  VECTOR('[0.2, 0.8, 0.5, ...]', 384, FLOAT32)
);

INSERT INTO product_embeddings VALUES (
  2,
  'Wireless Mouse',
  'Ergonomic wireless mouse',
  VECTOR('[0.1, 0.3, 0.7, ...]', 384, FLOAT32)
);

COMMIT;

-- Create vector index for fast similarity search
CREATE VECTOR INDEX product_emb_idx 
ON product_embeddings(embedding)
ORGANIZATION NEIGHBOR GRAPH
DISTANCE COSINE
WITH TARGET ACCURACY 95;

-- Perform similarity search
DECLARE
  query_vec VECTOR;
BEGIN
  -- Create query vector (in practice, from embedding model)
  query_vec := VECTOR('[0.15, 0.75, 0.6, ...]', 384, FLOAT32);

  -- Find similar products
  FOR rec IN (
    SELECT 
      product_name,
      description,
      COSINE_DISTANCE(embedding, query_vec) AS similarity
    FROM product_embeddings
    ORDER BY embedding <=> query_vec
    FETCH APPROXIMATE FIRST 5 ROWS ONLY
  ) LOOP
    DBMS_OUTPUT.PUT_LINE(
      'Product: ' || rec.product_name || 
      ' | Similarity: ' || rec.similarity
    );
  END LOOP;
END;
/

Best Practices

1. Choose Appropriate Dimensions

Match embedding dimensions to your model:

384: MiniLM, lightweight models
768: BERT, sentence transformers
1024: Cohere embedding models
1536: OpenAI ada-002

2. Normalize Vectors When Using DOT

-- Normalize vector to unit length
SELECT 
  VECTOR_NORM(embedding) AS original_magnitude,
  embedding / VECTOR_NORM(embedding) AS normalized_vector
FROM documents
WHERE id = 1;

3. Use Appropriate Formats

FLOAT32: Default, balances precision and performance
FLOAT64: When high precision is critical
INT8: For quantized models, saves storage

4. Monitor Index Accuracy

-- Check if index is being used
EXPLAIN PLAN FOR
SELECT id FROM documents
ORDER BY embedding <=> :query_vector
FETCH APPROXIMATE FIRST 10 ROWS ONLY;

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);

5. Benchmark Different Metrics

Test multiple metrics on your data to find the best performer:

-- Compare metrics
SELECT 
  'COSINE' AS metric,
  AVG(COSINE_DISTANCE(e1.embedding, e2.embedding)) AS avg_dist
FROM embeddings e1, embeddings e2
WHERE e1.id != e2.id
UNION ALL
SELECT 
  'EUCLIDEAN',
  AVG(L2_DISTANCE(e1.embedding, e2.embedding))
FROM embeddings e1, embeddings e2
WHERE e1.id != e2.id;

Common Pitfalls and Solutions

Pitfall 1: Dimension Mismatch

Problem: Comparing vectors with different dimensions

-- This will error
SELECT VECTOR('[1, 2]') <=> VECTOR('[1, 2, 3]');

Solution: Ensure all vectors have the same dimensions

Pitfall 2: Wrong Metric for Data Type

Problem: Using JACCARD on non-binary vectors

Solution: Use JACCARD only with BINARY vectors

Pitfall 3: Not Using Indexes

Problem: Slow similarity searches on large datasets

Solution: Create appropriate vector indexes (HNSW for speed, IVF for scale)

Pitfall 4: Metric Mismatch with Index

Problem: Query metric conflicts with index metric, causing full table scan

Solution: Match query metric to index metric

Conclusion

Oracle Database 23ai's native vector capabilities provide a powerful, integrated platform for semantic search and AI-powered applications. Key takeaways:

Vector Creation:

VECTOR constructor for creating vectors directly in SQL
Support for multiple formats (INT8, FLOAT32, FLOAT64, BINARY)
Flexible dimension specification

Distance Metrics:

COSINE: Default, best for text embeddings and semantic similarity
EUCLIDEAN: Straight-line distance, good for spatial data
DOT: Fast for normalized vectors, recommendation systems
MANHATTAN: Grid-based problems, faster than Euclidean
HAMMING: Binary vectors, error detection
JACCARD: Set similarity with binary vectors

Shorthand Operators:

<-> for Euclidean distance
<=> for Cosine distance
<#> for negative dot product

Best Practices:

Match metric to your use case and embedding model
Create appropriate vector indexes
Ensure metric consistency between index and queries
Use approximate search for large datasets
Benchmark different metrics on your data

By understanding these vector operations and distance metrics, you can build efficient, accurate similarity search applications entirely within Oracle Database 23ai.

Batch Processing with Apache Spark

Ryan Giggs — Sat, 07 Mar 2026 13:04:08 +0000

Week 6 of Data Engineering Zoomcamp by @DataTalksClub complete
Just finished Module 6 - Batch Processing with Spark. Learned how to:

✅ Set up PySpark and create Spark sessions

✅ Read and process Parquet files at scale

✅ Repartition data for optimal performance

✅ Analyze millions of taxi trips with DataFrames

✅ Use Spark UI for monitoring jobs

Processing 4M+ taxi trips with Spark - distributed computing is powerful

Here's my homework solution: https://github.com/Derrick-Ryan-Giggs/pyspark-homework

Following along with this amazing free course - who else is learning data engineering?

You can sign up here: https://github.com/DataTalksClub/data-engineering-zoomcamp/

From APIs to Warehouses: AI-Assisted Data Ingestion with dlt

Ryan Giggs — Sun, 01 Mar 2026 11:44:32 +0000

🚀 dlt Workshop of Data Engineering Zoomcamp by @DataTalksClub complete
Just finished the Data Ingestion workshop with @dltHub. Learned how to:
✅ Build REST API data pipelines with dlt
✅ Use AI-assisted development with dlt MCP Server
✅ Load paginated API data into DuckDB
✅ Inspect pipeline data with dlt Dashboard and marimo notebooks
Built a full NYC taxi data pipeline from a custom API - AI-assisted data engineering is the future

Here's my homework solution: https://github.com/Derrick-Ryan-Giggs/-my-dlt-taxi-pipeline

Following along with this amazing free course - who else is learning data engineering?

You can sign up here: https://github.com/DataTalksClub/data-engineering-zoomcamp/

Oracle AI Vector Search: DML and DDL Operations on Vector Columns

Ryan Giggs — Thu, 26 Feb 2026 08:18:20 +0000

Oracle Database 23ai introduces the VECTOR data type, enabling you to store AI embeddings alongside traditional business data. Understanding how to perform Data Manipulation Language (DML) and Data Definition Language (DDL) operations on vector columns is essential for building effective AI-powered applications. This guide covers everything you need to know about working with vector columns, including supported operations and important restrictions.

Understanding the VECTOR Data Type

The VECTOR data type can store vectors with:

Arbitrary number of dimensions: From 1 to 65,535 dimensions (65,528 for BINARY format)
Multiple formats: INT8, FLOAT32, FLOAT64, or BINARY
Flexible or fixed specifications: Can be declared with or without constraints

Declaration Examples:

-- Flexible: any dimensions and format
CREATE TABLE flexible_vectors (
    id NUMBER,
    embedding VECTOR
);

-- Fixed: specific dimensions and format
CREATE TABLE fixed_vectors (
    id NUMBER,
    embedding VECTOR(384, FLOAT32)
);

DML Operations on Vectors

1. INSERT Operations

You can directly insert vectors into tables using several methods:

Method A: Insert Using Vector Literals

-- Insert vector as array literal
INSERT INTO products (product_id, name, embedding)
VALUES (1, 'Laptop', '[0.1, 0.8, 0.5, 0.3]');

-- Insert with TO_VECTOR function
INSERT INTO products (product_id, name, embedding)
VALUES (2, 'Mouse', TO_VECTOR('[0.2, 0.7, 0.4, 0.6]'));

Method B: Insert with Embedding Generation

-- Generate embeddings during insert
INSERT INTO products (product_id, name, description, embedding)
VALUES (
    3,
    'Keyboard',
    'Mechanical gaming keyboard with RGB',
    VECTOR_EMBEDDING(doc_model USING 'Mechanical gaming keyboard with RGB' AS data)
);

Method C: Insert from Another Table

-- Copy vectors from another table
INSERT INTO products_archive (product_id, name, embedding)
SELECT product_id, name, embedding
FROM products
WHERE created_date < ADD_MONTHS(SYSDATE, -12);

2. UPDATE Operations

You can directly update vector columns:

-- Update with vector literal
UPDATE products
SET embedding = '[0.15, 0.85, 0.55, 0.35]'
WHERE product_id = 1;

-- Update using embedding generation
UPDATE products
SET embedding = VECTOR_EMBEDDING(doc_model USING description AS data)
WHERE embedding IS NULL;

-- Conditional update
UPDATE products
SET embedding = VECTOR_EMBEDDING(doc_model USING description AS data)
WHERE category = 'Electronics'
    AND embedding IS NULL;

3. DELETE Operations

Delete operations work normally with tables containing vector columns:

-- Delete specific rows
DELETE FROM products
WHERE product_id = 100;

-- Delete based on relational criteria
DELETE FROM products
WHERE created_date < ADD_MONTHS(SYSDATE, -24)
    AND status = 'ARCHIVED';

-- Delete all rows
DELETE FROM products;
-- or
TRUNCATE TABLE products;

4. Loading Data Using SQL*Loader

SQL*Loader can load vector data from external files:

Control File Example (products.ctl):

LOAD DATA
INFILE 'products.dat'
INTO TABLE products
FIELDS TERMINATED BY ','
(
    product_id,
    product_name,
    description,
    embedding CHAR(4000)
)

Data File Example (products.dat):

1,Laptop,High-performance laptop,"[0.1,0.8,0.5,0.3,0.6]"
2,Mouse,Wireless gaming mouse,"[0.2,0.7,0.4,0.6,0.5]"
3,Keyboard,Mechanical keyboard,"[0.15,0.85,0.55,0.35,0.65]"

Load Command:

sqlldr userid=username/password@db control=products.ctl log=products.log

5. DML on Tables with HNSW Indexes

Important Update (23ai Release 23.6+):

In Oracle Database 23ai releases 23.4 and 23.5, DML operations were not allowed on tables with HNSW indexes. Starting with Release 23.6, this restriction has been lifted:

Transactional consistency: DML modifications are now supported on tables with HNSW indexes
RAC support: Guarantees transactional consistency even in Oracle RAC environments
Consistent results: Vector search queries using HNSW indexes see transactionally consistent results based on their read snapshot

-- These operations now work with HNSW indexes (23.6+)
INSERT INTO products VALUES (101, 'New Product', '[0.1, 0.2, 0.3]');
UPDATE products SET embedding = '[0.2, 0.3, 0.4]' WHERE product_id = 101;
DELETE FROM products WHERE product_id = 101;

Vector DDL Operations

Tables with Multiple Vector Columns

Tables can have:

More than one column of VECTOR data type
Different formats and dimensions in different columns

CREATE TABLE multimedia_content (
    content_id NUMBER PRIMARY KEY,
    title VARCHAR2(500),
    description CLOB,
    -- Text embedding: 384 dimensions, FLOAT32
    text_embedding VECTOR(384, FLOAT32),
    -- Image embedding: 512 dimensions, FLOAT32
    image_embedding VECTOR(512, FLOAT32),
    -- Audio embedding: 256 dimensions, INT8
    audio_embedding VECTOR(256, INT8),
    -- Flexible dimension column
    metadata_embedding VECTOR
);

Adding Vector Columns

You can add vector columns to existing tables:

-- Add vector column with specific dimensions
ALTER TABLE products
ADD description_vector VECTOR(384, FLOAT32);

-- Add flexible vector column
ALTER TABLE customers
ADD preference_vector VECTOR;

-- Add with default NULL
ALTER TABLE documents
ADD content_vector VECTOR(768, FLOAT32) DEFAULT NULL;

After Adding, Populate the Column:

-- Generate embeddings for existing data
UPDATE products
SET description_vector = VECTOR_EMBEDDING(doc_model USING description AS data);

Dropping Vector Columns

-- Drop a specific vector column
ALTER TABLE products
DROP COLUMN description_vector;

-- Drop multiple columns
ALTER TABLE products
DROP (description_vector, image_vector);

Dropping Tables with Vector Columns

Tables containing vector columns can be dropped normally:

-- Drop table
DROP TABLE products;

-- Drop table with CASCADE CONSTRAINTS
DROP TABLE products CASCADE CONSTRAINTS;

-- Drop and purge from recycle bin
DROP TABLE products PURGE;

Prohibited Operations on Vector Columns

Oracle Database 23ai has specific restrictions on where and how vector columns can be used. Understanding these limitations is crucial for proper database design.

1. Table and Storage Restrictions

Cannot Define Vector Columns In:

External Tables:

-- This will fail
CREATE TABLE ext_vectors (
    id NUMBER,
    embedding VECTOR
)
ORGANIZATION EXTERNAL (...);

Note: As of Oracle Database 26ai, external tables CAN be created with VECTOR columns, allowing vector embeddings in text or binary format stored in external files to be rendered as the VECTOR data type. Check your database version for availability.

Index-Organized Tables (IOTs):

-- Cannot use as primary key
CREATE TABLE iot_vectors (
    embedding VECTOR PRIMARY KEY,  -- Not allowed
    data VARCHAR2(100)
)
ORGANIZATION INDEX;

-- Cannot use as non-key column either
CREATE TABLE iot_vectors (
    id NUMBER PRIMARY KEY,
    embedding VECTOR,  -- Not allowed
    data VARCHAR2(100)
)
ORGANIZATION INDEX;

Clusters and Cluster Tables:

-- Vectors cannot be part of clusters
CREATE CLUSTER vector_cluster (
    id NUMBER,
    embedding VECTOR  -- Not allowed
);

Global Temporary Tables:

-- This will fail
CREATE GLOBAL TEMPORARY TABLE temp_vectors (
    id NUMBER,
    embedding VECTOR  -- Not allowed
)
ON COMMIT DELETE ROWS;

Manual Segment Space Management (MSSM) Tablespaces:

Only the SYS user can create vectors as BasicFiles in MSSM tablespaces. Regular users should use Automatic Segment Space Management (ASSM) tablespaces:

-- Create ASSM tablespace for vectors
CREATE TABLESPACE vector_data
DATAFILE '/u01/app/oracle/oradata/vector_data01.dbf' SIZE 1G
SEGMENT SPACE MANAGEMENT AUTO;  -- Required for non-SYS users

2. Partitioning Restrictions

Sub-partitioning Keys:

-- Vector columns cannot be sub-partition keys
CREATE TABLE sales_data (
    sale_id NUMBER,
    sale_date DATE,
    embedding VECTOR(384, FLOAT32)
)
PARTITION BY RANGE (sale_date)
SUBPARTITION BY HASH (embedding)  -- Not allowed
(...);

Vectors in Partitioned Tables (Allowed):

-- Vectors CAN exist in partitioned tables
CREATE TABLE sales_data (
    sale_id NUMBER,
    sale_date DATE,
    product_name VARCHAR2(200),
    embedding VECTOR(384, FLOAT32)
)
PARTITION BY RANGE (sale_date) (
    PARTITION p2023 VALUES LESS THAN (DATE '2024-01-01'),
    PARTITION p2024 VALUES LESS THAN (DATE '2025-01-01')
);

3. Constraint Restrictions

Primary Keys:

CREATE TABLE vectors (
    embedding VECTOR PRIMARY KEY  -- Not allowed
);

Foreign Keys:

CREATE TABLE vectors (
    id NUMBER PRIMARY KEY,
    embedding VECTOR
);

CREATE TABLE related (
    id NUMBER PRIMARY KEY,
    vector_ref VECTOR REFERENCES vectors(embedding)  -- Not allowed
);

Unique Constraints:

CREATE TABLE vectors (
    id NUMBER PRIMARY KEY,
    embedding VECTOR UNIQUE  -- Not allowed
);

Check Constraints:

CREATE TABLE vectors (
    id NUMBER PRIMARY KEY,
    embedding VECTOR CHECK (embedding IS NOT NULL)  -- Not allowed
);

Default Values:

CREATE TABLE vectors (
    id NUMBER PRIMARY KEY,
    embedding VECTOR DEFAULT '[0,0,0,0]'  -- Not allowed
);

4. Column Modification Restrictions

Cannot Modify Vector Column Definition:

-- Cannot change dimensions or format
ALTER TABLE products
MODIFY (embedding VECTOR(512, FLOAT64));  -- Not allowed

-- Cannot change to/from VECTOR type
ALTER TABLE products
MODIFY (embedding CLOB);  -- Not allowed

Workaround - Add New Column:

-- Instead, add a new column
ALTER TABLE products
ADD embedding_new VECTOR(512, FLOAT64);

-- Migrate data (with conversion if needed)
UPDATE products
SET embedding_new = VECTOR_EMBEDDING(model USING description AS data);

-- Drop old column
ALTER TABLE products DROP COLUMN embedding;

-- Rename new column
ALTER TABLE products RENAME COLUMN embedding_new TO embedding;

5. Index Restrictions

Non-Vector Indexes:

Vector columns cannot be part of traditional indexes:

-- B-tree index
CREATE INDEX idx_embedding ON products(embedding);  -- Not allowed

-- Bitmap index
CREATE BITMAP INDEX idx_embedding ON products(embedding);  -- Not allowed

-- Reverse key index
CREATE INDEX idx_embedding ON products(embedding) REVERSE;  -- Not allowed

-- Function-based index
CREATE INDEX idx_embedding ON products(UPPER(embedding));  -- Not allowed

Vector Indexes Only:

-- HNSW vector index
CREATE VECTOR INDEX idx_hnsw ON products(embedding)
ORGANIZATION INMEMORY NEIGHBOR GRAPH
DISTANCE COSINE
WITH TARGET ACCURACY 95;

-- IVF vector index
CREATE VECTOR INDEX idx_ivf ON products(embedding)
ORGANIZATION NEIGHBOR PARTITIONS
DISTANCE COSINE
WITH TARGET ACCURACY 95;

6. Other Restrictions

Continuous Query Notification (CQN):

Vector columns are not supported in CQN queries.

Comparison Operators:

Standard comparison operators cannot be used:

-- These will fail
SELECT * FROM products WHERE embedding = '[0.1, 0.2]';
SELECT * FROM products WHERE embedding > '[0.1, 0.2]';
SELECT * FROM products WHERE embedding < '[0.1, 0.2]';

Use Vector Distance Functions Instead:

-- Correct approach
SELECT product_name
FROM products
ORDER BY VECTOR_DISTANCE(embedding, '[0.1, 0.2, 0.3]', COSINE)
FETCH FIRST 10 ROWS ONLY;

Best Practices for Vector DML and DDL

1. Design Considerations

-- Good: Specify dimensions and format upfront
CREATE TABLE products (
    product_id NUMBER PRIMARY KEY,
    name VARCHAR2(200),
    embedding VECTOR(384, FLOAT32)  -- Clear specification
);

-- Less optimal: Flexible vectors harder to optimize
CREATE TABLE products (
    product_id NUMBER PRIMARY KEY,
    name VARCHAR2(200),
    embedding VECTOR  -- May lead to performance issues
);

2. Batch Operations

-- Efficient: Batch update
UPDATE products
SET embedding = VECTOR_EMBEDDING(doc_model USING description AS data)
WHERE embedding IS NULL;

-- Less efficient: Row-by-row updates in loop

3. Transaction Management

-- Good practice: Commit after large operations
BEGIN
    INSERT INTO products_archive
    SELECT * FROM products WHERE created_date < DATE '2023-01-01';

    COMMIT;
END;
/

4. Use ASSM Tablespaces

-- Create tablespace with ASSM for vector tables
CREATE TABLESPACE vector_ts
DATAFILE '/u01/oradata/vector_ts01.dbf' SIZE 10G
AUTOEXTEND ON NEXT 1G MAXSIZE UNLIMITED
SEGMENT SPACE MANAGEMENT AUTO;

-- Create table in ASSM tablespace
CREATE TABLE products (
    product_id NUMBER PRIMARY KEY,
    embedding VECTOR(384, FLOAT32)
) TABLESPACE vector_ts;

5. Monitor Vector Column Usage

-- Check vector column statistics
SELECT 
    table_name,
    column_name,
    data_type,
    nullable
FROM user_tab_columns
WHERE data_type = 'VECTOR';

-- Check table size with vectors
SELECT 
    segment_name,
    segment_type,
    bytes/1024/1024 AS size_mb
FROM user_segments
WHERE segment_name = 'PRODUCTS';

Summary

DML Operations Supported:

INSERT vectors directly or with embedding generation
UPDATE vector columns
DELETE rows from tables with vectors
Load data using SQL*Loader
DML on tables with HNSW indexes (23.6+)

DDL Operations Supported:

Create tables with multiple vector columns
Different formats and dimensions per column
ADD vector columns to existing tables
DROP vector columns
DROP tables containing vectors

Key Restrictions:

External tables (except 26ai+)
Index-Organized Tables (IOTs)
Clusters and cluster tables
Global temporary tables
Sub-partitioning keys
Primary keys, foreign keys, unique constraints
Check constraints, default values
Column modification (dimensions/format)
MSSM tablespaces (non-SYS users)
Non-vector indexes (B-tree, bitmap, etc.)
Standard comparison operators (=, >, <)

Understanding these operations and restrictions ensures you can effectively design and manage vector-enabled applications in Oracle Database 23ai, combining the power of AI embeddings with traditional relational data operations.

Data Platforms with Bruin

Ryan Giggs — Tue, 24 Feb 2026 09:04:29 +0000

Week 5 of Data Engineering Zoomcamp by @DataTalksClub complete

Just finished Module 5 - Data Platforms with Bruin. Learned how to:

✅ Build end-to-end ELT pipelines with Bruin
✅ Configure environments and connections
✅ Use materialization strategies for incremental processing
✅ Add data quality checks to ensure data integrity
✅ Deploy pipelines from local to cloud (BigQuery)

Modern data platforms in a single CLI tool - no vendor lock-in

Here's my homework solution: https://github.com/Derrick-Ryan-Giggs/bruin-taxi-pipeline-homework

Following along with this amazing free course - who else is learning data engineering?

You can sign up here: https://github.com/DataTalksClub/data-engineering-zoomcamp/