Forem: Prithvi S

Elasticsearch Cluster Health 101 – Deep Dive

Prithvi S — Tue, 05 May 2026 00:31:31 +0000

You ship your Elasticsearch cluster to production. Traffic spikes. Suddenly your dashboard flashes YELLOW. What does that mean? Are you about to lose data? Can you keep the service running? This guide teaches you how to read your cluster’s health signals, diagnose problems, and keep your data safe. It focuses on the architecture and coordination aspects of Elasticsearch, not on the low‑level search mechanics.

What Is Cluster Health?

Cluster health is the collective state of all nodes, shards, and data replication in an Elasticsearch cluster. Elasticsearch reports three health levels:

GREEN – All primary and replica shards are allocated and active.
YELLOW – All primary shards are active, but some replica shards are missing.
RED – One or more primary shards are unassigned, meaning some data is not searchable.

These states tell you whether the cluster can serve read requests, whether it has redundancy, and whether it can recover from failures.

Under the Hood – Cluster Coordination

Master‑eligible Nodes

The master‑eligible nodes maintain the cluster state – a compact JSON document that records shard locations, node metadata, and index settings. Only master‑eligible nodes can become the master node that orchestrates shard allocation and rebalancing.

Master Election

Elasticsearch uses a quorum‑based voting algorithm. A master is elected when a majority of master‑eligible nodes agree on the same node. This prevents split‑brain scenarios where two partitions each think they own the master.

Discovery

When a node starts, it contacts the configured discovery.seed_hosts (or uses the default unicast list) to locate other nodes. Once a master is found, the node receives the latest cluster state and registers itself.

Example Topology

Node A – master‑eligible
Node B – master‑eligible
Node C – master‑eligible
Node D – data
Node E – data
Node F – coordinating (client) – no data

The three master‑eligible nodes form a quorum. If one master fails, the remaining two can elect a new master without interruption.

Shard Allocation & Rebalancing

Primary and Replica Shards

Primary shard – receives write operations first.
Replica shard – read‑only copy of a primary; provides redundancy and additional query capacity.

Allocation Rules

Elasticsearch tries to spread shards across nodes to avoid a single point of failure. By default, it places at most one shard (primary or replica) of the same index on a node.

Unassigned Shards

A shard becomes unassigned when there is no node that satisfies its allocation rules. Common reasons:

The node that held the shard is down.
Disk usage on all nodes exceeds the cluster.routing.allocation.disk.threshold_enabled limit.
Allocation is manually disabled via settings.

Rebalancing

When a node fails, the master promotes a replica to primary and creates a new replica on another node. This process is asynchronous; it may take seconds to minutes depending on shard size and network bandwidth.

Allocation Awareness

You can tag nodes with attributes like rack, zone, or custom labels. Allocation filters (cluster.routing.allocation.awareness.attributes) ensure that replicas are spread across these domains, improving fault tolerance.

Reading the Health Endpoint

The simplest way to check cluster health is the GET /_cluster/health API.

GET /_cluster/health
{
  "cluster_name": "my‑cluster",
  "status": "yellow",
  "timed_out": false,
  "number_of_nodes": 6,
  "number_of_data_nodes": 4,
  "active_primary_shards": 120,
  "active_shards": 240,
  "relocating_shards": 2,
  "initializing_shards": 0,
  "unassigned_shards": 4,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 3,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 96.0
}

Key fields:

status: overall health (green, yellow, red).
active_primary_shards: number of primary shards that are active.
unassigned_shards: shards that have no assigned node; a non‑zero value is a red flag for primary shards.
relocating_shards: shards moving to a new node during rebalancing.

You can wait for a specific health level before proceeding with operations:

GET /_cluster/health?wait_for_status=green&timeout=30s

If the cluster does not reach green within the timeout, the request returns a warning.

Diagnosing Common Issues

Red Cluster – Primary Shard Unassigned

Symptoms: status: red, unassigned_shards > 0.
Root causes:

Node failure (hardware crash, network loss).
Disk full on all nodes, preventing allocation.
Allocation disabled (cluster.routing.allocation.enable: none). Steps to fix:
Identify the missing primary shard:

   GET /_cat/shards?v&s=state&h=index,shard,prirep,state,unassigned.reason

Check node health and disk usage:

   GET /_cat/nodes?v&h=name,ip,heap.percent,cpu,load_1m,disk.avail

Free disk space or add a new node.
Enable allocation if it was disabled.
Run _cluster/reroute to force allocation if needed.

Yellow Cluster – Replica Shards Missing

Symptoms: status: yellow, active_shards equals active_primary_shards but active_shards_percent_as_number < 100.
Why it’s okay to serve queries: All primary shards are active, so data is readable. However you lack redundancy; a second node failure could cause data loss.
Fixes:

Wait for automatic rebalancing; new replicas are created as nodes become available.
Increase the number of data nodes or add capacity.
Check for allocation filters that might be preventing replica placement.

High CPU / Memory with Green Status

Observation: Cluster reports green but query latency is high.
Actions:

Examine shard size distribution: GET /_cat/shards?h=index,shard,prirep,size,node
Look for hot shards on a single node.
Review query patterns; use profile API to find slow queries.

Key Diagnostic Commands

Command	Purpose
`GET /_cluster/health`	Overall health snapshot
`GET /_cluster/health?wait_for_status=green&timeout=50s`	Block until cluster is green
`GET /_cat/nodes?v=true&h=name,ip,heap.percent,cpu,load_1m`	Node‑level resource usage
`GET /_cat/shards?health=red`	List shards causing red status
`GET /_cat/shards?health=yellow`	List shards pending replica allocation
`GET /_cat/indices?health=red`	Identify indices with missing primaries
`GET /_nodes/stats`	Detailed JVM, OS, and thread pool stats
`GET /_cluster/allocation/explain`	Explain why a particular shard is not allocated

Monitoring & Alerting Strategy

What to Alert On

Cluster status red – immediate pager.
Unassigned primary shards > 0 – critical, may indicate data loss.
Disk usage > 85% on any data node – block new shard allocation.
Heap usage > 80% – risk of out‑of‑memory errors.
Relocating shards > 10 for > 5 minutes – possible rebalancing stall.

Tools

Kibana dashboards – visualize cluster_health, node_stats, and shard_allocation.
Elastic Stack Alerting – create Watcher alerts that send Slack or email.
Prometheus + Grafana – scrape /metrics endpoint via the Exporter.

Sample Alert Rule (Kibana)

"condition": {
  "script": {
    "source": "params.status == 'red' || params.unassigned_shards > 0",
    "params": {
      "status": "{{ctx.payload.status}}",
      "unassigned_shards": {{ctx.payload.unassigned_shards}}
    }
  }
}

When the rule fires, you receive a notification with the cluster name and a link to the health API.

Common Mistakes & Fixes

Single master node – no quorum; a network partition can split the cluster. Fix: Deploy at least three master‑eligible nodes.
Ignoring yellow status – you lose redundancy. Fix: Treat yellow as a warning and add capacity or wait for rebalancing.
All shards on one node – no fault tolerance. Fix: Use default allocation rules or explicit awareness attributes.
Allocation disabled after a failure – shards stay unassigned. Fix: Re‑enable allocation (PUT /_cluster/settings).
Oversharding – too many small shards increase cluster state size and recovery time. Fix: Aim for 10‑50 GB per shard.

Conclusion

Cluster health is your safety net. Green means everything is replicated; yellow signals missing replicas; red means primary data is missing. Understanding the master election, shard allocation, and rebalancing process helps you react quickly when something goes wrong. Monitor the health API, set alerts on critical metrics, and keep an eye on disk and heap usage. With these practices you can keep your Elasticsearch deployment reliable and resilient.

References

Elasticsearch Documentation – Cluster APIs: https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster.html
Elastic Stack Monitoring Guide: https://www.elastic.co/guide/en/observability/current/monitoring.html

Author: Prithvi S, Staff Software Engineer at Cloudera and Open‑source Enthusiast

The Life of a Search Query in OpenSearch

Prithvi S — Fri, 01 May 2026 13:10:44 +0000

OpenSearch is an open‑source search and analytics engine built on Apache Lucene. When you send a search request, a complex dance of components runs behind the scenes to turn a simple HTTP call into a ranked list of results. In this article we follow a query from the moment it hits the REST endpoint all the way to the final merged response, explaining each step in plain language while preserving the technical depth that engineers expect.

1. The Entry Point – REST Request

Everything starts with an HTTP request to the OpenSearch REST API, typically a GET /my-index/_search with a JSON body describing the query. The request can include parameters such as size, from, and sorting directives. The client can be anything from curl to a Python SDK, but the wire format is always the same: JSON over HTTP.

OpenSearch runs a lightweight HTTP server that parses the request and hands it over to the coordinating node – the node that received the request. In a cluster, any node can act as the coordinating node; it does not need to store data.

2. Routing the Request – Shard Selection

OpenSearch stores data in shards – Lucene indexes that are distributed across the cluster. Each document is assigned to a primary shard based on a routing value, which defaults to the document _id. The routing formula is:

hash(routing) % number_of_primary_shards

The coordinating node runs this hash function to determine which primary shards are responsible for the data the query touches. For a simple term query across a single index, the coordinating node may need to contact all primary shards of that index. If the query targets a specific routing value, the set of shards can be reduced dramatically, improving latency.

3. Query Phase – Parallel Execution on Shards

Once the responsible shards are known, the coordinating node forwards the query to the shard nodes (which may be the same physical node or a different one). Each shard executes the query locally against its Lucene segments. This phase has two important sub‑steps:

3.1 Segment Search

Lucene stores data in immutable segments. During the query phase, each segment is searched independently. OpenSearch can search segments in parallel within a shard – a feature called concurrent segment search introduced in version 3.0. The engine decides automatically how many slices to create based on CPU cores and segment size.

3.2 Scoring with BM25

For each matching document, Lucene computes a relevance score using the BM25 algorithm. The key parameters are term frequency, inverse document frequency, and the b length‑normalisation factor (default 0.75). The shard returns the top‑k (default 10) documents along with their scores.

4. Fetch Phase – Getting Full Documents

The query phase only returns document IDs and scores. If the client also requested the _source field (which most do), a second round called the fetch phase runs. The coordinating node asks each shard for the full source of the selected documents. Shards retrieve the stored _source from the Lucene stored fields and send it back.

Because the fetch phase may involve moving larger payloads over the network, OpenSearch tries to keep the number of fetched documents small. This is why pagination (from/size) and stored_fields filters are important performance knobs.

5. Merging Results – The Coordinating Node’s Role

After receiving the top‑k results from each shard, the coordinating node merges them into a single ranked list. It re‑applies the global size and from parameters, then sorts the combined set based on the BM25 scores returned by each shard. If the query includes custom sorts, the coordinating node also applies those rules.

The final merged list is then formatted as a JSON response and sent back to the client.

6. Behind the Scenes – Refresh, Translog, and Near‑Real‑Time Search

While the query is being processed, OpenSearch maintains a near‑real‑time view of the data. New documents are first written to an in‑memory buffer and appended to the translog for durability. Every second (the default refresh interval) the buffer is flushed to a new Lucene segment, making the freshly indexed documents searchable. This means there is typically a < 1‑second lag between indexing and visibility in search results.

7. Practical Tips for Optimising Queries

Issue	Why it Happens	Mitigation
Slow query latency	Too many shards queried, high segment count	Use routing, configure `index.routing_partition_size`, force‑merge to reduce segments
High CPU usage	Concurrent segment search on large shards	Tune `search.max_concurrent_shard_requests` and `search.max_concurrent_segments`
Stale results	Refresh interval too large for real‑time needs	Reduce `index.refresh_interval` on hot indices
Large payloads	Fetching full `_source` for many docs	Use `stored_fields` or `docvalue_fields`, limit `size`

8. Conclusion

A search query in OpenSearch is more than a simple HTTP call. It involves routing, parallel shard execution, scoring, optional fetching, and a final merge step that stitches everything together. Understanding each stage helps you design better schemas, tune performance, and avoid common pitfalls such as unnecessary shard scans or excessive refresh intervals.

By visualising the journey of a query, you gain the confidence to diagnose latency issues, choose the right indexing strategies, and make the most of OpenSearch’s powerful plugin and analysis ecosystems.

Author bio: I'm Prithvi S, Staff Software Engineer at Cloudera and Opensource Enthusiast. Follow my work on GitHub: https://github.com/iprithv

Credential Vending in Apache Polaris: Securing Data Access Without Sharing Keys

Prithvi S — Thu, 30 Apr 2026 00:30:40 +0000

Introduction

In modern data architectures, managing who can access what data is a constant challenge. Traditional approaches rely on long‑lived access keys or service accounts that are difficult to rotate and can become a security liability. Apache Polaris tackles this problem head‑on with a built‑in credential vending mechanism. Instead of distributing static keys, Polaris mints short‑lived, scoped credentials on demand, giving each request exactly the permissions it needs and expiring them after a few minutes.

This post walks through the design, implementation, and benefits of credential vending in Polaris. It also shows how the feature integrates with the rest of the system, discusses best practices, and provides a practical example of using the API.

Why Credential Vending?

Data engineers and scientists often need to read or write to cloud storage (S3, GCS, Azure) as part of their pipelines. Giving them permanent access keys creates several problems:

Key leakage – a single compromised key can expose an entire bucket.
Rotation overhead – keys must be rotated regularly, which is operationally heavy.
Principle of least privilege – static keys usually have broad permissions, violating least‑privilege best practices.

Credential vending solves these issues by generating short‑lived, scoped tokens that are tied to a specific operation (read‑only, read‑write) and a narrow resource path. Tokens expire after a configurable period (default ~15 minutes) and can be revoked instantly if needed.

Architecture Overview

Below is a high‑level diagram of the credential vending flow (illustrated with a professional image from Unsplash – placeholder):

Client Request – An engine (Spark, Flink, Trino) sends an HTTP request to Polaris to perform an action on a table.
Auth Check – Polaris authorizes the request using its two‑tier RBAC model.
Storage Lookup – The system determines which cloud storage backend backs the catalog (S3, GCS, Azure).
Credential Minting – Polaris calls the cloud provider’s token service (AWS STS, GCS token API, Azure AD) to create a temporary token with the exact permissions required.
Response – The temporary credential is returned to the client, which uses it for the subsequent data operation.

Deep Dive: How Polaris Mints Credentials

1. Authorization Layer

Polaris first evaluates the request against its RBAC model. The model consists of:

Principal Roles – assigned to users, service accounts, or automated agents.
Catalog Roles – define privileges on catalog objects (e.g., TABLE_READ_DATA, TABLE_WRITE_DATA).
PolarisAuthorizer – resolves the effective privileges for the request.

Only if the request has the required privilege does Polaris proceed to credential vending.

2. Storage Integration

Polaris supports three major cloud storage providers via the PolarisStorageIntegration interface. Each implementation knows how to:

Translate a credential scope (e.g., s3://my-bucket/path/) into a provider‑specific request.
Call the provider’s temporary credential service.
Apply any additional constraints (IP allow‑list, expiration window).

AWS Example

AssumeRoleRequest req = AssumeRoleRequest.builder()
    .roleArn(storageConfig.getAwsRoleArn())
    .durationSeconds(900) // 15 minutes
    .policy(scopedPolicy) // restrict to specific bucket/prefix
    .build();
Credentials creds = stsClient.assumeRole(req).credentials();

GCS Example

GoogleCredentials scoped = GoogleCredentials.createFromSecret(
    storageConfig.getServiceAccountJson())
    .createScoped(List.of("https://www.googleapis.com/auth/devstorage.read_write"))
    .createDelegated(storageConfig.getServiceAccountEmail());
AccessToken token = scoped.refreshAccessToken();

3. Token Construction and Caching

After receiving the provider token, Polaris wraps it in a PolarisCredential object that includes:

Provider name (aws, gcs, azure)
Expiration timestamp
Scoped resource path
Original request ID for tracing

Polaris also caches tokens for a short window to reduce provider calls when identical scopes are requested repeatedly.

Benefits in Real‑World Deployments

Benefit	Description
Reduced Blast Radius	Compromise of a short‑lived token limits exposure to a few minutes and a narrow path.
Automatic Revocation	Tokens expire automatically; administrators can also invalidate the cache to force re‑minting.
Compliance Friendly	Auditable token issuance logs simplify regulatory reporting.
Operational Simplicity	No need to rotate static keys; credential lifecycle is managed by Polaris.

Practical Example: Reading a Table from Spark

import org.apache.polaris.client.PolarisClient
import org.apache.spark.sql.SparkSession

val polaris = PolarisClient.builder()
  .endpoint("https://polaris.mycompany.com")
  .authToken("Bearer <user‑jwt>")
  .build()

val cred = polaris.getTemporaryCredential(
  catalog = "analytics",
  namespace = "sales",
  table = "transactions",
  privilege = "TABLE_READ_DATA"
)

// Spark can now read directly using the temporary S3 credentials
val df = SparkSession.builder()
  .appName("PolarisDemo")
  .getOrCreate()

df.read
  .format("iceberg")
  .option("fs.s3a.access.key", cred.accessKey)
  .option("fs.s3a.secret.key", cred.secretKey)
  .option("fs.s3a.session.token", cred.sessionToken)
  .load("s3://my‑bucket/analytics/sales/transactions")

df.show()

The Spark job never sees a permanent AWS key; it receives a scoped token that expires after 15 minutes.

Best Practices for Using Credential Vending

Limit Scope Aggressively – Include the bucket and prefix that the request truly needs.
Set Short Expiration – Default of 5‑15 minutes is usually sufficient for a data pipeline step.
Cache Wisely – Enable short‑term caching to reduce provider latency, but ensure cache invalidation on role changes.
Monitor Token Usage – Polaris logs each token issuance; integrate with your observability stack to detect anomalies.
Rotate Underlying IAM Roles – Even though tokens are short‑lived, the underlying IAM role should be rotated periodically.

Conclusion

Apache Polaris’ credential vending mechanism provides a modern, secure alternative to static access keys. By issuing short‑lived, scoped tokens on demand, Polaris reduces the attack surface, simplifies compliance, and aligns with the principle of least privilege. As data pipelines continue to scale and integrate with multiple cloud providers, such dynamic credential management becomes a cornerstone of a robust data governance strategy.

If you want to try it yourself, check out the Polaris GitHub repository and the official documentation. Feel free to reach out with questions or share your own experiences – secure data access is a community effort.

Author Bio: I'm Prithvi S, Staff Software Engineer at Cloudera and Open‑source Enthusiast. Follow my work on GitHub: https://github.com/iprithv

Improving Search Relevance in OpenSearch: A Practical Guide for Engineers

Prithvi S — Mon, 27 Apr 2026 14:03:41 +0000

OpenSearch powers search at scale for many organizations, but raw relevance scores often need fine‑tuning to match user expectations. The Search Relevance plugin gives engineers a structured way to evaluate, experiment, and improve relevance without writing custom code for every change. In this post we walk through a complete workflow: from defining a query set to running experiments, measuring metrics, and applying the insights to boost search quality.

1. Why Search Relevance Matters

Even the most powerful search engine can return results that feel irrelevant if the underlying scoring models are not aligned with the business problem. Users expect the most useful documents at the top, and a mismatch can increase bounce rates, reduce conversions, and erode trust. The Search Relevance plugin provides a repeatable process to measure relevance, experiment with configurations, and iterate based on data‑driven metrics.

2. Core Concepts

2.1 Query Sets

A query set is a collection of representative user queries that you want to evaluate. Each entry includes the query text, optional filters, and a unique identifier. Building a good query set is critical: it should cover the most common intents, edge cases, and any domain‑specific terminology.

2.2 Experiments

An experiment ties a query set to one or more search configurations. A configuration may adjust analyzers, boosting rules, or ranking functions. Experiments run the queries against each configuration and collect judgments for every result.

2.3 Judgments

Judgments capture the perceived relevance of a document for a given query. They can be explicit (human annotators rating relevance on a scale) or implicit (click‑through, dwell time). The plugin stores judgments in internal system indexes, making them available for metric computation.

3. Building a Search Quality Evaluation Pipeline

The following steps outline a practical pipeline that you can adapt to any OpenSearch cluster.

3.1 Create the Query Set

### Example JSON for a query set stored in index .search-relevance-query-set
curl -X POST "http://localhost:9200/.search-relevance-query-set/_doc" \
  -H 'Content-Type: application/json' -d '{
    "name": "ecommerce‑top‑queries",
    "queries": [
      {"id": "q1", "query": "wireless headphones"},
      {"id": "q2", "query": "best laptop for developers"},
      {"id": "q3", "query": "budget travel insurance"}
    ]
  }'

3.2 Define Search Configurations

Each configuration lives in index .search-relevance-config. You can adjust analyzers, boost fields, or enable function scoring.

curl -X POST "http://localhost:9200/.search-relevance-config/_doc" \
  -H 'Content-Type: application/json' -d '{
    "name": "baseline",
    "settings": {"boost": 1.0},
    "analyzer": "standard"
  }'

Create additional configs (e.g., custom‑boost‑titles) to compare against the baseline.

3.3 Launch the Experiment

curl -X POST "http://localhost:9200/.search-relevance-experiment/_doc" \
  -H 'Content-Type: application/json' -d '{
    "name": "Q1‑baseline‑vs‑custom",
    "query_set": "ecommerce‑top‑queries",
    "configs": ["baseline", "custom‑boost‑titles"],
    "type": "PAIRWISE_COMPARISON"
  }'

The experiment moves to RUNNING state and the plugin orchestrates query execution across the selected configs.

4. Collecting Judgments

The UI of the Search Relevance Dashboards presents side‑by‑side results for each config. Evaluators choose a relevance rating (e.g., 0‑3) for each document. Implicit signals such as click‑through are also recorded automatically.

Tip: Use a small group of domain experts for explicit judgments and supplement with implicit data from production traffic.

5. Computing Metrics

After the experiment completes, the plugin calculates common relevance metrics:

nDCG (Normalized Discounted Cumulative Gain) – accounts for position bias and relevance grades.
Precision@k – proportion of relevant results in the top k.
Recall@k – coverage of relevant documents in the top k.
MRR (Mean Reciprocal Rank) – average of the reciprocal rank of the first relevant result.

The results are stored in index .search-relevance-metrics and can be queried via the REST API.

6. Interpreting the Results

A typical output looks like:

config               nDCG   Precision@5   MRR
baseline             0.72   0.48           0.55
custom‑boost‑titles   0.79   0.55           0.62

Higher scores indicate better alignment with the judged relevance. In this example, boosting titles improves nDCG by 0.07 and lifts Precision@5 by 7 percentage points.

7. Applying the Findings

7.1 Update Index Settings

Based on the metrics, you might adjust the search.analyzer or add custom scoring scripts. For the title‑boost example, you could add a field‑level boost in the query DSL.

"query": {
  "function_score": {
    "query": {"match": {"title": "{{query}}"}},
    "field_value_factor": {"field": "title_boost", "factor": 1.5}
  }
}

7.2 Re‑run the Experiment

After applying changes, rerun the experiment to verify the impact. This iterative loop ensures that each tweak produces measurable gains.

8. Real‑World Case Study: Fixing Bad Results

A media platform noticed that users searching for "latest tech news" were frequently seeing older articles. The team:

Defined a query set focused on timeliness.
Ran a baseline experiment and observed low nDCG (0.48).
Added a recency boost using a decay function.
Re‑ran the experiment and saw nDCG rise to 0.73.
Deployed the new config to production, resulting in a 15 % increase in click‑through rate for the targeted queries.

9. Best Practices

Keep query sets small but representative – 30‑50 queries often provide enough signal.
Use both explicit and implicit judgments – they complement each other.
Version your search configurations – the plugin stores them as immutable objects, making rollback trivial.
Automate metric extraction – integrate a CI step that fetches the latest metrics after each experiment.
Document decisions – store rationale in the experiment description field for future reference.

10. Conclusion

Search relevance is an ongoing discipline, not a one‑time setting. The OpenSearch Search Relevance plugin gives you a repeatable, data‑driven process to measure and improve relevance. By defining clear query sets, running controlled experiments, and acting on concrete metrics, you can turn vague user complaints into quantifiable improvements.

About the author

Prithvi S – Staff Software Engineer at Cloudera and Open‑source enthusiast. Follow my work on GitHub: https://github.com/iprithv

The Metadata Tree: How Apache Iceberg Tracks Everything Without a Database

Prithvi S — Fri, 24 Apr 2026 00:31:47 +0000

If you've worked with petabyte-scale data lakes, you've felt the pain: tables fragment into thousands of small files, query engines can't find what they need, schema changes become nightmares, and concurrent writes collide silently. Apache Iceberg solves these problems with something deceptively simple: a metadata hierarchy that tracks every file, every change, and every snapshot without needing a centralized database.

In this post, we'll walk through Iceberg's architecture from the ground up. By the end, you'll understand why Netflix built this, how it works, and why it's becoming the standard for analytics across Spark, Trino, Flink, and beyond.

The Problem: Why Big Data Tables Are So Fragile

Before Iceberg, data lakes used simple file layouts (Parquet/ORC on HDFS or S3). Sounds fine in theory. In practice:

Atomicity fails. A write dies halfway through. Are those partial files valid? Which ones should queries skip?
Partition discovery is slow. Query engines scan directory listings to find what data exists. With millions of files, this takes minutes.
Schema changes break old files. Add a column, and now old data files don't have it. Readers fail or invent NULLs.
Concurrent writes collide. Two writers claim they "finished" the same table simultaneously. Which one won? Corruption.
Pruning is guesswork. Engines don't know column ranges within files. They scan everything.

Hive, Delta, and early Iceberg prototypes tried band-aids (metadata services, loose conventions). They all had the same fundamental issue: the catalog was either centralized (slow, inconsistent) or decentralized (unreliable).

Iceberg took a different approach: immutable snapshots and a metadata hierarchy, with a single atomic pointer as the source of truth.

Meet the Metadata Tree

Here's the insight: instead of asking "what files exist?", Iceberg asks "what does this snapshot contain?". A snapshot is an immutable point-in-time view of the table. Every write creates a new snapshot. Every snapshot points to a set of files. And Iceberg tracks all of this through a beautifully layered hierarchy.

Let's build it from the bottom up:

Layer 1: Data Files (Parquet/ORC/Avro)

The base layer is simple: actual data. Files contain table rows in columnar format (Parquet is standard, ORC supported, Avro optional).

s3://my-warehouse/my-table/data/
  00000-abc123.parquet  (1 GB, 10M rows, partition_date=2024-01-01)
  00001-def456.parquet  (950 MB, 9.5M rows, partition_date=2024-01-01)
  00002-ghi789.parquet  (1.1 GB, 11M rows, partition_date=2024-01-02)

Key property: These files are write-once. Once created, they never change. If you delete a row, you don't rewrite the file (we'll see how Iceberg handles deletes later). This makes them safe for concurrent reads and enables efficient caching.

Layer 2: Manifest Files (Track Data Files and Stats)

Now here's where Iceberg gets clever. Instead of listing files via S3 API calls (slow, inconsistent), Iceberg writes a manifest file for each snapshot. A manifest is a Parquet file that lists all data files in that snapshot along with their metadata.

For each data file, a manifest stores:

File path: s3://my-warehouse/my-table/data/00000-abc123.parquet
File format: parquet
File size: 1,073,741,824 bytes
Partition info: {partition_date: 2024-01-01} (as integers, not strings)
Record count: 10,000,000
Column stats: min/max/null_count for each column
- user_id: min=1000, max=9999999, null_count=0
- amount: min=0.01, max=9999.99, null_count=1200
- timestamp: min=2024-01-01 00:00:00, max=2024-01-01 23:59:59, null_count=0

Why stats? Because query engines use them for predicate pushdown. If you ask "SELECT * WHERE amount > 5000", the engine checks the manifest: "Does any file have max amount >= 5000?". If no, skip the entire file.

s3://my-warehouse/my-table/metadata/
  manifests/
    20240501-000001-abc.avro  (manifest for snapshot 1, 500 KB)
    20240502-000002-def.avro  (manifest for snapshot 2, 510 KB)

Multiple snapshots = multiple manifests.

Layer 3: Manifest List (Snapshot-Level Index)

One manifest per snapshot might be thousands of files. Iceberg adds another layer: the manifest list. This is a small Parquet file that lists which manifests belong to a given snapshot.

For each manifest, it stores:

Manifest file path: s3://my-warehouse/my-table/metadata/manifests/20240501-000001-abc.avro
Manifest length: 500,000 bytes
Partition stats: min/max partition values across all files in this manifest
- partition_date: min=2024-01-01, max=2024-01-01
Record count: Sum of all files in this manifest

So if your snapshot includes 100 manifests, the manifest list is a ~50 KB file that summarizes all of them. Engines can scan the manifest list to decide which manifests to read.

s3://my-warehouse/my-table/metadata/
  snap-20240501000001.avro  (manifest list for snapshot 1, 25 KB)
  snap-20240502000002.avro  (manifest list for snapshot 2, 26 KB)

Layer 4: Metadata File (Table Schema, History, Current Snapshot)

The metadata file is a JSON file that contains the entire truth about the table:

{
  "format-version": 2,
  "table-uuid": "12345-abcde",
  "location": "s3://my-warehouse/my-table",
  "last-updated-ms": 1714605600000,
  "last-column-id": 5,
  "schema": {
    "type": "struct",
    "fields": [
      {"id": 1, "name": "user_id", "required": true, "type": "long"},
      {"id": 2, "name": "email", "required": true, "type": "string"},
      {"id": 3, "name": "amount", "required": false, "type": "double"},
      {"id": 4, "name": "created_at", "required": true, "type": "timestamp"},
      {"id": 5, "name": "country", "required": false, "type": "string"}
    ]
  },
  "partition-spec": [
    {"name": "date_partition", "transform": "year", "source-id": 4}
  ],
  "current-snapshot-id": 2,
  "snapshots": [
    {
      "snapshot-id": 1,
      "timestamp-ms": 1714519200000,
      "summary": {"operation": "append", "added-files": "100"},
      "manifest-list": "s3://my-warehouse/my-table/metadata/snap-20240501000001.avro"
    },
    {
      "snapshot-id": 2,
      "timestamp-ms": 1714605600000,
      "summary": {"operation": "append", "added-files": "5"},
      "manifest-list": "s3://my-warehouse/my-table/metadata/snap-20240502000002.avro"
    }
  ],
  "properties": {
    "write.format.default": "parquet"
  }
}

See current-snapshot-id: 2? That pointer tells engines: "If you want the latest table, load snapshot 2, which points to this manifest list, which lists these manifests, which contain these files."

The metadata file also contains:

Schema: Column definitions and IDs (more on this later)
Partition spec: How to partition new data (year, month, day, hour, bucket, truncate)
Snapshot history: Every snapshot ever created, its timestamp, and what manifest-list it uses
Properties: Table config (compression, default format, etc.)

s3://my-warehouse/my-table/metadata/
  00000.json  (metadata file v0, created at table init)
  00001.json  (metadata file v1, created after first write)
  00002.json  (metadata file v2, created after second write)

Each write creates a new metadata file.

Layer 5: Catalog Pointer (The Only Mutable Piece)

Finally, the catalog. This is where Iceberg gets its reliability. The catalog is a simple mapping:

table_name -> current_metadata_file_location

For example:

my-table -> s3://my-warehouse/my-table/metadata/00002.json

That's it. One pointer. Everything else is immutable.

How does the catalog live? It doesn't live in S3. It lives in a catalog service:

Hive metastore: Legacy, still widely used
AWS Glue: Managed, but slower
Nessie: Git-like, multi-branch table versioning
REST catalog: HTTP-based, engine-agnostic (Apache Polaris)
Iceberg catalog: In-memory, for testing

When you write to a table:

Load current metadata file (via catalog pointer)
Generate new data files and manifests
Create a new metadata file that points to the new manifests
Atomically update the catalog pointer from old metadata to new metadata

Step 4 is the secret sauce. Most catalog services support atomic CAS (compare-and-swap) operations. If two writers race, only one CAS succeeds. The loser retries with the new current state. This gives you serializable isolation without distributed locks.

Putting It Together: A Write Operation

Let's trace a real write to see all five layers in action.

Initial state:

Catalog: my-table -> metadata/00000.json
Metadata 00000: current-snapshot-id=1, 100 data files
Snapshot 1: manifest-list points to 10 manifests

Writer appends 5 new files:

Writer generates 5 new Parquet files in the data directory
Writer creates a new manifest (manifest v1) listing these 5 files + stats
Writer creates a new manifest-list (snap-2) that includes both old manifests and the new one
Writer creates new metadata file (00001.json):
- current-snapshot-id: 2
- snapshot 2 points to new manifest-list
- schema unchanged
- partition-spec unchanged
Atomic CAS: Writer tells catalog: "Update my-table to point to metadata/00001.json (only if it still points to 00000.json)"
Catalog confirms: pointer updated
Write succeeds. Old metadata file (00000.json) stays on disk for history/time travel

Concurrent reader:

Sees catalog still pointing to 00000.json (old snapshot)
Reads that snapshot's manifest-list
Reads 10 manifests, finds 100 files
Scans those 100 files
Write doesn't interfere; reader gets consistent view

Later reader:

Sees catalog now pointing to 00001.json
Reads snapshot 2's manifest-list
Finds 11 manifests (10 old + 1 new)
Reads 105 files total
Sees the 5 new rows

Why This Design Is Brilliant

1. Atomicity without transactions: Only the catalog pointer moves. Everything else is immutable. No metadata locks, no two-phase commit.

2. Snapshot isolation: Readers hold a reference to a metadata file (or snapshot ID). Old snapshots never change. Even if 10 new writes happen, that reader's view stays frozen.

3. Time travel: Query engine wants rows from May 1st? Load the snapshot from May 1st, read that manifest-list and files. No replaying transactions; the snapshot already exists.

4. File pruning: Manifest files store column stats. Query engines skip files without matching data before ever touching S3/HDFS. Orders of magnitude faster than directory scans.

5. Partition evolution: Want to change from daily to hourly partitions? Create new metadata file with new partition-spec. Old data files keep old partition values (stored in manifest). New files use new partition values. Readers handle both transparently.

6. Schema evolution: Columns are identified by ID, not name. Rename a column? New metadata file with updated schema. Old files still have ID 1 for user_id; readers understand both.

7. Concurrent writes at scale: Thousands of writers, zero coordination. Optimistic locking via metadata CAS. If you have 10% write collision rate, 99% of writers succeed on first try.

The Performance Trade-off

This architecture does have a cost: metadata reads. Every query must:

Load metadata file (JSON from S3)
Load manifest-list (Parquet)
Load relevant manifests (Parquet)
Filter down to data files

For a table with millions of files, even with parallelism, this adds 100-500ms to query startup. This is why Iceberg works best with a warm metadata cache (many query engines cache manifest files in memory) and why partition pruning is critical (manifest-list stores partition bounds, so you can skip entire manifest groups).

Next Steps in the Iceberg Journey

Now that you understand the metadata tree, you're ready for:

Schema evolution: How field IDs make ALTER TABLE safe
Hidden partitioning: How Iceberg auto-applies transforms (year, month, day, etc.)
Delete semantics: Position deletes vs equality deletes, Copy-on-Write vs Merge-on-Read
Optimistic concurrency: How CAS prevents write collisions
Time travel: Querying historical snapshots by timestamp

Conclusion

Apache Iceberg's genius is in its simplicity: a five-layer metadata tree with one atomic pointer at the top. No central database. No complex consistency protocols. Just immutable files, clever indexing, and compare-and-swap semantics.

If you're building data platforms at scale, this is the architecture you should understand. Whether you're using Spark, Trino, Flink, or any other engine, Iceberg's design enables correctness, performance, and flexibility that older table formats simply can't match.

Start with a test table in your warehouse. Try time travel. Try schema evolution. Feel how different Iceberg is. Then you'll really get why it matters.

Want to dive deeper? Check out the Apache Iceberg spec at https://iceberg.apache.org/spec/ and the source code at https://github.com/apache/iceberg.

About the Author

I'm Prithvi S, Staff Software Engineer at Cloudera and Open Source Enthusiast. I work on data systems, LLM-powered applications, and large-scale architectures. Follow my work on GitHub: https://github.com/iprithv

How OpenSearch Plugins Really Work: Architecture & Extension Points

Prithvi S — Thu, 23 Apr 2026 15:05:38 +0000

OpenSearch is powerful out of the box, but its true flexibility comes from plugins. Yet most developers treat plugins as black boxes: you install them, they work, and you move on. But what if you need to build one? Or understand why a plugin broke after an upgrade? Or design a system that integrates with OpenSearch's plugin ecosystem?

In this post, I'll walk you through how plugins actually work: compilation, packaging, installation, and the extension points that make customization possible. By the end, you'll understand the mechanics well enough to build your own.

The Plugin Lifecycle: From Source to Running Code

Step 1: Writing and Compiling a Plugin

A plugin is a Java project with dependencies on OpenSearch core. At minimum, you need:

dependencies {
    compileOnly "org.opensearch:opensearch:${opensearch_version}"
}

That compileOnly is critical: your plugin compiles against OpenSearch, but doesn't bundle it. The plugin will run inside the OpenSearch JVM, using the host's core libraries.

Your plugin entry point is a class that extends Plugin. For example:

public class MyCustomPlugin extends Plugin implements SearchPlugin {
    @Override
    public List<QuerySpec<?>> getQueries() {
        return Collections.singletonList(
            new QuerySpec<>(MyCustomQuery.NAME, MyCustomQuery::new, p -> MyCustomQuery.fromXContent(p))
        );
    }
}

This simple declaration tells OpenSearch: "I provide a custom query type called my_custom_query."

Step 2: Building the Plugin Artifact

When you run gradle build, you produce a .zip file containing:

my-plugin-1.0.0.zip
├── opensearch-plugin-descriptor.properties
├── lib/
│   ├── my-plugin-1.0.0.jar
│   └── my-dependencies.jar (if any third-party libs needed)
├── bin/ (optional: scripts)
└── config/ (optional: default settings)

The opensearch-plugin-descriptor.properties file is the plugin manifest:

name=my-custom-plugin
description=My custom query plugin
version=1.0.0
opensearch.version=2.13.0
java.version=11
classname=com.example.MyCustomPlugin

This manifest declares: which OpenSearch version the plugin targets, what Java version it needs, and crucially, the entry point class name.

Step 3: Installation via the opensearch-plugin Tool

You install via CLI:

./bin/opensearch-plugin install file:///path/to/my-plugin-1.0.0.zip

The tool does several things:

Verifies the manifest — reads opensearch-plugin-descriptor.properties
Version checks — ensures plugin targets the installed OpenSearch version
Extracts — unpacks to plugins/my-custom-plugin/
Loads classes — prepares the plugin for JVM loading
Restarts the node — required to load the plugin code

After restart, your plugin code is live.

Class Loader Isolation and Bootstrap

Here's where it gets interesting. Your plugin code runs in the same JVM as OpenSearch core. How does OpenSearch prevent your plugin from accidentally (or maliciously) breaking core?

Class Loader Isolation:

OpenSearch uses a custom PluginClassLoader for each plugin. This loader is a child of the core class loader, but has its own namespace:

Core classes (org.opensearch.*) resolve from the main class loader
Plugin classes resolve from the plugin's class loader first
If a class isn't found in the plugin loader, it falls back to core

This prevents version conflicts. If your plugin wants to use a specific version of a library, it can bundle it, and its class loader will find that version first without conflicting with core.

Bootstrap Contract:

When OpenSearch starts, it:

Discovers all plugins in plugins/ directory
Reads each plugin's descriptor
Creates a PluginClassLoader for each
Instantiates each plugin's entry point class via reflection
Calls lifecycle methods: onIndexModule(), onNodeStarted(), etc.

If a plugin fails to load, OpenSearch will refuse to start. This is intentional: it's safer to fail loudly than to silently omit a plugin that applications might depend on.

Extension Points: How Plugins Hook Into OpenSearch

A plugin doesn't have direct access to internal OpenSearch code. Instead, it implements well-defined extension point interfaces. OpenSearch discovers these implementations and calls them at the right moments.

SearchPlugin: Custom Query Types and Aggregations

The most common extension point for search-focused plugins:

public class MySearchPlugin extends Plugin implements SearchPlugin {
    @Override
    public List<QuerySpec<?>> getQueries() {
        // Register custom query types
        return Collections.singletonList(
            new QuerySpec<>(MyQuery.NAME, MyQuery::new, p -> MyQuery.fromXContent(p))
        );
    }

    @Override
    public List<AggregationSpec> getAggregations() {
        // Register custom aggregations
        return Collections.singletonList(
            new AggregationSpec(MyAggregation.NAME, MyAggregation::new, p -> MyAggregation.parse(p))
        );
    }

    @Override
    public List<ScoreFunctionSpec<?>> getScoreFunctions() {
        // Register custom scoring functions
        return Collections.singletonList(
            new ScoreFunctionSpec<>(MyScoreFunction.NAME, MyScoreFunction::new, p -> MyScoreFunction.parse(p))
        );
    }
}

Once registered, your custom query is available via the REST API:

GET /my-index/_search
{
  "query": {
    "my_custom_query": {
      "field": "title",
      "boost": 2.0
    }
  }
}

ActionPlugin: Custom REST and Transport Actions

For plugins that need custom REST endpoints or transport operations:

public class MyActionPlugin extends Plugin implements ActionPlugin {
    @Override
    public List<ActionHandler<?, ?>> getActions() {
        return Collections.singletonList(
            new ActionHandler<>(MyAction.INSTANCE, TransportMyAction.class)
        );
    }

    @Override
    public List<RestHandler> getRestHandlers(Settings settings, RestController restController, 
            ClusterSettings clusterSettings, IndexScopedSettings indexScopedSettings,
            SettingsFilter settingsFilter, List<NamedWriteableRegistry> namedWriteableRegistries,
            List<NamedXContentRegistry> namedXContentRegistries, Supplier<DiscoveryNodes> nodesInCluster,
            Supplier<ClusterState> clusterStateSupplier) {
        return Collections.singletonList(
            new RestMyHandler()
        );
    }
}

Now you can hit a custom endpoint:

POST /_plugin/my-action
{
  "param1": "value"
}

MapperPlugin: Custom Field Types

If you need a new field type (beyond standard text, keyword, numeric, etc.):

public class MyMapperPlugin extends Plugin implements MapperPlugin {
    @Override
    public Map<String, Mapper.TypeParser> getMappers() {
        return Collections.singletonMap(
            "my_custom_field",
            (name, node, parserContext) -> new MyCustomFieldMapper(name, parserContext)
        );
    }
}

Now you can use it in mappings:

PUT /my-index
{
  "mappings": {
    "properties": {
      "custom_field": {
        "type": "my_custom_field",
        "analyzer": "standard"
      }
    }
  }
}

EnginePlugin: Custom Lucene Behavior

For advanced use cases, you can hook into the Lucene engine itself:

public class MyEnginePlugin extends Plugin implements EnginePlugin {
    @Override
    public Optional<EngineFactory> getEngineFactory(IndexSettings indexSettings) {
        return Optional.of(config -> new MyCustomEngine(config));
    }
}

IngestPlugin: Custom Processors

For plugins that process documents during ingestion:

public class MyIngestPlugin extends Plugin implements IngestPlugin {
    @Override
    public Map<String, Processor.Factory> getProcessors(Processor.Parameters parameters) {
        return Collections.singletonMap(
            "my_processor",
            (factories, tag, config) -> new MyIngestProcessor(tag, config)
        );
    }
}

Use in pipeline:

PUT /_ingest/pipeline/my_pipeline
{
  "processors": [
    {
      "my_processor": {
        "field": "content"
      }
    }
  ]
}

Real-World Example: The Search Relevance Plugin

OpenSearch's own search-relevance plugin demonstrates these concepts in action. It provides:

Custom query types for A/B testing search relevance
Custom aggregations for metrics collection
REST endpoints to manage experiments
System indexes (prefixed with .plugins-search-rel-) to store experiment state
Concurrent search request deciders (OpenSearch 2.17+) for custom query execution strategies

The plugin is battle-tested in production, used by teams optimizing ranking and relevance across massive datasets.

System Indexes: How Plugins Store Their Own State

Most non-trivial plugins need to persist data. Rather than requiring external storage, they use system indexes within OpenSearch itself.

System indexes are prefixed with .plugins- or .opendistro-:

.plugins-search-rel-<version>-experiments
.plugins-search-rel-<version>-notes
.plugins-ml-config
.opendistro-job-scheduler-lock

The challenge: how do you evolve the schema without breaking existing deployments?

OpenSearch plugins use a schema versioning pattern:

public static final String SCHEMA_VERSION = "1";

private void ensureIndexInitialized() {
    if (!indexExists()) {
        createIndex();
        return;
    }

    Map<String, Object> indexMeta = getIndexMeta();
    String currentVersion = (String) indexMeta.getOrDefault("schema_version", "0");

    if (!currentVersion.equals(SCHEMA_VERSION)) {
        migrateSchema(currentVersion, SCHEMA_VERSION);
    }
}

private void migrateSchema(String fromVersion, String toVersion) {
    // Use Put Mapping API to add new fields (additive only)
    // Never remove or change existing field types
    putMapping(newFields);
}

This ensures:

Old documents coexist with new schema
Upgrades are backwards compatible
No downtime required for schema evolution

Performance and Reliability Considerations

Startup Time

Each plugin adds to startup time. Large plugins or plugins that do heavy initialization can slow cluster startup. Monitor this in production.

Class Loader Memory

Each plugin gets its own class loader, holding copies of loaded classes in memory. Many plugins = higher memory footprint. Keep plugin count reasonable.

API Stability

OpenSearch's plugin APIs are versioned with OpenSearch itself. When OpenSearch releases a major version, plugins must recompile and test. This is by design: it ensures plugins stay compatible with core.

Security

Plugins run in the same JVM as OpenSearch core. A malicious or buggy plugin can crash the entire node. Only install plugins from trusted sources. In multi-tenant environments, consider network isolation or separate clusters.

Building Your Own Plugin: Where to Start

Clone the plugin template: OpenSearch provides plugin-template repository
Implement your extension point (SearchPlugin, ActionPlugin, etc.)
Write tests — use OpenSearch's testing framework
Build the .zip — gradle build produces the artifact
Install locally — ./bin/opensearch-plugin install file://...
Test end-to-end — verify your REST endpoint/query/aggregation works
Publish — host on artifact repository or GitHub Releases

Conclusion

OpenSearch plugins are not magic. They're well-structured Java code that hooks into OpenSearch via extension points. Understanding this architecture demystifies plugin behavior, helps you troubleshoot issues, and opens the door to building custom extensions.

Whether you're optimizing search relevance, integrating with custom systems, or building observability tooling, the plugin architecture gives you the hooks you need without compromising core stability.

The next time a plugin breaks after an upgrade, you'll know exactly where to look. And when you need to build one, you'll have a mental model of how the pieces fit together.

About the Author

The Credential Vending Revolution: How Polaris Eliminates Long-Lived Keys

Prithvi S — Thu, 23 Apr 2026 09:24:08 +0000

The Problem Nobody Wants to Talk About

You're a data engineer at a mid-sized company. Your team needs access to production data for analytics, ML pipelines, and ad-hoc queries. So you do what everyone does: you create long-lived AWS credentials (access key + secret), store them in a vault (or worse, environment variables), and distribute them to your team.

Then you pray.

You pray nobody copies them to Slack. You pray an engineer doesn't accidentally commit them to GitHub. You pray that when someone leaves, you remember to rotate them. You pray a compromised machine doesn't expose them to attackers.

This is the status quo. And it's broken.

For years, data catalogs have accepted this reality: want to access data? Here's a credential. Use it however you want. Rotation? Access control? Audit trails? Maybe in the next version.

Apache Polaris just threw that playbook in the trash.

Instead of handing out credentials, Polaris mints them on demand. Every request for data gets a fresh, short-lived token scoped to exactly what's needed. No long-lived keys. No distribution. No prayer.

This is credential vending. And it's about to change how we think about data security.

What Is Credential Vending?

Credential vending is simple in concept but elegant in execution: instead of pre-issuing static credentials, a system dynamically generates temporary, scoped credentials when you request access to data.

Here's how it works in Polaris:

You request access: A Spark engine asks Polaris for permission to read a specific table
Polaris authorizes: It checks your identity, verifies your role, confirms you have TABLE_READ_DATA privilege
Polaris mints a credential: It calls AWS STS, GCS token service, or Azure token service and gets a temporary credential
Polaris scopes it: That credential is locked to the specific table path, read-only, and expires in ~15 minutes
Your engine uses it: Spark gets the scoped token, reads exactly what it needs, then the credential expires automatically

No long-lived keys. No distribution. No rotation headaches.

The genius is in the details.

Why This Matters: The Security Cascade

1. No Long-Lived Credentials to Steal

Traditional approach: You create an AWS access key with read+write permissions to your S3 bucket. You give it to 20 engineers. It lives in vaults, notebooks, CI/CD pipelines. Anywhere there's a copy, there's a vulnerability.

Attack surface = number of copies × time each copy exists.

Polaris approach: Your team never touches long-lived credentials. The only keys that exist are:

Polaris' own cloud provider credentials (stored securely, rotated regularly)
Temporary tokens minted per-request, valid for 15 minutes, then deleted

Attack surface = 1 system × 15 minutes at a time.

That's a 1000x reduction in exposure.

2. Instant Revocation

With long-lived credentials, revoking access means updating IAM policies, rotating keys, or waiting for vault secrets to refresh. By then, a compromised key might already be in use.

With credential vending, revocation is instant: Polaris stops issuing credentials for that principal, and their next data request fails immediately. No key in circulation. No 15-minute grace period for attackers to exploit.

3. Path-Level Scoping

Your Spark job needs to read s3://data-lake/customers/transactions. With traditional credentials, you'd give it broad S3 access. Polaris? It mints a credential valid only for that exact table path.

Even if that credential leaks, an attacker can only read transactions, not your employee records, financial data, or anything else in the bucket.

Write operations get the same treatment: TABLE_WRITE_DATA privilege generates a credential that can only write to that specific table, not drop it, not truncate it, not write to other tables.

Privilege mapped to cloud permissions. Boundaries enforced at the storage layer.

4. Compliance & Audit Trail

Regulators (GDPR, HIPAA, SOX) love paper trails. Credential vending creates one automatically:

Every credential vend is logged with who requested it, what table, when it was issued, when it expired
You can correlate data access with user identity without relying on IAM logs that are often delayed or incomplete
Breaches are traceable: "Which credentials were active when this data left the building?" Answer: the ones issued to that specific user for that specific 15-minute window

No more "we distributed keys, so we don't know who accessed what" handwaving.

Under the Hood: How Polaris Mints Credentials

Let's get technical. Here's the flow for an S3-backed Polaris catalog:

The Request

A Trino query hits your Polaris catalog asking to read table prod.warehouse.users. Polaris receives:

Your identity (service principal or user)
The table you want to access
The operation (read, write, etc.)

Authorization Check

Polaris checks:

Does this principal have a role assigned?
Does that role have TABLE_READ_DATA privilege for this table (or its parent namespace)?
If yes, proceed. If no, fail fast.

This is enforced by Polaris' two-tier RBAC model: Principal Roles (identity) separate from Catalog Roles (permissions). More on that in a future post, but the key insight: authorization happens before any credential is minted.

Credential Minting

Assuming authorization passes, Polaris looks up the storage configuration for this catalog. For S3, it has:

An AWS account ID
A role ARN (e.g., arn:aws:iam::123456789:role/polaris-data-lake)
An external ID (for added security)

Polaris calls STS:AssumeRole with parameters:

Role ARN
Duration: 900 seconds (15 minutes)
Session policy: A JSON policy restricting the token to s3:GetObject on paths matching s3://data-lake/prod/warehouse/users/*

AWS returns a temporary security credential (access key, secret key, session token).

Scoping the Credential

Here's where it gets clever. Polaris doesn't just pass through the STS response. It crafts a session policy that restricts the token further:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::data-lake/prod/warehouse/users/*"
      ]
    }
  ]
}

For TABLE_WRITE_DATA, the policy includes s3:PutObject:

{
  "Effect": "Allow",
  "Action": ["s3:GetObject", "s3:ListBucket", "s3:PutObject"],
  "Resource": [
    "arn:aws:s3:::data-lake/prod/warehouse/users/*"
  ]
}

Notice: s3:DeleteObject is never included. You can write to the table, but you can't delete it or its backing files. Polaris itself controls deletes through atomic metadata operations.

Response to Engine

Polaris returns the temporary credential to your Spark/Trino/Flink engine. The engine uses it to read data. After 15 minutes, the token expires automatically.

No revocation needed. No cleanup. No leftover keys.

The Federation Game Changer (v1.3.0)

Here's where it gets even more interesting.

Many organizations don't run pure Iceberg. You have Snowflake for analytics, Delta Lake in Databricks, Hudi on Hadoop, and Iceberg in your data lake. Managing credentials across all of them is a nightmare.

Polaris v1.3.0 introduces federated credential vending.

Instead of each external system managing its own credentials, Polaris can mint credentials on behalf of external catalogs. Your Snowflake-to-Iceberg migration? Polaris handles credential vending for both. Your Databricks Delta table accessed through Polaris? Same story.

This is huge for:

Data mesh architectures: One source of truth for credential vending across multiple catalog types
Migrations: Seamlessly bridge old systems (Snowflake, Glue) with new ones (Iceberg) without credential sprawl
Multi-cloud setups: Mint GCS credentials for BigQuery, S3 credentials for Iceberg, all from a single Polaris instance

Performance: The Caching Question

"But doesn't minting a credential for every request add latency?"

Yes. Each STS call takes 100-200ms.

Polaris solves this with intelligent caching:

For repeated requests from the same principal to the same table, Polaris reuses the cached credential (until near expiration)
This reduces cloud provider API calls significantly
The tradeoff: earlier revocation (e.g., if permissions change mid-session) requires a cache flush

For most workloads (batch jobs, dashboards, recurring queries), this is a net win. Latency is imperceptible; security is massively improved.

Implementing Credential Vending in Your Catalog

If you're building a catalog or evaluating options, here's what to look for:

Cloud-native credential generation: Does the system call your cloud provider's token service (STS, GCS, Azure), or does it generate its own tokens? Cloud-native is better (leverages existing IAM, auditable).
Scoping mechanism: Are credentials scoped to table paths, operations, or both? Path + operation = maximum security.
Expiration: How short can tokens be? 15 minutes is ideal for security; anything longer risks exposing stale credentials.
Caching strategy: How does the system balance revocation latency with performance? Intelligent caching (by principal + table) is the sweet spot.
Multi-cloud support: Do you need GCS, S3, and Azure all at once? Credential vending should work across cloud providers.

Polaris nails all five. That's why it's becoming the standard for open-source Iceberg deployments.

What's Next?

Credential vending is the foundation. On top of it, Polaris builds:

Two-tier RBAC for fine-grained access control
OPA integration for externalized policies
Metrics reporting for observability
Generic table support for non-Iceberg formats

But credential vending is the security core. It's why Polaris is uniquely positioned for zero-trust data architectures.

If your organization is scaling data access, if compliance is a concern, or if you're tired of rotating long-lived credentials, Polaris' credential vending approach is worth the migration.

No more praying.

Questions? Thoughts on credential vending, Polaris, or data security architecture?

Drop a comment below or find me on GitHub: https://github.com/iprithv

About the author: I'm Prithvi S, Staff Software Engineer at Cloudera and Opensource Enthusiast. Follow my work on GitHub.

Elasticsearch Cluster Health 101: Understanding Your Distributed System's Vital Signs

Prithvi S — Tue, 21 Apr 2026 14:29:18 +0000

You ship your Elasticsearch cluster to production. Traffic hits it. Three hours later, your monitoring dashboard flashes yellow. Your heart sinks. What does that mean? Are you in trouble? Should you wake up the on-call engineer at 2 AM?

This post teaches you to read your cluster's health like a doctor reads vital signs. By the end, you'll understand what GREEN, YELLOW, and RED actually mean, why your cluster sometimes needs time to heal itself, and how to spot real problems before they become disasters.

What Is Cluster Health? The Three States

Every Elasticsearch cluster has a health status. It's not a guess. It's a concrete signal that tells you whether your data is safe.

GET /_cluster/health
{
  "cluster_name": "production",
  "status": "green",
  "timed_out": false,
  "number_of_nodes": 5,
  "number_of_data_nodes": 3,
  "active_primary_shards": 12,
  "active_shards": 36,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 0,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 100.0
}

GREEN means all is well. Every primary shard has a replica. Every index is fully replicated. You can sleep soundly.

YELLOW means something is missing, but it's not critical yet. Your data is still readable and searchable. Every primary shard exists. But not all replicas are assigned. This usually happens when you lose a node, and Elasticsearch hasn't finished rebalancing yet. You have time to fix it, but replicas are your safety net. If another node fails while you're yellow, you lose data.

RED means you're in trouble. At least one primary shard is missing. Data is gone or unreachable. Your cluster cannot fully serve requests. This is the emergency light. Time to act.

The Architecture Behind Health: Cluster Coordination

To understand why your cluster gets sick, you need to understand how it stays healthy.

Elasticsearch is fundamentally distributed. Your data is split across multiple nodes. Each node is independent. But they need to agree on one critical thing: where is my data? This is the job of the cluster coordinator (master node).

The Master Node: The Orchestrator

One node in your cluster is elected master. This node makes all the big decisions:

Where do shards live?
Is node X still alive, or did it fail?
When a node joins, where do its shards go?
Which indices can be created or deleted?

The master maintains the cluster state, a constantly updated map of the cluster. This map says: "Shard 0 of index-2026-04 is on node-1 (primary), node-2 (replica), and node-3 (replica)."

Why does this matter? Because if the master dies, the cluster needs to elect a new one. And if you don't have enough nodes to reach a quorum, the cluster freezes to prevent split-brain (where two masters disagree and corrupt your data).

Master Election: The Quorum Rule

Elasticsearch uses quorum voting. You need at least (N/2 + 1) master-eligible nodes to have a working master.

1 master-eligible node: no quorum, single point of failure
2 master-eligible nodes: no quorum (need 2/2 + 1 = 2, but split vote = tie)
3 master-eligible nodes: quorum = 2 (safe, survives 1 failure)
5 master-eligible nodes: quorum = 3 (safe, survives 2 failures)

This is why production clusters run 3 or 5 master-eligible nodes, not 2. A 2-node cluster can't form a quorum if one node fails. Your cluster goes RED and stops serving requests.

Recommended Production Setup:
- 3 master-eligible nodes (dedicated, small machines)
- 3+ data nodes (store and search data, large machines)
- 1+ coordinating nodes (optional, route queries, aggregate results)

This setup survives any single node failure.

Shard Allocation: How Data Spreads

Your index has 3 primary shards and 2 replicas per primary. That's 9 shards total (3 primary + 6 replica). Elasticsearch's job is to spread these 9 shards across your nodes so that:

No primary and replica on the same node (otherwise a single node failure loses data)
Replicas spread across different nodes (fault tolerance)
Load balanced (roughly equal shard count per node)

When everything works, this happens automatically. When a node fails, Elasticsearch:

Detects the failure (no heartbeat for 30 seconds)
Marks the node as dead
Reassigns its shards to other nodes
Creates new replicas to restore redundancy

This process is called rebalancing. It takes time. A large index might take minutes or hours to fully rebalance. During this time, your cluster is YELLOW (replicas missing), but still operational.

Common Health Scenarios

Scenario 1: New Node Joins

Before: 2 nodes, 6 shards each (fully replicated, GREEN)
New node joins
Action: Master rebalances, shards move to new node
During: YELLOW (shards initializing on new node)
After: GREEN (shards reassigned, balanced)
Timeline: Minutes to hours depending on shard size

Scenario 2: Node Failure

Before: 3 nodes, GREEN (all shards have replicas)
Node 2 crashes (network partition, power failure)
Immediately: YELLOW (node 2's shards gone, replicas missing)
Action: Master promotes replicas on nodes 1 and 3 to primary
       Creates new replicas on nodes 1 and 3
During: YELLOW (replicas initializing)
After: GREEN (all shards have replicas again)
Timeline: Seconds (replica promotion) + minutes (replica creation)

Scenario 3: Disk Full

Before: 3 nodes, GREEN
Node 1 disk reaches 85% capacity
Action: Elasticsearch refuses to assign new shards to node 1
Symptom: Some shards can't be assigned to node 1, cluster goes YELLOW
Fix: Delete old indices, or add disk space
After: Cluster rebalances, goes GREEN

Reading the Health Endpoint: What Each Field Means

The GET /_cluster/health API is your primary diagnostic tool. Here's what each field tells you:

Field	Meaning
`status`	GREEN (all good), YELLOW (missing replicas), RED (missing primary)
`number_of_nodes`	Total nodes in cluster
`number_of_data_nodes`	Nodes that store data
`active_primary_shards`	Primary shards assigned and healthy
`active_shards`	Primary + replica shards assigned and healthy
`relocating_shards`	Shards currently moving to another node
`initializing_shards`	Shards being created or recovered
`unassigned_shards`	Shards that haven't been assigned to a node
`delayed_unassigned_shards`	Shards waiting to be assigned (temporary delay)
`number_of_pending_tasks`	Master tasks waiting to be executed

Example: Degraded Cluster

{
  "status": "yellow",
  "number_of_nodes": 3,
  "active_primary_shards": 12,
  "active_shards": 24,
  "unassigned_shards": 12,
  "relocating_shards": 2,
  "initializing_shards": 4
}

Translation: 3 nodes, 12 primary shards assigned, but only 24 total shards assigned. That means 12 replicas are missing (unassigned). Also, 2 shards are moving, 4 are initializing. The cluster is rebalancing from a recent failure or node addition.

Diagnosing RED: Data Is Missing

A RED cluster means at least one primary shard has no home. This is an emergency.

Find the problematic shard:

GET /_cat/shards?health=red&v

This shows you which shards are unassigned. Look for entries with no node assignment.

Common causes:

Node failure with insufficient replicas - If a node fails and you had zero replicas, the primary shard is lost
- Fix: Restore from snapshot
Disk full on all nodes - Elasticsearch won't assign shards to nodes >85% full
- Fix: Delete old indices, add disk space, or adjust disk threshold setting
Allocation disabled - Someone (usually during disaster recovery) disabled shard allocation
- Fix: Re-enable with PUT /_cluster/settings {"transient": {"cluster.routing.allocation.enable": "all"}}
Too many relocating shards - Master is overloaded trying to rebalance
- Fix: Wait, or reduce concurrent recoveries with cluster settings

Diagnosing YELLOW: Replicas Are Missing

YELLOW is a warning, not a failure. You can still read and write. But you're one node failure away from RED.

Check which indices are yellow:

GET /_cat/indices?health=yellow&v

Check if it's stuck or still rebalancing:

GET /_cluster/health?wait_for_status=green&timeout=5m

This waits up to 5 minutes for the cluster to reach GREEN. If it times out, you're stuck yellow.

Why you might be stuck yellow:

Insufficient nodes - You have 1 data node but 2 replicas per shard. No place to put the replicas
- Fix: Add more nodes
Allocation disabled - Replicas won't be assigned if allocation is off
- Fix: PUT /_cluster/settings {"transient": {"cluster.routing.allocation.enable": "all"}}
Allocation filters blocking replicas - You set a filter that prevents replicas on certain nodes
- Fix: Review allocation filtering rules

Monitoring: Don't Just React, Anticipate

Cluster health is reactive. It tells you what happened, not what will happen. For reliability, monitor proactively:

Alert on these:

Status == RED (obvious, immediate incident)
Status == YELLOW for >5 minutes (stuck rebalancing, investigate)
Unassigned shards > 0 for >10 minutes
Disk usage >85% on any data node
Heap usage >80% on any node
Relocating shards > 5 (recovery is slow)

Useful dashboard queries:

GET /_cat/nodes?v&h=name,ip,heap.percent,disk.percent,cpu,load_1m

This shows node-by-node health: heap usage, disk usage, CPU, load. Red flags: heap >80%, disk >85%.

GET /_nodes/stats/jvm,fs,indices

Deep dive: garbage collection pauses, segment count, cache hit rates. Useful for performance issues hiding behind a GREEN cluster.

Common Mistakes That Destroy Reliability

Mistake 1: Single master-eligible node

You think it works fine until that one node fails
Cluster elects no master, goes RED, stops serving requests
Fix: Always run 3+ master-eligible nodes

Mistake 2: Ignoring YELLOW for days

"It's yellow, but traffic is fine!" you say
Then a second node fails, cluster goes RED
Fix: Investigate YELLOW immediately, restore replicas

Mistake 3: All shards on one node

You didn't specify replicas or shard allocation rules
One node failure = RED cluster
Fix: Use allocation awareness, rack awareness, or zone awareness

Mistake 4: Disabling shard allocation and forgetting to re-enable

You disabled it during maintenance and moved on
Weeks later, replicas are still unassigned
Fix: Audit allocation settings regularly

Mistake 5: Not understanding recovery time

You expect replicas to be created instantly
But recovery depends on network bandwidth, index size, merge rate
You panic and manually delete/recreate indices, making it worse
Fix: Understand recovery SLOs for your cluster size

Putting It Together: Your First Cluster Health Audit

Here's what to do right now:

# Check overall health
curl http://localhost:9200/_cluster/health?pretty

# If not GREEN, check which indices are affected
curl http://localhost:9200/_cat/indices?health=yellow&v

# Check which shards are unassigned
curl http://localhost:9200/_cat/shards?health=yellow&v

# Check node status
curl http://localhost:9200/_cat/nodes?v&h=name,ip,heap.percent,disk.percent,cpu

# Check allocation settings
curl http://localhost:9200/_cluster/settings?pretty

If you see YELLOW and unassigned shards, it's usually one of these:

A node is recovering (wait 5-10 min, check again)
You don't have enough nodes for your replica count (add nodes)
Disk is full (delete old data)
Allocation is disabled (re-enable it)

Conclusion: Health Is Visibility

Cluster health is not a number to ignore. It's your window into the distributed system running underneath your search and analytics.

GREEN means you're safe. YELLOW means you're vulnerable. RED means you have a real problem.

The key insight: Elasticsearch recovers automatically most of the time. Your job is to understand what's happening, monitor proactively, and know when to intervene.

Next step: Learn about shard allocation strategies and how to scale your cluster without triggering cascading failures.

About the Author

I'm Prithvi S, Staff Software Engineer at Cloudera and Open Source Enthusiast. Follow my work on GitHub: https://github.com/iprithv

Building a Search Quality Evaluation Pipeline with OpenSearch Search Relevance

Prithvi S — Mon, 20 Apr 2026 00:34:27 +0000

Introduction

You've built a search engine. Users can query your data. Results come back. But here's the uncomfortable question: are the results good?

This isn't about whether the feature works. It works. But does it rank the most relevant documents first? Do users find what they're looking for? Are you optimizing for the right metrics?

If you're operating OpenSearch at scale, you've probably felt this pain. Search quality isn't a one-time configuration problem. It's an ongoing optimization challenge. You need to measure it, experiment with it, and improve it systematically.

That's where the OpenSearch Search Relevance plugin comes in.

In this post, I'll walk you through building an end-to-end search quality evaluation pipeline using the Search Relevance plugin. By the end, you'll understand how to create representative query sets, run controlled experiments, collect human judgments, compute relevance metrics, and iterate on your search configuration until results actually matter.

The Search Quality Problem

Before diving into the solution, let's frame the problem clearly.

Search quality has multiple dimensions:

Relevance - Does the top result match what the user searched for?
Completeness - Are all relevant documents in the top-K results?
Ranking - Are more relevant docs ranked higher than less relevant ones?
Precision - What fraction of returned results are actually useful?
Recall - What fraction of useful documents did we find?

The BM25 algorithm (OpenSearch's default) is good out of the box. But "good" isn't "perfect." And what's perfect for one use case (e-commerce product search) might be terrible for another (medical research papers).

You need a way to:

Define what "good" means for your domain
Measure it quantitatively
Test changes before deploying
Track improvements over time

This is exactly what a search quality evaluation pipeline does.

Meet the OpenSearch Search Relevance Plugin

The Search Relevance plugin is part of the opensearch-project ecosystem. It's designed specifically to solve this problem.

At its core, it orchestrates four key components:

Query Sets - Representative questions or search terms for your domain
Search Configurations - Different index analyzers, query types, and boosting settings
Experiments - Controlled comparisons between two search configurations
Judgments - Human-provided relevance labels for query-document pairs
Metrics - Computed evaluation scores: nDCG, precision, recall, MRR

The beauty is that it connects all of these together into a coherent workflow. You don't need to glue together five different tools. It's all built in.

Step 1: Build Your Query Set

Everything starts with queries.

A query set is a collection of representative search terms or questions for your domain. The quality of your query set directly affects the quality of your evaluation.

How to Design a Query Set

Think like your users. What would they actually search for?

For an e-commerce search engine, examples might be:

"blue running shoes size 10"
"wireless headphones under 100"
"coffee maker for 2 people"

For a documentation search:

"how to configure SSL certificates"
"what is a shard"
"troubleshooting connection timeouts"

For a code search:

"implement BFS algorithm"
"parse JSON to object"
"handle file not found error"

Include variety. Your query set should cover:

Short queries (1-2 terms) and long queries (5+ terms)
Common searches and edge cases
Different intent types (navigation, informational, transactional)
Different domains if your search spans multiple categories

Size matters. A good query set has 50-300 queries depending on your domain:

50-100: Initial evaluation (fast iteration)
100-200: Standard evaluation for most teams
200+: Large-scale benchmarking across multiple configurations

Create it incrementally. Start small, run experiments, learn what queries are most impactful, expand.

Storing Your Query Set

In the Search Relevance plugin, query sets are stored as OpenSearch documents. Here's a conceptual example:

{
  "_id": "ecommerce-base-v1",
  "name": "E-commerce Base Query Set",
  "description": "100 representative queries for product search",
  "queries": [
    {"id": "q001", "text": "blue running shoes"},
    {"id": "q002", "text": "wireless headphones"},
    ...
  ]
}

Each query has an ID and text. Simple. You can create query sets via the OpenSearch API or the Dashboards UI.

Step 2: Define Search Configurations

A search configuration is a snapshot of how you want to search: analyzers, query types, field boosting, synonym expansion, and more.

Think of it as: "This is one way to search."

What Goes Into a Configuration

Here are common things you might tune:

Analyzers

Standard analyzer vs. custom analyzer with stemming
Language-specific analyzers (English, French, etc.)
Phonetic analysis for typo tolerance

Query Types

BM25 query (default, term frequency + IDF)
Match phrase query (exact phrase matching)
Bool query with SHOULD/MUST/FILTER clauses
Multi-match across multiple fields

Field Boosting

Title field: 2x boost (more important)
Description field: 1x boost (baseline)
Tags field: 0.5x boost (less important)

Query Parameters

Fuzziness (tolerating typos)
Operator (AND vs. OR semantics)
Minimum should match (for multi-clause queries)

Creating a Configuration

Via API:

PUT /search-config/_doc/config-v1
{
  "name": "Default BM25",
  "description": "Standard BM25 with field boosting",
  "analyzer": "standard",
  "query_type": "multi_match",
  "field_weights": {
    "title": 2.0,
    "description": 1.0,
    "tags": 0.5
  }
}

You typically create 2-3 configurations to compare:

Your baseline (current production)
A proposed improvement (new analyzer or boosting strategy)
(Optional) A radically different approach

Step 3: Run an Experiment

Now the magic happens. You tell the plugin: "Compare config A vs. config B using my query set. Execute all queries against both and show me which is better."

The plugin does this:

Iterate through each query in your query set
Execute the query against your OpenSearch index using config A
Execute the same query against your index using config B
Capture the results (top-K documents for each)
Store everything indexed and ready for judgment

This is an experiment. The status is CREATED -> RUNNING -> COMPLETED.

Once completed, you have paired results: for each query, you can see side-by-side what config A returned vs. what config B returned.

Why This Matters

You now have:

Reproducible, deterministic comparisons
No randomness (same queries, same configs, same results every time)
Side-by-side results for human evaluation
A complete audit trail of what changed

Step 4: Collect Judgments

Here's where humans come in.

Judgments are relevance labels. A human expert looks at a query and says: "This document is highly relevant" or "This document is not relevant."

The Search Relevance plugin supports two types of judgments:

Explicit Judgments

A human grades each query-document pair on a scale. Most common:

0: Not relevant (wrong topic entirely)
1: Somewhat relevant (tangentially related)
2: Relevant (answers the query)
3: Highly relevant (perfect match)

The plugin UI presents your experiment results (config A vs. B side-by-side) and lets judges assign these grades.

Implicit Judgments

Collect signals from user behavior:

Click-through rate (user clicked this result)
Dwell time (user spent time reading this)
Skip rate (user skipped this and clicked something lower)

For many teams, explicit judgments from a small pool of domain experts (5-20 people) is enough to get strong signal.

Step 5: Compute Metrics

Once you have judgments, the plugin computes relevance metrics:

nDCG (Normalized Discounted Cumulative Gain)

This measures ranking quality. The intuition:

Relevant documents should rank high
Position matters (higher positions worth more)
Perfect ranking gets nDCG = 1.0
Random ranking gets nDCG near 0.5

Formula (simplified):

nDCG = (1/IDCG) * sum(relevance_grade / log2(position + 1))

When to use: Almost always. This is the gold standard for ranking quality.

Precision and Recall

Precision: What fraction of top-K results were relevant?

Precision@10 = (# relevant in top 10) / 10
Good for: User experience (are the top results useful?)

Recall: What fraction of relevant documents did we find?

Recall@100 = (# relevant in top 100) / (# relevant total)
Good for: Comprehensiveness (did we find everything?)

MRR (Mean Reciprocal Rank)

Average position of the first relevant document.

MRR = 1 / (average position of first relevant result)
Good for: Cases where only the first result matters (navigation queries)

Which Metric to Track

nDCG: Primary metric. Use nDCG@10 or nDCG@20 for most cases.
Precision@K: Secondary. Shows top-K quality.
Recall@K: If comprehensiveness matters (e.g., search across entire document corpus).
MRR: Only if navigational queries are critical.

Step 6: Iterate

The experiment outputs metrics. You see that config B's nDCG is 0.78 while config A's is 0.72. Config B wins.

Now what?

Deploy config B. Monitor search quality in production. But don't stop.

Run the next experiment. Try:

A different analyzer
Different field boosting
Additional query logic

Treat search quality like any other product: continuous improvement, guided by metrics.

Putting It All Together: The Pipeline

Here's the complete workflow:

1. Design Query Set
   |
   v
2. Create Search Configurations (Baseline + Proposed)
   |
   v
3. Run Experiment (Execute queries for both configs)
   |
   v
4. Collect Judgments (Humans grade results)
   |
   v
5. Compute Metrics (nDCG, precision, recall, MRR)
   |
   v
6. Analyze Results (Which config wins? By how much?)
   |
   v
7. Deploy Winner & Monitor
   |
   v
8. Repeat (Design next experiment)

Each cycle takes days to weeks (depending on judgment collection speed). But you're grounded in data, not guesses.

Practical Tips

Start small. 50 queries, 2 configurations, 10 judges. Run the pipeline end-to-end. You'll learn what works.

Judge consistently. Train your judges. Create a judgment guide. Have judges re-evaluate a subset for inter-rater agreement.

Track over time. Keep historical metrics. Did nDCG improve? By how much? This builds confidence that you're moving in the right direction.

Combine signals. Use nDCG as your primary metric, but also check precision, recall, and MRR. Sometimes improvements in one metric hurt another.

Automate where possible. Explicit judgments require humans, but experiment execution, metric computation, and result analysis should all be automated.

Version everything. Query sets, configurations, experiments, judgments. Treat them like code: track versions, enable reproducibility, enable rollback.

Challenges You'll Face

Judgment burden. Grading 100 queries x 10 results per query = 1,000 judgments. At 30 seconds per judgment, that's 8 hours. Parallelize across judges. Use inter-rater agreement to validate quality.

Query set quality. A bad query set produces meaningless metrics. Spend time upfront building representative queries. Validate with users or domain experts.

Config comparison fairness. Make sure both configs query the same data, same index, same relevance judgments. Isolate variables.

Metric interpretation. A 0.02 improvement in nDCG might be noise or might be significant. Track confidence intervals. Run multiple rounds.

Real-World Example

Let's say you run e-commerce search. Your current baseline achieves nDCG@10 = 0.68. Users complain that size/color variants aren't matching well.

You hypothesize: "If I boost the size and color fields more, users will find exact matches faster."

You create a new config with aggressive field boosting on size and color. Run an experiment with 100 queries and 15 judges.

Results:

Config A (baseline): nDCG@10 = 0.68, Precision@10 = 0.72
Config B (boosted): nDCG@10 = 0.71, Precision@10 = 0.75

Config B wins. You deploy it. Users are happier. Search quality improved by 4.4%.

Next experiment: Can we improve recall without hurting precision? Try a different query operator. Repeat.

Over 6 months, you've compounded these improvements. Your nDCG went from 0.68 to 0.79. That's real impact, measured and reproducible.

Getting Started

Install OpenSearch Search Relevance plugin. Follow the official docs: https://github.com/opensearch-project/search-relevance
Create your first query set. Start with 50-100 queries. Validate with users.
Set up 2 configurations. Baseline (current) and one proposed improvement.
Run an experiment. Let it complete. Study the results.
Collect judgments. Have 5-10 domain experts grade a subset of results.
Compute metrics. Let the plugin calculate nDCG, precision, recall, MRR.
Analyze. Which config won? By how much? Is it statistically significant?
Iterate. Deploy the winner, then design the next experiment.

Conclusion

Search quality doesn't happen by accident. It's engineered, measured, and continuously improved.

The OpenSearch Search Relevance plugin gives you the infrastructure to do this systematically. Query sets, configurations, experiments, judgments, metrics. All connected. All reproducible.

If you're operating search at scale, you owe it to your users to build this pipeline. Start small. Run your first experiment this week. Measure. Iterate. Improve.

Your search results will thank you. So will your users.

About the Author

I'm Prithvi S, Staff Software Engineer at Cloudera and Open Source Enthusiast. Follow my work on GitHub: https://github.com/iprithv

How Lucene Executes a Boolean Query: The Hidden Optimization Layer

Prithvi S — Sat, 18 Apr 2026 00:32:12 +0000

When you run a search query against Elasticsearch, Solr, or any Lucene-powered system, something remarkable happens under the hood. That simple boolean query you wrote gets transformed, optimized, and executed through a sophisticated pipeline designed to return results in milliseconds. But most developers never see inside that black box. Let's change that.

The Journey: From Query to Results

Every time you execute a boolean query like:

(status:active AND (category:tech OR category:science)) NOT spam:true

Lucene doesn't just naively search for documents matching each condition. Instead, it performs a series of intelligent transformations and optimizations before a single byte is read from disk. Understanding this journey is the key to writing efficient queries and debugging slow search performance.

Part 1: Boolean Query Structure - The Basics

A BooleanQuery in Lucene is built from clauses:

BooleanQuery.Builder builder = new BooleanQuery.Builder();
builder.add(new TermQuery(new Term("status", "active")), BooleanClause.Occur.MUST);
builder.add(new TermQuery(new Term("category", "tech")), BooleanClause.Occur.SHOULD);
builder.add(new TermQuery(new Term("spam", "true")), BooleanClause.Occur.MUST_NOT);
BooleanQuery query = builder.build();

The four clause types create different semantics:

MUST: Document must match (hard constraint)
SHOULD: Document may match (soft constraint, boosts score)
MUST_NOT: Document must not match (hard constraint)
FILTER: Document must match, but scoring is skipped

The FILTER clause is often overlooked, but it's one of Lucene's most powerful optimizations for production workloads.

Part 2: Query Rewriting - The Invisible Optimization

Before anything else happens, BooleanQuery goes through a rewriting phase. This is where Lucene applies transformations to simplify and optimize the query tree.

Example transformations:

A MUST_NOT clause with nothing else becomes impossible (returns no results)
Single SHOULD clauses in a pure-SHOULD query become OR logic
Nested BooleanQueries get flattened
Queries with all FILTER clauses get optimized for constant-score execution

Query rewritten = query.rewrite(searcher.getIndexReader());

The rewritten query may look completely different from the original. This is where Lucene applies domain knowledge about query structures to reduce complexity.

Part 3: Weight Creation - Binding Query to Index Statistics

After rewriting, a Weight object is created. This is the marriage between your query and the actual index data. The Weight object:

Examines index statistics - How many documents contain each term? What's the frequency distribution?
Computes normalization factors - These will be used during scoring
Prepares scoring parameters - BM25 k1, b values; query boosts; field boosts

Weight weight = query.createWeight(searcher, ScoreMode.TOP_SCORES, 1.0f);

This is where Lucene decides: "Given that 'active' appears in 50% of documents but 'quantum-physics' appears in only 12 documents, how should I weight these terms when scoring?"

Part 4: The Scorer Tree - Building the Execution Plan

Now comes the magic. For each segment in the index, Lucene constructs a Scorer tree - a hierarchical structure that knows how to efficiently find matching documents.

For our example query (status:active AND category:tech) OR category:science, Lucene builds something like:

          DisjunctionScorer (OR)
         /                    \
   ConjunctionScorer (AND)   TermScorer(category:science)
   /                      \
 TermScorer(status:active)  TermScorer(category:tech)

Each node in this tree knows:

How to advance to the next matching document
How to compute scores for matching documents
How to skip documents that can't possibly match

This tree structure is critical. The order of operations and the types of scorers used determine query performance.

Part 5: Conjunction Optimization - The AND is Expensive

Here's a critical insight: finding documents that match both terms (AND/MUST) is expensive. You need to find the intersection of two postings lists.

Lucene uses several optimizations:

1. Selectivity-Aware Ordering
Lucene reorders conjunctions to start with the most selective term first:

// Instead of: status:active AND rare_term:value
// Lucene does: rare_term:value AND status:active

If rare_term:value matches only 5 documents, why waste time iterating through the 50% of documents matching status:active? Start with the rare term.

2. Two-Phase Iteration
For expensive predicates (like positions or more complex matching), Lucene uses a two-phase approach:

Phase 1: Approximate match - "Could this doc match?"
Phase 2: Exact match - "Does this doc actually match?"

This avoids computing expensive term positions for documents that will be filtered out anyway.

3. Skipping Strategy
Postings lists store skip pointers that allow jumping over blocks of documents. If you're looking for docID 10000 and the current posting is at docID 1000, you can skip forward without examining every document in between.

Part 6: Disjunction and WAND - The OR Optimization

Disjunctions (OR/SHOULD) are fundamentally different from conjunctions. You don't need to match all clauses; you just need to match at least one.

For simple disjunctions on small result sets, Lucene just iterates through all matching documents. But when you're returning top-10 results from millions of documents, Lucene applies the Weak AND (WAND) algorithm.

The WAND algorithm is elegant:

Maintain a threshold - The current minimum score of documents we're keeping
Skip high-cost scorers - If a scorer can't possibly contribute more points than our threshold, skip it
Update the threshold - As we collect results, raise it

This means for a query like (category:tech OR category:science OR category:business), Lucene doesn't score every document in those three categories. It:

Starts with documents from the most selective category
Uses block-level maximum scores to skip large blocks that can't reach the threshold
Only evaluates expensive scoring logic when necessary

For queries with 100+ SHOULD clauses (common in recommendation systems), WAND can reduce scoring work by 90%.

Part 7: The FILTER Clause Secret - Performance Magic

This is where many developers miss an optimization opportunity.

// Slow - scoring happens for all status:active documents
BooleanQuery slow = new BooleanQuery.Builder()
    .add(new TermQuery(new Term("status", "active")), BooleanClause.Occur.MUST)
    .add(new TermQuery(new Term("content", "machine learning")), BooleanClause.Occur.MUST)
    .build();

// Fast - filtering skips scoring for the status check
BooleanQuery fast = new BooleanQuery.Builder()
    .add(new TermQuery(new Term("status", "active")), BooleanClause.Occur.FILTER)
    .add(new TermQuery(new Term("content", "machine learning")), BooleanClause.Occur.MUST)
    .build();

The difference: FILTER clauses are executed as constant-score queries. Lucene doesn't compute BM25 scores for the filtering predicate - it just checks "matches or doesn't match" and moves on. All the BM25 scoring effort goes to your actual ranked clause.

If you have a clause that's just a hard constraint (like status:active for filtering), use FILTER instead of MUST.

Part 8: Early Termination - Stopping Before You Read Everything

Here's a subtle but important optimization. When you ask for top-10 results, does Lucene really score every document in the index?

No. It uses early termination.

The trick: Lucene tracks the "worst of the best" scores it's collected so far. Once a scorer indicates "I cannot produce any score higher than X for the remaining documents", Lucene can stop.

This works because:

Postings lists are sorted by document ID (not score)
Block-level statistics tell Lucene the maximum possible score in each block
If that maximum is lower than our threshold, skip the entire block

For large indexes with million+ documents, early termination can reduce the number of documents scored from 100% to 5%.

Part 9: Real Code Example - Understanding explain()

Let's trace a real query using Lucene's explain() API:

Query query = new BooleanQuery.Builder()
    .add(new TermQuery(new Term("title", "machine learning")), BooleanClause.Occur.MUST)
    .add(new TermQuery(new Term("tags", "python")), BooleanClause.Occur.SHOULD)
    .build();

Explanation explanation = searcher.explain(query, docID);
System.out.println(explanation);

Output:

10.5 = score(doc=42)
  10.2 = BM25 score from title:machine learning
    title occurrence frequency (2 times)
    inverse document frequency
    field length normalization
  0.3 = BM25 score from tags:python
    tags occurrence frequency (1 time)
    lower IDF (python is common)

Reading this output shows you exactly how Lucene computed the score. You'll see:

Which clauses contributed to the score
Whether term frequency or field length dominated
How IDF affected ranking
Whether MUST vs SHOULD affected the calculation

This is your debugging tool when a query returns unexpected ordering.

Part 10: Production Pitfalls and Debugging

Pitfall 1: Inefficient Query Ordering

// Bad: rare term second
status:active AND (rarely_updated:value)

// Good: rare term first  
(rarely_updated:value) AND status:active

Lucene tries to optimize this automatically, but explicit ordering in your application layer helps.

Pitfall 2: Too Many SHOULD Clauses Without Minimum Match

// Bad: 100 SHOULD clauses, each one scores every document
(category:a OR category:b OR category:c ... x100)

// Good: use minimum match or filtering
setMinimumNumberShouldMatch(1)

Pitfall 3: Scoring When You Only Need Filtering

// Bad: scoring overhead for a filter
status:active AND (premium:true)

// Good: use FILTER
+status:active +(premium:true)

Pitfall 4: Expensive Predicates in Conjunctions

// Bad: phrase queries are expensive, put them last
"exact phrase match" AND common_term:value

// Good: filter first, then expensive query
common_term:value AND "exact phrase match"

Part 11: Visualizing the Execution

When you understand query execution, you start thinking about queries differently:

Most selective terms first - Filter out documents early
Hard constraints as FILTER - Avoid wasted scoring
Expensive predicates last - Only evaluate when necessary
Watch clause count - 50+ SHOULD clauses trigger WAND; 100+ becomes slow

Use tools like Elasticsearch's profile API to see exactly what happened:

{
  "profile": {
    "shards": [{
      "searches": [{
        "query": [{
          "type": "BooleanQuery",
          "description": "(status:active AND content:machine learning)",
          "time_in_nanos": 1250000,
          "breakdown": {
            "score": 450000,
            "build_scorer": 350000,
            "create_weight": 200000,
            "advance": 250000
          }
        }]
      }]
    }]
  }
}

Conclusion: Think Like Lucene

Understanding how Lucene executes boolean queries transforms you from someone who writes queries to someone who can reason about their performance.

Key takeaways:

Rewriting is invisible but powerful - Your query gets transformed before execution
Scorer trees represent execution plans - More selective terms should be first
WAND optimizes disjunctions - SHOULD clauses with 50+ options still perform well
FILTER beats MUST for hard constraints - Skip unnecessary scoring
Early termination matters at scale - Top-10 doesn't mean score all million documents
Tools like explain() are your window inside - Always inspect queries that surprise you

The next time you write a search query, picture the scorer tree Lucene will build. Ask yourself: "Is this tree optimized for my workload?"

That's thinking like Lucene.

Further Reading:

Lucene Query Parser syntax
BM25 scoring model deep dive
Elasticsearch query profiling guide
Custom Query implementations in Lucene

Have you debugged a slow query by understanding its scorer tree? Share your experience in the comments.

About the Author

The Metadata Tree: How Apache Iceberg Finds the Right Files Without a Database

Prithvi S — Fri, 17 Apr 2026 00:32:40 +0000

When you're managing petabytes of data across hundreds of machines, every millisecond matters. Most data engineers assume you need a separate metadata system to keep track of what's where. Iceberg proves you don't. Instead, it bakes metadata intelligence directly into the file format itself, creating a self-describing hierarchy that can scale to massive datasets without external dependencies.

This is the story of how Iceberg's metadata architecture works, why Netflix designed it this way, and what it means for your data platform.

The Problem: Metadata Hell at Scale

Let's start with what happens in traditional data lakes.

You have files on HDFS or cloud storage. Lots of them. Thousands. Millions. Every time someone runs a query, the query engine needs to answer three questions:

Which files contain data relevant to this query?
What schema does each file have?
What values are in each column (for filtering)?

In Hive or standard Parquet setups, you usually solve this with an external metadata store. Maybe it's a Hive metastore (which is just a relational database). You track table locations, schemas, partitions, and statistics in that database. When you write new data, you update the database. When you read, you query the database first.

This approach has real problems:

External dependency: Your data isn't independent. The metadata store becomes a single point of failure.
Consistency issues: What happens when a write succeeds on HDFS but the metadata update fails? You now have files with no catalog entry.
Scalability: At petabyte scale, metadata queries become bottlenecks. A single database can't efficiently track billions of files.
Time travel is hard: Historical metadata isn't naturally preserved. Rollback operations require custom logic.
Schema evolution breaks: Renaming columns means updating the entire catalog and potentially invalidating downstream queries.

Iceberg's answer: What if metadata lived with the data, not in a separate system?

The Metadata Hierarchy: Five Layers of Intelligence

Iceberg organizes metadata in a strict hierarchy. Each layer builds on the one below it, creating a chain from the catalog pointer all the way down to individual data files.

Catalog (mutable pointer)
    ↓
Metadata File (JSON, immutable)
    ↓
Manifest List (immutable)
    ↓
Manifests (immutable)
    ↓
Data Files (immutable)

Let's walk through each layer.

Layer 1: The Catalog (The Single Mutable Piece)

The catalog is the entry point to your table. It's a simple pointer: "The current state of this table is defined by this metadata file."

That's it. Just a reference. And this reference is the only thing that ever changes.

When you execute a write operation, Iceberg doesn't modify the existing metadata. Instead, it:

Creates a new metadata file describing the new table state
Creates new manifest files describing new snapshots
Writes new data files
Atomically updates the catalog pointer using compare-and-swap (CAS) logic

If two writers collide, one's update is rejected, it retries, and the conflict resolves cleanly. This is optimistic concurrency control, and it's elegant because there's only one pointer to update.

The catalog itself is usually a simple file or an HTTP endpoint. For file-based catalogs, it's literally a pointer to a location in cloud storage. For REST catalogs (used by distributed systems like Polaris), it's an HTTP service that handles the pointer updates.

Layer 2: The Metadata File (The Table Schema)

Each snapshot has exactly one metadata file. It's a JSON file that describes the entire table at that point in time:

Schema: Column names, data types, field IDs (more on field IDs later)
Partition spec: How data is partitioned (by year? month? day?)
Current snapshot ID: Which snapshot is currently active
Snapshot history: All previous snapshots and their timestamps
Table properties: Custom metadata, format version, sort order

This metadata file is immutable. Once written, it never changes. Every time you write data, a new metadata file is created with an updated snapshot list.

Why immutability matters: Your entire table history is preserved. You can query any previous snapshot by ID. You can time-travel to last Tuesday. You can audit exactly what changed and when.

Layer 3: The Manifest List (The Snapshot View)

A manifest list is a file that describes all the manifests for a single snapshot.

Think of it as a table of contents: "This snapshot contains these manifests, which collectively describe all the data files in the table at this point in time."

The manifest list includes:

References to all manifest files
Partition specs for each manifest
Min/max values for each partition (for pruning)
File counts and row counts

When a query engine wants to scan a snapshot, it reads the manifest list first. This allows it to prune entire manifests before reading individual data files.

Why this level of indirection? It allows Iceberg to handle large tables elegantly. Instead of one giant list of all files, you have a list of manifests. If you have 10 million data files, you might have 1000 manifest files, and one manifest list. The query engine can prune at the manifest level before drilling into individual files.

Layer 4: Manifests (The File Inventory)

Each manifest is a file that lists a set of data files. It includes metadata about each file:

File path: Location of the data file
File format: Parquet, ORC, Avro
Row count: How many rows in this file
Byte size: File size
Column statistics: Min, max, null count per column
Partition values: What partition this file belongs to

Manifests are where the real filtering magic happens. Before a query engine reads a single data file, it can:

Check partition values: "Do I even need this file for this query?"
Check column statistics: "Are all values in this column outside my filter range?"
Prune aggressively: Skip entire files without touching them

For example, if you query WHERE year = 2024 AND user_id > 1000000, the manifest can tell you instantly which files have relevant data. No scanning required.

Layer 5: Data Files (The Actual Data)

Finally, at the bottom, you have the data files themselves. These are standard Parquet, ORC, or Avro files. They contain the actual table data.

The key point: They're immutable. Once written, they never change. All mutations (updates, deletes) are handled higher in the stack through delete files and new snapshot creation.

How the Hierarchy Enables Architecture Without Databases

Now you see the full picture. Let's connect it back to the original problem.

With this hierarchy, Iceberg can answer all three questions without an external metadata store:

Which files are relevant? Walk down from the catalog pointer through manifest list to manifests. File-level granularity statistics do the filtering.
What's the schema? It's in the metadata file, part of the immutable snapshot.
What values are in each column? Statistics are stored in manifests alongside each file reference.

The entire metadata layer is self-contained. You don't need a separate database. You don't need to manage catalog consistency separately from data consistency. It's all atomic, all together.

And because each snapshot is immutable, your entire table history is naturally preserved. Time travel queries aren't a special feature; they're just reading an older snapshot. Schema changes don't break the system; old data files continue to work with old schemas (thanks to field IDs, which we'll touch on next).

Field IDs: Why Names Are Dangerous

This is the secret weapon that makes schema evolution work.

In traditional tables, columns are identified by position or name. If you rename a column, downstream systems break. If you reorder columns, file readers get confused.

Iceberg uses field IDs instead. Every column has a unique numeric ID that never changes. When you ALTER TABLE to rename a column, the field ID stays the same. The data file format doesn't care about the name; it reads by ID.

Example:

-- Original table
CREATE TABLE users (
  id INT,              -- field ID 1
  name STRING,         -- field ID 2
  email STRING         -- field ID 3
)

-- Years later, rename a column
ALTER TABLE users RENAME COLUMN email TO email_address

-- Old data files still have field ID 3 pointing to email data
-- New queries use field ID 3, which now maps to email_address
-- Everything works transparently

This is why Iceberg tables can safely evolve their schemas over years without rewriting historical data.

Partition Evolution: Changing Strategy Without Rewrites

The same principle applies to partitioning.

Let's say you originally partitioned your data by year:

/year=2020/...
/year=2021/...
/year=2022/...

Later, you realize you need monthly partitions for performance:

/year=2023/month=01/...
/year=2023/month=02/...

With traditional data lakes, you'd have to rewrite all the old yearly partitioned data to the new monthly scheme. That's hours of work for large tables.

With Iceberg, you change the partition spec and move on. New writes use the new partition layout. Old writes keep the old layout. The manifest files track which partition spec applies to which data files. When you query, Iceberg handles both seamlessly.

Hidden Partitioning: The Query Engine Doesn't Know or Care

Here's another elegance: Your query doesn't explicitly reference partitions.

In Hive, you might write:

SELECT * FROM events WHERE year = 2024

Hive sees the partition column and uses it for partition pruning.

In Iceberg, you write:

SELECT * FROM events WHERE event_date >= '2024-01-01'

You don't mention partitioning at all. You just query the actual column. Iceberg's hidden partitioning handles partition transforms internally. If the table is partitioned by year(event_date), Iceberg applies the transform, prunes the right partitions, and returns the answer. The query engine never knows partitioning happened.

This is powerful because:

Queries are simpler and more portable across engines
You can change partition transforms without rewriting queries
The partition strategy is an implementation detail, not part of the query contract

Snapshots: Immutable Points in Time

Every write operation creates a new snapshot. A snapshot is an immutable view of the table at a point in time.

The catalog points to the current snapshot. Previous snapshots remain accessible. You can:

Query snapshot N: SELECT * FROM table VERSION AS OF snapshot_id_12345
Query by timestamp: SELECT * FROM table FOR SYSTEM_TIME AS OF '2026-04-10'
Audit history: Inspect the metadata file to see all snapshot timestamps
Rollback: Point the catalog back to an earlier snapshot

Because snapshots are immutable and complete, time travel isn't a performance problem. You're not scanning all history; you're just reading a different snapshot.

Copy-on-Write and Merge-on-Read: Two Paths to Mutation

Iceberg supports two different strategies for handling updates and deletes.

Copy-on-Write (CoW): When you delete a row, Iceberg reads the entire data file, filters out the deleted row, and writes a new file. The old file is marked as deleted in the manifest.

Pros: Clean reads, no delete reconciliation needed
Cons: Every mutation rewrites files (slow for write-heavy workloads)

Merge-on-Read (MoR): When you delete a row, Iceberg writes a separate delete file that marks rows as deleted. At read time, the query engine merges the data files and delete files to reconstruct the current state.

Pros: Fast writes, no file rewrites
Cons: Reads have to do reconciliation work

The manifest tracks both data files and delete files, so readers know which delete files apply to which data files.

Metadata Compaction: Keeping History Manageable

Over months and years, you accumulate many snapshots. The metadata files grow. The manifest lists grow.

Iceberg has automatic mechanisms to clean this up:

Metadata expiration: Old snapshots beyond a retention period are marked for deletion
Manifest compaction: Small manifest files are merged into larger ones
Orphan file cleanup: Files with no references in active snapshots are removed

You can also manually trigger compaction to optimize for query performance.

Why This Architecture Matters

Let's zoom out. Why is this design so important?

Scalability without infrastructure: You don't need a separate metadata store. Files themselves carry the metadata. This means Iceberg scales to petabytes without additional complexity.
ACID correctness: The atomic catalog pointer ensures that either a write succeeds completely or fails completely. There's no partial success, no consistency holes.
Engine independence: The metadata hierarchy is format-agnostic. Spark reads it the same way Trino does. The metadata is the contract between engines, not some database-specific schema.
History preservation: Because snapshots are immutable, you get time travel and audit trails for free. No special feature; just a natural consequence of the design.
Evolution without friction: Schema and partition evolution don't require data rewrites. Your table grows and changes safely.
Concurrent writes: Optimistic concurrency control at the metadata level means multiple writers can work simultaneously without locking. Conflicts resolve cleanly at the atomic point.

This is why Iceberg has become the standard for large-scale analytics. It's not that it's flashy or new. It's that it solves real problems at scale.

Bringing It All Together

The next time you write a query against an Iceberg table, remember the architecture beneath it:

The catalog pointer found your table
The metadata file described its schema
The manifest list identified relevant snapshots
The manifests pruned unnecessary files using statistics
The data files provided the actual rows
Delete files (if any) marked rows as removed

No external database needed. No consistency problems. No performance cliffs at scale.

Just files, organized intelligently, describing themselves.

That's Iceberg.

Resources

About the Author

Why Polaris Never Touches Your Cloud Credentials: Storage Config Internals

Prithvi S — Thu, 16 Apr 2026 08:16:23 +0000

Every data engineer has a nightmare: discovering that a credential spreadsheet with AWS keys got committed to git. Or worse, finding that production credentials are sitting in a YAML file on 50 developer laptops.

Most data platforms solve this by asking you to trust them with your cloud credentials. Snowflake stores them. Hive stores them. Glue stores them. Then they promise really hard not to leak them.

Apache Polaris takes a different approach entirely. It never asks for your cloud credentials at all.

Instead, it does something cleverly different: it establishes trust relationships with your cloud provider, then mints temporary, scoped credentials on-the-fly whenever an engine needs to read or write data. You set it up once. Then Polaris handles the rest.

This is the foundation of Polaris's entire security model, and it's worth understanding deeply. Not just because it's clever, but because it fundamentally changes what's possible in multi-tenant, regulated, or security-conscious environments.

Let's dig into how it works.

The Traditional Problem: Credential Storage

When you set up Snowflake to read from S3, you provide your AWS credentials. Snowflake stores them (encrypted, they promise). When a query runs, Snowflake uses those credentials to access S3.

This creates several problems:

1. Long-lived credentials in the system. If Snowflake's database gets compromised, those credentials are exposed for months or years until someone notices and rotates them.

2. One set of credentials for many operations. The same credential can be used to read, write, delete, or modify anything in your S3 account. There's no granularity.

3. Difficult audit trails. When suspicious S3 access happens, you can't pinpoint which Snowflake query or which user triggered it. The logs just show "snowflake_service_account accessed this bucket."

4. Compliance friction. Regulated organizations (healthcare, finance) have strict rules about where credentials can live. Storing them in Snowflake often violates those policies.

5. Credential rotation is manual and risky. You have to update credentials in Snowflake, hope nothing breaks mid-rotation, and coordinate with other systems.

Polaris was designed to solve all of these at once.

How Polaris Does It: The Trust Model

Instead of storing credentials, Polaris stores a configuration that establishes trust with your cloud provider. Let's walk through S3 as the example.

Step 1: Register Your Cloud Storage

When you create a catalog in Polaris, you provide a storage configuration. For S3, that looks like this:

{
  "storageType": "S3",
  "config": {
    "externalId": "polaris-prod-7f92ac",
    "roleArn": "arn:aws:iam::123456789012:role/polaris-catalog-role",
    "bucket": "my-company-data-lake"
  }
}

Notice what's not here: no AWS access key. No secret key. No credentials of any kind.

What is here is a reference to an IAM role that you've already created in AWS, plus an external ID that makes the trust relationship unique to this Polaris instance.

Step 2: Set Up the Trust Relationship in AWS

Before Polaris can mint credentials, you need to create that IAM role and configure it to trust Polaris. Here's the trust policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::AWS_ACCOUNT_ID:user/polaris-service"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": "polaris-prod-7f92ac"
        }
      }
    }
  ]
}

This says: "The Polaris service can assume this role, but only if it provides the external ID polaris-prod-7f92ac."

The external ID is crucial. It prevents confused deputy attacks. Even if an attacker compromises Polaris, they can't assume random IAM roles in other AWS accounts without the correct external ID.

Step 3: Attach Policies to the IAM Role

You then attach an S3 policy to that IAM role that limits what it can do:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::my-company-data-lake",
        "arn:aws:s3:::my-company-data-lake/*"
      ]
    }
  ]
}

This role can only read from your data lake bucket. It can't write, delete, or access anything else.

Now Polaris is set up. It has a configuration (not credentials) that points to this IAM role. It has an external ID. And the trust relationship is wired up in AWS.

The Credential Vending Flow

Here's where the magic happens. When a Spark engine wants to read data from a Polaris-managed table:

Request Phase

The Spark engine calls the Polaris REST API:

GET /v1/catalogs/my_catalog/namespaces/raw/tables/customers/data

Polaris receives this request and extracts the context: who is asking, what are they trying to do, and what table do they want to access.

Authorization Phase

Polaris checks its RBAC model. Does this principal have TABLE_READ_DATA permission on the customers table? It consults its role hierarchy:

The user's identity is bound to a principal role (e.g., "analytics_engineers")
That principal role is granted a catalog role (e.g., "read_raw_data")
That catalog role has TABLE_READ_DATA on the customers table

If the authorization check passes, Polaris moves to the next phase.

Credential Minting Phase

Polaris looks up the storage configuration for this table. It sees:

roleArn: arn:aws:iam::123456789012:role/polaris-catalog-role
externalId: polaris-prod-7f92ac

It then calls AWS STS AssumeRole:

aws sts assume-role \
  --role-arn arn:aws:iam::123456789012:role/polaris-catalog-role \
  --external-id polaris-prod-7f92ac \
  --duration-seconds 900  # 15 minutes

AWS validates the external ID, checks the trust policy, and returns:

{
  "Credentials": {
    "AccessKeyId": "ASIA...",
    "SecretAccessKey": "...",
    "SessionToken": "...",
    "Expiration": "2026-04-16T09:28:00Z"
  }
}

These are temporary credentials. They expire in 15 minutes. They can only do what the IAM role allows (in this case, read S3).

Scope Restriction Phase

Polaris could stop here, but it doesn't. It further restricts these credentials to just the table being accessed. It uses S3 path restrictions or additional policy layers to ensure the credential can only touch s3://my-company-data-lake/raw/customers/, not s3://my-company-data-lake/sensitive/.

Response Phase

Polaris returns the temporary credentials to Spark:

{
  "credentials": {
    "aws_access_key_id": "ASIA...",
    "aws_secret_access_key": "...",
    "aws_session_token": "...",
    "expires_at": "2026-04-16T09:28:00Z"
  },
  "path": "s3://my-company-data-lake/raw/customers/"
}

Spark now has everything it needs. It can read data for 15 minutes. After that, the credential is useless.

Why This Matters: The Security Benefits

This design has profound implications:

1. No long-lived credentials stored anywhere. Polaris doesn't store AWS keys. Your laptop doesn't have them. They're generated on-demand and expire quickly.

2. Instant revocation. If you need to immediately revoke a user's access, you update their Polaris role. The next credential mint fails. There's no delay.

3. Audit trails. AWS logs show exactly which Polaris instance, with which external ID, assumed the role. You can trace every data access back to a specific user and query.

4. Fine-grained access control. Different tables can have different IAM roles with different permissions. Read-only tables get read-only roles. Write-enabled tables get write roles. A user's access to each table is independently controlled.

5. Multi-cloud compatibility. Polaris supports the same pattern for GCS (using service account tokens) and Azure (using managed identities). The mechanism changes, but the principle is the same: temporary, scoped credentials.

6. Compliance-friendly. Regulated organizations can enforce policies like "credentials must expire in under 30 minutes" or "all access must be auditable." Polaris handles both automatically.

The GCS and Azure Equivalents

The S3 pattern generalizes to other clouds.

Google Cloud Storage

With GCS, you don't provide a credential. Instead, you provide a service account:

{
  "storageType": "GCS",
  "config": {
    "projectId": "my-gcp-project",
    "serviceAccount": "polaris@my-gcp-project.iam.gserviceaccount.com"
  }
}

You configure GCS IAM so that Polaris's service account can impersonate a restricted role. When a credential is needed, Polaris calls GCS APIs to get a short-lived access token. Same pattern, different mechanism.

Azure

With Azure, you use managed identities:

{
  "storageType": "AZURE",
  "config": {
    "tenantId": "12345678-...",
    "storageAccount": "mycompanydatalake",
    "containerId": "raw"
  }
}

Polaris (running in a managed identity or service principal) gets a short-lived token from Azure AD. Again, the principle is identical: temporary, scoped, revocable credentials.

Credential Caching and Performance

One question you might have: doesn't minting a new credential for every request add latency?

Yes, but Polaris optimizes for this. It caches credentials locally. If the same user asks for credentials for the same table within a few minutes, Polaris returns the cached credential instead of calling the cloud provider again. This reduces latency to under 10ms in most cases.

The tradeoff is acceptable: an extra 100-200ms on the first request for a credential is well worth the security benefits of never storing cloud credentials.

Deployment Implications

How do you actually deploy this? Polaris itself needs to run somewhere, and it needs to be able to call AWS STS (or GCS, or Azure AD).

Typically, you run Polaris in a Kubernetes cluster with a Kubernetes service account. You configure IRSA (IAM Roles for Service Accounts) to bind that service account to an IAM role. Polaris then inherits permissions to call STS.

The configuration looks like:

serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/polaris-service-role

This means Polaris gets credentials to call AWS APIs, but those credentials are also temporary and scoped. You've just nested trust relationships.

What Gets Stored in Polaris

Since Polaris doesn't store cloud credentials, what does it store?

Storage configurations (roleArn, externalId, bucket, project, etc.)
Entity metadata (table names, schemas, partitions)
RBAC definitions (which roles have which privileges)
Audit logs (who accessed what, when)

All of this lives in Polaris's metadata store, typically a PostgreSQL database. The metadata store itself should be encrypted at rest and in transit, but it doesn't contain cloud credentials. Even if the metadata store is compromised, an attacker can't access your data lake.

Real-World Example: A Multi-Tenant SaaS

Imagine you're building a data platform SaaS. You have 100 customers, each with their own S3 bucket. You can't ask each customer for their AWS credentials (security nightmare for them). Instead:

Each customer creates an IAM role in their AWS account and trusts your Polaris instance
They register that role ARN in Polaris during onboarding
Your single Polaris instance now manages access to 100 buckets securely
Each customer's queries get credentials scoped to their bucket only
You can audit which customer accessed what, when

This is impossible with traditional credential storage.

Looking Forward: Polaris v1.3.0 Enhancements

Polaris v1.3.0 extends this pattern to federated catalogs. You can now register external catalogs (Snowflake, AWS Glue, Databricks) with Polaris, and Polaris will vend credentials for them too.

This means you could have a single Polaris instance managing access across Iceberg catalogs, Glue catalogs, and Snowflake, all without storing credentials for any of them.

Conclusion

The reason Polaris never touches your cloud credentials is because it doesn't have to. By establishing trust relationships upfront and minting temporary credentials on-demand, Polaris achieves something that traditional data platforms can't: security without storing secrets.

This is why enterprises are moving to Polaris. Not just because Iceberg is open-source, but because the entire access control model is built for environments where credentials are liabilities, not assets.

If you're building data infrastructure at scale, this pattern is worth understanding. It might change how you think about credential management in your own systems.

Want to learn more?

I'm Prithvi S, Staff Software Engineer at Cloudera and Opensource Enthusiast. Follow my work on GitHub: https://github.com/iprithv