<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Prithvi S</title>
    <description>The latest articles on Forem by Prithvi S (@iprithv).</description>
    <link>https://forem.com/iprithv</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3869317%2Fe48d8dde-3457-4eca-881a-f414fac5b86e.jpg</url>
      <title>Forem: Prithvi S</title>
      <link>https://forem.com/iprithv</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/iprithv"/>
    <language>en</language>
    <item>
      <title>How OpenSearch Plugins Really Work: Architecture &amp; Extension Points</title>
      <dc:creator>Prithvi S</dc:creator>
      <pubDate>Wed, 15 Apr 2026 00:32:11 +0000</pubDate>
      <link>https://forem.com/iprithv/how-opensearch-plugins-really-work-architecture-extension-points-59k1</link>
      <guid>https://forem.com/iprithv/how-opensearch-plugins-really-work-architecture-extension-points-59k1</guid>
      <description>&lt;h1&gt;
  
  
  How OpenSearch Plugins Really Work: Architecture &amp;amp; Extension Points
&lt;/h1&gt;

&lt;p&gt;OpenSearch is powerful out of the box, but its true flexibility comes from plugins. Yet most developers treat plugins as black boxes: you install them, they work, and you move on. But what if you need to build one? Or understand why a plugin broke after an upgrade? Or design a system that integrates with OpenSearch's plugin ecosystem?&lt;/p&gt;

&lt;p&gt;In this post, I'll walk you through how plugins actually work: compilation, packaging, installation, and the extension points that make customization possible. By the end, you'll understand the mechanics well enough to build your own.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Plugin Lifecycle: From Source to Running Code
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Writing and Compiling a Plugin
&lt;/h3&gt;

&lt;p&gt;A plugin is a Java project with dependencies on OpenSearch core. At minimum, you need:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight gradle"&gt;&lt;code&gt;&lt;span class="k"&gt;dependencies&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;compileOnly&lt;/span&gt; &lt;span class="s2"&gt;"org.opensearch:opensearch:${opensearch_version}"&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;compileOnly&lt;/code&gt; is critical: your plugin compiles against OpenSearch, but doesn't bundle it. The plugin will run inside the OpenSearch JVM, using the host's core libraries.&lt;/p&gt;

&lt;p&gt;Your plugin entry point is a class that extends &lt;code&gt;Plugin&lt;/code&gt;. For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MyCustomPlugin&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;Plugin&lt;/span&gt; &lt;span class="kd"&gt;implements&lt;/span&gt; &lt;span class="nc"&gt;SearchPlugin&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;QuerySpec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;?&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;getQueries&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Collections&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;singletonList&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
            &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;QuerySpec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="nc"&gt;MyCustomQuery&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;NAME&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nl"&gt;MyCustomQuery:&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nc"&gt;MyCustomQuery&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fromXContent&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
        &lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This simple declaration tells OpenSearch: "I provide a custom query type called &lt;code&gt;my_custom_query&lt;/code&gt;."&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Building the Plugin Artifact
&lt;/h3&gt;

&lt;p&gt;When you run &lt;code&gt;gradle build&lt;/code&gt;, you produce a .zip file containing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;my-plugin-1.0.0.zip
├── opensearch-plugin-descriptor.properties
├── lib/
│   ├── my-plugin-1.0.0.jar
│   └── my-dependencies.jar (if any third-party libs needed)
├── bin/ (optional: scripts)
└── config/ (optional: default settings)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;opensearch-plugin-descriptor.properties&lt;/code&gt; file is the plugin manifest:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;name&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;my-custom-plugin&lt;/span&gt;
&lt;span class="py"&gt;description&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;My custom query plugin&lt;/span&gt;
&lt;span class="py"&gt;version&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;1.0.0&lt;/span&gt;
&lt;span class="py"&gt;opensearch.version&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;2.13.0&lt;/span&gt;
&lt;span class="py"&gt;java.version&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;11&lt;/span&gt;
&lt;span class="py"&gt;classname&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;com.example.MyCustomPlugin&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This manifest declares: which OpenSearch version the plugin targets, what Java version it needs, and crucially, the entry point class name.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Installation via the opensearch-plugin Tool
&lt;/h3&gt;

&lt;p&gt;You install via CLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./bin/opensearch-plugin &lt;span class="nb"&gt;install &lt;/span&gt;file:///path/to/my-plugin-1.0.0.zip
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The tool does several things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Verifies the manifest&lt;/strong&gt; - reads &lt;code&gt;opensearch-plugin-descriptor.properties&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Version checks&lt;/strong&gt; - ensures plugin targets the installed OpenSearch version&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extracts&lt;/strong&gt; - unpacks to &lt;code&gt;plugins/my-custom-plugin/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Loads classes&lt;/strong&gt; - prepares the plugin for JVM loading&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Restarts the node&lt;/strong&gt; - required to load the plugin code&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;After restart, your plugin code is live.&lt;/p&gt;

&lt;h2&gt;
  
  
  Class Loader Isolation and Bootstrap
&lt;/h2&gt;

&lt;p&gt;Here's where it gets interesting. Your plugin code runs in the same JVM as OpenSearch core. How does OpenSearch prevent your plugin from accidentally (or maliciously) breaking core?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Class Loader Isolation:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;OpenSearch uses a custom &lt;code&gt;PluginClassLoader&lt;/code&gt; for each plugin. This loader is a child of the core class loader, but has its own namespace:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Core classes (org.opensearch.*) resolve from the main class loader&lt;/li&gt;
&lt;li&gt;Plugin classes resolve from the plugin's class loader first&lt;/li&gt;
&lt;li&gt;If a class isn't found in the plugin loader, it falls back to core&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This prevents version conflicts. If your plugin wants to use a specific version of a library, it can bundle it, and its class loader will find that version first without conflicting with core.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bootstrap Contract:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When OpenSearch starts, it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Discovers all plugins in &lt;code&gt;plugins/&lt;/code&gt; directory&lt;/li&gt;
&lt;li&gt;Reads each plugin's descriptor&lt;/li&gt;
&lt;li&gt;Creates a &lt;code&gt;PluginClassLoader&lt;/code&gt; for each&lt;/li&gt;
&lt;li&gt;Instantiates each plugin's entry point class via reflection&lt;/li&gt;
&lt;li&gt;Calls lifecycle methods: &lt;code&gt;onIndexModule()&lt;/code&gt;, &lt;code&gt;onNodeStarted()&lt;/code&gt;, etc.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If a plugin fails to load, OpenSearch will refuse to start. This is intentional: it's safer to fail loudly than to silently omit a plugin that applications might depend on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Extension Points: How Plugins Hook Into OpenSearch
&lt;/h2&gt;

&lt;p&gt;A plugin doesn't have direct access to internal OpenSearch code. Instead, it implements well-defined &lt;strong&gt;extension point interfaces&lt;/strong&gt;. OpenSearch discovers these implementations and calls them at the right moments.&lt;/p&gt;

&lt;h3&gt;
  
  
  SearchPlugin: Custom Query Types and Aggregations
&lt;/h3&gt;

&lt;p&gt;The most common extension point for search-focused plugins:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MySearchPlugin&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;Plugin&lt;/span&gt; &lt;span class="kd"&gt;implements&lt;/span&gt; &lt;span class="nc"&gt;SearchPlugin&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;QuerySpec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;?&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;getQueries&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Register custom query types&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Collections&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;singletonList&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
            &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;QuerySpec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="nc"&gt;MyQuery&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;NAME&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nl"&gt;MyQuery:&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nc"&gt;MyQuery&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fromXContent&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
        &lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;AggregationSpec&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;getAggregations&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Register custom aggregations&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Collections&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;singletonList&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
            &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;AggregationSpec&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;MyAggregation&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;NAME&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nl"&gt;MyAggregation:&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nc"&gt;MyAggregation&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;parse&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
        &lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;ScoreFunctionSpec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;?&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;getScoreFunctions&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Register custom scoring functions&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Collections&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;singletonList&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
            &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ScoreFunctionSpec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="nc"&gt;MyScoreFunction&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;NAME&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nl"&gt;MyScoreFunction:&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nc"&gt;MyScoreFunction&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;parse&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
        &lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once registered, your custom query is available via the REST API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;GET&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;/my-index/_search&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"my_custom_query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"field"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"boost"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  ActionPlugin: Custom REST and Transport Actions
&lt;/h3&gt;

&lt;p&gt;For plugins that need custom REST endpoints or transport operations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MyActionPlugin&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;Plugin&lt;/span&gt; &lt;span class="kd"&gt;implements&lt;/span&gt; &lt;span class="nc"&gt;ActionPlugin&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;ActionHandler&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;?,&lt;/span&gt; &lt;span class="o"&gt;?&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;getActions&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Collections&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;singletonList&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
            &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ActionHandler&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="nc"&gt;MyAction&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;INSTANCE&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;TransportMyAction&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;RestHandler&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;getRestHandlers&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Settings&lt;/span&gt; &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;RestController&lt;/span&gt; &lt;span class="n"&gt;restController&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; 
            &lt;span class="nc"&gt;ClusterSettings&lt;/span&gt; &lt;span class="n"&gt;clusterSettings&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;IndexScopedSettings&lt;/span&gt; &lt;span class="n"&gt;indexScopedSettings&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
            &lt;span class="nc"&gt;SettingsFilter&lt;/span&gt; &lt;span class="n"&gt;settingsFilter&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;NamedWriteableRegistry&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;namedWriteableRegistries&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
            &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;NamedXContentRegistry&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;namedXContentRegistries&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Supplier&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;DiscoveryNodes&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;nodesInCluster&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
            &lt;span class="nc"&gt;Supplier&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;ClusterState&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;clusterStateSupplier&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Collections&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;singletonList&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
            &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;RestMyHandler&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
        &lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you can hit a custom endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;POST /_plugin/my-action
&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"param1"&lt;/span&gt;: &lt;span class="s2"&gt;"value"&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  MapperPlugin: Custom Field Types
&lt;/h3&gt;

&lt;p&gt;If you need a new field type (beyond standard text, keyword, numeric, etc.):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MyMapperPlugin&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;Plugin&lt;/span&gt; &lt;span class="kd"&gt;implements&lt;/span&gt; &lt;span class="nc"&gt;MapperPlugin&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Mapper&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;TypeParser&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;getMappers&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Collections&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;singletonMap&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
            &lt;span class="s"&gt;"my_custom_field"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
            &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;parserContext&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;MyCustomFieldMapper&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;parserContext&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you can use it in mappings:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;PUT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;/my-index&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mappings"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"properties"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"custom_field"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"my_custom_field"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"analyzer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"standard"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  EnginePlugin: Custom Lucene Behavior
&lt;/h3&gt;

&lt;p&gt;For advanced use cases, you can hook into the Lucene engine itself:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MyEnginePlugin&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;Plugin&lt;/span&gt; &lt;span class="kd"&gt;implements&lt;/span&gt; &lt;span class="nc"&gt;EnginePlugin&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;Optional&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;EngineFactory&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;getEngineFactory&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;IndexSettings&lt;/span&gt; &lt;span class="n"&gt;indexSettings&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Optional&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;of&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;MyCustomEngine&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  IngestPlugin: Custom Processors
&lt;/h3&gt;

&lt;p&gt;For plugins that process documents during ingestion:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MyIngestPlugin&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;Plugin&lt;/span&gt; &lt;span class="kd"&gt;implements&lt;/span&gt; &lt;span class="nc"&gt;IngestPlugin&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Processor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Factory&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;getProcessors&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Processor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Parameters&lt;/span&gt; &lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Collections&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;singletonMap&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
            &lt;span class="s"&gt;"my_processor"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
            &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;factories&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tag&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;MyIngestProcessor&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tag&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use in pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;PUT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;/_ingest/pipeline/my_pipeline&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"processors"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"my_processor"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"field"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"content"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Real-World Example: The Search Relevance Plugin
&lt;/h2&gt;

&lt;p&gt;OpenSearch's own &lt;strong&gt;search-relevance plugin&lt;/strong&gt; demonstrates these concepts in action. It provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Custom query types for A/B testing search relevance&lt;/li&gt;
&lt;li&gt;Custom aggregations for metrics collection&lt;/li&gt;
&lt;li&gt;REST endpoints to manage experiments&lt;/li&gt;
&lt;li&gt;System indexes (prefixed with &lt;code&gt;.plugins-search-rel-&lt;/code&gt;) to store experiment state&lt;/li&gt;
&lt;li&gt;Concurrent search request deciders (OpenSearch 2.17+) for custom query execution strategies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The plugin is battle-tested in production, used by teams optimizing ranking and relevance across massive datasets.&lt;/p&gt;

&lt;h2&gt;
  
  
  System Indexes: How Plugins Store Their Own State
&lt;/h2&gt;

&lt;p&gt;Most non-trivial plugins need to persist data. Rather than requiring external storage, they use &lt;strong&gt;system indexes&lt;/strong&gt; within OpenSearch itself.&lt;/p&gt;

&lt;p&gt;System indexes are prefixed with &lt;code&gt;.plugins-&lt;/code&gt; or &lt;code&gt;.opendistro-&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.plugins-search-rel-&amp;lt;version&amp;gt;-experiments
.plugins-search-rel-&amp;lt;version&amp;gt;-notes
.plugins-ml-config
.opendistro-job-scheduler-lock
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The challenge: how do you evolve the schema without breaking existing deployments?&lt;/p&gt;

&lt;p&gt;OpenSearch plugins use a &lt;strong&gt;schema versioning pattern&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="no"&gt;SCHEMA_VERSION&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"1"&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;ensureIndexInitialized&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(!&lt;/span&gt;&lt;span class="n"&gt;indexExists&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;createIndex&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="nc"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Object&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;indexMeta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;getIndexMeta&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;currentVersion&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;indexMeta&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getOrDefault&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"schema_version"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"0"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(!&lt;/span&gt;&lt;span class="n"&gt;currentVersion&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;equals&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="no"&gt;SCHEMA_VERSION&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;migrateSchema&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;currentVersion&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="no"&gt;SCHEMA_VERSION&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;migrateSchema&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;fromVersion&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;toVersion&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Use Put Mapping API to add new fields (additive only)&lt;/span&gt;
    &lt;span class="c1"&gt;// Never remove or change existing field types&lt;/span&gt;
    &lt;span class="n"&gt;putMapping&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;newFields&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This ensures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Old documents coexist with new schema&lt;/li&gt;
&lt;li&gt;Upgrades are backwards compatible&lt;/li&gt;
&lt;li&gt;No downtime required for schema evolution&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Performance and Reliability Considerations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Startup Time
&lt;/h3&gt;

&lt;p&gt;Each plugin adds to startup time. Large plugins or plugins that do heavy initialization can slow cluster startup. Monitor this in production.&lt;/p&gt;

&lt;h3&gt;
  
  
  Class Loader Memory
&lt;/h3&gt;

&lt;p&gt;Each plugin gets its own class loader, holding copies of loaded classes in memory. Many plugins = higher memory footprint. Keep plugin count reasonable.&lt;/p&gt;

&lt;h3&gt;
  
  
  API Stability
&lt;/h3&gt;

&lt;p&gt;OpenSearch's plugin APIs are versioned with OpenSearch itself. When OpenSearch releases a major version, plugins must recompile and test. This is by design: it ensures plugins stay compatible with core.&lt;/p&gt;

&lt;h3&gt;
  
  
  Security
&lt;/h3&gt;

&lt;p&gt;Plugins run in the same JVM as OpenSearch core. A malicious or buggy plugin can crash the entire node. Only install plugins from trusted sources. In multi-tenant environments, consider network isolation or separate clusters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building Your Own Plugin: Where to Start
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Clone the plugin template:&lt;/strong&gt; OpenSearch provides &lt;code&gt;plugin-template&lt;/code&gt; repository&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implement your extension point&lt;/strong&gt; (SearchPlugin, ActionPlugin, etc.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Write tests&lt;/strong&gt; - use OpenSearch's testing framework&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build the .zip&lt;/strong&gt; - &lt;code&gt;gradle build&lt;/code&gt; produces the artifact&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Install locally&lt;/strong&gt; - &lt;code&gt;./bin/opensearch-plugin install file://...&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test end-to-end&lt;/strong&gt; - verify your REST endpoint/query/aggregation works&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Publish&lt;/strong&gt; - host on artifact repository or GitHub Releases&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;OpenSearch plugins are not magic. They're well-structured Java code that hooks into OpenSearch via extension points. Understanding this architecture demystifies plugin behavior, helps you troubleshoot issues, and opens the door to building custom extensions.&lt;/p&gt;

&lt;p&gt;Whether you're optimizing search relevance, integrating with custom systems, or building observability tooling, the plugin architecture gives you the hooks you need without compromising core stability.&lt;/p&gt;

&lt;p&gt;The next time a plugin breaks after an upgrade, you'll know exactly where to look. And when you need to build one, you'll have a mental model of how the pieces fit together.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Want to explore further?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenSearch Plugin Developer Guide: &lt;a href="https://opensearch.org/docs/latest/plugins/intro/" rel="noopener noreferrer"&gt;https://opensearch.org/docs/latest/plugins/intro/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Plugin Template Repository: &lt;a href="https://github.com/opensearch-project/plugin-template" rel="noopener noreferrer"&gt;https://github.com/opensearch-project/plugin-template&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Search Relevance Plugin (Prithvi's contribution): &lt;a href="https://github.com/opensearch-project/dashboards-search-relevance" rel="noopener noreferrer"&gt;https://github.com/opensearch-project/dashboards-search-relevance&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;I'm Prithvi S, Staff Software Engineer at Cloudera and Opensource Enthusiast. Follow my work on GitHub: &lt;a href="https://github.com/iprithv" rel="noopener noreferrer"&gt;https://github.com/iprithv&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>opensearch</category>
      <category>search</category>
      <category>database</category>
      <category>data</category>
    </item>
    <item>
      <title>Inverted Index Explained: How Elasticsearch Achieves Sub-Millisecond Search on Billions of Documents</title>
      <dc:creator>Prithvi S</dc:creator>
      <pubDate>Tue, 14 Apr 2026 00:32:45 +0000</pubDate>
      <link>https://forem.com/iprithv/inverted-index-explained-how-elasticsearch-achieves-sub-millisecond-search-on-billions-of-documents-3la6</link>
      <guid>https://forem.com/iprithv/inverted-index-explained-how-elasticsearch-achieves-sub-millisecond-search-on-billions-of-documents-3la6</guid>
      <description>&lt;p&gt;Imagine you're building a search feature for your product catalog. You have 10 million products, and you need to return relevant results in under 100 milliseconds. You decide to use PostgreSQL's full-text search, so you write:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;products&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;to_tsvector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;@@&lt;/span&gt; &lt;span class="n"&gt;plainto_tsquery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'wireless headphones'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It works. But then you get 100 million products. Then a billion. The queries crawl from 100ms to 5 seconds. Your users leave. Your boss asks why.&lt;/p&gt;

&lt;p&gt;The answer isn't "use a bigger database." The answer is "use a different data structure."&lt;/p&gt;

&lt;p&gt;Elasticsearch doesn't store data the way PostgreSQL does. It uses something called an &lt;strong&gt;inverted index&lt;/strong&gt;, and that one difference is why Elasticsearch can search a billion documents in 2-5 milliseconds while traditional databases take seconds.&lt;/p&gt;

&lt;p&gt;This post dives into how that magic works.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is an Inverted Index?
&lt;/h2&gt;

&lt;p&gt;Think of a book. At the back, there's an index:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Elasticsearch ... pages 45, 78, 120, 156
Performance ... pages 45, 89, 203
Database ... pages 12, 78, 200
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The index maps &lt;strong&gt;words to page numbers&lt;/strong&gt;. When you want to find information about "Performance," you look it up once and jump directly to those pages. You don't read every single page.&lt;/p&gt;

&lt;p&gt;That's the core idea of an inverted index.&lt;/p&gt;

&lt;p&gt;Now imagine instead of a book, you have documents. Your "index" maps &lt;strong&gt;terms to document IDs&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;elasticsearch&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;doc&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;doc&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;doc&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;doc&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;performance&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;doc&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;doc&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;doc&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;database&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;doc&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;doc&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;doc&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's "inverted" because it flips the relationship. A &lt;strong&gt;forward index&lt;/strong&gt; says "doc1 contains terms: elasticsearch, performance, scalability." An &lt;strong&gt;inverted index&lt;/strong&gt; says "term elasticsearch is in documents: 1, 3, 5, 8."&lt;/p&gt;

&lt;p&gt;Why does this matter? Because searching becomes trivially fast.&lt;/p&gt;

&lt;p&gt;When someone searches for "elasticsearch," Elasticsearch doesn't scan all documents. It looks up "elasticsearch" in the index once and gets back a list of document IDs. Done. O(1) lookup plus a single postings list traversal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Under the Hood: How Elasticsearch Builds and Uses Inverted Indices
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Text Analysis (Before Indexing)
&lt;/h3&gt;

&lt;p&gt;Before a document gets indexed, its text goes through an analyzer pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"settings"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"analysis"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"analyzer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"custom_analyzer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"custom"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"char_filter"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"html_strip"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"tokenizer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"standard"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"filter"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"lowercase"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"stop"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"porter_stem"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Removes HTML tags&lt;/strong&gt; (character filter)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Splits text into tokens&lt;/strong&gt; (tokenizer): "Elasticsearch is powerful" becomes ["Elasticsearch", "is", "powerful"]&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lowercases tokens&lt;/strong&gt; (filter): ["elasticsearch", "is", "powerful"]&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Removes stop words&lt;/strong&gt; (filter): ["elasticsearch", "powerful"]&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stems words&lt;/strong&gt; (filter): ["elasticsearch", "power"]&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Now "powerful" and "powers" both map to the same root "power," so a search for "power" finds both.&lt;/p&gt;

&lt;p&gt;The analyzer is completely customizable. For medical documents, you might preserve technical terms. For e-commerce, you might add synonym expansion (so "laptop" matches "notebook").&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Segment Creation (The Immutable Index)
&lt;/h3&gt;

&lt;p&gt;Here's where Elasticsearch gets clever. Instead of maintaining one large mutable index, it creates immutable &lt;strong&gt;segments&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When documents arrive:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;They sit in an in-memory buffer&lt;/li&gt;
&lt;li&gt;Every ~1 second (refresh interval), the buffer flushes to disk as a new segment&lt;/li&gt;
&lt;li&gt;Each segment is an inverted index, but immutable&lt;/li&gt;
&lt;li&gt;Multiple segments are searched in parallel&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Why immutable? Because it's fast. You never have to lock or rebalance. You just append new segments. If a crash happens mid-write, you have the translog to recover from.&lt;/p&gt;

&lt;p&gt;Here's what a tiny two-document segment looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;INVERTED INDEX (Segment 1):

Term          | Postings List (Doc IDs)
--------------|----------------------
elasticsearch | [1, 2]
powerful      | [1]
scales        | [2]
horizontally  | [2]

DOCUMENT STORE:
Doc 1: "elasticsearch is powerful"
Doc 2: "elasticsearch scales horizontally"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Term Lookup (Lightning Fast)
&lt;/h3&gt;

&lt;p&gt;When you search for "elasticsearch," here's what happens:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The query arrives at a coordinator node&lt;/li&gt;
&lt;li&gt;It broadcasts the query to all relevant shards&lt;/li&gt;
&lt;li&gt;Each shard performs a &lt;strong&gt;binary search&lt;/strong&gt; on the sorted terms in its segments&lt;/li&gt;
&lt;li&gt;Found "elasticsearch"? Return the postings list: [1, 2]&lt;/li&gt;
&lt;li&gt;Fetch those documents from the document store&lt;/li&gt;
&lt;li&gt;Return to coordinator, which merges results from all shards&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The magic: &lt;strong&gt;binary search on sorted terms is O(log N)&lt;/strong&gt;. On a million terms, that's ~20 comparisons. Then you get the postings list and you're done.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Segment Merging (Background Optimization)
&lt;/h3&gt;

&lt;p&gt;Over time, you accumulate many small segments. Searching 100 segments is slower than searching 1 large segment. So Elasticsearch periodically merges them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Segment 1 (1000 docs) + Segment 2 (1000 docs) -&amp;gt; Merged Segment (2000 docs)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The merge is invisible to you. It happens in the background. Old segments are deleted. The new merged segment is searched going forward.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Inverted Index Beats Traditional Databases
&lt;/h2&gt;

&lt;p&gt;Let's compare searching 1 billion documents for "elasticsearch":&lt;/p&gt;

&lt;h3&gt;
  
  
  PostgreSQL with Full-Text Search
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;products&lt;/span&gt; 
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;to_tsvector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;@@&lt;/span&gt; &lt;span class="n"&gt;plainto_tsquery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'elasticsearch'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Internally, PostgreSQL uses a &lt;strong&gt;B-tree index&lt;/strong&gt;. Here's the problem:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;B-trees are designed for point lookups and range scans&lt;/li&gt;
&lt;li&gt;Full-text search requires traversing multiple branches of the tree&lt;/li&gt;
&lt;li&gt;For "elasticsearch," the database must:

&lt;ul&gt;
&lt;li&gt;Find all occurrences of the term (multiple tree lookups)&lt;/li&gt;
&lt;li&gt;Reconstruct which documents contain them&lt;/li&gt;
&lt;li&gt;Filter by relevance&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;On a billion products, this takes &lt;strong&gt;3-10 seconds&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Elasticsearch
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;GET /products/_search
&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"query"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"match"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
      &lt;span class="s2"&gt;"title"&lt;/span&gt;: &lt;span class="s2"&gt;"elasticsearch"&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Elasticsearch:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Looks up "elasticsearch" in the inverted index (binary search, ~30 comparisons)&lt;/li&gt;
&lt;li&gt;Gets back a postings list&lt;/li&gt;
&lt;li&gt;Fetches the top 10 documents&lt;/li&gt;
&lt;li&gt;Returns results&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Time: &lt;strong&gt;2-5 milliseconds&lt;/strong&gt; on a properly sized cluster.&lt;/p&gt;

&lt;p&gt;The difference: inverted indices are designed specifically for text search. B-trees are not.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Trade-Off: Updates vs Queries
&lt;/h2&gt;

&lt;p&gt;But inverted indices have a cost: updates are expensive.&lt;/p&gt;

&lt;p&gt;When you update a document in Elasticsearch:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The old document is marked for deletion&lt;/li&gt;
&lt;li&gt;A new document is indexed (goes through analysis, creates new segment)&lt;/li&gt;
&lt;li&gt;A merge eventually removes the deleted document&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This takes milliseconds to seconds, not microseconds. Elasticsearch is eventually consistent.&lt;/p&gt;

&lt;p&gt;In PostgreSQL, you just UPDATE a row. Done immediately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;So when do you use each?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PostgreSQL:&lt;/strong&gt; Transactional workloads, frequent updates, complex joins&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Elasticsearch:&lt;/strong&gt; Text search, logs, analytics, observability&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Real-World Performance Numbers
&lt;/h2&gt;

&lt;p&gt;Here are actual numbers from production Elasticsearch clusters:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Elasticsearch&lt;/th&gt;
&lt;th&gt;PostgreSQL&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Search 1M docs&lt;/td&gt;
&lt;td&gt;2-3ms&lt;/td&gt;
&lt;td&gt;100-200ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Search 1B docs&lt;/td&gt;
&lt;td&gt;5-10ms&lt;/td&gt;
&lt;td&gt;3-10s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Aggregation (cardinality)&lt;/td&gt;
&lt;td&gt;5-20ms&lt;/td&gt;
&lt;td&gt;500ms-2s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Index throughput&lt;/td&gt;
&lt;td&gt;100K docs/sec&lt;/td&gt;
&lt;td&gt;10K-50K docs/sec&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory per 50GB data&lt;/td&gt;
&lt;td&gt;8-16GB (compressed)&lt;/td&gt;
&lt;td&gt;50GB+ (uncompressed)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The compression factor is huge: Elasticsearch's inverted index compresses 3-5x tighter than raw JSON because terms are deduplicated and encoded efficiently.&lt;/p&gt;

&lt;h2&gt;
  
  
  Relevance Scoring with BM25
&lt;/h2&gt;

&lt;p&gt;Now that we can find documents fast, the next question is: which results should be first?&lt;/p&gt;

&lt;p&gt;Elasticsearch uses &lt;strong&gt;BM25&lt;/strong&gt;, a probabilistic relevance framework:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BM25&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;term_frequency&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;document_length&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inverse_document_frequency&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In plain English:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Term frequency:&lt;/strong&gt; how many times does "elasticsearch" appear in the document? (more = higher score)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inverse document frequency:&lt;/strong&gt; how rare is "elasticsearch"? (rare terms like "llama-index" rank higher than common terms like "the")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Document length normalization:&lt;/strong&gt; prevent long documents from always ranking highest&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So if you search "elasticsearch performance," a document mentioning "elasticsearch" 5 times and "performance" 3 times ranks higher than a document mentioning each once.&lt;/p&gt;

&lt;p&gt;You can customize this with field boosting:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"bool"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"must"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"match"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"elasticsearch"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"boost"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"match"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"body"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"elasticsearch"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now matches in the title count twice as much. Perfect for building relevant search experiences.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes (And How to Avoid Them)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Mistake 1: Too Many Shards
&lt;/h3&gt;

&lt;p&gt;You have 1GB of data and create 100 shards. Each shard has 10MB.&lt;/p&gt;

&lt;p&gt;Problem: search latency goes through the roof because you're coordinating across 100 shards, and overhead dominates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule of thumb:&lt;/strong&gt; aim for 10-50GB per shard. If you have 1TB of data, 20-100 shards is reasonable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake 2: Ignoring Refresh Interval
&lt;/h3&gt;

&lt;p&gt;You index a document and try to search it immediately. Nothing.&lt;/p&gt;

&lt;p&gt;That's because the default refresh interval is 1 second. Your data sits in the buffer for up to 1 second before becoming searchable.&lt;/p&gt;

&lt;p&gt;For near-real-time search, you might lower this to 100ms. But each refresh creates a new segment, and merging costs CPU.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Balance:&lt;/strong&gt; 500ms-1s for most use cases. Only lower for critical real-time systems.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake 3: Bad Analyzer Configuration
&lt;/h3&gt;

&lt;p&gt;You don't configure an analyzer, so Elasticsearch uses the default standard analyzer.&lt;/p&gt;

&lt;p&gt;Now when users search "AWS S3", they get no results because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"AWS" tokenizes to "aws" (fine)&lt;/li&gt;
&lt;li&gt;"S3" becomes "s" and "3" (terrible)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Custom analyzer with synonym expansion fixes this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"filter"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"synonyms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"synonym"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"synonyms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"AWS S3 =&amp;gt; s3"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"machine learning =&amp;gt; ml"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion: Why Inverted Index Matters
&lt;/h2&gt;

&lt;p&gt;The inverted index is deceptively simple: a mapping from terms to document IDs. But this simple data structure enables Elasticsearch to do what traditional databases struggle with: search billions of documents in milliseconds.&lt;/p&gt;

&lt;p&gt;The key insights:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Inverted index is designed for text search&lt;/strong&gt;, not general-purpose queries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Immutable segments&lt;/strong&gt; enable fast, lockless indexing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Binary search on terms&lt;/strong&gt; makes lookup blazing fast&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BM25 scoring&lt;/strong&gt; automatically ranks results by relevance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The trade-off:&lt;/strong&gt; fast reads, slower updates. Worth it for search workloads&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you're building a search feature, a logging system, or an observability platform, understanding how Elasticsearch works under the hood will save you from common mistakes and help you build systems that scale.&lt;/p&gt;

&lt;p&gt;Next step? Learn how to scale Elasticsearch horizontally with sharding, tune refresh and flush intervals for your workload, and customize analyzers for your domain.&lt;/p&gt;

&lt;p&gt;Happy searching.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html" rel="noopener noreferrer"&gt;Elasticsearch Inverted Index Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://en.wikipedia.org/wiki/Okapi_BM25" rel="noopener noreferrer"&gt;BM25 Algorithm Explained&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/elastic/elasticsearch" rel="noopener noreferrer"&gt;GitHub: Elasticsearch Source Code&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-search-speed.html" rel="noopener noreferrer"&gt;How to Tune Elasticsearch for Your Workload&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I'm Prithvi S, Staff Software Engineer at Cloudera and Opensource Enthusiast. Follow my work on GitHub: &lt;a href="https://github.com/iprithv" rel="noopener noreferrer"&gt;https://github.com/iprithv&lt;/a&gt;&lt;/p&gt;

</description>
      <category>elasticsearch</category>
      <category>search</category>
      <category>database</category>
      <category>analytics</category>
    </item>
    <item>
      <title>How Apache Iceberg's Metadata Architecture Enables ACID at Scale</title>
      <dc:creator>Prithvi S</dc:creator>
      <pubDate>Mon, 13 Apr 2026 13:08:56 +0000</pubDate>
      <link>https://forem.com/iprithv/how-apache-icebergs-metadata-architecture-enables-acid-at-scale-54kh</link>
      <guid>https://forem.com/iprithv/how-apache-icebergs-metadata-architecture-enables-acid-at-scale-54kh</guid>
      <description>&lt;p&gt;When you have a petabyte of data across millions of files in cloud storage, how do you ensure that reads are consistent, writes don't collide, and schema changes don't break everything? Traditional data lakes punt on this problem. Apache Iceberg solves it with an elegant metadata architecture that brings SQL-table reliability to distributed storage without needing a centralized database.&lt;/p&gt;

&lt;p&gt;Let me walk you through how it works, why each layer matters, and what makes it fundamentally different from older table formats.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: Why Traditional Data Lakes Are Unreliable
&lt;/h2&gt;

&lt;p&gt;Before Iceberg, data lakes operated like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data engineers wrote Parquet files to S3&lt;/li&gt;
&lt;li&gt;A Hive metastore tracked table schemas and partition locations&lt;/li&gt;
&lt;li&gt;Queries discovered files by scanning directories or querying the metastore&lt;/li&gt;
&lt;li&gt;Updates meant either rewriting entire partitions or leaving data inconsistent&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This worked for append-only workflows. But the moment you needed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Schema evolution&lt;/strong&gt; (add a column) without rewriting data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Atomic deletes&lt;/strong&gt; without breaking other writers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Time travel&lt;/strong&gt; (query historical snapshots)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Concurrent writes&lt;/strong&gt; without conflicts
...you hit a wall. The metastore was a single point of contention, and there was no reliable way to track which files belonged to which version of the table.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Iceberg fixes this by building a versioned metadata system directly into the table format. No metastore required (though one can help). Just immutable snapshots and a pointer to the current state.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Metadata Hierarchy: Five Layers of Metadata
&lt;/h2&gt;

&lt;p&gt;Iceberg organizes metadata in a clean bottom-to-top hierarchy. Each layer is immutable, and each layer is built from the layer below it. Here's how it works:&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: Data Files
&lt;/h3&gt;

&lt;p&gt;At the bottom are your actual data files: Parquet, ORC, or Avro files stored in cloud storage (S3, GCS, Azure Blob, or local filesystems). These contain the raw table data, partitioned and compressed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;s3://my-bucket/warehouse/db/table/data/
  00001-abc123.parquet    &amp;lt;- 50MB, partition year=2024, month=01
  00002-def456.parquet    &amp;lt;- 60MB, partition year=2024, month=01
  00003-ghi789.parquet    &amp;lt;- 45MB, partition year=2024, month=02
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From Iceberg's perspective, these files are opaque. It doesn't care about their internal structure. What matters is that each file has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A file path&lt;/li&gt;
&lt;li&gt;File format (parquet/orc/avro)&lt;/li&gt;
&lt;li&gt;Partition values (year=2024, month=01)&lt;/li&gt;
&lt;li&gt;Column-level statistics (min/max/null counts per column)&lt;/li&gt;
&lt;li&gt;File size and record count&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Layer 2: Manifest Files
&lt;/h3&gt;

&lt;p&gt;Above data files sit manifest files. A manifest file is a Parquet file that lists the data files belonging to a snapshot, along with metadata about each one.&lt;/p&gt;

&lt;p&gt;Think of a manifest like a file listing with extra info:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"snapshot_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;9223372036854775807&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"sequence_number"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"file_path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"s3://my-bucket/warehouse/db/table/data/00001-abc123.parquet"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"file_format"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"PARQUET"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"spec_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"partition"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"year"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"month"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"record_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1234567&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"file_size_in_bytes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;52428800&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"column_sizes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"1"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10485760&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"2"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;15728640&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"3"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;26214400&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"value_counts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"2"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1234567&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"null_value_counts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"2"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"lower_bounds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"1"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2024-01-15"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"2"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"upper_bounds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"1"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2024-01-31"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"2"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;9999&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The manifest records:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which data files are live vs deleted&lt;/li&gt;
&lt;li&gt;Partition values for each file&lt;/li&gt;
&lt;li&gt;Column-level statistics (min/max/null counts) for pruning&lt;/li&gt;
&lt;li&gt;File sizes and record counts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where the magic starts. Because every data file's statistics are recorded in a manifest, query engines can:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Read a single manifest file (much smaller than scanning all data files)&lt;/li&gt;
&lt;li&gt;Prune files based on partition values or column statistics&lt;/li&gt;
&lt;li&gt;Skip reading files that don't match the query filter&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;On a table with a million data files, you might read one manifest file and determine that only 500 files match your query. No metadata service needed; it's all in the manifest.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: Manifest List
&lt;/h3&gt;

&lt;p&gt;A manifest list is a Parquet file that references all the manifest files for a given snapshot.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"manifest_path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"s3://my-bucket/warehouse/db/table/metadata/10001-abc.avro"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"manifest_length"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1048576&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"partition_spec_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"added_snapshot_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;9223372036854775807&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"added_files_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;150&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"existing_files_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"deleted_files_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"added_rows_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;18750000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"existing_rows_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;312500000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"deleted_rows_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;125000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"partitions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"contains_null"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"lower_bound"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"upper_bound"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2024&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The manifest list aggregates statistics from all manifests for that snapshot:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How many files were added/existing/deleted?&lt;/li&gt;
&lt;li&gt;How many rows were added/deleted?&lt;/li&gt;
&lt;li&gt;What partition ranges does this snapshot contain?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Why? Because query engines need to know if a snapshot is even relevant before reading all its manifests. The manifest list answers that in one read.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 4: Metadata File (JSON)
&lt;/h3&gt;

&lt;p&gt;Above the manifest list sits the metadata file. This is a JSON file that contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Table schema and column definitions&lt;/li&gt;
&lt;li&gt;Partition spec (how to partition the table)&lt;/li&gt;
&lt;li&gt;Current snapshot ID (pointer to the "live" snapshot)&lt;/li&gt;
&lt;li&gt;Snapshot history (all past snapshots)&lt;/li&gt;
&lt;li&gt;Table properties and settings&lt;/li&gt;
&lt;li&gt;Sorted order definitions&lt;/li&gt;
&lt;li&gt;Schema evolution history (field IDs and renames)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example metadata file structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"format-version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"table-uuid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"abc-123-def-456"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"location"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"s3://my-bucket/warehouse/db/table"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"last-sequence-number"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"last-updated-ms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1712973955000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"last-column-id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"schema"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"struct"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"fields"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"event_id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"user_id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"long"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"event_timestamp"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"partition-spec"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"fields"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"source-id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"transform"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"year"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"year"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"current-snapshot-id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;9223372036854775807&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"snapshots"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"snapshot-id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;9223372036854775807&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"timestamp-ms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1712973955000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"summary"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"operation"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"append"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"spark.app.id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"app-20240412-123456"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"manifest-list"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"s3://my-bucket/warehouse/db/table/metadata/v1.manifest.list"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This single JSON file is the table's source of truth. It tells you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What columns exist (with field IDs, not just names)&lt;/li&gt;
&lt;li&gt;How the table is partitioned&lt;/li&gt;
&lt;li&gt;Which snapshot is current&lt;/li&gt;
&lt;li&gt;The full history of all snapshots ever created&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Layer 5: The Catalog Pointer (The Only Mutable Piece)
&lt;/h3&gt;

&lt;p&gt;Finally, at the very top sits the catalog. The catalog's job is simple: store a pointer to the current metadata file location.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;table_identifier (db.table) -&amp;gt; s3://bucket/warehouse/db/table/metadata/v123.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the only mutable piece in the entire system. When you commit a write, you:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Create a new metadata file (immutable)&lt;/li&gt;
&lt;li&gt;Create new manifest files (immutable)&lt;/li&gt;
&lt;li&gt;Create new data files (immutable)&lt;/li&gt;
&lt;li&gt;Update the catalog pointer to point to the new metadata file (atomic CAS operation)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If the pointer update fails (because another writer already updated it), you retry with conflict detection. This gives you serializable isolation without a database.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters: Three Critical Properties
&lt;/h2&gt;

&lt;p&gt;This five-layer hierarchy gives you three things that traditional data lakes don't have:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Atomicity Without a Database
&lt;/h3&gt;

&lt;p&gt;Traditional approach: Write files to S3, update the metastore database. If the database update fails, you have orphaned files. If the application crashes mid-write, the metastore is inconsistent.&lt;/p&gt;

&lt;p&gt;Iceberg approach: Write everything (metadata files, manifests, data files) to immutable storage. The only atomic operation is the catalog pointer update (compare-and-swap on a key-value store, or a file rename on S3). If that fails, nothing changed. No orphaned files.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Schema Evolution Without Rewrites
&lt;/h3&gt;

&lt;p&gt;With Hive tables, columns are identified by position or name. Want to add a column? Rewrite the schema. Want to rename a column? Rewrite all existing data files to update the column reference.&lt;/p&gt;

&lt;p&gt;Iceberg uses field IDs instead of names or positions. Column 1 is always "event_id" internally, even if you rename the external column or reorder columns. Old data files don't change. New data files use the new schema. Queries automatically reconcile both.&lt;/p&gt;

&lt;p&gt;Example: You have a table with schema &lt;code&gt;(id, user_id, event_time)&lt;/code&gt;. You want to add a &lt;code&gt;source&lt;/code&gt; column:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Old data files don't have &lt;code&gt;source&lt;/code&gt; (column ID 4)&lt;/li&gt;
&lt;li&gt;New data files do have &lt;code&gt;source&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Iceberg handles missing columns transparently (NULL fill-down)&lt;/li&gt;
&lt;li&gt;No rewrites&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Time Travel and Snapshot Isolation
&lt;/h3&gt;

&lt;p&gt;Every write creates a new immutable snapshot. Snapshots are never mutated or deleted (unless explicitly expired). This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You can query the table as it existed 30 days ago&lt;/li&gt;
&lt;li&gt;Concurrent reads aren't blocked by concurrent writes&lt;/li&gt;
&lt;li&gt;Snapshot expiration is manual (you decide when old snapshots are garbage)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Write Modes: Copy-on-Write vs Merge-on-Read
&lt;/h2&gt;

&lt;p&gt;When you delete or update rows, Iceberg gives you two options:&lt;/p&gt;

&lt;h3&gt;
  
  
  Copy-on-Write (CoW)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;When you update/delete rows in file X, rewrite the entire file&lt;/li&gt;
&lt;li&gt;Pros: Readers always see clean data files, no performance penalty&lt;/li&gt;
&lt;li&gt;Cons: Updates are expensive (rewrite entire files)&lt;/li&gt;
&lt;li&gt;Best for: Read-heavy workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Merge-on-Read (MoR)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;When you update/delete rows, write a separate delete file (position delete or equality delete)&lt;/li&gt;
&lt;li&gt;Pros: Updates are fast (just write a small delete file)&lt;/li&gt;
&lt;li&gt;Cons: Readers must merge data files + delete files, slight read penalty&lt;/li&gt;
&lt;li&gt;Best for: Write-heavy workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Iceberg v2 spec introduced position deletes (deleted by row position) and equality deletes (deleted by column value). This gives engines flexibility in how they reconcile deletes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-Engine Interoperability
&lt;/h2&gt;

&lt;p&gt;Here's what's remarkable: Spark, Trino, Flink, Presto, Hive, and Impala all read the same metadata format. The spec is engine-agnostic. A data engineer can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Write data with Spark&lt;/li&gt;
&lt;li&gt;Query it with Trino&lt;/li&gt;
&lt;li&gt;Delete rows with Flink&lt;/li&gt;
&lt;li&gt;Time travel with Presto
...all on the same table, without format conversions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is possible because Iceberg separates the format spec from the execution engine. The metadata hierarchy is standardized. Query engines just implement readers for that standard.&lt;/p&gt;

&lt;h2&gt;
  
  
  Optimistic Concurrency Control
&lt;/h2&gt;

&lt;p&gt;With multiple writers, how does Iceberg prevent conflicts? Via optimistic concurrency:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Writer A reads the current metadata file&lt;/li&gt;
&lt;li&gt;Writer B reads the current metadata file&lt;/li&gt;
&lt;li&gt;Writer A finishes its changes, creates a new metadata file, tries to update the catalog pointer&lt;/li&gt;
&lt;li&gt;Update succeeds (CAS operation)&lt;/li&gt;
&lt;li&gt;Writer B finishes its changes, creates a new metadata file, tries to update the catalog pointer&lt;/li&gt;
&lt;li&gt;Update fails (pointer no longer points to the metadata file Writer B read)&lt;/li&gt;
&lt;li&gt;Writer B detects the conflict, re-reads the current metadata, recomputes its changes, and retries&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This gives you serializable isolation. Writers don't block; they just retry on conflicts. For most workloads (few concurrent writers), conflicts are rare. For high-concurrency scenarios, you might need more sophisticated conflict resolution, but the default retry mechanism is sound.&lt;/p&gt;

&lt;h2&gt;
  
  
  Partition Evolution: Change Partitioning Without Rewriting
&lt;/h2&gt;

&lt;p&gt;Suppose you initially partitioned by &lt;code&gt;month(event_time)&lt;/code&gt;, and now you want to partition by &lt;code&gt;day(event_time)&lt;/code&gt; for better file pruning.&lt;/p&gt;

&lt;p&gt;Traditional approach: Rewrite the entire table with the new partition scheme.&lt;/p&gt;

&lt;p&gt;Iceberg approach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Old snapshots keep their old partition scheme&lt;/li&gt;
&lt;li&gt;New writes use the new partition scheme&lt;/li&gt;
&lt;li&gt;Manifest files track partition spec ID per file&lt;/li&gt;
&lt;li&gt;Queries automatically handle mixed partition layouts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No rewrites needed. This is huge for large tables where rewriting would take hours.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion: Why This Architecture Wins
&lt;/h2&gt;

&lt;p&gt;Iceberg's metadata hierarchy achieves something remarkable: &lt;strong&gt;ACID guarantees on immutable cloud storage&lt;/strong&gt;, without sacrificing performance or requiring a centralized database.&lt;/p&gt;

&lt;p&gt;The design principles are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Correctness over performance&lt;/strong&gt; - atomic commits matter more than throughput&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Immutability&lt;/strong&gt; - makes caching, parallelism, and disaster recovery trivial&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Versioning&lt;/strong&gt; - every snapshot is preserved, enabling time travel and rollback&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Engine-agnostic&lt;/strong&gt; - the spec is open, allowing diverse tools to interoperate&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;File-level granularity&lt;/strong&gt; - statistics are recorded per file, enabling efficient pruning&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you're building a data platform or migrating from Hive/Delta, understand this architecture. It's not just a file format; it's a rethinking of how to manage massive datasets reliably.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Want to dive deeper?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/apache/iceberg" rel="noopener noreferrer"&gt;https://github.com/apache/iceberg&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Specification: &lt;a href="https://iceberg.apache.org/spec/" rel="noopener noreferrer"&gt;https://iceberg.apache.org/spec/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Java API: &lt;code&gt;org.apache.iceberg&lt;/code&gt; package in the repo&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I'm Prithvi S, Staff Software Engineer at Cloudera and Opensource Enthusiast. Follow my work on GitHub: &lt;a href="https://github.com/iprithv" rel="noopener noreferrer"&gt;https://github.com/iprithv&lt;/a&gt;&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>database</category>
      <category>dataengineering</category>
      <category>distributedsystems</category>
    </item>
    <item>
      <title>Stop Guessing on Search Tuning: Using OpenSearch Search Relevance to Improve Results</title>
      <dc:creator>Prithvi S</dc:creator>
      <pubDate>Mon, 13 Apr 2026 13:05:53 +0000</pubDate>
      <link>https://forem.com/iprithv/stop-guessing-on-search-tuning-using-opensearch-search-relevance-to-improve-results-2f0j</link>
      <guid>https://forem.com/iprithv/stop-guessing-on-search-tuning-using-opensearch-search-relevance-to-improve-results-2f0j</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Search quality problems cost users and revenue. OpenSearch Search Relevance lets you measure exactly what's broken, iterate on fixes, and prove improvement with metrics. This guide walks you through a real-world search tuning workflow.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem: Search That Works for You, Not Your Users
&lt;/h2&gt;

&lt;p&gt;You've built a solid search system. Elasticsearch. OpenSearch. Full-text search running smoothly. But then you hear it:&lt;/p&gt;

&lt;p&gt;"Why can't I find what I'm looking for?"&lt;br&gt;
"Your search results are terrible."&lt;br&gt;
"I have to refine my query five times to get what I need."&lt;/p&gt;

&lt;p&gt;This is the gap between &lt;em&gt;search that works&lt;/em&gt; and &lt;em&gt;search that matters&lt;/em&gt;. Your system might be technically sound, but the relevance—the quality of what you return—is broken. And here's the problem: you can't measure what you can't improve.&lt;/p&gt;

&lt;p&gt;Most teams guess. They tweak analyzers, adjust boost factors, shuffle query logic, deploy, and hope. Sometimes it helps. Sometimes it makes things worse. Nobody knows because they're not measuring.&lt;/p&gt;

&lt;p&gt;This is where OpenSearch Search Relevance comes in.&lt;/p&gt;


&lt;h2&gt;
  
  
  What is OpenSearch Search Relevance?
&lt;/h2&gt;

&lt;p&gt;OpenSearch Search Relevance is a plugin ecosystem that turns search tuning from guesswork into science. It does three key things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Captures ground truth:&lt;/strong&gt; You build a query set—representative questions your users actually ask&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Runs experiments:&lt;/strong&gt; You configure multiple search strategies and compare them side by side&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Computes metrics:&lt;/strong&gt; You get nDCG, precision, recall, and MRR scores—the same metrics information retrieval researchers use&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The result: you know exactly what's working, what isn't, and by how much.&lt;/p&gt;


&lt;h2&gt;
  
  
  A Real-World Scenario: E-commerce Search Gone Wrong
&lt;/h2&gt;

&lt;p&gt;Let's say you're an e-commerce platform. Your search is powered by OpenSearch. Basic setup: BM25 scoring, standard analyzer, some field boosting. Reasonable. But your metrics show a problem:&lt;/p&gt;

&lt;p&gt;Users searching "comfortable running shoes" get back:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hiking boots (nope)&lt;/li&gt;
&lt;li&gt;Dress shoes on sale (nope)&lt;/li&gt;
&lt;li&gt;A few running shoes at the bottom of page 2 (finally!)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your nDCG score at position 10 is 0.42. That's bad. Users are leaving frustrated.&lt;/p&gt;

&lt;p&gt;The question: what's broken? The analyzer? The query type? The field weights? Without measurement, you're shooting in the dark.&lt;/p&gt;


&lt;h2&gt;
  
  
  Step 1: Build Your Query Set
&lt;/h2&gt;

&lt;p&gt;First, you collect representative queries. Not made-up queries—real questions from your search logs, support tickets, user research. For the e-commerce example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"comfortable running shoes"&lt;/li&gt;
&lt;li&gt;"women's waterproof winter boots"&lt;/li&gt;
&lt;li&gt;"lightweight hiking shoes under $100"&lt;/li&gt;
&lt;li&gt;"best cross-training footwear"&lt;/li&gt;
&lt;li&gt;"slip-on shoes for office"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You grade the relevance of top results for each query. The grading is simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Grade 3: Perfect match (what you wanted to buy)&lt;/li&gt;
&lt;li&gt;Grade 2: Good match (acceptable alternative)&lt;/li&gt;
&lt;li&gt;Grade 1: Poor match (tangentially related)&lt;/li&gt;
&lt;li&gt;Grade 0: Irrelevant (why is this here?)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This becomes your ground truth. This is what good search looks like for your domain.&lt;/p&gt;


&lt;h2&gt;
  
  
  Step 2: Set Up Your Baseline Search Configuration
&lt;/h2&gt;

&lt;p&gt;You define how search works today. For our e-commerce example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;Query:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="err"&gt;bool:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;must:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;multi_match:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="err"&gt;query:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;user_query&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="err"&gt;fields:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"title^2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"category"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;filter:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;term:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"is_active"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="err"&gt;Analyzer:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"standard"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;(default&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;tokenization,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;lowercase)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is your baseline. We'll measure it, then try to beat it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 3: Run an Experiment
&lt;/h2&gt;

&lt;p&gt;Now you hypothesize: "The standard analyzer is losing words. If we use a synonym-aware analyzer and boost title matches more, we'll rank better."&lt;/p&gt;

&lt;p&gt;You create a second search configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;Query:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="err"&gt;bool:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;must:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;multi_match:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="err"&gt;query:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;user_query&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="err"&gt;fields:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"title^3"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tags^2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;filter:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;term:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"is_active"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="err"&gt;Analyzer:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"custom_with_synonyms"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Tokenizer:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;standard&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Filters:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;lowercase,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;stop_words,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;synonym&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;(running/jogging/athletic)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You run a PAIRWISE_COMPARISON experiment:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Take your query set&lt;/li&gt;
&lt;li&gt;Execute each query with both configurations&lt;/li&gt;
&lt;li&gt;Present results side by side to human evaluators (or use implicit signals like CTR)&lt;/li&gt;
&lt;li&gt;Compute metrics for each&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The output:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Baseline (standard analyzer):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;nDCG@10: 0.42&lt;/li&gt;
&lt;li&gt;Precision@10: 0.35&lt;/li&gt;
&lt;li&gt;Recall: 0.48&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Variant (synonym-aware, title boost):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;nDCG@10: 0.68&lt;/li&gt;
&lt;li&gt;Precision@10: 0.58&lt;/li&gt;
&lt;li&gt;Recall: 0.71&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's a 62% improvement in nDCG. The variant wins decisively.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 4: Iterate
&lt;/h2&gt;

&lt;p&gt;One win doesn't mean you're done. You run another experiment:&lt;/p&gt;

&lt;p&gt;"What if we add a custom BM25 parameter tuning? Default is k1=1.2, b=0.75. We have short product titles—maybe b=0.5 would work better (less impact from field length)."&lt;/p&gt;

&lt;p&gt;You create variant #3, measure it, and compare all three. Now you have data-driven evidence, not hunches.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;You stop guessing.&lt;/strong&gt; Every change is measured against your ground truth.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You build confidence.&lt;/strong&gt; When nDCG goes from 0.42 to 0.68, you're not wondering if you broke something—you know you improved it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You compound gains.&lt;/strong&gt; Each iteration is small (sometimes), but over months, small improvements stack. A series of 10% wins is a 2.6x total improvement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You communicate value.&lt;/strong&gt; When leadership asks "Did the search redesign help?", you show metrics, not opinions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You catch regressions.&lt;/strong&gt; New feature breaks relevance? Your metrics catch it before users do.&lt;/p&gt;




&lt;h2&gt;
  
  
  Technical Implementation: Getting Started
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;OpenSearch cluster (7.10+) with search-relevance plugin installed&lt;/li&gt;
&lt;li&gt;OpenSearch Dashboards with dashboards-search-relevance plugin&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Workflow
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Create a query set&lt;/strong&gt; via the UI or API&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;POST to search-relevance query set endpoint&lt;/li&gt;
&lt;li&gt;Provide queries + graded judgments&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Define search configurations&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Store in index templates or as JSON documents&lt;/li&gt;
&lt;li&gt;Configurations are just OpenSearch queries + analyzer settings&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Create an experiment&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Specify baseline vs variant(s)&lt;/li&gt;
&lt;li&gt;Set experiment type (PAIRWISE_COMPARISON, etc.)&lt;/li&gt;
&lt;li&gt;Run via API or UI&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Evaluate results&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dashboard shows nDCG, precision, recall, MRR&lt;/li&gt;
&lt;li&gt;Human evaluators refine judgments (optional)&lt;/li&gt;
&lt;li&gt;Export results for reporting&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Example: Creating a Query Set via API
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="s2"&gt;"localhost:9200/.search-relevance-queries/_doc"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt;&lt;span class="s1"&gt;'{
    "query_set_name": "ecommerce_footwear_q2_2026",
    "queries": [
      {
        "query_text": "comfortable running shoes",
        "judgments": [
          {"doc_id": "shoe_123", "grade": 3},
          {"doc_id": "shoe_456", "grade": 3},
          {"doc_id": "boot_789", "grade": 0}
        ]
      }
    ]
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Best Practices for Search Quality Evaluation
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Represent your domain.&lt;/strong&gt; Query sets should reflect real user behavior. If 40% of your queries are brand-specific ("Nike Air Max"), weight them accordingly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Grade consistently.&lt;/strong&gt; Define grade rubrics upfront. Grade 3 should mean the same thing across all evaluators. Consider inter-rater agreement checks (Kappa scores).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Start with quick wins.&lt;/strong&gt; Don't boil the ocean. Fix analyzer issues, obvious field weight problems, missing synonyms. You'll see 20-30% gains fast.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Measure multiple metrics.&lt;/strong&gt; nDCG is great, but also track precision (false positives matter) and recall (missed results matter). Together they tell the full story.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. A/B test in production.&lt;/strong&gt; Once confident in experiments, shadow your baseline for a week. Real user behavior &amp;gt; offline metrics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Monitor over time.&lt;/strong&gt; As your catalog changes, re-evaluate. New product types? Seasonal shifts? Your query set may need updates.&lt;/p&gt;




&lt;h2&gt;
  
  
  Common Pitfalls
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Overfitting to your query set.&lt;/strong&gt; If you tune only to 20 queries, you might break search for the other 980 query types you haven't measured.&lt;br&gt;
&lt;em&gt;Fix: Expand your query set regularly. Aim for 100+ representative queries.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ignoring search latency.&lt;/strong&gt; You improved nDCG by 10%, but query time went from 50ms to 500ms. Users see slower search as worse search.&lt;br&gt;
&lt;em&gt;Fix: Track latency alongside relevance metrics.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Forgetting about cold starts.&lt;/strong&gt; Your new analyzer is great, but what about rare queries with few matches? Relevance breaks down at the edges.&lt;br&gt;
&lt;em&gt;Fix: Define fallback strategies. What happens when your perfect query gets zero results?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not evaluating at scale.&lt;/strong&gt; Your query set works, but only 30% of real user queries are covered. The other 70% are long-tail.&lt;br&gt;
&lt;em&gt;Fix: Use implicit signals (CTR, dwell time) to sample long-tail queries.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Next Steps
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Start small.&lt;/strong&gt; Pick 50 queries. Grade top-10 results. Measure baseline nDCG.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Hypothesize.&lt;/strong&gt; What analyzer issue, field weight problem, or query logic gap would improve things?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Experiment.&lt;/strong&gt; Create one variant. Run the experiment. Compare metrics.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Iterate.&lt;/strong&gt; Keep the winner. Try the next improvement.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Scale.&lt;/strong&gt; Once confident, expand your query set and refine your configuration.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;Search quality problems are hidden until you measure them. OpenSearch Search Relevance gives you the tools to turn search tuning from guesswork into data-driven iteration.&lt;/p&gt;

&lt;p&gt;Your users asked for better search. Now you can prove you delivered.&lt;/p&gt;




&lt;p&gt;I'm Prithvi S, Staff Software Engineer at Cloudera and Opensource Enthusiast. Follow my work on GitHub: &lt;a href="https://github.com/iprithv" rel="noopener noreferrer"&gt;https://github.com/iprithv&lt;/a&gt;&lt;/p&gt;

</description>
      <category>opensearch</category>
      <category>search</category>
      <category>database</category>
      <category>data</category>
    </item>
    <item>
      <title>Beyond Keywords: Unpacking Elasticsearch's Inverted Index for Sub-Millisecond Search</title>
      <dc:creator>Prithvi S</dc:creator>
      <pubDate>Fri, 10 Apr 2026 07:06:20 +0000</pubDate>
      <link>https://forem.com/iprithv/beyond-keywords-unpacking-elasticsearchs-inverted-index-for-sub-millisecond-search-30ji</link>
      <guid>https://forem.com/iprithv/beyond-keywords-unpacking-elasticsearchs-inverted-index-for-sub-millisecond-search-30ji</guid>
      <description>&lt;h2&gt;
  
  
  Introduction: The Need for Speed in Search
&lt;/h2&gt;

&lt;p&gt;In today's data-driven world, users expect instant access to information. Whether it's finding a product on an e-commerce site, searching through legal documents, or sifting through vast logs for an anomaly, every millisecond counts. Traditional relational databases, while excellent for structured queries and transactional integrity, often fall short when it comes to the complex, full-text search requirements of modern applications.&lt;/p&gt;

&lt;p&gt;This is where Elasticsearch shines. At its core, Elasticsearch is a distributed, RESTful search and analytics engine capable of tackling billions of documents and returning results in sub-millisecond times. But what's the magic behind this speed? The answer lies in its foundational data structure: the Inverted Index.&lt;/p&gt;

&lt;p&gt;In this deep dive, we'll peel back the layers of Elasticsearch to understand the inverted index-what it is, how it works, and why it's the cornerstone of Elasticsearch's remarkable performance. We'll also cover how it differs from traditional database indexing and why it's crucial for achieving lightning-fast, highly relevant search experiences.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with Traditional Database Search
&lt;/h2&gt;

&lt;p&gt;Consider a typical relational database table, perhaps storing articles:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;id&lt;/th&gt;
&lt;th&gt;title&lt;/th&gt;
&lt;th&gt;content&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;The Quick Brown Fox&lt;/td&gt;
&lt;td&gt;The quick brown fox jumps over the lazy dog.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Lazy Dog's Adventures&lt;/td&gt;
&lt;td&gt;The lazy dog enjoys chasing squirrels in the park.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Squirrels and Foxes&lt;/td&gt;
&lt;td&gt;Foxes and squirrels often cross paths in the wild.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you wanted to find articles containing the word "fox," you might write a SQL query like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;articles&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'%fox%'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This query, while simple, is problematic for large datasets. The &lt;code&gt;LIKE '%fox%'&lt;/code&gt; clause with a leading wildcard prevents the database from using a standard B-tree index efficiently. The database would likely resort to a full table scan, reading every row and checking the &lt;code&gt;content&lt;/code&gt; column, which is incredibly slow for millions of records. Even with full-text search extensions built into some databases, they often struggle with the scale and feature set that dedicated search engines offer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Enter the Inverted Index: A Paradigm Shift
&lt;/h2&gt;

&lt;p&gt;Instead of mapping records to the words they contain (like a traditional index), an inverted index flips this relationship. It maps &lt;em&gt;words&lt;/em&gt; to the &lt;em&gt;documents&lt;/em&gt; (or articles, in our example) that contain them.&lt;/p&gt;

&lt;p&gt;Let's take our example sentences and build a simple inverted index. First, we break down each document into individual terms, a process called tokenization, and normalize them (e.g., lowercase).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Document 1:&lt;/strong&gt; "The quick brown fox jumps over the lazy dog."&lt;br&gt;
Tokens: &lt;code&gt;the&lt;/code&gt;, &lt;code&gt;quick&lt;/code&gt;, &lt;code&gt;brown&lt;/code&gt;, &lt;code&gt;fox&lt;/code&gt;, &lt;code&gt;jumps&lt;/code&gt;, &lt;code&gt;over&lt;/code&gt;, &lt;code&gt;lazy&lt;/code&gt;, &lt;code&gt;dog&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Document 2:&lt;/strong&gt; "The lazy dog enjoys chasing squirrels in the park."&lt;br&gt;
Tokens: &lt;code&gt;the&lt;/code&gt;, &lt;code&gt;lazy&lt;/code&gt;, &lt;code&gt;dog&lt;/code&gt;, &lt;code&gt;enjoys&lt;/code&gt;, &lt;code&gt;chasing&lt;/code&gt;, &lt;code&gt;squirrels&lt;/code&gt;, &lt;code&gt;in&lt;/code&gt;, &lt;code&gt;park&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Document 3:&lt;/strong&gt; "Foxes and squirrels often cross paths in the wild."&lt;br&gt;
Tokens: &lt;code&gt;foxes&lt;/code&gt;, &lt;code&gt;and&lt;/code&gt;, &lt;code&gt;squirrels&lt;/code&gt;, &lt;code&gt;often&lt;/code&gt;, &lt;code&gt;cross&lt;/code&gt;, &lt;code&gt;paths&lt;/code&gt;, &lt;code&gt;in&lt;/code&gt;, &lt;code&gt;wild&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Now, let's construct the inverted index:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Term&lt;/th&gt;
&lt;th&gt;Document List (Posting List)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;and&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;brown&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;chasing&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cross&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;dog&lt;/td&gt;
&lt;td&gt;1, 2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;enjoys&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;fox&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;foxes&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;in&lt;/td&gt;
&lt;td&gt;2, 3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;jumps&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;lazy&lt;/td&gt;
&lt;td&gt;1, 2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;often&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;over&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;park&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;paths&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;quick&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;squirrels&lt;/td&gt;
&lt;td&gt;2, 3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;the&lt;/td&gt;
&lt;td&gt;1, 2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;wild&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Notice a few things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Each unique term is listed once.&lt;/li&gt;
&lt;li&gt;  Next to each term is a "posting list" – a list of document IDs where that term appears.&lt;/li&gt;
&lt;li&gt;  Terms like "fox" and "foxes" are treated separately here, but in a real-world Elasticsearch setup, an analyzer could stem them to a common root (&lt;code&gt;fox&lt;/code&gt;) if desired.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  How Elasticsearch Uses the Inverted Index for Speed
&lt;/h2&gt;

&lt;p&gt;When you query Elasticsearch for "fox," it doesn't scan documents. Instead, it performs an O(1) lookup in the inverted index for the term "fox." This immediately returns &lt;code&gt;Document 1&lt;/code&gt;. If you searched for "lazy dog," it would:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Look up "lazy" → &lt;code&gt;[1, 2]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt; Look up "dog" → &lt;code&gt;[1, 2]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt; Perform an intersection of these two posting lists: &lt;code&gt;[1, 2] INTERSECT [1, 2] = [1, 2]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt; The result is &lt;code&gt;Document 1&lt;/code&gt; and &lt;code&gt;Document 2&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This process is incredibly fast because it's operating on sorted lists of document IDs, making intersections highly efficient.&lt;/p&gt;
&lt;h3&gt;
  
  
  Beyond Simple Lookups: Position and Frequency
&lt;/h3&gt;

&lt;p&gt;Real-world search is more complex than just knowing if a word exists in a document. We need to know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  How often a word appears (term frequency)&lt;/li&gt;
&lt;li&gt;  Where it appears (position)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To achieve this, the inverted index stores more than just document IDs in its posting lists. It also includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Term Frequency (TF)&lt;/strong&gt;: How many times a term appears in a document. This is crucial for relevance scoring (e.g., a document mentioning "Elasticsearch" ten times is likely more relevant than one mentioning it once).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Term Position&lt;/strong&gt;: The positions within the document where the term occurs. This is vital for phrase queries ("quick brown fox") and proximity searches (words appearing near each other).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's enhance our inverted index fragment:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Term&lt;/th&gt;
&lt;th&gt;Document (ID, TF, Positions)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;fox&lt;/td&gt;
&lt;td&gt;1: (TF=1, Positions: [3]), 3: (TF=1, Positions: [0])&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;lazy&lt;/td&gt;
&lt;td&gt;1: (TF=1, Positions: [6]), 2: (TF=1, Positions: [1])&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;dog&lt;/td&gt;
&lt;td&gt;1: (TF=1, Positions: [7]), 2: (TF=1, Positions: [2])&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;quick&lt;/td&gt;
&lt;td&gt;1: (TF=1, Positions: [1])&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;brown&lt;/td&gt;
&lt;td&gt;1: (TF=1, Positions: [2])&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Now, if we search for the phrase "quick brown fox":&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Elasticsearch finds documents containing "quick" at position &lt;code&gt;X&lt;/code&gt;, "brown" at &lt;code&gt;X+1&lt;/code&gt;, and "fox" at &lt;code&gt;X+2&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt; It quickly identifies Document 1 as a match because "quick" is at 1, "brown" at 2, and "fox" at 3.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This additional metadata, while increasing the storage footprint of the index itself, drastically improves the speed and accuracy of complex full-text searches.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Indexing Pipeline: From Raw Text to Inverted Index
&lt;/h2&gt;

&lt;p&gt;How does raw text transform into this optimized inverted index? Elasticsearch uses a sophisticated indexing pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Document Arrival&lt;/strong&gt;: A new document (e.g., a JSON object) is sent to an Elasticsearch coordinating node.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Shard Routing&lt;/strong&gt;: The coordinating node determines which primary shard the document belongs to. This is typically done using a hash of the document's ID. For example, &lt;code&gt;shard = hash(document_id) % num_primary_shards&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Write Buffer&lt;/strong&gt;: The document is temporarily stored in an in-memory buffer on the data node hosting the target primary shard.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Analysis&lt;/strong&gt;: The document's text fields undergo an "analysis" process. This is where text is transformed:

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Character Filters&lt;/strong&gt;: Clean up text (e.g., remove HTML tags, convert special characters).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Tokenizer&lt;/strong&gt;: Breaks the text into individual terms (tokens). The "standard" tokenizer splits on whitespace and punctuation.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Token Filters&lt;/strong&gt;: Process tokens (e.g., lowercase them, remove stop words like "the", apply stemming to reduce words to their root form like "fishing" -&amp;gt; "fish", handle synonyms, etc.).
This analyzed output is what forms the terms in the inverted index.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Segment Creation&lt;/strong&gt;: Periodically (typically every 1 second by default), the terms in the in-memory write buffer are flushed to disk, creating a new "segment." A segment is a mini-inverted index, immutable once written. This is when the data becomes searchable. This process is called "refresh."&lt;br&gt;
&lt;/p&gt;

&lt;pre class="highlight plaintext"&gt;&lt;code&gt;POST /my-index/_doc
{ "title": "Elasticsearch is fast" }
&lt;/code&gt;&lt;/pre&gt;



&lt;p&gt;This single document hits the write buffer, gets analyzed (e.g., "elasticsearch", "fast"), and then after a refresh, these terms appear in a new segment.&lt;/p&gt;


&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;strong&gt;Replication&lt;/strong&gt;: For high availability and read scalability, Elasticsearch replicates the document to its replica shards.&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;strong&gt;Commit&lt;/strong&gt;: To ensure durability, segments are periodically "flushed." This involves writing all in-memory segments to a new commit point on disk and creating a commit file that lists all known segments. The transaction log (translog) is also flushed, ensuring that even if the system crashes, no data is lost between flushes.&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;strong&gt;Merging&lt;/strong&gt;: Over time, many small segments accumulate. Searching across numerous small segments can be less efficient. Elasticsearch intelligently merges these smaller segments into larger ones in the background. This process is resource-intensive but crucial for maintaining search performance.&lt;/p&gt;&lt;/li&gt;

&lt;/ol&gt;

&lt;h2&gt;
  
  
  Importance of Immutability
&lt;/h2&gt;

&lt;p&gt;Each segment in Elasticsearch is immutable. Once written to disk, it cannot be changed. When a document is updated, Elasticsearch doesn't modify the existing segment. Instead, it marks the old document as deleted (logically, not physically) and indexes the new version of the document into a new segment. Deleted documents are eventually removed during segment merging.&lt;/p&gt;

&lt;p&gt;This immutability offers significant advantages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Concurrency&lt;/strong&gt;: No need for complex locking mechanisms when reading segments, as they never change.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Caching&lt;/strong&gt;: Segments can be aggressively cached by the operating system, improving read performance.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Reliability&lt;/strong&gt;: Simpler to manage and recover from failures.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Relevance Scoring: Beyond Just Existence
&lt;/h2&gt;

&lt;p&gt;The inverted index provides the raw material for finding documents, but how does Elasticsearch rank them? This is where relevance scoring comes in, and the inverted index's detailed term information is invaluable.&lt;/p&gt;

&lt;p&gt;Elasticsearch uses a scoring algorithm called &lt;strong&gt;BM25 (Best Match 25)&lt;/strong&gt; by default. While a full explanation of BM25 is beyond this post, it primarily considers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Term Frequency (TF)&lt;/strong&gt;: How often a term appears in a document. More occurrences generally mean higher relevance.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Inverse Document Frequency (IDF)&lt;/strong&gt;: How rare a term is across all documents. Rarer terms are more significant.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Field Length&lt;/strong&gt;: Documents where a term appears in a shorter field (e.g., title) are often considered more relevant than if it appears in a very long field.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The inverted index provides the Term Frequency and Term Position directly. The IDF can be calculated across all segments. By combining these factors, BM25 assigns a score to each matching document, allowing Elasticsearch to return the most relevant results first.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion: The Unsung Hero of Fast Search
&lt;/h2&gt;

&lt;p&gt;The inverted index is more than just an internal detail; it's fundamental to understanding why Elasticsearch excels at search. It's a clever reversal of traditional indexing, transforming the challenge of full-text search into an efficient lookup and set intersection problem.&lt;/p&gt;

&lt;p&gt;From its role in fast lookups and phrase queries to providing the necessary statistics for sophisticated relevance scoring, the inverted index empowers Elasticsearch to deliver the sub-millisecond search experiences that users now demand. By understanding this core component, you gain valuable insight into how to design, optimize, and troubleshoot your Elasticsearch deployments for peak performance and accuracy.&lt;/p&gt;

&lt;p&gt;So next time you perform a lightning-fast search, take a moment to appreciate the unsung hero working tirelessly beneath the surface: the mighty inverted index.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;About the author:&lt;/strong&gt; I'm Prithvi S, Staff Software Engineer at Cloudera and Opensource Enthusiast. Follow my work on &lt;a href="https://github.com/iprithv" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>elasticsearch</category>
      <category>search</category>
      <category>database</category>
      <category>performance</category>
    </item>
    <item>
      <title>How Apache Polaris Vends Credentials: Securing Data Access Without Sharing Keys</title>
      <dc:creator>Prithvi S</dc:creator>
      <pubDate>Thu, 09 Apr 2026 09:10:24 +0000</pubDate>
      <link>https://forem.com/iprithv/how-apache-polaris-vends-credentials-securing-data-access-without-sharing-keys-156i</link>
      <guid>https://forem.com/iprithv/how-apache-polaris-vends-credentials-securing-data-access-without-sharing-keys-156i</guid>
      <description>&lt;p&gt;The modern data warehouse demands a fundamental shift in how we think about access control. When you build multi-tenant systems at scale, the traditional approach - distributing long-lived API keys or database credentials - becomes a security nightmare. Apache Polaris solves this elegantly: vend temporary, scoped credentials on demand, revoke instantly, audit everything.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: Why Long-Lived Credentials Don't Scale
&lt;/h2&gt;

&lt;p&gt;At Netflix, Cloudera, or any major data platform, you're managing access across hundreds of users, services, and applications. If you hand out permanent API keys:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Revocation is impossible&lt;/strong&gt; - a compromised key stays valid until you manually rotate it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit trails are fuzzy&lt;/strong&gt; - you don't know which key accessed which data when&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance is painful&lt;/strong&gt; - SOC2, HIPAA, PCI-DSS demand temporal, traceable access&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Key rotation is a nightmare&lt;/strong&gt; - updating thousands of clients, coordinating across teams&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scope is too broad&lt;/strong&gt; - a key that works today still works tomorrow, even if access should have expired&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why cloud providers moved away from permanent credentials. AWS uses temporary STS tokens. GCP uses short-lived access tokens. Azure has managed identities. The pattern is clear: trust should be ephemeral, scoped, and revocable.&lt;/p&gt;

&lt;p&gt;Polaris applies this principle to data catalogs and table access.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Polaris Vends Credentials
&lt;/h2&gt;

&lt;p&gt;Polaris is an open-source, REST-first Iceberg catalog that implements the Iceberg REST API. Unlike traditional catalogs (which require direct database access or assume long-lived credentials), Polaris mints temporary credentials on every access.&lt;/p&gt;

&lt;p&gt;Here's the flow:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Authorization Check (Who Are You? What Can You Do?)
&lt;/h3&gt;

&lt;p&gt;When a client requests data access, Polaris first checks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is this principal (user/service) authenticated?&lt;/li&gt;
&lt;li&gt;Do they have a role with permission to access this table?&lt;/li&gt;
&lt;li&gt;Is the access read-only or read-write?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This uses Polaris's two-tier RBAC model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Principal Roles&lt;/strong&gt; - assigned to service principals (identities)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Catalog Roles&lt;/strong&gt; - define actual permissions (TABLE_READ_DATA, TABLE_WRITE_DATA, etc.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example: A data analyst gets role &lt;code&gt;analyst_prod&lt;/code&gt;, which is granted &lt;code&gt;TABLE_READ_DATA&lt;/code&gt; on &lt;code&gt;catalog.sales.transactions&lt;/code&gt;. A service account gets role &lt;code&gt;etl_writer&lt;/code&gt;, which gets &lt;code&gt;TABLE_WRITE_DATA&lt;/code&gt; on &lt;code&gt;catalog.etl.staging&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Storage Configuration Lookup
&lt;/h3&gt;

&lt;p&gt;Polaris queries its configuration:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which cloud provider hosts this table? (AWS S3, GCS, Azure Blob)&lt;/li&gt;
&lt;li&gt;What credentials should be used for minting? (Polaris's service role)&lt;/li&gt;
&lt;li&gt;Are there any table-specific overrides?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Credential Minting
&lt;/h3&gt;

&lt;p&gt;Here's where the magic happens. Polaris calls the cloud provider's API to mint temporary, scoped credentials:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For AWS (S3):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;assume-role(
  role_arn=polaris-service-role,
  session_name=client-session-xyz,
  session_duration=15m,
  policy=restrict-to-s3://bucket/table-path/
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Returns: temporary AWS credentials (access key + secret key) valid for 15 minutes, scoped to the specific table path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For GCS:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;create-service-account-key(
  service_account=polaris-sa@project.iam.gserviceaccount.com,
  lifetime=15m,
  custom-claims={ "resource": "gs://bucket/table-path" }
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Returns: short-lived JWT valid for 15 minutes, scoped to the table path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For Azure:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;get-managed-identity-token(
  resource=https://storage.azure.com,
  lifetime=15m,
  scope=/subscriptions/xxx/resourceGroups/yyy/providers/Microsoft.Storage/storageAccounts/zzz/blobServices/default/containers/table-path
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Returns: short-lived bearer token (OAuth2) valid for 15 minutes, scoped to the container path.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Scope Restriction
&lt;/h3&gt;

&lt;p&gt;The credentials are scoped to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Path&lt;/strong&gt; - exact table location (e.g., &lt;code&gt;s3://data/catalog/table/&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Operations&lt;/strong&gt; - read-only (GET) vs read-write (GET, PUT, DELETE)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Duration&lt;/strong&gt; - typically 15 minutes (configurable)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A client can't use these credentials to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Access other tables&lt;/li&gt;
&lt;li&gt;Write data to a read-only table&lt;/li&gt;
&lt;li&gt;Perform actions after expiration&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5. Return to Client
&lt;/h3&gt;

&lt;p&gt;Polaris returns the temporary credentials to the client. The client's query engine (Spark, Trino, Presto, DuckDB, etc.) receives these credentials and uses them to read/write data directly to object storage.&lt;/p&gt;

&lt;p&gt;No long-lived secrets are distributed. The client never sees Polaris's service credentials.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Security Benefits
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Instant Revocation&lt;/strong&gt; - Delete a principal's role, all future requests are denied instantly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fine-Grained Access&lt;/strong&gt; - per-table, per-operation, per-principal permissions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auditability&lt;/strong&gt; - every credential mint event is logged (who, when, which table, read/write)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance-Ready&lt;/strong&gt; - temporal credentials, immutable audit trails, no shared secrets&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Blast Radius&lt;/strong&gt; - if a credential leaks, it's only valid for 15 minutes and only for one table&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Operational Benefits
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;No Credential Rotation&lt;/strong&gt; - credentials are automatically rotated every request&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No Key Distribution&lt;/strong&gt; - no need to distribute, store, or rotate permanent keys&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-Cloud Ready&lt;/strong&gt; - same API works with S3, GCS, Azure, MinIO&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Client Simplicity&lt;/strong&gt; - clients just receive credentials and query - they don't manage them&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Business Benefits
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Compliance Aligned&lt;/strong&gt; - meets SOC2, HIPAA, PCI-DSS, FedRAMP requirements&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost Control&lt;/strong&gt; - audit who accessed what, charge accordingly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Governance&lt;/strong&gt; - enforce data mesh principles (teams own their data, Polaris mediates access)&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Real-World Example: Data Mesh at Scale
&lt;/h2&gt;

&lt;p&gt;Imagine you're running a data mesh with 50 teams, each owning their own datasets. Without Polaris:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each team issues permanent API keys to consumers&lt;/li&gt;
&lt;li&gt;Keys spread across configuration files, CI/CD pipelines, notebooks&lt;/li&gt;
&lt;li&gt;A leaked key compromises an entire dataset&lt;/li&gt;
&lt;li&gt;Revocation requires manual updates across dozens of systems&lt;/li&gt;
&lt;li&gt;Audit trails are incomplete (keys used by multiple systems)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With Polaris:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each consumer requests access via Polaris API (authenticated via OIDC, OAuth2, mTLS)&lt;/li&gt;
&lt;li&gt;Polaris checks if consumer's identity has permission&lt;/li&gt;
&lt;li&gt;Polaris mints a 15-minute credential scoped to the specific table and operation&lt;/li&gt;
&lt;li&gt;Consumer queries the data with the temporary credential&lt;/li&gt;
&lt;li&gt;On next request, the credential is already expired - a new one is minted&lt;/li&gt;
&lt;li&gt;Revoke a consumer's role, and all future requests fail instantly&lt;/li&gt;
&lt;li&gt;Audit logs show exactly which identity accessed which table at what time&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Version 1.3.0 Features (January 2026)
&lt;/h2&gt;

&lt;p&gt;Recent Polaris releases added:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Federated Credential Vending&lt;/strong&gt; - Polaris can mint credentials for external catalogs (Snowflake, AWS Glue) instead of clients using their own credentials&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OPA Integration&lt;/strong&gt; - externalize authorization logic to Open Policy Agent for complex policies&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generic Tables&lt;/strong&gt; - support Delta Lake and Hudi alongside Iceberg&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metrics Reporting&lt;/strong&gt; - pluggable framework to report table metrics (row/byte counts, commits)&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;Polaris is on Apache Foundation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/apache/polaris" rel="noopener noreferrer"&gt;https://github.com/apache/polaris&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docs:&lt;/strong&gt; &lt;a href="https://polaris.apache.org/" rel="noopener noreferrer"&gt;https://polaris.apache.org/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;REST API:&lt;/strong&gt; Implements Iceberg REST Catalog spec&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're building a data platform, data mesh, or multi-tenant system, Polaris's credential vending model is worth studying. It's a pattern that applies beyond Iceberg - any system managing shared resource access can benefit from temporal, scoped credentials.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;About the author:&lt;/strong&gt; I'm Prithvi S, Staff Software Engineer at Cloudera and Opensource Enthusiast. Follow my work on &lt;a href="https://github.com/iprithv" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>polaris</category>
      <category>security</category>
      <category>api</category>
      <category>cloud</category>
    </item>
  </channel>
</rss>
