B-Tree for Equality, Sort, Range (indexing strategies in MongoDB)

#mongodb #database #index #performance

When creating an index, you don't need the full details of the queries it will serve, but the following access patterns characteristics:

A key prefix with fields used for equality predicates to query one or multiple discrete values. This key can also be used for partitioning, sharding, hashing, distributing, or merging sorted ranges.
The sorting fields to enable filtering over a continuous range of values, facilitate efficient sort and pagination, or simply get some entries clustered together.
Additional selective fields that can be used by range predicates.
Additional fields included to cover additional filters and projection and avoid fetching the document.

Maintaining indexes during inserts, deletes, and updates is synchronous to guarantee consistency when querying though secondary indexes. Therefore, it’s important to limit the number of indexes to a small set that supports your application queries performance.

In this post, I'll consider only the fields that have no array in the JSON path so that there is at maximum one index entry per document.

Access pattern: videos by category sorted by publishing date

In the previous post, with a search index, I queried the "Games" category for videos longer than 30 seconds that were published recently. To rerun this query with regular indexes on current state and let the query planner select the appropriate index, I can adjust the aggregation pipeline to:

db.youstats.aggregate([  
  {  
    $match: {  
      category: "Music",  
      duration: { $gt: 30 }  
    }  
  },  
  {  
    $sort: {  
      publishedDate: -1  
    }  
  },  
  {  
    $limit: 10  
  },  
  {  
    $project: {  
      _id: 0,  
      author: 1,  
      title: 1,  
      duration: 1,  
      type: 1,  
      publishedDate: 1  
    }  
  }  
]);

I can also run the same with a simple find():

db.youstats.find({  
  category: "Music",  
  duration: { $gt: 30 }.      ,  
},{
  author: 1                   ,
  title: 1                    ,
  duration: 1                 ,
  type: 1                     ,
  publishedDate: 1            ,
}).sort({ publishedDate: -1 }).limit(10)

This is a simple query with Equality (to get one category), Sort (to fetch only the last published) and Range (to get a duration bound) on top level fields, for which each document has a unique value. Without an index, the query is slow.

I create an index with the key fields following the Equality, Sort, Range:

db.youstats.createIndex({ 
 category: 1               , // for Equality on category
 publishedDate: -1         , // for Sort within each category
 duration: 1               , // for additional Range filtering
});

Queries with a category filter go directly to relevant B-Tree leaves, returning ordered rows without extra seeks. While reading, it filters by duration and fetches documents, stopping after finding 10 entries, as shown in the execution plan (.explain("executionStats").executionStats).

db.youstats.find({  
  category: "Music"       , 
  duration: { $gt: 10 }   ,  
},{
  author: 1               ,
  title: 1                ,
  duration: 1             ,
  type: 1                 ,
  publishedDate: 1        ,
}).sort({
  publishedDate: -1 }).limit(10).explain("executionStats").executionStats

The summary indicates that 10 keys were read, resulting in 10 documents being fetched to return the 10 documents for the result:

  nReturned: 10,
  executionTimeMillis: 0,
  totalKeysExamined: 10,
  totalDocsExamined: 10,

The IXSCAN details how it did it:

         indexBounds: {
            category: [ '["Music", "Music"]' ],
            publishedDate: [ '[MaxKey, MinKey]' ],
            duration: [ '(10, inf.0]' ]
          },
          keysExamined: 10,
          seeks: 1,

The first index bound returns a single range for category: "Music", the second reads all this range in order of publishedDate, and the third one skips duration outside [10, infinity], without changing the order.

It is often suggested that indexes should begin with a selective field, but this is wrong. Here I have the best index and the number of categories is limited. The misconception arises from the potential for an index on a selective field to serve multiple queries, because selective fields are often used for filters. However, we will see in the next post that this index can serve different filters.
Additionally, MongoDB indexes, using WiredTiger engine, leverage prefix compression, which is more effective with common prefixes in keys. The less space they take, the more index pages stay in memory.

In the next post, we will explore how such indexes can still be utilized with less selective filtering on categories, and that the low cardinality of the first field in the key can be advantageous.

AssemblyAI Voice Agents Challenge 🗣️

Running through July 27, the AssemblyAI Voice Agents is all about building with Universal-Streaming, AssemblyAI's most advanced real-time transcription API. Universal-Streaming is ultra fast (300ms latency!), ultra accurate, and offers intelligent endpointing to keep conversations flowing naturally.

Start building 🏗️

Top comments (2)

Dun • May 27

Great explanation of index design! One thing to consider, though, is that starting an index with a low-cardinality field like category can sometimes be less efficient for workloads with more diverse queries, especially if category isn’t commonly filtered on. In those cases, leading with a field that has higher selectivity for your common filters may still perform better. It really seems to depend on the query patterns and data distribution within each use case.