Video Search with MongoDB

Gayathri Jetti — Wed, 06 May 2026 07:17:48 +0000

By: @gayathriabcde, @gotlur_parnika_27c3841815 , @jhv_07 , and @gnana_1905 , under the guidance of @chanda_rajkumar

Building a Retrieval-Augmented Generation (RAG) application generally involves the similar steps of: chunking text, embedding it, storing the embeddings in a vector database, and searching based on similarity. This works perfectly for text-based documents, but we can't apply these steps as-is when trying to build a RAG system for videos, specially procedural videos, where each frame in the video is connected to another.

While trying to build a RAG system specifically for the purpose of making procedural videos very easy to search through, I realized that treating it as a 'bag of chunks' is extremely inefficient. The large amount of data that each frame in the video generates was also another problem we faced. Now, I'll explain how I attempted to use MongoDB's $graphLookup to solve this problem.

The Prerequisite Problem:

Videos aren't just isolated concepts grouped together, the frames are chronologically sequenced, and the actions performed in previous frames tend to have a action-result relation with consequential frames, based on what is being discussed in the video. For example, if an AI search engine is asked a question like "How do I tighten the 10mm bolt?", a standard vector search will just give the exact, most semantically similar frame, exactly when the bolt is being tightened. But, that doesn't answer the user's question entirely, as the user isn't made aware of the things they have to do before turning the bolt.

This is where prerequisite steps are necessary. They can be used to guide the user through the entire process to achieve the result and answer the query in a most thorough and educational way.

2. Vector DB vs Graph DB:

To do this, the worker needed to analyze the videos and store them as a chronological chain of nodes, a graph instead of a flat list. There are two things we need to do this effectively:

A Vector Database that would perform semantic search on both the audio transcripts and visual descriptions which would quickly find the exact moment the user requires.
A database that would help traverse backward through the links between the frames and give the prerequisite steps.

Traversing through nodes would be easier if I used a Graph Database, but only if we were linking nodes that are far from each other in the video. Since we are primarily focusing on procedural videos right now, and since they're essentially linked lists where nodes point to the nodes right before them, I felt a dedicated graph database, used for highly interconnected data, would be unnecessary. And maintaining two databases, one for semantic search and the other for storing with relationships, and syncing both their states, is extremely difficult.

Node Schema:

3. The Solution: MongoDB Atlas

Since MongoDB Atlas is a document database that natively supports both Vector Search and graph traversal, it allowed me to use a unified approach by modelling video moments as documents with two main components:

A vector_int8 field that stores highly compressed and quantized embeddings. This keeps the storage footprint small while supporting fast semantic search.
A graph_edges object that contains simple prev_node_id and next_node_id pointers. These act as the "links" that turn our flat list of frames into a traversable timeline.

4. The Single-Pipeline logic:

The main optimization happens in the retrieval API. Instead of fetching a target vector, extracting its ID, and then executing another query to get the prerequisite steps, MongoDB handles everything in a single Aggregation Pipeline.

Here is how the search is executed:

Stage 1: The Vector Search

First, we use the $vectorSearch stage to find the "Target Node"—the specific moment that best matches what the user is asking.

{
  "$vectorSearch": {
    "index": "vector_index",
    "path": "vector_int8",
    "queryVector": "<user_query_int8>",
    "numCandidates": 50,
    "limit": 3
  }
}

Stage 2: Traversal

Immediately after finding the target node, we send it directly into $graphLookup. This stage uses the prev_node_id field to automatically walk backward in time, fetching the exact prerequisite steps the user needs to see.

{
  "$graphLookup": {
    "from": "video_nodes",
    "startWith": "$graph_edges.prev_node_id",
    "connectFromField": "graph_edges.prev_node_id",
    "connectToField": "_id",
    "as": "prerequisite_steps",
    "maxDepth": 5, 
    "depthField": "steps_backward"
  }
}

Why This Architecture Wins:

Single Query Execution: MongoDB returns a structured document containing the target answer and its chronological context that is ready to be sent to the frontend, with a single query from the backend.
Configurable Depth: We can very easily change the number of steps the user would require by simply modifying the value of the maxDepth field.
Simple Architecture: We get the benefits of a graph structure without having to learn specialized query languages or maintain two separate databases.

5. Conclusion:

To summarize, the regular RAG approach of treating data as a "bag of chunks" fails for procedural videos because it ignores the dependencies between frames. To solve this without creating further problems, I used MongoDB Atlas to create a unified architecture that gives meaningful results efficiently using $graphLookup and a vector search index.

If you are building an AI search tool for any kind of chronological or step-by-step data, maintaining that relational link between your nodes is most important. This unified approach keeps the backend simple, avoids unnecessary operational overhead, and ultimately gives the user a much more thorough and educational answer.

A limited local implementation where the video gets processed to understand the insertion of the frame-nodes in MongoDB

After the video is processed, searching something will take us to the exact second the answer to the query is being discussed in the video

Forem: Gayathri Jetti