Forem: Gokul Kannan

Vector Databases in RAG - Day 2

Gokul Kannan — Sun, 03 May 2026 10:43:27 +0000

In Day-1, we understood about the overview of a RAG system and what are its components and how it helps the LLM to generate more accurate and contextual responses. Now, lets see about the storage of the data using Vector Databases.

Vector Database

Lets assume that we have a PDF with us and this would be considered as our private data. Now I want my LLM to have the context about this PDF, So that I could ask any query related to that PDF and get the response.

Now, we need to store this PDF data in a format with which the LLM could fetch the data and give us a relevant responses.
Here in this case, Vector Database helps us to store the PDF data in a Numerical format which can be used by the LLM to fetch the relevant data.

A vector database stores data in the form of vectors (arrays of numbers).

A vector database is a specialized database designed to store and search vector embeddings (numerical representations of data). Unlike traditional RDBMS systems that use exact matching (like SQL queries), vector databases are optimized for similarity search. Examples include ChromaDB, Pinecone, FAISS, and Qdrant.

Example:

Text => converted into numbers using embeddings
Image => converted into numbers
Audio => converted into numbers

These numbers capture meaning, not just raw data.

What does mean by Numerical Format?

It means the any kind of Data (Text, Image or Audio) is converted into a numerical format using any kind of encoding algorithms and gets saved into DB.

Here, first we would break down the PDF data as chunks of data where each chunk would have n number of characters. Then each chunks would be converted into a vector points with n-number of dimensions. To have clear understanding, Lets see the below example :

Today is Wednesday
Tomorrow is Thursday
I am travelling today
Wednesday is a nice series

Now I need this data to be converted into some sort of numerical format - as vectors. Here lets consider a simple One Hot Encoding.

First lets just find all the unique words and list it down as an array.

[Today, is, Wednesday, Tomorrow, Thursday, I, am, Travelling, a, nice, series]

Now lets assign the values 1 and 0 for each words in the same array format. We would give 1 if it occurs in the sentence, if not 0.

1 1 1 0 0 0 0 0 0 0 0
0 1 0 1 1 0 0 0 0 0 0
1 0 0 0 0 1 1 1 0 0 0
0 1 1 0 0 0 0 0 1 1 1

As you can see, now we have converted each sentence into a numerical format. This is a very basic encoding algorithm which can be used for a meaningful conversion.

How do we do this conversion?

Now we understood that why we need to convert the data in a numerical format and we would use different kinds of encoding algorithms to do that conversion.

So, how do we convert our data into a format which can be stored in a vector database? The answer is Embedding Models.

What is this Embedding Model?

Embedding Model helps us to convert our data into vectors which can then be stored into a vector DB.

There are different sets of embedding models available. One such model is nomic-embed which has a 768 Dimensions, it means each chunk of data is represented as a vector of 768 Dimensions.

Data
↓
[nomic-embed -Embedding Model]
↓
768D[] vectors
↓
VectorDB

Why do we need to save as Numerical Format?

We may have a question that why can't we just store the same text data into the DB and do a normal text search. In this case, what happens is, we would be able to save those as isolated words and we can't really extract the context or semantic meaning out of this.

Vector DB helps us to find a similar meaning data by doing a Similarity or Semantic Search.

Now lets understand the whole flow where this Vector DB gets used:

Your data (PDF, docs, DB) → converted into embeddings
Stored in a vector DB
When user asks a question → it is also converted into a vector
Vector DB finds similar content
That content is sent to the LLM for answer generation

Lets understand with an example.

Imagine a company has:

1000 PDFs (policies, FAQs, manuals)
They want a chatbot to answer questions based on these documents

Step 1: Convert documents into vectors

Each paragraph is converted into numbers (embeddings) using an Embedding Model

Step 2: Store in Vector Database

Step 3: User asks a question

Step 4: Convert question into vector

Step 5: Similarity Search

Vector DB compares:

Question vector
Stored document vectors

It finds the closest match (similar meaning)

How does a vector DB finds the similar meaning?

A Vector Database is a type of database designed to store data as numerical vectors (embeddings) and efficiently retrieve similar data by performing similarity searches using metrics like cosine similarity.

Let’s imagine we reduce everything to 2D (real systems use 100s–1000s of dimensions).

We take 5 words:

Cat
Dog
Tiger
Car
Bus

Now imagine they are plotted like this:

    ↑ Y-axis
    |
    |        Tiger .(0.8, 0.9)
    |
    |   Cat .(0.6, 0.7)
    |   Dog .(0.65, 0.6)
    |
    |
    |
    |                    Car .(0.1, 0.2)
    |                    Bus .(0.15, 0.25)
    |
    +--------------------------------→ X-axis

Cat, Dog, Tiger are close → similar meaning ()
Car, Bus are close → similar meaning
Animals are far from vehicles → very different

Step 1: User query

Let’s say user searches: "Lion"
We convert "Lion" into a vector: Lion → (0.75, 0.85)

Step 2: Compare with existing points

Now the Vector DB calculates similarity using something like:
- Cosine similarity - This measures the angle between two vectors, not just distance
- OR Euclidean distance

Step 3: Find nearest neighbors

Finally,

Vector DB returns: Tiger, Cat (top matches)

In reality:

The Embedding models would not have 2D instead it would have 768D, 1536D, etc.

Uses optimized algorithms like:

ANN (Approximate Nearest Neighbor)
KNN (K Nearest Neighbor)

Is Vector DB mandatory for RAG?

RAG (Retrieval-Augmented Generation) is an approach/architecture. In one of the approach we use the Vector DB to retrieve the relevant Data.

Here Vector DB used in the retriever layer to perform semantic search.

Instead of Vector DB, RAG can also use:

Keyword search (like SQL LIKE)
APIs or databases

So to have clear understanding,

RAG is not just a LLM + Vector DB
Instead,
RAG is LLM + Retrieval (Vector DB is one way to do retrieval)

So, RAG is an approach where an LLM retrieves relevant external data (often using a vector database) and uses it to generate more accurate, context-aware responses.

Summary

A Vector Database performs similarity search by representing data (such as text, images, or audio as chunks) as high dimensional vectors. If we consider that as a multi dimensional space, each item is stored as a point in this space.
When a query is given, it is also converted into a vector, and the database uses similarity metrics such as cosine similarity to measure how close the query vector is to other vectors.
Based on this, it retrieves the most relevant results by selecting the vectors that are closest in terms of semantic meaning.

Hello World of RAG - Day 1

Gokul Kannan — Sat, 02 May 2026 11:11:22 +0000

As a beginner in understanding LLMs, when I heard the term RAG-Retrieval Augmented Generation, I assumed it was a technique used within LLMs. However, from this session, I learned that RAG is all about use of our own custom or private data along with an LLM to generate more relevant responses.

Before understanding RAG, we need to have clarity on what exactly these LLMs are?

What does a Model mean?

A model means an equation. Let's say now we have this equation
y = mx + c
This is a straight line equation.
If the values of x and y are provided, then the system just tweak the values of m and c to come up with best fits.
Here lets say x = 1 & y = 2, now I can have m=1 & c=1 or m=0 & c=2, etc., Here it learns different patterns. This process is called as learning.

Parameters and Weights

Similarly in an AI model, the equation would be much more larger with the billions of parameters. The more complex the equation, the more patterns the model can learn and so the relevance and accuracy improves. Based on the Data exposed to train a model , its prediction varies.
This is the reason why bigger models often perform better. That's why AI companies like OpenAI, Gemini and Claude come up with their model containing billions of parameters which helps to learn complex relationships.

And along with these parameters, we have something called as weights.
For Example : m1x^2 + m2x^3
Here the m1 and m2 are called as weights. These are the things that comes from the data. These are the values learned during the training which act as deciding factors.
For Example :
When a model learns about animals,
"Cat" gets one weight
"Dog" gets another
"Lion" gets another

Based on the weights, the relevance changes. And using this the model could prioritize the importance of one over the other.

What does an LLM do?

Now we understood about the model. So, what does an LLM actually do?
The answer is "It just predict the next word."

If you ask a question to an AI, it does not understand like how we do. Instead it uses the question or prompt as an input and it predicts the next word and uses the predicted word again as an input to predict the next. This gets repeated until it generates the complete response. And this is the reason, why it always streams the response and does not just give a whole response at once.

But how does it predict the next word?

It uses the weights which the pretrained model already has for all data which it had trained on. What if we ask about a word, which the model didn't get trained on? Will it say "I don't know"?
No, it just Hallucinates.

For example, the model is trained only on:
Cats
Dogs
and if we ask about:
Lions

The model was never exposed to data related to "Lions".
Instead of saying:
“I don’t know”
the model answers a wrong answer confidently. This is called Hallucination.

This is why RAG becomes necessary. Here we would give it a context by providing our private data, so that it doesn't hallucinate when we ask anything related to our data, instead it uses this private data and then generate the response rather than just using the pretrained data.

Temperature

Temperature controls the creativity of the model.
It usually ranges from 0 to 1:

Low Temperature (0.1)

More factual
More stable
Less creative

Medium Temperature (0.5)

Balanced output

High Temperature (0.9)

More creative
More imaginative
Higher chance of hallucination Temperature does not directly control truth. It controls randomness.

LLM and SLM

We don't always need a bigger model which knows everything when we actually just need it for our specific use cases. In this situation, we may need a specialized model. Here SLM helps.

SLM - Smaller Language Model
This helps us with specific use cases. For example: ChatBots, Any Domain-Specific Tasks, Voice Assistants.
These models may have millions of parameters instead of billions.

It is much more cheaper, smaller than a LLM

LLM - Large Language Model
It is a Generalized model which has knowledge from different domains. It has billions of parameters. Example : Claude, Gemini and ChatGPT.

Why do we need RAG?

All these LLMs have few major limitations like

Outdated Knowledge- They may not now about recent events. They only know about the data with which it had been trained on.
Hallucination - Outcome of first limitation is when we ask about something it doesn't know, it hallucinates.
They doesn't have any knowledge about a private data which they cannot access. Example : Private Business Data, HR Documents, Finance Documents, Project Reports , Project Management Tool Data, etc.,

This is where the RAG comes into picture.

RAG - Retrieval Augmented Generation

This is self Explanatory.

RAG typically involves three main steps:

Retrieve – Relevant data is fetched from external sources like PDFs, databases, internal files, knowledge bases or documents based on the user’s query.
Augment – The retrieved data is added to the prompt/context that is sent to the pre-trained LLM.
(Important: we are not modifying or retraining the model itself, just giving it extra context.)
Generate – The LLM uses both its pre-trained knowledge and the retrieved context to generate a more accurate and relevant response.

So instead of relying only upon its pretrained data, it just looks up on this retrieved private data and then generates the response. This way, RAG helps the LLM to overcome the above mentioned limitations.

Where is this private data gets stored?

The Private Data is stored inside a Database known as a Vector Database. This vector database is a concept.

For example: The private data like AzureDevOps board content, HR Policy documents, Jira content, Internal Business Docs.
All these are not directly fed to the LLM. Instead, they are converted and stored intelligently.

How these documents are stored?
Documents are broken into smaller parts called as Chunks.
These chunks are always a
Sentence Groups or Paragraph Chunks and
not individual Words.
This is because meaning comes from the context and not with isolated words.
This contextual chunks helps the RAG to give a more relevant responses.

What is a Vector?

A vector has Magnitude and Direction.

Each Chunk is converted into a numerical vector.
Example:
A paragraph about Lion becomes.
P1=[...700 dimensions]
P2=[...700 dimensions]
P3=[...700 dimensions]
Here P1, P2 and P3 are the points in a graph. All these points are defined with a 700 dimensions.
For our understanding, in a 2D graph, we represent a point with x and y value. It means a point P1 can be defined as (x, y). Similarly we can define a point with any number of dimensions.
Now the system measures the distance between the vectors and finds the relevant information.

It checks which are the points which are closer to P1. It finds P2 and P3 by measuring the distance.

Example :

Apple
Orange
Pear
Lemon
Doctor

Here all the fruits related words stay closer and the word Doctor stays farther from these words.
This is how the relevance works.

How relevant Chunks are found?

When we say it measures the distance between each vectors, it means it involves different kinds of Algorithms.
Examples :
ANN - Approximate Nearest Neighbors
KNN - K- Nearest Neighbors
These help quickly find the most relevant chunks.
The same idea is used in:

Spotify, Netflix recommendations
Amazon suggestions
YouTube feed
Social media recommendations

Summary

The flow of RAG system is
User asks a Query - Prompt
↓
System retrieves the chunks from this prompt
↓
This retrieved context goes into the LLM
↓
Based on the context, it retrieves all related chunks of data (Stored in a vector DB)
↓
LLM Generates the Answer with the retrieved relevant chunks of data
↓
User Receives a better response with their context.

LLM- Predicts
Vector DB - Stores your private Data as vectors
RAG provides the context by searching the vector DB for relevant chunks. Send those to LLM.
More contextual responses are generated.

RAG is a method where an LLM retrieves relevant information (often from pre-indexed data in a vector database) and uses it to generate a more accurate answer.

One simple analogy to have a clear understanding is:
If you are getting into a project, you may need a Senior person's help to know about a particular application.
So, RAG can be that senior person by below way:
Here the Data means your application's documentations-files, Jira/ADO data. This is a private data.

LLM <--> YOUR DATA
|________|
↓ (Combining these two)
RAG (Now this acts as that Senior Person-You can interact with)