Forem: Indumathi R

Day 3 - Chunking - RAG

Indumathi R — Sun, 10 May 2026 09:24:48 +0000

What is chunking ?
It is one of the step in RAG pipeline. Dividing a large document into several small parts. Each small part is called chunk. Chunking means dividing.Let's consider this following passage:

Redis is a high-speed, in-memory data structure store that functions as a database, cache, message broker, and streaming engine. It is widely used for real-time applications because it keeps data in RAM rather than on disk, enabling sub-millisecond response times. Unlike traditional databases (like MySQL or PostgreSQL) that read from a hard drive, Redis operates in the computer's main memory, which is significantly faster.

We are going to give the whole passage to the embedding model. It will generate a point (let's consider it as P1)and it is stored in vector DB. There is a small problem with this approach. If i ask a query like , "How redis functions ? " intended answer for this question will be "database, cache, message broker, and streaming engine". However, since the entire passage is stored as single point, it wont retrieve the specific part, it will return the entire passage. To get only the specific part and leave out irrelevant parts as an answer to the query, chunking is very important.

Chunking can be performed in two ways:

Discrete chunking
Semantic chunking

How Small a chunk should be or what should be the size of a chunk ?
If i ask a question "How are you ? " to LLM, if it answers as "sun rises in the east", it is irrelevant but the stmt provided is not wrong. It is just irrelevant to the question provided. LLM wont just say, i dont know, it tries to make up some answer. By means of chunking, we are going to tweak the way in which LLM provides answer.

Discrete chunking
Fixed logic to generate chunk; Let's see some types in discrete chunking :

Fixed Chunking
If i say size as 25 characters, each chunk will contain only 25 characters. In a paragraph, first 25 characters will be in chunk1 , next 25 characters will be in chunk2 etc... In the redis passage, if i start to split into 25 characters, first chunk would be Redis is a high-speed i second chunk would be n memory data structure etc. When we see these chunks, we can see that, meaning of the words is lost due to splitting. What can we infer from this chunk Redis is a high-speed i meaning is lost right ?

How can we better do chunking in this ?

Besides taking 25 characters, we can take till sentence get completed i.e 25 characters and till fullstop. In this case, chunk 1 would be Redis is a high-speed in-memory data structure store that functions as a database, cache, message broker, and streaming engine

Overlapping chunks
Taking from the heading, words between the chunks would be overlapped. i.e Consider the first sentence as Redis is a high-speed in-memory data structure store that functions as a database, cache, message broker, and streaming engine and second sentence as It is widely used for real-time applications because it keeps data in RAM rather than on disk, enabling sub-millisecond response times. If overlapping chunking is applied,few words from the last sentence would be added to starting of next sentence. i.e
Chunk 1 would be Redis is a high-speed in-memory data structure store that functions as a database, cache, message broker, and streaming engine and Chunk 2 would be database, cache, message broker, and streaming engine. It is widely used for real-time applications because it keeps data in RAM rather than on disk, enabling sub-millisecond response times .

Sometimes there are chances for the points to be plotted farther from each other although the texts are closely related to each other. overlapping chunking will reduce this event to some extent.

Day 2 - RAG - What is Vector DB ?

Indumathi R — Fri, 08 May 2026 02:13:27 +0000

To recall, Integrating our private documents with LLM is called RAG.

Lets assume that, we have some pdfs containing our data. That data in the pdf will be broken down into chunks based on some criteria. That chunk will be fed as input to the model. More specifically embedding model. This model will generate a point. How the point is generated ?

Lets take a simple example:

Today is Wednesday
Tomorrow is Thursday
I am travelling today
Wednesday is a nice series

Lets construct a sentence now containing only unique words from the above set of sentences:
Today, is, Wednesday, Tomorrow, Thursday, I, am, travelling, a, nice, series

We are now going to construct each of the 4 sentences into a number format. We will compare unique constructed sentence with each of the input sentence. If the input sentence contains a word from unique construct sentence, number 1 will be assigned to uniquely constructed sentence otherwise 0.

1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0
0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0
1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0
0,1, 1, 0, 0, 0, 0, 0, 1, 1, 1

This method of conversion is called a one shot encoding

Now coming to RAG, based on the context of the model, it will generate a point. Generated point will be multidimensional (x,y,z,a ...). Generated points will enable semantic search. What is semantic search ? It will help us to know, how two points are closely related to each other. Meaning based search is called semantic search. For each chunk, a point will be generated. Then model based on its context plots it. Related points appear together.

Vector DB provides a place to store related points together and when quering on the data, it provides the related data.

*How do we say that two points are closer to each other ?
*
When distance is less we say that the two points are closer to each other. Just because there are two points, we can't always say that they are nearer to each other. We need to bring in another point.(for comparison). To find distance between points, there are several algorithms: Euclidean, Cosine Similarity, Manhattan distance.

Lets take Cosine similarity and see how it works:
There are three points(p1,p2,p3) plotted in a graph. From origin, a straight line will be drawn to each of the points. The lines forming an angle with point3 will be considered and its angle will be noted. Cosine of the angle will be taken. smallest cosine angle will be the shortest point.

There are 100 points. if i want to find the nearest points for a point named x, i need to calculate distance between x to all other remaining points. Then only i can arrive the nearest points. But this approach is time consuming.

So a pipeline for RAG is, data will be given to a embedding model(nomic-embeed text), it will a generate a point (mathematical representation of the data). This point will be stored in a vector DB. Some examples of vector DB are chromaDB(general purpose), pinecone, FAISS(high similarity), Quadrant(images) etc.

If i ask any query, it will be sent to embedding model and generate a point and store it in the vector DB and returns the points(say like 5) that are nearer to the query point. This is all about Vector DB

Day 1 - RAG

Indumathi R — Mon, 04 May 2026 04:11:00 +0000

RAG stands for Retrieval Augmented Generation. Why do we even need RAG?? To answer this lets take a look at What LLMs and SLMs are.

LLM(Large Language Model). Data on several categories(generalized) will be given as input. From that, a model would be created. What is a model ? To understand this, lets take mathematical equation of a straight line

y = mx +c

Lets take x values to be 1, 2, 3, ... and y values to be 2, 4, 6, 8, 10. We can use whatever values for m and c to get our desired y value(like 2, 4 etc). Instead of a simple linear equation, we can also consider double, cubic or equations(order of the variables like x^2, x^3 etc...). When we say a model is os of 4b parametrs, 120b parameters and all , it refers to a big equation. Using the input data, a mathematical equation is being created. Larger the equation, more better the result will be. i.e if model is exposed and trained on several amount of data, results generated will also be more relevant and good.

LLMs predict the next word. If we give hello, it may give hello world. We can control how the output should be generated by LLM. like factual or imaginative type. This is determined by a factor used in LLM called Temperature. Higher the temperature, more factual it will be. Lower the temperature, output will be more imaginative.

Temperature is meant for a single query

SLM(Small Language model)
Instead of training the data on vast amount of data across all categories, training a model on the data of specific domain to solve a set of tasks from that domain (like speech to text generation) is referred to as small language model.

Think of it like this, LLMs are generic and SLMs are specific

If we ask a question to LLM based on the data it was trained, we will be getting a good result. But, if we ask a question which is out of the scope of trained data, it will try to answer it i.e makes up answer on its own. This is called hallucination. (wont say like i dont know it, unless we explicitly prompt it).

Analogy: Lets take GPT-OSS model (released at around 2025). If we ask the model now about the Iran-Isreal war, it wont know about it. As the war did not happen at 2025.

In the sameway think about this, In our company, we have some set of data stored in doc, wikis etc. Models out there (gemini, claude) wont know about it. Somehow, if were able to link the LLMs with our private data, we can use that LLM for our internal usage in our company/personal use. This is called RAG. i.e Linking LLM with our data and asking LLM some questions about our data is what RAG is.

One of the approach to achieve LLM to answer our queries on private data is to train the LLMs with the private data. This is one way but not the only way.

Another way is, uploading documents into a vector DB. Before getting into deep in this. Lets first What is vector ? one that has direction and magnitude. For our case, we wont be dealing with direction only dealing with magnitude.

We will be breaking the document into several chunks and convert it into points and plot it in a graph. Lets just plot apple, orange, pear, doctor as points in a graph. Which two are points are releveant here? apple and doctor(apple a day keeps a doctor away), how more relevant ? How to find this. Two points are said to be closer, if the distance between the two points are less. (This is with respect to 2d). It can go upto 700D.

Why did we put doctor closer to apple ? Normally a sentence will be broken into chunks. These chunks wil be given to LLM and it gives points. Based on the context it was trained, it generates points. The closer points will be related to each other.

In essence, our private document will be broken down into several chunk. For each chunk, a point will be generated and plotted in vector DB.

Analogy: ANN(Approximate nearest neighbour) is one of the algorithms used in spotify like platform to find relevancy between items and suggest relevant items