Forem: Ramya Perumal

RAG - Dense Embedding

Ramya Perumal — Wed, 20 May 2026 03:03:30 +0000

Dense means continuous.

When text is converted into a numerical representation called a vector (point) that contains continuous values, it is called a dense embedding.

Unlike sparse vectors, where many values are zero, dense vectors contain meaningful numerical values across most dimensions.

Example

A dense vector may look like:
[0.123, -0.456, 0.789, 0.245, ...]

Multi-Dimensional Representation

Each vector is represented in an n-dimensional space.
This means:

Every value in the vector represents one dimension
Each dimension contains some numerical value other than zero
Similar meanings are stored closer together in vector space

All vectors are stored in a mathematical space called latent space.

Words or sentences with similar meanings are usually positioned closer together inside this latent space.

How Dense Embeddings are Generated

To convert text into vectors, we can use:

Embedding Models
Examples:

nomic-embed-text
BGE (Beijing Academy of Artificial Intelligence General Embedding) models

Transformer Models
Examples:

all-MiniLM-L6-v2
Nomic Transformer

These models are commonly available through:

Hugging Face
Ollama

Relationship Between LLMs and Transformers

LLMs internally use transformer architecture.

A transformer mainly contains two parts:

Encoder
Decoder

Encoder
The encoder converts text into embeddings (vectors).

Decoder
The decoder processes embeddings and generates human-readable text.

In embedding models, the encoder part is mainly used to generate vector representations.

Methods to Generate Embeddings

Embeddings can be generated in two ways:

1. Using Dedicated Embedding Models

These models are specifically trained for embedding generation.

Examples

nomic-embed-text
BGE models

This is the most common and efficient approach in RAG systems.

2. Using General LLMs Through Prompting

A general-purpose LLM can also generate embeddings by giving prompts that instruct the model to convert text into vector representations.

This approach is sometimes used in vectorless RAG systems.

Disadvantage
Higher computational cost
Slower performance
More token consumption

Measuring Embedding and Retrieval Accuracy

To measure retrieval accuracy effectively, unit tests should be written for the RAG pipeline.

The test cases should include:

Expected inputs
Expected outputs
Different query scenarios
Edge cases
Semantic similarity checks

This helps evaluate how accurately the embedding model retrieves relevant information.

Similarity Methods Used in Dense Embeddings

Dense embeddings commonly use one of the following similarity measurement methods:

Cosine Similarity

This is the most commonly used similarity method in RAG applications.

It measures the angle between vectors rather than physical distance.

If the vectors point in similar directions, the similarity score becomes higher.

Euclidean Distance

Measures the straight-line distance between vectors in vector space.

Dot Product

Measures similarity by multiplying corresponding vector values and summing them.

Why the Same Embedding Model Must Be Used

The same embedding model should be used for both:

Data ingestion phase
Retrieval phase

If different embedding models are used, the generated vectors may exist in completely different latent spaces or vector distributions.

As a result:

Similarity calculations become inaccurate
Retrieval quality decreases
Relevant chunks may not be retrieved correctly

Using the same embedding model ensures that both stored documents and user queries are represented consistently in the same vector space.

Sparse Embeddings

Sparse embeddings use TF-IDF and BM25 mechanisms for retrieval.

In sparse embeddings, vectors are generated mainly based on keyword frequency and importance rather than semantic meaning.

The combination of BM25 and vector search is called hybrid search.

Tools such as OpenSearch and Elasticsearch support hybrid search by combining:

Traditional keyword-based retrieval
Semantic vector-based retrieval

Similar to one-hot encoding, sparse embeddings generate vectors based on text frequency. Most values in the vector remain 0, while only important terms receive higher numerical values.

Example

[3.91, 0, 0, 1.62]

In this representation:

Higher values indicate more important or frequently occurring terms
Zero values indicate terms that are absent or not important in the document

Sparse embeddings mainly focus on exact keyword matching and are highly effective for traditional search use cases.

RAG- Understanding of Embedding

Ramya Perumal — Sun, 17 May 2026 22:43:09 +0000

What is Embedding?

After text is split into chunks, the next process is called embedding. In this step, each chunk is converted into vectors (points in vector space). In vector-based RAG systems, chunks are converted into vectors so that semantic search can be performed efficiently.

Why Do We Need to Convert Chunks into Vectors?

The main goal of a RAG application is to achieve semantic search.

Semantic
Example
The word feline is related to the cat family, even though the words are different. Understanding that “feline” and “cat” are related is called semantic understanding.

Similarity
When a user asks a query, semantically related chunks are returned even though the exact words in the chunks may be different.

Semantic Similarity
Semantic similarity combines:

Intent
Context
Meaning

The purpose is to establish relationships between the user query and the documents stored in the RAG system. This allows the system to retrieve relevant information from the database and provide it to the LLM for further processing.

Words that are semantically related are usually stored closer together in multi-dimensional vector space.

Cosine Similarity
To determine how close vectors are to each other, cosine similarity is commonly used.

When a user query arrives:

The query is converted into a vector
Cosine similarity is calculated between the query vector and stored vectors
The closest vectors are retrieved

Retrieval Methodologies

Two major retrieval methodologies are used:

1. KNN (K-Nearest Neighbors)
KNN compares the query vector with all stored vectors one by one to find the nearest neighbors.

Advantage
More accurate retrieval

Disadvantage
Slow for very large datasets

2. ANN (Approximate Nearest Neighbors)
ANN approximately finds the nearest vectors instead of comparing every single point.

This method is mainly used when:

The document volume is huge
Faster retrieval is required
Time constraints exist

ANN improves retrieval speed while sacrificing a small amount of accuracy.

Why Cosine Similarity Instead of Sine or Tangent?

Cosine similarity works effectively because:
If two vectors are very close and highly related, the cosine similarity value approaches 1. If the angle between vectors increases, the cosine similarity value decreases, meaning the vectors are less related

Why Not Sine or Tangent?
For small angles:

Sine values remain close to 0
Tangent values can fluctuate significantly

These measurements are not stable for semantic comparison. Cosine similarity provides a more reliable way to measure semantic closeness between vectors.

Embedding Dimensions
Embedding models can generate vectors with dimensions ranging from 256 to 3000 or more.

The dimension size depends on the embedding model and the amount of contextual information it captures.

Generally:

Higher dimensions capture richer semantic information
Lower dimensions are faster and cheaper but may lose context

Types of Embedding Models
Choosing an embedding model completely depends on the application scenario.

1. Based on Query Type

Symmetric Models
Symmetric embedding models are used when the query and the documents are similar in structure and length.

Examples
nomic-embed-text
Qwen embeddings

These are commonly used in semantic search systems.

Asymmetric Models
Asymmetric embedding models are used when:

Queries are short
Documents are long

Example
Google Gemini embedding models

These models are optimized for retrieving long documents from short queries.

2. Based on Retrieval Type

Dense Embeddings
Dense embeddings mainly focus on semantic meaning.

These embeddings generate dense vectors where most values contain meaningful information.

Examples
Cohere embedding models
ChatGPT OSS 120B embeddings

Advantage
Better semantic understanding

Sparse Embeddings
Sparse embeddings mainly focus on exact keyword matching.

They commonly use the BM25 (Best Match 25) algorithm, which is based on:

TF (Term Frequency)
IDF (Inverse Document Frequency)

TF-IDF Concepts

TF (Term Frequency)
Measures how many times a word appears in a document.

IDF (Inverse Document Frequency)
Measures how important a word is across the entire document collection.Words that appear too frequently across all documents are considered less important.

Transformer Architecture

The transformer architecture was a major breakthrough for LLMs.
Transformers mainly contain:

Encoder
Decoder

Encoder
The encoder converts text into embeddings (vectors).

Decoder
The decoder converts embeddings back into human-readable text after processing.

This architecture enables modern LLMs to understand and generate natural language effectively.

Choosing a Vector Database

Chroma
Open source
Easy to set up
Suitable for basic and small-scale applications

FAISS
Better for large document collections
Optimized for high-performance semantic search
Commonly used in production-scale retrieval systems

RAG - Sliding Window, Token Based Chunking and PDF Chunking Packages

Ramya Perumal — Thu, 14 May 2026 23:25:36 +0000

Sliding Window Chunking

Sliding Window Chunking is a more intensive chunking mechanism.

In this method, a window size is defined based on a character or token limit. Instead of creating completely separate chunks, the window moves forward gradually while keeping part of the previous content.

The character or token limit is called the window size
The amount the window moves forward each time is called the step size

This is a stricter form of overlapping chunking.

How it Works

Suppose:

Window size = 500 characters
Step size = 100 characters

The first chunk may contain characters 1–500.
The second chunk starts after moving 100 characters and may contain characters 101–600.

Because of this overlap, related information is repeatedly included across chunks.

Benefits

The major benefit of this method is that semantically related points are stored closer together in the vector database, almost like clusters. This improves retrieval in scenarios where context changes frequently.

Disadvantages
Problem 1: Higher Token Consumption

Since overlapping data is repeatedly embedded, the embedding model consumes more tokens. This increases computational cost unless local embedding models are used.

Problem 2: Duplicate Retrieval

Because related chunks are stored very close together, the LLM may retrieve multiple duplicate or nearly identical chunks instead of retrieving different contexts.

As a result:

Context diversity decreases
Token usage increases
Retrieval efficiency may reduce

Where Sliding Window Chunking is Useful

Sliding window chunking is useful in scenarios where context switching happens frequently.

Example: Source Code

In coding-related datasets:

Different parts of the code may not be directly related
One service or module may trigger another service indirectly
Important context may exist across multiple sections

For example, in microservices architecture:

One service event may trigger another service
Related logic may exist in different files or services

Sliding window chunking helps preserve such relationships, even though it comes with higher token consumption.

Token Based Chunking

Token-based chunking mainly focuses on cost optimization and model limitations.

LLMs process text as tokens rather than words.

Depending on the tokenizer and model:

One word may become a single token
One word may become multiple tokens
Sometimes even a single character can become a token

Since models have token input limits, token-based chunking is used to ensure the content stays within the allowed token size.

In this method:

Text is split based on token count
Chunks are converted into embedding vectors
Vectors are stored in the vector database

This method is mainly used when working with strict token constraints.

TOON (Token-Oriented Object Notation)

TOON stands for Token-Oriented Object Notation.

It is an alternative representation format designed to reduce token usage compared to JSON.

JSON is human-readable, but repeated keys increase token consumption.

Repeated structures and keys increase token usage.

TOON reduces repeated keys and represents the same information in a more token-efficient format.

The purpose is to reduce embedding and inference cost while preserving context.

LLMLingua

LLMLingua is a framework used for prompt compression.

It converts user queries or prompts into simplified versions while preserving the original meaning and context.

The main goal is:

Reduce token consumption
Lower inference cost
Improve efficiency

However, aggressive compression may sometimes reduce retrieval quality compared to the original JSON or text structure.

Summary of Chunking Methods

The commonly used chunking methods are:

Fixed Chunking
Overlapping Chunking
Semantic Chunking
Embedding-Based Chunking
Sliding Window Chunking
Token-Based Chunking

These methods represent different approaches and trade-offs.

In real-world applications, multiple chunking methods are often combined depending on:

Dataset type
Retrieval quality
Cost constraints
Token limitations
Application requirements

There is no single chunking strategy that works best for every dataset.

PDF Reading in RAG Systems

To process documents such as company internal communication files, PDFs must first be converted into readable text.

Several libraries are commonly used for this purpose under LangChain Framework:

PyPDFLoader
PyPDF
PyMuPDF

Different document types require different processing approaches. A single package may not work effectively for all document formats.

Challenges in PDF Processing

Documents may contain:

Scanned images
Multi-column layouts
Tables
Handwritten text
Two-sided scanned pages

Because of this, preprocessing becomes an important step.

Tools Used in PDF Processing
Camelot
Camelot is commonly used to extract table content from PDFs.

Tesseract
Tesseract or computer vision models are used to convert scanned images into readable text documents.

Final RAG Flow for Documents

Raw documents are collected
Images, tables, and scanned content are converted into text
Data is cleaned
Documents are split into chunks using chunking methods
Chunks are converted into embedding vectors
Vectors are stored in the vector database for retrieval

RAG - Chunking

Ramya Perumal — Mon, 11 May 2026 03:16:49 +0000

What is chunking

Chunking is the process of breaking data into smaller pieces called chunks. Chunking happens before the data is fed into an embedding model, which converts each chunk into a vector (point) and stores the converted vectors in a vector database.

Why chunking Matters in RAG

Data can contain different types of context while still relating to the same topic.

From the above example, we may have a paragraph related to the Redis database that contains multiple contexts. An embedding model like nomic-embed-text converts the entire paragraph into a single vector point and stores it in the database.

This is where chunking plays a major role. Proper chunking helps retrieve only the most relevant information and avoids unrelated content.

For example, if a chunk contains information about both Python and Java, a query about Python may also retrieve Java-related information because both topics exist in the same chunk. Effective chunking helps prevent unrelated data from being retrieved.

Even an entire document can be stored as a single chunk. However, the purpose of chunking is to split the data into smaller meaningful sections so that only relevant data is retrieved for the user query while avoiding irrelevant information.

Chunking Method(Discrete way - formula methodology)

Fixed Chunking

Fixed chunking is the most common chunking method. In this approach, a fixed character or token limit is assigned to every chunk.

There is no single best chunking strategy for all datasets. Choosing the right chunk size usually requires a trial-and-error approach.

Disadvantage
A chunk may break in the middle of a sentence, resulting in incomplete context. This can reduce retrieval quality and may lead to irrelevant results.

Solution
One way to overcome this issue is to allow the chunk to continue until the sentence ends by checking for punctuation such as "." or spaces.

Overlapping chunking

In some cases, related information may be stored far apart in vector space due to the embedding model’s understanding. As a result, the LLM may miss relevant information during retrieval.

To overcome this issue, overlapping chunking is used.

In overlapping chunking, each chunk includes a portion of the previous chunk’s ending content. This helps the embedding model place related chunks closer together in the vector database.

The purpose of overlapping is to improve retrieval by making semantically related chunks easier to find.

Disadvantage

There is a possibility that irrelevant information may also be retrieved because of the overlap.

Example

Suppose:

Paragraph 1 is related to Topic A
Paragraph 2 is related to Topic B

If overlapping is applied, a query about Topic B may also retrieve some information from Topic A because part of Paragraph 1 overlaps with Paragraph 2.

In such scenarios, storing these chunks closer together may not be necessary. This is where semantic chunking becomes useful.

Semantic Chunking

Another scenario is when two paragraphs discuss the same topic but are not strongly related to each other. Normally, these paragraphs may still be stored nearby in the vector database. In such cases, overlapping chunking may not be necessary.

Semantic chunking solves this problem by grouping content based on meaning rather than fixed size.

In this method, each sentence is compared with the previous chunk using a similarity threshold value.

If the similarity score is below the threshold value, the sentence becomes a separate chunk.
If the similarity score is above the threshold value, it is added to the current chunk.

Libraries such as NLTK can be used to implement semantic chunking. The threshold value is configurable based on the use case.

Embedded Chunking

In embedding-based chunking, embedding models are used instead of libraries like NLTK.

This method works by calculating cosine similarity between sentences and grouping semantically similar sentences into chunks.

Advantage
Better semantic understanding
More accurate chunk boundaries

Disadvantage
Higher computational cost
Additional embedding model usage cost

Choosing the Right Chunking Method

Choosing a chunking method always involves trade-offs. There is no single chunking strategy that works for all datasets.

The best chunking method depends on:

Dataset type
Cost
Time
Retrieval accuracy requirements
Embedding model behavior

Different applications may require different chunking strategies to achieve the best RAG performance.

RAG - Vector DB

Ramya Perumal — Thu, 30 Apr 2026 17:56:45 +0000

What is a Vector Database?

A vector database is a database used to store vectors (points in space) where data with similar meanings are positioned close together. These vectors are generated using embedding models or LLM embedding models. One of the embedding models is nomic-embed-text. We can download this model using Ollama.

Why Vector DB in RAG?

One-hot encoding is a technique used to convert categorical data (like words) into binary vectors.

How it works:

Each unique word in a vocabulary is mapped to a vector that is mostly zeros except for a single 1 at a specific index.

Example:

Today is Wednesday
Tomorrow is Thursday
I am travelling Today
Wednesday is a nice series

Vocabulary values:

[Today, is, Wednesday, Tomorrow, Thursday, I, am, Travelling, a, nice, series]

Vector representation:

Line 1 = [1,1,1,0,0,0,0,0,0,0,0]
Line 2 = [0,1,0,1,1,0,0,0,0,0,0]
Line 3 = [1,0,0,0,0,1,1,1,0,0,0]
Line 4 = [0,1,1,0,0,0,0,0,1,1,1]

Disadvantages:
No semantic meaning
High dimensionality
Not scalable

Because of these limitations, modern RAG systems use vector databases where chunks are converted into vectors in a high-dimensional space, where similar meanings are positioned close together.

How Data is Stored In a vector DB:

Documents will be split into chunks. Each chunk will be converted into a vector using an embedding model. The resulting vector will be stored in the vector DB. Chunks with similar semantic meaning are stored closer together in vector space.

Similarity Search

When a user query arrives, the LLM will search for the vectors that are closest to the user query by distance.

To calculate the distance, we can use:
Euclidean Distance (based on the Pythagorean theorem)
Manhattan method
Cosine similarity (finds the smaller angle to the user vector)

Calculating similarity against every vector becomes computationally expensive. For that, we use ANN and KNN algorithms.

Popular Vector DBs

Some of the popular vector databases are:
Chroma
FAISS
Pinecone
Qdrant – commonly used for embeddings, semantic search, and image similarity search.
MongoDB – It also has vector database support

End-to-End Flow

Data Ingestion
Data or documents will be split into chunks.
Each chunk will be converted into vectors using embedding models
Stored in the vector DB.

Data Retrieval
User query will be converted into a vector using an embedding model. Semantically related vectors will be obtained using search algorithms in the vector DB. Along with the user query, the retrieved chunks are provided to the LLM as context to get output in human-readable format.

Introduction to RAG

Ramya Perumal — Mon, 27 Apr 2026 19:45:04 +0000

Title: 40-days training on RAG(Day 1)

RAG is Retrieval-Augmented Generation.

What is a Model?
A model is nothing but an equation.
Example:
y=mx+c
During training, values of x and y will be provided. The model has to find the appropriate values of m and c and try to make a line that best fits the graph. The values of m and c may vary depending on the use case.

What is a Parameter?
A parameter is nothing but a variable that is learned during training.
In the above equation:
m is a parameter
c is a parameter

If the number of parameters is more, the model can learn more complex patterns.

What is Temperature
Temperature controls the model's creativity. It usually ranges from 0 to 1.
Lower temperature gives more factual answers.
Higher temperature gives more imaginative answers.

Temperature is passed along with the prompt input.
Usually, it is kept around 0.5 for balanced output.

SLM
SLM stands for Small Language Model.
It usually has fewer billion parameters and is trained for a particular domain or specific tasks.
Training cost can still be high, similar to LLMs, depending on the use case.
Example: smallest ai - provides voice-based smaller AI models.

LLM
LLM stands for Large Language Model.
It usually has billions of parameters and contains knowledge from many domains. It is called a generalized model.
Example: gpt-oss-120b.

How LLM Works
The primary functionality of an LLM is to predict the next word correctly.
It generates text by predicting one word after another based on previous words.
Sometimes LLMs generate incorrect information confidently. This is called hallucination.
Example:
If the model knows about cats and dogs but has limited knowledge about lions, it may generate irrelevant or incorrect content.
Hallucination can be reduced by writing proper prompts and providing correct context.

What is RAG?
RAG stands for Retrieval-Augmented Generation.

It is a method used to provide private or external knowledge such as:
Company policies
HR policy documents
Internal business documents

This information is given to the LLM so it can generate human-readable answers based on that content.

Where is Private Data Stored?
Private data is usually stored in a database called a Vector Database.

How Documents are Stored
Documents are split into smaller parts called chunks.
These chunks are converted into numerical vectors and stored in the vector database.

To search relevant chunks quickly, algorithms like:

ANN (Approximate Nearest Neighbors)
KNN (K-Nearest Neighbors)
are commonly used.

These kind of algorithm used to find next suggestion in spotify app , amazon etc..

Thank you Syed Jafer for conducting this wonderful course.