From Hallucinations to Grounded AI: Building a Gemini RAG System with Qdrant

Soham Sharma — Wed, 04 Mar 2026 16:48:16 +0000

This is a submission for the Built with Google Gemini: Writing Challenge

What I Built with Google Gemini

Large Language Models are powerful, but they still struggle with one major issue — hallucinations.

While building AI assistants, I often found that models could generate answers that sounded convincing but were not actually grounded in real data. That led me to explore Retrieval-Augmented Generation (RAG) and build a system that allows Gemini to answer questions using real documents instead of guesses.

The Problem

Large Language Models are incredibly powerful, but they have a well-known limitation: they can generate answers that sound convincing but are actually wrong. This behavior is called AI hallucination, where a model produces fluent text that is not grounded in real facts or evidence.

These hallucinations don’t happen randomly — they usually occur because of structural limitations in how LLMs work.

Some common causes include:

Limited context window

LLMs can only “remember” a fixed number of tokens in a conversation. When the context becomes too long, earlier information may drop out of the window, causing the model to lose important instructions or details.
Long or complex documents

When documents are very large, the model may struggle to reason over the entire content and can miss dependencies between different parts of the text.
Outdated training knowledge

LLMs rely on training data collected at a specific point in time. If new information appears after that, the model may provide answers based on stale or incomplete knowledge.
Probabilistic text generation

Language models generate responses by predicting the most likely next token rather than verifying facts, which can lead to confident but incorrect outputs.

Why Retrieval-Augmented Generation (RAG)

For applications like document search, knowledge assistants, or research tools, these limitations become a serious problem. Users need answers that are grounded in real documents, not guesses.

This challenge led me to explore Retrieval-Augmented Generation (RAG) — a technique that helps language models answer questions using real data instead of relying only on what they remember.

Instead of relying only on the model’s training data, a RAG system first retrieves relevant information from external documents and then uses that information as context when generating an answer.

The idea is simple: rather than asking the model to rely purely on memory, we give it access to the correct information at the moment it generates the response.

By grounding responses in retrieved documents, RAG systems help:

reduce hallucinations
provide answers based on real data
work with private or domain-specific knowledge
keep information up-to-date without retraining the model

System Architecture

The system is built as a Retrieval-Augmented Generation (RAG) pipeline using FastAPI, Google Gemini (for both LLMs and embeddings), and Qdrant (as the vector database).

Users upload documents, which are processed into embeddings and stored in a vector database. When a query is made, relevant document chunks are retrieved and used as context for Gemini to generate a grounded response.

Core Components

Layer	Technology	Role
Frontend	HTML, CSS, JavaScript	User interface for uploading PDFs and interacting with the chatbot
API Server	FastAPI	Handles routes, request processing, and serves the frontend
Embedding Model	Gemini `gemini-embedding-001`	Converts queries and documents into vector embeddings
Vector Database	Qdrant	Stores embeddings and retrieves similar document chunks
LLM	Gemini Flash models	Generates answers based on retrieved context

Backend Modules

Module	File	Responsibility
API Layer	`main.py`	Defines endpoints (`/upload`, `/search`, `/chat`) and initializes services
Document Ingestion	`ingest.py`	Extracts PDF text, cleans it, and splits it into overlapping chunks
Embeddings	`embeddings.py`	Generates embeddings for queries and document chunks
Vector DB Utility	`qdrant_utils.py`	Handles Qdrant connection and collection initialization
Search Pipeline	`search.py`	Performs retrieval and generates answers using Gemini
Chat Pipeline	`chat.py`	Implements conversational RAG with memory and streaming responses

Data Flow

Step	Process
1	User uploads a PDF document
2	Text is extracted and split into chunks
3	Chunks are converted into embeddings using Gemini
4	Embeddings and text are stored in Qdrant
5	User submits a query
6	Query is embedded and matched against stored vectors
7	Relevant document chunks are retrieved
8	Gemini generates an answer using the retrieved context

TECH ARSENAL

Component	Technology
Backend	FastAPI
LLM	Google Gemini Flash
Embeddings	Gemini `gemini-embedding-001`
Vector Database	Qdrant
Framework	LangChain
Frontend	HTML, CSS, JavaScript

System Workflow

1) Overall RAG System Workflow

2) Document Ingestion Pipeline

3) Query Retrieval Pipeline

4) Conversational Chat Flow

Demo

To make the system easier to explore, I built a simple web interface where users can upload documents and interact with the RAG system in real time.

Interaction

the interaction may be slightly off or may not work as desired due to deployment on render

rag-miee.onrender.com

Video

GitHub Repo

Sohamactive / Rag-implementation-using-qdrant-gemini

Gemini-Qdrant RAG Backend

This is a production-ready RAG (Retrieval-Augmented Generation) backend built with FastAPI. It uses Google Gemini Embeddings for high-quality semantic search and Qdrant as the vector database.

The pipeline is non-blocking, designed for concurrent requests, and optimized for indexing documents.

⚙️ Setup and Installation

python -m venv .venv
.venv/scripts/activate
pip install -r requirements.txt
fastapi dev backend/main.py

Create a .env file in the root of your project directory and set your access keys and database URL.

# Gemini API Key (Required for embedding and generation)
GEMINI_API_KEY="YOUR_GEMINI_API_KEY_HERE"

# Qdrant Vector Database Connection
QDRANT_URL="http://localhost:6333"
QDRANT_API_KEY="" # Use only if your Qdrant instance requires it
QDRANT_COLLECTION="rag_documents_768"

future endeavors :
[] implementing graphrag , pageindex, and other types of rag

[] deploying a full fledged app

View on GitHub

What I Learned

This project helped me understand how RAG actually works in practice, not just in theory. Building the pipeline made me see how document ingestion, chunking, embeddings, vector search, and LLM generation all connect together. 🧩🤖
I also learned how embeddings convert text into vectors and why they are essential for semantic search. 🔢🔍
Working with vector databases like Qdrant helped me understand how similarity search and top-k retrieval power document-based AI systems. 🗂️⚡
And honestly, working with documentation can sometimes be a headache 🥲. A lot of the learning came from experimenting, debugging, and figuring things out along the way. 🛠️😇

Future Endeavours ⛰️

Adding an authentication and user management system to manage chats and create separate databases for each user.
Adding support for different database services like MongoDB, both local and cloud-based.
Using local LLMs to minimize inference costs.
Experimenting with advanced RAG architectures such as PageIndex.
Deploying the system properly on cloud platforms like AWS, Azure, or GCP.

Google Gemini Feedback

LLM API — Smooth Experience 👍👍

Calling the Gemini generative models was straightforward.
The google-genai SDK has a clean interface. Methods like:

client.models.generate_content()
client.models.embed_content()

feel intuitive, and getting a working prototype running was quick.

The models themselves performed well for RAG use cases. Once retrieved document chunks were provided as context, Gemini produced grounded responses and streaming worked reliably in conversational flows.

Embeddings API — Documentation Could Be Clearer

The embeddings API worked well overall, but the documentation could be easier to navigate.

While integrating gemini-embedding-001, some configuration details were not immediately obvious and well explained, especially parameters like:

task_type = "RETRIEVAL_QUERY"
task_type = "RETRIEVAL_DOCUMENT"
output_dimensionality
batch_size_limits

Because of that, I ended up doing some trial-and-error and experimentation (and occasionally using AI coding tools) to determine the correct request structure.

The API itself works well, but a more consolidated explanation of embedding configuration would make the developer experience even smoother.

Forem: Soham Sharma