Forem: Aayush Mishra

Building a Local Documentation Chatbot with Python, FAISS, and OpenAI

Aayush Mishra — Mon, 08 Sep 2025 18:12:19 +0000

We’ve all faced this situation: you’re buried in a massive wall of documentation, desperately trying to track down one tiny use case. It feels exactly like hunting for a needle in a haystack.

So, how do we make this easier?

That’s where Retrieval-Augmented Generation (RAG) powered by LLMs comes in. Instead of endlessly scrolling, you can query the docs in plain English and get precise, contextual answers back—just like chatting with a subject-matter expert.

And here’s the best part:

You don’t have to stick to OpenAI’s SDK.
You can even spin this up with an on-prem LLM for full control and data privacy.

If you’re curious about setting up an on-prem LLM with Llama, check out this detailed guide: Setting up RAG locally with Ollama – A Beginner-Friendly Guide.

In this post, I’ll walk you through how I built a local RAG-powered chatbot on top of the Python Command Line Documentation. By the end, you’ll see how documentation can transform from a static reference into an interactive, intelligent assistant.

High-Level Architecture

At a glance, here’s the end-to-end flow we’ll be building:

Scrape Documentation — Extract raw text directly from the official Python docs.
Chunk the Text — Split large documents into smaller, context-friendly segments.
Generate Embeddings — Convert text chunks into vector representations using an embedding model (e.g., OpenAI or any on-prem alternative).
Store in FAISS — Persist embeddings locally in a FAISS vector store for efficient semantic search.
RAG Pipeline — Retrieve the most relevant chunks and feed them into the LLM to generate accurate, context-aware answers.
Expose API with FastAPI — Wrap everything into a lightweight API so the system can be queried just like a chatbot.

Note

It’s always a good practice to pen down your idea and flow first before jumping into implementation.

A clear sketch or outline helps avoid confusion later and keeps your architecture consistent.

Let’s Implement the Process

Now that we understand the big picture, let’s roll up our sleeves and implement the chatbot step by step.
We’ll start with scraping the documentation content from the official Python website.

Prerequisites

Before you start, make sure you have Python 3.10+ installed and the following libraries available.

You can install them using pip:

# Core libraries
pip install openai faiss-cpu numpy pickle5

# Web scraping (if using requests/BeautifulSoup)
pip install requests beautifulsoup4 lxml

# FastAPI for serving the chatbot API
pip install fastapi uvicorn

faiss-cpu : can be interchangeably used with gpu supported library for more info read the FAISS documentation

1. Scraping the Content

The Python documentation is written in HTML and hosted online.
To use it in our chatbot, we need to extract the text from the web page so we can later process it into embeddings.

We’ll use two handy libraries:

requests : Fetch the webpage HTML.
BeautifulSoup : Parse the HTML and extract only the meaningful text.

Here’s the code:

import requests
from bs4 import BeautifulSoup

# URL of the Python command-line documentation page
URL = "https://docs.python.org/3/using/cmdline.html"

def scrape_cmdline():
    """
    Scrape the Python command line documentation page.

    Steps:
    1. Make a GET request to the documentation URL.
    2. Parse the HTML content using BeautifulSoup.
    3. Extract only the main section of the page (role="main"),
       which contains the actual documentation (excluding headers, nav, footers).
    4. Return the extracted text as a clean string.
    """
    # Fetch the HTML page
    res = requests.get(URL)

    # Parse HTML
    soup = BeautifulSoup(res.text, "html.parser")

    # Extract the main content div
    content = soup.find("div", {"role": "main"}).get_text(separator="\n")

    return content


if __name__ == "__main__":
    # Preview the first 500 characters of the scraped content : Best practice
    text = scrape_cmdline()
    print(text[:500])

What’s happening here?

We request the Python docs page using requests.get().
BeautifulSoup parses the HTML so we can search inside it.
We specifically grab the section, because that’s where the actual documentation content lives. This avoids scraping menus, sidebars, or footers.
We return the cleaned text, and in the main block we print a preview of the first 500 characters to confirm it’s working (A good testing practice where modules are somewhat independent from each other. More on the Development Pattern Later)

1.5. Normalizing the Text

Before we move to chunking, it’s useful to normalize the content a bit.

Documentation often contains extra line breaks, inconsistent spacing, or formatting artifacts.

By collapsing whitespace and trimming redundant line breaks, we ensure cleaner input for chunking and embeddings.

def normalize_text(text: str) -> str:
    """
    Normalize scrapped documentation text.
    - Collapse multiple spaces/newlines
    - Strip leading/trailing whitespace
    """
    import re
    text = re.sub(r"\s+", " ", text)   # collapse multiple spaces/newlines
    return text.strip()

2. Chunking the scrapped content

Large documents are too big for embeddings, We’ll split them into chunks of ~800 characters.

def chunk_text(text, chunk_size=800):
    """
    Split text into smaller chunks of roughly `chunk_size`.
    This helps embeddings capture semantic meaning without hitting token limits.
    """
    lines = text.split("\n")
    chunks, current = [], []
    current_len = 0

    for line in lines:
        current.append(line)
        current_len += len(line)

        if current_len >= chunk_size:
            chunks.append("\n".join(current))
            current, current_len = [], 0

    if current:
        chunks.append("\n".join(current))

    return chunks

Here we are

Global chunk size set to 800 for balance between context and token efficiency
Split by sentence/line first, then aggregate into chunks
Stop at the chunk size limit, ensuring no oversized segments
Preserve semantic continuity by joining lines back together

3. Generate Embeddings & Build FAISS Index

Now that we have clean, chunked text, the next step is to make it searchable by meaning rather than just raw keywords.

To achieve this, we’ll:

Generate vector embeddings — Each chunk of text is transformed into a high-dimensional vector using OpenAI’s text-embedding-3-small model.
Build a FAISS index — These vectors are stored in a FAISS index, enabling fast semantic similarity search.
Persist artifacts — Both the document chunks and FAISS index are saved locally so we don’t need to recompute them every run.

import pickle
import numpy as np
import faiss
from openai import OpenAI
from <SCRAP_SCRIPT> import scrape_cmdline
from <CHUNK_SCRIPT> import chunk_text

# Initialize OpenAI client
client = OpenAI()

# Scrape documentation
text = scrape_cmdline()

# Chunk documentation
docs = chunk_text(text, chunk_size=800)

# Generate embeddings for each chunk
embeddings = [
    client.embeddings.create(
        model="text-embedding-3-small", 
        input=doc
    ).data[0].embedding
    for doc in docs
]

# Convert embeddings into NumPy array (FAISS requires float32)
emb_np = np.array(embeddings, dtype="float32")

# Create and populate FAISS index
index = faiss.IndexFlatL2(emb_np.shape[1])
index.add(emb_np)

# Save both docs and index locally
with open("cmdline_docs.pkl", "wb") as f:
    pickle.dump(docs, f)

faiss.write_index(index, "cmdline.index")

print("Embedding index built and stored locally!")

Why FAISS?

Optimized for similarity search: Handles millions of embeddings efficiently.
Local & lightweight: No external dependency once built.
Scalable: You can swap in approximate nearest-neighbor (ANN) indexes for larger datasets.
More on FAISS in some other blog.

4. Query with RAG Pipeline

Now let’s write a script that takes a user question, retrieves relevant chunks, and generates an answer using GPT.

import pickle
import numpy as np
import faiss
from openai import OpenAI

client = OpenAI()

# Load FAISS index + document chunks
index = faiss.read_index("cmdline.index")
docs = pickle.load(open("cmdline_docs.pkl", "rb"))

def ask_cmdline(question, k=3):
    """
    Retrieve relevant doc chunks and answer the user's question
    using Retrieval-Augmented Generation (RAG).
    """
    # Embed the query
    q_emb = client.embeddings.create(
        model="text-embedding-3-small",
        input=question
    ).data[0].embedding

    # Search in FAISS for top-k similar chunks
    D, I = index.search(np.array([q_emb], dtype="float32"), k=k)
    context = "\n\n".join([docs[i] for i in I[0]])

    # Construct prompt for GPT
    prompt = f"""
    You are a Python documentation assistant.
    Use the following docs context to answer the question.
    Cite specific flags or examples when possible.

    Context:
    {context}

    Question: {question}
    """

    # Generate response
    resp = client.chat.completions.create(
        model="gpt-4o-mini", # choose a model of your liking.
        messages=[{"role": "user", "content": prompt}]
    )
    return resp.choices[0].message.content

# Adding the testing block to test RAG Process with a static prompt. 
if __name__ == "__main__":
    print(ask_cmdline("What does the -m option do in Python?"))

We can run this code using

python <rag_file>.py

By this we can test the output provided by LLM based on our faiss index.

Exposing API for ChatBot Simulation using FastApi

To make this interactive, we can expose it with FastAPI interface and can integrate with UI components if we want:

from fastapi import FastAPI, Query
from <RAG_FILE> import ask_cmdline

app = FastAPI()

@app.get("/ask")
def ask_docs(q: str = Query(..., description="Ask about Python cmdline")):
    """
    API endpoint to query the chatbot.
    """
    return {"answer": ask_cmdline(q)}

run this command to make your API Live

uvicorn <rag_file>:app --reload

Now you can use curl to pass any question related to the documentation and fetch the answer.

Ideas for a Production-Grade Chatbot

Once the basic pipeline is working, here are a few ways to enhance it for real-world usage:

Store Metadata with Chunks
- If your content spans multiple pages or external links, enhance your chunking strategy to include metadata such as page numbers, URLs, or section titles.
- When the chatbot retrieves a chunk, you can return this metadata alongside the answer, enabling users to jump directly to the source of information.
Frontend & Contextual Memory
- Integrate a frontend UI using an existing chatbot component to make interactions more user-friendly and visually appealing.
- For production-grade usage, maintain a memory of previous messages so the bot can provide context-aware responses and carry on multi-turn conversations naturally.

With these improvements, you now have a chatbot capable of answering “needle-in-a-haystack” queries from massive, often messy technical documentation.

Even when the content is dense or full of jargon, your RAG pipeline ensures that the chatbot can provide accurate, contextually relevant answers efficiently.

Setting up RAG Locally with Ollama: A Beginner-Friendly Guide

Aayush Mishra — Wed, 03 Sep 2025 21:07:12 +0000

Introduction

Retrieval-Augmented Generation (RAG) is one of the most powerful ways to make LLMs more useful by grounding them in your own data. Instead of relying only on a model's pretraining, RAG lets you ask questions over PDFs, docs, or databases and get precise, context-aware answers.

In this post, we'll set up a local RAG pipeline using Ollama, so you can run everything privately on your machine without cloud costs.

What is RAG?

Retrieval-Augmented Generation (RAG) is a technique that enhances large language models (LLMs) by enabling them to access and utilize external knowledge sources during response generation.

Think of it like this:
Instead of relying only on what the LLM was trained on, RAG lets it search a knowledge base (like your PDFs, notes, or datastores) and combine that retrieved information with its reasoning capabilities.

Quick example:
Imagine you need to understand your company's 120-page policy manual. Instead of manually searching through it, you can just ask: "What's the travel reimbursement policy?" A RAG system will fetch the relevant paragraph from the PDF, and the LLM will generate a clear answer based on that specific content.

At its core, RAG works in two key phases:

1. Document Retrieval Phase

Documents are converted into numerical vectors (embeddings) using specialized models and stored in vector databases optimized for similarity search (e.g., FAISS, ChromaDB).

When you ask a question, the system performs semantic search to find the most relevant chunks of text based on cosine similarity or other distance metrics.

2. Response Generation Phase

The retrieved chunks are passed as context into the LLM prompt template.

The model then generates an answer grounded in your specific documents, combining retrieved facts with natural language generation.

Why Local RAG with Ollama?

Running RAG locally provides several compelling advantages:

Privacy and Data Security — Your documents never leave your local machine. No risk of sending sensitive data to third-party APIs or cloud services.

Cost Efficiency — Zero API call costs. Perfect for experimentation, high-volume usage, or continuous operation without budget constraints.

Model Experimentation — You can easily test multiple models (Llama 3.1, Mistral, CodeLlama) and compare their performance for your specific use case.

Offline Capability — Complete independence from internet connectivity once models are downloaded.

Ollama makes this possible by providing a simple interface to run optimized open-source models locally with minimal setup overhead.

Implementation Setup

Download and Configure Ollama

Download Ollama from: Ollama Download Page

Run ollama post download and verify installation:

ollama --version

Pull Required Models

Before running any code, you need to download both the embedding model and the language model:

# Pull the embedding model (essential for document vectorization)
ollama pull nomic-embed-text

# Pull the language model for text generation
ollama pull llama3.1:latest

Verify your models are available:

ollama list

Start Ollama Service

Make ollama available for API calls:

ollama serve

Python Dependencies

Ensure you have Python 3.9+ installed, then install the required libraries:

llama-index-core
llama-index-embeddings-ollama
llama-index-llms-ollama
llama-index-vector-stores-chroma
chromadb
pypdf

Install using:

pip install -r requirements.txt

Library Overview:

llama-index-core → Core framework for document loading, indexing, and query processing
llama-index-embeddings-ollama → Integrates Ollama for generating embeddings
llama-index-llms-ollama → Integrates Ollama as the language model
llama-index-vector-stores-chroma → ChromaDB connector for vector storage
chromadb → Vector database backend for similarity search
pypdf → PDF parsing and text extraction

Complete RAG Implementation

Create your project structure:

project/
├── data/           # Place your PDF files here
├── test_rag.py     # Main implementation
└── requirements.txt

test_rag.py

from pathlib import Path
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.llms.ollama import Ollama


# Initialize the embedding model
embed_model = OllamaEmbedding(
    model_name="nomic-embed-text",
    request_timeout=300.0,  # Increased timeout for large documents
)

# Initialize the LLM with optimized settings
llm = Ollama(
    model="llama3.1:latest",  # Confirm with `ollama list`
    request_timeout=300.0,
    temperature=0.1,          # Lower temperature for more factual responses
)

# Set global configurations
Settings.embed_model = embed_model
Settings.llm = llm

def load_and_index_documents(data_dir="data"):
    """Load documents and create vector index"""

    # Check if data directory exists
    if not Path(data_dir).exists():
        raise FileNotFoundError(f"Data directory '{data_dir}' not found. Please create it and add your PDF files.")

    # Load documents from the data folder
    docs = SimpleDirectoryReader(data_dir).load_data()

    if not docs:
        raise ValueError(f"No documents found in {data_dir}")


    # Build vector index from documents
    index = VectorStoreIndex.from_documents(docs, embed_model=embed_model)

    return index

def create_query_engine(index, similarity_top_k=3):
    """Create query engine with specified retrieval parameters"""

    query_engine = index.as_query_engine(
        llm=llm,
        similarity_top_k=similarity_top_k,  # Number of relevant chunks to retrieve
        response_mode="compact"             # Compact response generation
    )

    return query_engine

def test_rag_system():
    """Test the RAG system with sample queries"""

    try:
        # Load documents and create index
        index = load_and_index_documents()

        # Create query engine
        query_engine = create_query_engine(index)

        # Sample test queries
        test_queries = [
            "Summarize this document in 3 lines",
            "What are the main topics covered in these documents?",
        ]

        print("RAG System Test Results")
        print("=" * 50)

        for i, query in enumerate(test_queries, 1):
            print(f"\nTest {i}: {query}")
            print("-" * 40)

            try:
                response = query_engine.query(query)
                print(f"Response: {response}")
                print(f"Status: SUCCESS")
            except Exception as e:
                print(f"Error: {str(e)}")
                print(f"Status: FAILED")

            print("-" * 40)

        return True

    except Exception as e:
        print(f"System Error: {str(e)}")
        return False

# Main execution
if __name__ == "__main__":

    print("Starting RAG Pipeline Test...")

    # Test the complete system
    success = test_rag_system()

    if success:
        print("\nRAG system is working correctly!")
        print("You can now use the query_engine to ask questions about your documents.")
    else:
        print("\nRAG system test failed. Check the error messages above.")

Usage Instructions

Prepare your documents: Place PDF files in a data/ folder
Run the test: Execute python test_rag.py to verify everything works
Interactive usage: After successful testing, you can use the functions individually

Testing Your Setup

The code includes a comprehensive testing function that will:

Verify document loading works correctly
Test vector index creation
Run sample queries to ensure end-to-end functionality
Provide clear success/failure feedback

Advanced Configuration Options

Chunk Size Optimization: Adjust chunk sizes based on your document types:

Settings.chunk_size = 1024    # Default: good for most documents
Settings.chunk_overlap = 200  # Maintains context between chunks

Retrieval Tuning: Modify similarity search parameters:

query_engine = index.as_query_engine(
    similarity_top_k=5,        # Retrieve more chunks for complex queries
    response_mode="tree_summarize"  # Better for longer documents
)

Production Considerations

For production deployment, consider:

Performance: Use appropriate chunk sizes and similarity thresholds based on your document types
Monitoring: Implement logging to track query patterns and response quality
Storage: ChromaDB provides good performance for most use cases; consider FAISS for larger datasets
Model Selection: Test different models to find the best balance of speed and accuracy for your specific use case

This implementation provides a solid foundation for document-based question answering with complete privacy and no ongoing costs. The modular structure makes it easy to customize for specific requirements while maintaining reliability and performance.

For questions or improvements to this setup, feel free to reach out.

This is a basic setup; in future blogs, we will explore more enhanced and production-grade setups for RAG architecture to achieve better similarity search. Meanwhile, if you want to study more on the similarity score, the following links can help :
Similarity Score
cosine similarity