Forem: Prathap Daniel Rajasooriyar

Supercharge Your Terminal: ShellGPT + ChromaDB + LangChain for Context-Aware Automation

Prathap Daniel Rajasooriyar — Mon, 01 Sep 2025 10:16:14 +0000

A Smarter Way to Work in the CLI

The command line has always been a place for power users — fast, flexible, and unforgiving. But what if your terminal could do more than just run commands? What if it could understand your intent, execute the right actions, and even pull answers from your own notes before acting?

In this guide, we’ll start with ShellGPT — an AI-powered CLI companion that can chat, generate commands, and execute them directly in your system.

And we won’t stop there. ShellGPT already comes with a built-in function called execute_shell_command, which allows it to run generated commands. We’ll add a brand‑new custom function call (tool) to its toolset — one that can query your knowledge store and return relevant documents directly inside your CLI session.

Then we’ll take it further by integrating ChromaDB and LangChain to add Retrieval‑Augmented Generation (RAG), wiring that custom tool to your notes so the terminal can reason over your personal knowledge base. The result is a context‑aware assistant that doesn’t just respond — it acts, informed by your own data.

Meet ShellGPT: Your AI-Powered Command Line Companion

ShellGPT is a versatile command-line productivity tool powered by large language models, such as GPT-4. It enhances your terminal experience by intelligently generating shell commands, code snippets, and documentation—all without leaving the CLI. Designed to streamline workflows and reduce context switching, ShellGPT supports Linux, macOS, and Windows, and works with major shells such as Bash, Zsh, PowerShell, and CMD.

Key Features:

Command Generation: Instantly generate shell commands tailored to your OS and shell.
Code Assistance: Use --code to generate or annotate code directly from the terminal.
Chat & REPL Modes: Maintain conversational sessions or interactively explore ideas.
Function Calling: Define and execute custom Python functions via GPT.
Role Customisation: Create roles to tailor GPT responses for specific tasks.
Local Model Support: Optionally connect to local LLMs, such as Ollama.

What enables ShellGPT to be a powerful tool for automation and system administration is that, when running with admin or sudo privileges, ShellGPT can also execute system-level commands through function calling.

🗃
To explore ShellGPT in depth, including installation instructions, usage examples, and advanced configuration options, head over to the official ShellGPT GitHub repository.

Here is an example provided in the GitHub repo where ShellGPT uses function call to generate a system command and then executes the command:

Unlocking Context: ShellGPT + ChromaDB

Granting ShellGPT access to your personal notes and documents through a vector database, such as ChromaDB, enables it to become a context-aware assistant. Instead of relying solely on generic knowledge, it can now reason over your own data—tailored to your workflows, preferences, and domain expertise.

Key Benefits:

Semantic Search Over Your Knowledge: Retrieve relevant information from your notes using natural language queries.
Contextual Command Suggestions: ShellGPT can generate commands or code snippets based on the content of your documents.
Conversational Recall: Ask ShellGPT questions like “What did I note about Docker networking?” and get precise answers drawn from your own writing.
Enhanced Learning & Debugging: Use your notes as a knowledge base to troubleshoot errors, explore concepts, or revisit tutorials—all within the CLI.
Privacy-Preserving Intelligence: Since ChromaDB runs locally, your data stays on your machine—giving you control without sacrificing capability.

Setup Guide: Wiring ShellGPT to Your Notes with ChromaDB & LangChain

Step 1: Initialize ChromaDB and Store Your Notes

To enable semantic search over your documents, you'll first need to initialize ChromaDB and populate it with embeddings.

💡
If you do not already have a ChromaDB vector store set up, you can visit From Markdown to Meaning: Turn Obsidian Notes into a Conversational AI to learn an example of how to convert your markdown notes to embeddings stored in ChromaDB. (Refer Code Section 1 - 5)

Step 2: Install ShellGPT and Enable Function Calling

ShellGPT will act as your conversational interface.

To install: ShellGPT GitHub repository
To enable Function Calling: Function Calling

Step 3: Build a LangChain Driver for Retrieval

Install the following required packages in your environment

pip install langchain
pip install langchain-chroma 
pip install langchain-openai

The following Python driver will handle semantic queries over your ChromaDB collection, utilising LangChain’s retrievers to retrieve 5 documents that match the query.

# query.py
# Save this in ~/.config/shell_gpt/

import subprocess
import argparse
import os
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.retrievers import ContextualCompressionRetriever 
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import OpenAI

# Path to the persist directory
persist_directory = 'Path/To/Chroma/Database' # <-- Change this path
collection_name = "your_collection" # <-- Change this to your collection's name

# Initialize the embedding model
embedding = OpenAIEmbeddings()

# Initialize the Chroma vector store with the embedding function and persist_directory
vectordb = Chroma(
    collection_name=collection_name,
    persist_directory=persist_directory,
    embedding_function=embedding
)

# Initialize the language model
llm = OpenAI(temperature=0, model="gpt-3.5-turbo-instruct")

# Initialize the compressor
compressor = LLMChainExtractor.from_llm(llm)

#init base retriever
base_retriever=vectordb.as_retriever(search_kwargs={"k": 5})

# Function to pretty print documents
def pretty_print_docs(docs):
    for i, d in enumerate(docs):
        # Print the document index, metadata, and content
        print(f"Document {i+1}:")
        #print("Metadata:")
        for key, value in d.metadata.items():
            print(f"  {key}: {value}")
        #print("\nContent:")
        print(d.page_content)
        print("\n" + "-" * 100 + "\n")

# Main execution block
if __name__ == "__main__":
    # Set up argument parsing 
    parser = argparse.ArgumentParser(description="Query the vector store")
    parser.add_argument("question", type=str, help="The question to query the vector store with")
    args = parser.parse_args()
    # Get the question from command-line arguments 
    question = args.question
    # Debugging output to verify the received question 
    print(f"Received question: {question}") 
    # Retrieve relevant documents 
    retrieved_docs = base_retriever.invoke(question)
    # Debugging output 
    print(f"Retrieved {len(retrieved_docs)} documents.")
    # Pretty print the retrieved documents
    pretty_print_docs(retrieved_docs)

Save the above as a Python file, for example, query.py in your Shell-GPT Folder: ~/.config/shell_gpt

To test the driver file:

python query.py “your question here based on ChromaDB documents”

Here is an example of the query Python driver retrieving 5 documents based on my question:

 > python3 query.py "how to use python for pentesting?"
Received question: how to use python for pentesting?
Retrieved 5 documents.
Document 1:
  path:---------------------------------------\Python for Pentest\10 Extra Challenges.md
Document Name: 10 Extra Challenges.md
Path: ----------------------------------------\Python for Pentest\10 Extra Challenges.md
Based on what we have covered in this room, here are a few suggestions about how you could expand these tools or start building your own using Python:
- UseDNSrequests to enumerate potential subdomains
- Build the keylogger to send the capture keystrokes to a server you built using Python
- Grab the banner of services running on open ports
- Crawl the target website to download .js library files included
- Try to build a Windows executable for each and see if they work as stand-alone applications on a Windows target
- Implement threading in enumeration and brute-forcing scripts to make them run faster

----------------------------------------------------------------------------------------------------

Document 2:
  path: ----------------------------------------\Buffer Overflow - TCM.md
Fuzzing Python script: 1.py
.
.
.

💡
LangChain’s base retriever offers minimal filtering and ranking—it simply returns documents based on raw similarity scores. To implement an accurate and focused document retriever that uses contextual compression and re-ranking, refer to https://bohowhizz.hashnode.dev/from-markdown-to-meaning#heading-code-section-6-implementing-a-query-system

Step 4: Create a Custom Function for ShellGPT

ShellGPT already supports function calling, and comes with built‑in tools like execute_shell_command. In this step, we’ll add our own custom function to its toolset — one that acts as a bridge between ShellGPT and the query.py driver we built earlier.

This new function will:

Accept a natural‑language question from ShellGPT.
Pass that question to the query.py driver you built in Step 3.
Retrieve relevant documents from your ChromaDB store.
Return those documents so ShellGPT can use the LLM to craft a context‑aware answer.

By wiring this in, you’re effectively teaching ShellGPT to “look things up” in your personal notes before answering — turning it into a context‑aware terminal assistant.

Install the following required package in your environment

pip install instructor

# query_chat.py
# Save this in ~/.config/shell_gpt/functions
import subprocess
import os
from pydantic import Field
from instructor import OpenAISchema

class Function(OpenAISchema):
    """
    Pass the (question) to get related documents that are returned as output (result)
    """
    question: str = Field(..., example="from my notes how to create golden ticket?", descriptions="user query to pass as the question",)

    class Config:
        title = "chromadb_query"

    @classmethod
    def execute(cls, question: str) -> str:
        script_path = "~/.config/shell_gpt/query.py" # <-- Change this path if necessary
        command = ["python3", script_path, question] # <-- Change python3 to python if necessary
        result = subprocess.run(command, capture_output=True, text=True)

        # Debugging output to verify command execution
        print(f"Running command: {command}")
        print(f"Command output: {result.stdout}")
        print(f"Command error (if any): {result.stderr}")

        # Check if the command was successful 
        if result.returncode == 0:
            return result.stdout.strip()
        else:
            return f"Error: {result.stderr.strip()}"
# Function to pretty print documents 
def pretty_print_docs(docs):
    print("Debug: Entered pretty_print_docs")
    if not docs:
        print("No documents found.")
    else:
        split_docs = docs.split('\n----------------------------------------------------------------------------------------------------\n')
        print(f"Debug: Number of documents found: {len(split_docs)}")
        for i, doc in enumerate(split_docs):
            doc_content = doc.strip()
            # Skip the lines containing metadata 
            if any(keyword in doc_content for keyword in ["Received question", "Retrieved"]): 
                continue
            # Check if the document is not empty before printing
            if doc_content:
                print(f"Document {i+1}:\n\n{doc}\n{'-' * 100}")
# Example usage 
if __name__ == "__main__": 
    # Sample question 
    test_question = input("Please enter your question: ")
    # Call the function 
    output = Function.execute(test_question)
    # Print the output print
    print("Output from Function.execute:")
    print(output)
    # Pretty print the retrieved documents
    pretty_print_docs(output)

Save the above as a Python file, for example, query_chat.py in your Shell-GPT Folder: ~/.config/shell_gpt/functions

How It Works

ShellGPT receives your query in CHAT or REPL mode.
The chromadb_query function runs query.py with your question.
query.py uses LangChain to pull the top 5 relevant documents from ChromaDB.
ShellGPT uses those documents as context to generate a tailored, informed answer.

RAG Demo with ShellGPT

In the following example, I asked ShellGPT: “How to use Python to create a port scanner?” — but with the added context “from my notes.”

ShellGPT retrieved 5 relevant documents from my ChromaDB store and used OpenAI’s LLM to generate a tailored response based on that personal context. This showcases Retrieval-Augmented Generation (RAG) in action: instead of relying solely on general knowledge, ShellGPT reasons over my own notes to deliver a precise, informed answer.

>sgpt --repl temp
Entering REPL mode, press Ctrl+C to exit.
>>> from my notes how to use python to create a port scanner?
Running command: ['python3', '------------/.config/shell_gpt/query.py', 'how to
use python to create a port scanner']
Command output: Received question: how to use python to create a port scanner
Retrieved 5 documents.
Document 1:
  path: ------------------------------\Python for Pentest\4 Port Scanner.md
Document Name: 4 Port Scanner.md
Path: ---------------------------------\Python for Pentest\4 Port Scanner.md
In this task, we will be looking at a script to build a simple port scanner.
The code:
.
.
.
----------------------------------------------------------------------------------------------------


▌ @FunctionCall chromadb_query(question="how to use python to create a port scanner")

To create a simple port scanner in Python, you can use the socket library. Here's a basic example:


 import socket

 def scan_ports(ip, ports):
     open_ports = []
.
.
.
This script attempts to connect to each port in the specified range and lists the open ones. Adjust the IP and port
range as needed.

Before vs After RAG

Without RAG	With RAG
Generic answer from model’s training data	Answer grounded in my own notes
May miss my preferred tools or methods	Matches my documented workflows
No awareness of past experiments	Recalls exactly what I’ve done before

💡
Pro Tip: If you're using ShellGPT in CHAT or REPL mode, you can go beyond just reading the output. You can, for example, ask ShellGPT to save the generated portscanner.py file to your desktop and even execute it against a custom IP and port range—all from the terminal.

⚠
Security Note: Since ShellGPT can execute system-level commands, always review generated commands before running them. Avoid using elevated privileges unless absolutely necessary, and keep sensitive files or credentials out of its reach. For risky or unfamiliar operations, test in a sandbox or VM first to protect your main environment.

If this integration sparked ideas for your own setup, I’d love to hear how you’re using ShellGPT to personalise your terminal experience. Whether you're querying notes, automating system tasks, or just experimenting with RAG workflows, the beauty lies in adapting these tools to your unique context.

From Markdown to Meaning: Turn Your Obsidian Notes into a Conversational Database Using LangChain, Python, and ChromaDB

Prathap Daniel Rajasooriyar — Thu, 28 Aug 2025 11:11:46 +0000

Lost in Notes

If you're like me, your Obsidian vault has grown into a sprawling collection of hundreds of notes scattered across dozens of folders. I haven't fully utilised the tags function and haven't linked notes efficiently. So the majority of my notes are floating orphan nodes. What started as an organised knowledge management system has become an overwhelming maze where finding specific information feels like searching for a needle in a haystack.

Simple text searches return too many irrelevant results or miss contextually relevant notes that use different terminology. Even worse, I know that perfect piece of information exists somewhere in my vault, but I just can't remember which note contains it or what exact words I used.

I needed a smarter way to search through my notes—one that could understand context, find semantic relationships, and surface relevant information regardless of the specific words used.

Surfacing What Matters

To solve this challenge, I built a system that processes all my Obsidian notes into a vector space, enabling intelligent querying through LangChain and Python. Here's what this solution accomplishes:

• Semantic Search: Transform notes into embeddings that capture meaning, not just keywords

• OCR: Extract text from linked screenshots and images and embed them into the notes

• Context Preservation: Include original document headers and full path in every document chunk after splitting to maintain traceability

• Cross-Note Connections: Surface semantic relationships between notes that might otherwise remain hidden

• Folder-Agnostic Retrieval: Find relevant content regardless of how (or how poorly) your notes are organised

Document Retrieval Proof of Concept

For this proof of concept, I have vectorised a few documents related to Pentesting. Given the limited number of documents vectorised, the retrieved data from the Vector Space is impressive.

Code Walkthrough: From Structure to Semantics

My goal is to recursively process every document in the Obsidian vault, extract text from embedded images (internal and external links, and seamlessly integrate that content back into the original document flow. After that, each document is stored in a ChromaDB vector store — with careful attention to preserving its context, structure, and metadata for meaningful retrieval.

🗃
If you'd like to follow along or tweak the code yourself, feel free to grab the Jupyter notebook from my GitHub repo: Neural-and-Firewall-Blog

Code Section 1: Setting Up Required Packages and Environment

Requirements:

langchain
langchain-chroma
langchain-community
langchain-openai
instructor
pytesseract Pillow
chromadb

save the above as requirements.txt and then run the following command: pip install -r requirements.txt

Setting Up Your OpenAI API Key:

To run the parts of this project that use OpenAI’s models (for vector embedding, contextual retrieval, or generating answers), you’ll need an OpenAI API key. This key authenticates your requests and links them to your OpenAI account. You can get your API key by:

Signing in to your OpenAI account.
Navigating to View API Keys in your account settings.
Creating a new key and copying it somewhere safe.

Option 1 — Store It as an Environment Variable (recommended)

This keeps your key out of your code and version control.

macOS / Linux (bash/zsh): export OPENAI_API_KEY="your_api_key_here"
Windows (PowerShell): setx OPENAI_API_KEY "your_api_key_here"
Then in Python: import os api_key = os.getenv("OPENAI_API_KEY")

Option 2 — Paste It Directly in Your Code (quick, but less secure)

If you’re just testing locally and want the fastest setup: api_key = "your_api_key_here"

⚠
Warning: Avoid committing your API key to GitHub or sharing it publicly.

Tip: If you’re running this in a Jupyter Notebook, you can set the environment variable in a cell %env OPENAI_API_KEY=your_api_key_here

This will make it available to all subsequent cells in that notebook session.

Loading the Obsidian Vault

from langchain.document_loaders import ObsidianLoader
# Load Markdown files from Obsidian directory 
loader = ObsidianLoader(path="Path/To/Obsidian/Vault/", collect_metadata=True) # <-- Change this path
documents = loader.load()
# Base directory for images 
image_base_path = "Path/To/Obsidian/Vault/Attachments/Folder" # <-- Change this path

Code Section 2: Setting up Tesseract for OCR

I am using Tesseract OCR, as my use case involves converting screen-captured notes to text. Tesseract does not work well with handwritten notes. If you need OCR for handwritten notes, please follow the alternative solution I provided below for OCR.

To install Tesseract on Windows:

Download and install the .exe file from GitHub - UB-Mannheim/tesseract: Tesseract Open Source OCR Engine (main repository)
Add Tesseract to your PATH
If you are using Jupyter Notebook, it is best to restart the Notebook.

In the following code block, we will define two functions.

The first function handles local image files stored in your local drive: it opens the image and passes it to Tesseract to retrieve any embedded text.
The second function works with image URLs: it sends an HTTP request with headers to fetch the image. It tries twice, and if successful, converts the response into an image object and applies the same OCR process.

Both functions include error handling to manage cases where the image is unreadable or the request fails.

import pytesseract
from PIL import Image
import os

# Function to extract text from images 
def extract_text_from_image(image_path):
    try:
        img = Image.open(image_path)
        text = pytesseract.image_to_string(img)
        return text
    except UnidentifiedImageError:
        print(f"Cannot identify image file: {image_path}")
        return ""

# Function to extract text from image urls
def extract_text_from_image_url(image_url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3',
        'Referer': image_url,
        'Accept-Language': 'en-US,en;q=0.9'
    }

    attempts = 2  # Number of attempts
    session = requests.Session()  # Create a session object

    for attempt in range(attempts):
        try:
            response = session.get(image_url, headers=headers)
            response.raise_for_status()  # Check if the request was successful

            img = Image.open(BytesIO(response.content))
            text = pytesseract.image_to_string(img)
            return text
        except UnidentifiedImageError:
            print(f"Cannot identify image file from URL: {image_url}")
            return ""
        except requests.exceptions.RequestException as e:
            print(f"Error downloading image (attempt {attempt + 1}): {e}")
            if attempt < attempts - 1:
                print("Retrying...")
            else:
                return ""

Alternative: Setting up EasyOCR for OCR

EasyOCR is suitable for extracting text from Handwritten notes. This tool is extremely sensitive, as it will try to convert everything in the image to text. If you would like to use EasyOCR, you can follow this setup.

EasyOCR depends on PyTorch. Run this in your terminal:

# For CPU only version: 
pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu

# For GPU version: 
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118

# Now install EasyOCR itself:
pip install easyocr

In the following code block, we will define two functions.

The first function handles local image files stored in your local drive: it opens the image and passes it to EasyOCR to retrieve any embedded text.
The second function works with image URLs: it sends an HTTP request with headers to fetch the image. It tries twice, and if successful, converts the response into an image object and applies the same OCR process.

Both functions include error handling to manage cases where the image is unreadable or the request fails.

import easyocr
import os
from PIL import Image, UnidentifiedImageError
import numpy as np
import cv2

def extract_text_from_image(image_path):
    reader = easyocr.Reader(['en'], gpu=True)  # You can add more languages like ['en', 'hi']
    results = reader.readtext(image_path, detail=0)  # detail=0 returns just the text
    return "\n".join(results)

def extract_text_from_image_url(image_url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3',
        'Referer': image_url,
        'Accept-Language': 'en-US,en;q=0.9'
    }

    attempts = 2
    session = requests.Session()
    reader = easyocr.Reader(['en'], gpu=True)

    for attempt in range(attempts):
        try:
            response = session.get(image_url, headers=headers)
            response.raise_for_status()

            # Convert image bytes to NumPy array for EasyOCR
            img_array = np.frombuffer(response.content, np.uint8)
            img = cv2.imdecode(img_array, cv2.IMREAD_COLOR)

            results = reader.readtext(img, detail=0)
            return "\n".join(results)

        except UnidentifiedImageError:
            print(f"Cannot identify image file from URL: {image_url}")
            return ""
        except requests.exceptions.RequestException as e:
            print(f"Error downloading image (attempt {attempt + 1}): {e}")
            if attempt < attempts - 1:
                print("Retrying...")
            else:
                return ""

Code Section 3: Extract and Embed Text from Image Links

The following code block processes the collection of documents, scanning each one for embedded image references—either URLs or local file paths. For every image found, it uses OCR to extract any visible text and appends that text directly beneath the image reference in the document. It also updates each document with its name and source path, enriching the content with both metadata and extracted insights. This approach is especially useful for making image-based information searchable and accessible within text-based workflows.

import requests
from io import BytesIO
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
import urllib.parse

for doc in documents:
    content = doc.page_content  # Access the content attribute of the Document object
    doc_name = os.path.basename(doc.metadata['source'])  # Assuming the document metadata contains the source file path
    path = doc.metadata.get('path', 'Unknown path') # Extract the source from metadata
    lines = content.split('\n')
    for i, line in enumerate(lines):
        if line.startswith('![') and line.endswith(')'):
            # Handle image URL
            start_index = line.find('(') + 1
            end_index = line.find(')')
            image_url = line[start_index:end_index]
            extracted_text = extract_text_from_image_url(image_url)
            lines[i] += f"\nExtracted Text from {image_url}:\n{extracted_text}"
        elif line.startswith('![[') and line.endswith(']]'):
            # Handle local image path
            image_name = line[3:-2]  # Extract the image file name from the Markdown link
            image_path = os.path.join(image_base_path, image_name)
            if os.path.exists(image_path):
                # Extract text from the image using OCR
                extracted_text = extract_text_from_image(image_path)
                # Append the extracted text to the corresponding line in the document
                lines[i] += f"\nExtracted Text from {image_path}:\n{extracted_text}"
    # Update the document's content with the document name and modified lines
    doc.page_content = f"Document Name: {doc_name}\nPath: {path}\n\n" + '\n'.join(lines)

print(f"Total documents processed: {len(documents)}")

To view any document content: print(documents[0].page_content)

An example of an image extraction and embedding:

While it may not be the best, I am content considering my use case.

To view document metadata: print(documents[2].metadata)

{'source': '2 Note_Title.md', 'path': '\Path\To\Obsidian\2 Note_Title.md', 'created': 1756102692.0840187, 'last_modified': 1726299048.0, 'last_accessed': 1756116860.309166}

Code Section 4: Splitting by Headings and Chunking

When working with Markdown notes, headings (#, ##, ###) are boundaries that organise content into meaningful sections. Splitting by headings first ensures that the context stays intact.

After splitting by headings, we still break longer sections into smaller overlapping chunks (e.g., 200–500 words) so they fit within embedding model limits and can be matched more precisely during search. This two‑step process — headings first, then chunking — gives the best balance between context preservation and retrieval accuracy.

from langchain.text_splitter import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter
from langchain.schema import Document  # Adjust the import based on your project structure

# Initialize the MarkdownHeaderTextSplitter to recognize headers from # to ####
header_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[("#", "Heading 1"), 
                         ("##", "Heading 2"), 
                         ("###", "Heading 3"), 
                         ("####", "Heading 4"), 
                         ("#####", "Heading 5"), 
                         ("######", "Heading 6")]
)

# Initialize a character-based splitter for chunking
chunk_size = 3500  # Set your desired chunk size
chunk_overlap = 300  # Set your desired chunk overlap
recursive_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

all_chunks = []

for doc_index, doc in enumerate(documents):
    #print(f"Processing document {doc_index + 1}/{len(documents)}")

    if not isinstance(doc.page_content, str):
        print("Warning: Document page_content is not a string.")
        continue

    # Split document based on headers
    header_split_docs = header_splitter.split_text(doc.page_content)
    #print(f"Header split resulted in {len(header_split_docs)} parts for document {doc_index + 1}")

    # Convert split Documents into new Documents with consistent metadata
    for subdoc_index, subdoc in enumerate(header_split_docs):
        # Ensure subdoc is a Document object
        if isinstance(subdoc, Document):
            #print(f"Processing subdoc {subdoc_index + 1}/{len(header_split_docs)}")
            # Extract the content
            content = subdoc.page_content

            # Split the content into smaller chunks
            chunks = recursive_splitter.split_text(content)
            for chunk_index, chunk in enumerate(chunks):
                new_metadata = {
                    'path': doc.metadata.get('path', 'Unknown')
                }
                # Create a new Document for each chunk ensuring it retains metadata
                chunk_doc = Document(page_content=chunk, metadata=new_metadata)
                all_chunks.append(chunk_doc)
                #print(f"Created chunk {chunk_index + 1}/{len(chunks)} for subdoc {subdoc_index + 1}")
        else:
            print(f"Warning: Header split result {subdoc_index + 1} is not a Document.")

# Print the number of split documents created
print(f"Total number of chunks created: {len(all_chunks)}")

To confirm whether the metadata of the documents, such as document path, headings, etc., are preserved, you can use the following code.

documents[0]

all_chunks[0]

Code Section 5: Building a Vector Store

A vector store is a database designed for storing and retrieving documents as embeddings — numerical representations that capture meaning rather than just keywords. Storing as embeddings allows the system to match concepts, not just exact words, powering more intelligent search and retrieval. ChromaDB stands out for being open‑source, easy to integrate with LangChain, and optimised for quick similarity searches.

Using OpenAIEmbeddings, each chunk is converted into a high‑dimensional vector and saved in a persistent Chroma collection. The script then prints the total number of stored entries to confirm the database is ready.

import langchain_openai
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain.retrievers import ContextualCompressionRetriever 
from langchain.retrievers.document_compressors import LLMChainExtractor

persist_directory = 'Path/To/New/Chroma/Directory' # <-- Change this path
collection_name = "blog_collection" # <-- Change to any name

embedding = OpenAIEmbeddings()
vectordb = Chroma(
    collection_name=collection_name,
    persist_directory=persist_directory,
    embedding_function=embedding
)

vectordb.add_documents(documents=all_chunks)

Once the documents are added to the vector store, the above code will print the list of the documents' UUIDs, similar to the following:

['7cab8851-c4b7-4a05-895a-01c92c8af564',
 'b5c1bcb1-0067-44a5-8d04-a30678c48537',
 '6cdf533a-75f4-44a7-a993-2280f352ff4f',
 '9a553c5c-f327-48a5-8321-a6e46759ace9',
 'f57bb100-9408-4cd6-b78c-df63f18cc705',
 '197b851b-814a-4659-b9e0-aa5ba244727c',
 '9fa84d43-1ffb-450d-84e1-a1eef0175e1f',
 '9b9fd7d7-4f77-4efa-a74a-76c10d1dd661',
 'e84148d7-3e73-4308-8b86-cfd7b3c93af0']

To find the total collection count: print(vectordb._collection.count())

Code Section 6: Implementing a Query System

There are several retriever tools to query the vector database. After several trials, I have found it best to take the following approach:

Initial Search

Start by fetching the top 20 documents most similar to the query using ChromaDB’s similarity scores (lower scores mean better matches).
Threshold Filtering

Only documents with a similarity score below 0.4 are kept — this filters out weak matches.
Embedding Preparation

The query and filtered documents are converted into embeddings (numerical vectors) for further comparison.
MMR Selection

Maximal Marginal Relevance (MMR) to re‑rank the results so they’re both highly relevant and diverse
Compression Layer

Finally, the selected documents are passed through a language model (GPT‑3.5) to compress and extract only the most useful context, making the output leaner and more focused.

This way, we retain only the most useful context, eliminating noise for accurate and focused document retrieval.

import numpy as np
from typing import List
from langchain.schema import BaseRetriever, Document
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain.chains import RetrievalQA
from langchain_openai import OpenAI
from langchain.vectorstores.utils import maximal_marginal_relevance

score_threshold = 0.4  # lower = better in Chroma distances

class MMRThresholdRetriever(BaseRetriever):
    def _get_relevant_documents(self, query: str) -> List[Document]:
        # Step 1: fetch more docs with scores
        docs_and_scores = vectordb.similarity_search_with_score(query, k=20) # K here compared top 20 docs

        # Step 2: filter by threshold
        filtered = [(doc, score) for doc, score in docs_and_scores if score <= score_threshold]
        if not filtered:
            return []

        # Step 3: prepare embeddings as NumPy arrays
        query_emb = np.array(vectordb._embedding_function.embed_query(query))
        doc_embs = np.array([vectordb._embedding_function.embed_query(doc.page_content) for doc, _ in filtered])

        # Step 4: apply MMR
        selected_indices = maximal_marginal_relevance(query_emb, doc_embs, k=5, lambda_mult=0.7) # k here is to output max of 5 docs
        return [filtered[i][0] for i in selected_indices]

    async def _aget_relevant_documents(self, query: str) -> List[Document]:
        return self._get_relevant_documents(query)

# Instantiate retriever
mmr_retriever = MMRThresholdRetriever()

# Add compression
llm = OpenAI(model="gpt-3.5-turbo-instruct", temperature=0)
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_retriever=mmr_retriever,
    base_compressor=compressor
)

def pretty_print_docs(docs):
    for i, d in enumerate(docs):
        # Print the document index, metadata, and content
        print(f"Document {i+1}:")
        #print("Metadata:")
        for key, value in d.metadata.items():
            print(f"  {key}: {value}")
        #print("\nContent:")
        print(d.page_content)
        print("\n" + "-" * 100 + "\n")

You can explore the diverse range of outputs by tweaking the values in the above code

Score_Threshold: Think of it as the “minimum pass mark” for results.

Higher → Only very strong matches make it through (more precise, but you might miss some useful ones).
Lower → Lets in weaker matches too (more coverage, but more noise).

Lambda_Mult: The balance between “most relevant” and “most different.”

Higher → Prioritises relevance over variety (results are closely related, but may be repetitive).
Lower → Prioritises variety over strict relevance (more diverse, but some results may be less on‑point).

It's time to put the retrieval system to the test and see how it performs!

question = "how to create key logger?"
print("\n=== Source Documents ===\n")
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

Despite the maximum output value being set to 5 documents, the system retrieved only 1 document, as it identified that this was the most relevant one and filtered out the others because it was less relevant. Based on the small set of documents I vectorised, this is performing exceptionally well.

Making Notes Conversational

Once the documents are transformed into vector embeddings and stored in the database, we've created a smart, queryable knowledge base that opens up powerful possibilities beyond simple search. The vectorised format allows us to seamlessly integrate with large language models (LLMs), enabling complex agentic workflows- automatically pulling relevant information from our knowledge base to inform decision-making, generate reports, or even trigger follow-up actions based on the content they discover. This is called Retrieval-Augmented Generation (RAG)

The beauty of this approach is that our documents become dynamic, accessible knowledge, turning our static note collection into an intelligent, responsive resource.

A Simple RAG Demo

We'll use a simple code snippet to demonstrate RAG using RetrievalQA - a chain in LangChain, which enables LLMs to answer questions with context pulled from custom data sources. This results in answers that are more accurate, specific, and context-aware.

from langchain.chains import RetrievalQA

# Create the RetrievalQA chain with the compression retriever
qa_llm = OpenAI(
    temperature=0,
    model="gpt-3.5-turbo-instruct"
)

qa_chain = RetrievalQA.from_chain_type(
    llm=qa_llm,
    retriever=compression_retriever,   # <-- our threshold-aware MMR + compression
    return_source_documents=True,
    chain_type="stuff"  # you can change to "map_reduce" or "refine" if needed
)

question = "how to create key logger?"
result = qa_chain.invoke({"query": question})

print("\n=== Answer ===\n")
print(result["result"])

print("\n=== Source Documents ===\n")
pretty_print_docs(result["source_documents"])

If you found this approach helpful, I'd love to hear how you adapt it to your own workflow. Every note-taking setup is unique, and that's precisely what makes these kinds of experiments so rewarding.

Happy note-querying!