<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Mounesh Kodi</title>
    <description>The latest articles on Forem by Mounesh Kodi (@mounesh_k).</description>
    <link>https://forem.com/mounesh_k</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3700847%2F72429caf-4809-4298-87e1-c0c46a588240.png</url>
      <title>Forem: Mounesh Kodi</title>
      <link>https://forem.com/mounesh_k</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/mounesh_k"/>
    <language>en</language>
    <item>
      <title>How I Built an Offline-First RAG System That’s 10x Faster (at 19)</title>
      <dc:creator>Mounesh Kodi</dc:creator>
      <pubDate>Fri, 09 Jan 2026 05:49:00 +0000</pubDate>
      <link>https://forem.com/mounesh_k/how-i-built-an-offline-first-rag-system-thats-10x-faster-at-19-3l6i</link>
      <guid>https://forem.com/mounesh_k/how-i-built-an-offline-first-rag-system-thats-10x-faster-at-19-3l6i</guid>
      <description>&lt;p&gt;&lt;em&gt;A technical deep-dive into IntraMind - how I built a production RAG system with 60% context compression and sub-10ms&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I built IntraMind, an offline-first RAG system that achieves:&lt;/p&gt;

&lt;p&gt;10x faster retrieval than baseline systems&lt;br&gt;
 100% offline operation (zero cloud dependencies)&lt;br&gt;
 40-60% context compression with custom algorithm&lt;br&gt;
 Sub-10ms cached queries&lt;br&gt;
 470+ documents indexed in production&lt;/p&gt;

&lt;p&gt;Tech Stack: Python, ChromaDB, Ollama, Sentence Transformers&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Problem&lt;/strong&gt;&lt;br&gt;
As a CS student, I was drowning in research papers. Over 400 PDFs, DOCX files, and scanned documents with no efficient way to search through them.&lt;br&gt;
Existing solutions sucked:&lt;br&gt;
❌ Cloud RAG systems - Not uploading my university's research papers to some random cloud&lt;br&gt;
❌ Local alternatives - Slow (30s+ per query), memory-heavy (4GB+), terrible context handling&lt;br&gt;
❌ Enterprise tools - $10k/year for features I didn't need&lt;br&gt;
So I did what any sleep-deprived student would do: built my own.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture Overview&lt;/strong&gt;&lt;br&gt;
IntraMind follows a classic RAG pipeline but with significant optimizations:&lt;br&gt;
┌─────────────┐&lt;br&gt;
│  Documents  │ (PDF, DOCX, Images)&lt;br&gt;
└──────┬──────┘&lt;br&gt;
       │ OCR + Parsing&lt;br&gt;
       ▼&lt;br&gt;
┌─────────────┐&lt;br&gt;
│  Chunking   │ Semantic boundary-aware&lt;br&gt;
└──────┬──────┘&lt;br&gt;
       │&lt;br&gt;
       ▼&lt;br&gt;
┌─────────────┐&lt;br&gt;
│  Embedding  │ all-MiniLM-L6-v2 (384-dim)&lt;br&gt;
└──────┬──────┘&lt;br&gt;
       │&lt;br&gt;
       ▼&lt;br&gt;
┌─────────────┐&lt;br&gt;
│  ChromaDB   │ Persistent vector store&lt;br&gt;
└──────┬──────┘&lt;br&gt;
       │&lt;br&gt;
       ▼&lt;br&gt;
┌─────────────┐&lt;br&gt;
│   Query     │ Semantic search (&amp;lt;1ms)&lt;br&gt;
└──────┬──────┘&lt;br&gt;
       │&lt;br&gt;
       ▼&lt;br&gt;
┌─────────────┐&lt;br&gt;
│ Neuro-Weaver│ Context compression (our secret sauce)&lt;br&gt;
└──────┬──────┘&lt;br&gt;
       │&lt;br&gt;
       ▼&lt;br&gt;
┌─────────────┐&lt;br&gt;
│   LLM       │ Local Ollama inference&lt;br&gt;
└──────┬──────┘&lt;br&gt;
       │&lt;br&gt;
       ▼&lt;br&gt;
┌─────────────┐&lt;br&gt;
│   Answer    │ + Source citations&lt;br&gt;
└─────────────┘&lt;br&gt;
&lt;strong&gt;Component Breakdown&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Embedding Model
pythonfrom sentence_transformers import SentenceTransformer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;model = SentenceTransformer('all-MiniLM-L6-v2')&lt;/p&gt;

&lt;h1&gt;
  
  
  Why? 384-dim vectors, excellent speed/quality balance
&lt;/h1&gt;

&lt;h1&gt;
  
  
  Optimized for: Academic content, technical documentation
&lt;/h1&gt;

&lt;ol&gt;
&lt;li&gt;Vector Database
pythonimport chromadb&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;client = chromadb.PersistentClient(path="./chroma_db")&lt;br&gt;
collection = client.get_or_create_collection(&lt;br&gt;
    name="research_papers",&lt;br&gt;
    metadata={"hnsw:space": "cosine"}&lt;br&gt;
)&lt;/p&gt;

&lt;h1&gt;
  
  
  Persistent storage, disk-based, sub-1ms retrieval
&lt;/h1&gt;

&lt;ol&gt;
&lt;li&gt;Async Processing Pipeline
pythonimport asyncio&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;async def process_batch(documents):&lt;br&gt;
    tasks = [process_document(doc) for doc in documents]&lt;br&gt;
    return await asyncio.gather(*tasks)&lt;/p&gt;

&lt;h1&gt;
  
  
  Result: 73% faster batch uploads (45s → 12s for 3 PDFs)
&lt;/h1&gt;

&lt;p&gt;The Innovation: Neuro-Weaver&lt;br&gt;
This is where it gets interesting. Most RAG systems waste context window by dumping redundant chunks.&lt;br&gt;
Neuro-Weaver compresses retrieved context through semantic deduplication:&lt;br&gt;
pythondef neuro_weaver_compress(chunks, query, threshold=0.85):&lt;br&gt;
    """&lt;br&gt;
    Proprietary context compression algorithm&lt;br&gt;
    Achieves 40-60% token reduction with &amp;lt;2% accuracy loss&lt;br&gt;
    """&lt;br&gt;
    # Step 1: Rank chunks by query relevance&lt;br&gt;
    scored_chunks = []&lt;br&gt;
    for chunk in chunks:&lt;br&gt;
        similarity = cosine_similarity(&lt;br&gt;
            embed(query), &lt;br&gt;
            embed(chunk)&lt;br&gt;
        )&lt;br&gt;
        scored_chunks.append((chunk, similarity))&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scored_chunks.sort(key=lambda x: x[1], reverse=True)

# Step 2: Extract query-relevant sentences
sentences = []
for chunk, score in scored_chunks:
    if score &amp;gt; 0.7:  # Relevance threshold
        sentences.extend(extract_sentences(chunk))

# Step 3: Remove semantic duplicates
unique_sentences = []
for sent in sentences:
    is_duplicate = False
    for existing in unique_sentences:
        if cosine_similarity(embed(sent), embed(existing)) &amp;gt; threshold:
            is_duplicate = True
            break
    if not is_duplicate:
        unique_sentences.append(sent)

# Step 4: Reconstruct context with semantic boundaries
return reconstruct_with_transitions(unique_sentences)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Key Features:&lt;/p&gt;

&lt;p&gt;Query-aware extraction (not just top-k chunks)&lt;br&gt;
Cosine similarity deduplication (threshold: 0.85)&lt;br&gt;
Semantic boundary preservation&lt;br&gt;
Adaptive compression based on content type&lt;/p&gt;

&lt;p&gt;Results:&lt;br&gt;
Input context:  4000 chars (avg)&lt;br&gt;
Output context: 1600 chars (avg)&lt;br&gt;
Reduction:      60%&lt;br&gt;
Accuracy loss:  &amp;lt;2% (measured on academic Q&amp;amp;A)&lt;/p&gt;

&lt;p&gt;Performance Benchmarks&lt;br&gt;
I ran comprehensive tests on v1.0 vs v1.1:&lt;br&gt;
Metricv1.0v1.1ImprovementBatch Upload (3 PDFs)45s12s⚡ 73% fasterCold Query15s14.98sBaselineCached Query15s0.01s🚀 1500x fasterContext Size4000 chars1600 chars📉 60% smallerMemory Usage2.5 GB1.5 GB💾 40% reductionModel Size1.8 GB986 MB📦 45% smaller&lt;br&gt;
Caching Strategy&lt;br&gt;
The 1500x speedup comes from a hybrid approach:&lt;br&gt;
pythonfrom functools import lru_cache&lt;br&gt;
import hashlib&lt;/p&gt;

&lt;p&gt;@lru_cache(maxsize=128)&lt;br&gt;
def cached_query(query_hash):&lt;br&gt;
    # LRU cache for frequent queries&lt;br&gt;
    return retrieve_and_generate(query_hash)&lt;/p&gt;

&lt;p&gt;def pre_warm_cache():&lt;br&gt;
    # Pre-compute common query patterns on startup&lt;br&gt;
    common_queries = load_query_patterns()&lt;br&gt;
    for q in common_queries:&lt;br&gt;
        cached_query(hash(q))&lt;/p&gt;

&lt;h1&gt;
  
  
  Adaptive learning: track query patterns, pre-warm cache
&lt;/h1&gt;

&lt;p&gt;Real-World Use Case&lt;br&gt;
Scenario: University research lab with 500+ ML papers&lt;br&gt;
Before IntraMind:&lt;/p&gt;

&lt;p&gt;Manual ctrl+F across PDFs: ~5 min per query&lt;br&gt;
Organizing papers: Nightmare&lt;br&gt;
Cloud concerns: Can't upload sensitive research&lt;br&gt;
Cost: $200/month for Mendeley + cloud RAG tools&lt;/p&gt;

&lt;p&gt;After IntraMind:&lt;/p&gt;

&lt;p&gt;Semantic search: &amp;lt;10ms (cached), ~15s (cold)&lt;br&gt;
Zero organization needed: AI handles retrieval&lt;br&gt;
Complete privacy: Everything local&lt;br&gt;
Cost: $0 (runs on existing hardware)&lt;/p&gt;

&lt;p&gt;Actual query example:&lt;br&gt;
Q: "What are the different types of persistence in data structures?"&lt;/p&gt;

&lt;p&gt;A: "There are three main types of persistence:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Partial Persistence: Only past versions are accessible&lt;/li&gt;
&lt;li&gt;Full Persistence: All versions can be accessed and modified&lt;/li&gt;
&lt;li&gt;Confluent Persistence: Allows merging of different versions&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Sources: Advanced_DS.pdf (similarity: 0.89), &lt;br&gt;
         Algorithms_Book.pdf (similarity: 0.76)"&lt;/p&gt;

&lt;p&gt;Query time: 0.009s (cached)&lt;br&gt;
Context reduction: 42%&lt;/p&gt;

&lt;p&gt;Technical Challenges Solved&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;OCR for Scanned PDFs
Many academic papers are scan-only. Solution:
pythonimport pytesseract
from pdf2image import convert_from_path&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;def extract_text_with_ocr(pdf_path):&lt;br&gt;
    images = convert_from_path(pdf_path)&lt;br&gt;
    text = ""&lt;br&gt;
    for img in images:&lt;br&gt;
        # Confidence-based filtering&lt;br&gt;
        data = pytesseract.image_to_data(img, output_type='dict')&lt;br&gt;
        for i, conf in enumerate(data['conf']):&lt;br&gt;
            if int(conf) &amp;gt; 60:  # Only high-confidence text&lt;br&gt;
                text += data['text'][i] + " "&lt;br&gt;
    return text&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Context Window Management
Early versions constantly hit LLM limits (4096 tokens). Neuro-Weaver solved this by:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Intelligent chunk selection&lt;br&gt;
Redundancy removal&lt;br&gt;
Semantic compression&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Encryption for Compliance
Added AES-256-GCM for HIPAA/GDPR compliance:
pythonfrom cryptography.hazmat.primitives.ciphers.aead import AESGCM
from cryptography.hazmat.primitives.kdf.pbkdf2 import PBKDF2HMAC&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;def encrypt_document(data, password):&lt;br&gt;
    kdf = PBKDF2HMAC(&lt;br&gt;
        algorithm=hashes.SHA256(),&lt;br&gt;
        length=32,&lt;br&gt;
        salt=os.urandom(16),&lt;br&gt;
        iterations=100000  # NIST recommended&lt;br&gt;
    )&lt;br&gt;
    key = kdf.derive(password.encode())&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;aesgcm = AESGCM(key)
nonce = os.urandom(12)
ciphertext = aesgcm.encrypt(nonce, data, None)

return nonce + ciphertext
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;What I Learned&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Performance &amp;gt; Features
Users don't care about 47 cool features if queries take 30 seconds.&lt;/li&gt;
&lt;li&gt;Privacy is a moat
Organizations are desperate for AI that doesn't require cloud uploads. This is a HUGE market.&lt;/li&gt;
&lt;li&gt;Quantization is underrated
Q4_K_M quantization gave me:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;40% model size reduction&lt;br&gt;
2-3x inference speedup&lt;br&gt;
&amp;lt;2% accuracy loss&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Good docs &amp;gt; Marketing
My technical documentation converted more pilot partners than any marketing copy.&lt;/li&gt;
&lt;li&gt;Offline != Slow
With proper optimization, offline systems can match or beat cloud performance.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Open Questions I'm Exploring&lt;/p&gt;

&lt;p&gt;Multi-modal RAG: How to handle equations, charts, and diagrams in research papers?&lt;br&gt;
Collaborative knowledge bases: Can multiple researchers share a vector store without centralization?&lt;br&gt;
Active learning: Should the system learn from user feedback to improve retrieval over time?&lt;br&gt;
Cross-lingual RAG: How to handle papers in different languages efficiently?&lt;/p&gt;

&lt;p&gt;Current Status &amp;amp; What's Next&lt;br&gt;
IntraMind is currently in pilot phase. We're working with 3 research institutions and looking for 2 more partners.&lt;br&gt;
Roadmap:&lt;/p&gt;

&lt;p&gt;✅ v1.0: Basic RAG pipeline&lt;br&gt;
✅ v1.1: Neuro-Weaver compression + caching&lt;br&gt;
🚧 v1.2: Multi-modal support (images, tables)&lt;br&gt;
📋 v2.0: Collaborative features&lt;br&gt;
📋 Research paper submission (Neuro-Weaver algorithm)&lt;br&gt;
📋 Patent filing exploration&lt;/p&gt;

&lt;p&gt;Tech Stack:&lt;/p&gt;

&lt;p&gt;Backend: Python 3.12+, FastAPI&lt;br&gt;
Embeddings: Sentence Transformers (all-MiniLM-L6-v2)&lt;br&gt;
Vector DB: ChromaDB (persistent)&lt;br&gt;
LLM: Ollama (local inference, quantized models)&lt;br&gt;
OCR: Tesseract + pdf2image&lt;br&gt;
Encryption: AES-256-GCM&lt;/p&gt;

&lt;p&gt;Try It Yourself&lt;br&gt;
IntraMind is open for pilot partners:&lt;br&gt;
Requirements:&lt;/p&gt;

&lt;p&gt;Research institution/lab/library&lt;br&gt;
453+ documents to index&lt;br&gt;
Willing to provide honest feedback&lt;/p&gt;

&lt;p&gt;What you get:&lt;/p&gt;

&lt;p&gt;Free 1 month deployment and setup&lt;br&gt;
Custom configuration for your use case&lt;br&gt;
Priority support during pilot&lt;br&gt;
Influence on roadmap&lt;/p&gt;

&lt;p&gt;Discussion&lt;br&gt;
For the community:&lt;/p&gt;

&lt;p&gt;What context compression techniques have you tried in RAG systems?&lt;br&gt;
How do you handle OCR quality issues in academic papers?&lt;br&gt;
What's your experience with offline vs. cloud LLM inference?&lt;/p&gt;

&lt;p&gt;For students/researchers:&lt;br&gt;
Would an offline RAG system solve any pain points you currently have? What features would be most valuable?&lt;br&gt;
Drop a comment or reach out—always happy to discuss RAG optimization, privacy-first AI, or offline architectures!&lt;/p&gt;

&lt;p&gt;About Me:&lt;br&gt;
I'm Mounesh Kodi, 19, founder of CruxLabx. I build AI systems that prioritize privacy and performance.&lt;/p&gt;

&lt;p&gt;💼 LinkedIn: linkedin.com/in/mounesh-kodi&lt;br&gt;
🐙 GitHub: github.com/crux-ecosystem&lt;br&gt;
🌐 Website: cruxlabx.dev&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Building in public at 19. Follow my journey!&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>python</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
