<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Nrk Raju Guthikonda</title>
    <description>The latest articles on Forem by Nrk Raju Guthikonda (@kennedyraju55).</description>
    <link>https://forem.com/kennedyraju55</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3875587%2F0e0fea57-3e20-4e0a-bf89-f91e1bb899e0.png</url>
      <title>Forem: Nrk Raju Guthikonda</title>
      <link>https://forem.com/kennedyraju55</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/kennedyraju55"/>
    <language>en</language>
    <item>
      <title>Your Voice Assistant Doesn't Need the Cloud — Here's How I Built 5 Offline NLP Tools</title>
      <dc:creator>Nrk Raju Guthikonda</dc:creator>
      <pubDate>Sun, 12 Apr 2026 23:35:15 +0000</pubDate>
      <link>https://forem.com/kennedyraju55/your-voice-assistant-doesnt-need-the-cloud-heres-how-i-built-5-offline-nlp-tools-4n44</link>
      <guid>https://forem.com/kennedyraju55/your-voice-assistant-doesnt-need-the-cloud-heres-how-i-built-5-offline-nlp-tools-4n44</guid>
      <description>&lt;p&gt;Every time I build an AI-powered tool that requires an internet connection, I feel a small pang of guilt. We've normalized shipping software that stops working the moment a cloud API goes down, a subscription lapses, or a user happens to be on an airplane. But here's the thing: most NLP tasks — sentiment analysis, text summarization, conversational AI, even voice assistants — don't &lt;em&gt;need&lt;/em&gt; the cloud anymore.&lt;/p&gt;

&lt;p&gt;Over the past year, I've built a series of open-source tools that prove this point. They handle voice calls, language tutoring, sentiment dashboards, news digestion, and research paper Q&amp;amp;A — all running locally with &lt;a href="https://ollama.com" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt; and models like Gemma 4. No API keys. No cloud bills. No data leaving your machine.&lt;/p&gt;

&lt;p&gt;In this post, I'll walk through the patterns I've found most effective for building offline-first NLP applications in Python, with real code from five of my projects.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Offline NLP Matters More Than You Think
&lt;/h2&gt;

&lt;p&gt;The conversation around AI tooling is dominated by cloud-first thinking. GPT-4o, Claude, Gemini — they're brilliant, but they come with strings attached:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Privacy&lt;/strong&gt;: Every prompt you send is processed on someone else's server. For healthcare data, legal documents, or personal conversations, that's a non-starter.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost&lt;/strong&gt;: API calls add up fast. A sentiment analysis pipeline processing 10,000 documents a day can cost hundreds per month.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliability&lt;/strong&gt;: Cloud APIs have rate limits, outages, and deprecation cycles. Your local GPU doesn't.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency&lt;/strong&gt;: A local model on an M-series Mac or a decent NVIDIA card returns responses in milliseconds, not seconds.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In my experience building search and retrieval systems, I've learned that the best AI tool is the one that's always available. Local LLMs make that possible for a surprising range of NLP tasks.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Foundation: Ollama as Your Local AI Runtime
&lt;/h2&gt;

&lt;p&gt;Every project I'll discuss uses the same foundation: &lt;a href="https://ollama.com" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt; running a local model (typically Gemma 4). The setup is dead simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install Ollama&lt;/span&gt;
curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://ollama.com/install.sh | sh

&lt;span class="c"&gt;# Pull a model&lt;/span&gt;
ollama pull gemma3:4b

&lt;span class="c"&gt;# Verify it's running&lt;/span&gt;
ollama list
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the Python integration pattern I use across all my projects:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;query_local_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma3:4b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Send a prompt to the local Ollama instance and return the response.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:11434/api/generate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;options&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;top_p&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;num_predict&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This function is the beating heart of every tool I build. It's simple, it's reliable, and it works identically whether you're on a MacBook, a Linux workstation, or a Windows machine with WSL.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project 1: CallPilot — A Voice AI Assistant
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/kennedyraju55/callpilot" rel="noopener noreferrer"&gt;CallPilot&lt;/a&gt; is probably the most ambitious project in this collection. It's an AI-powered outbound phone call assistant: you give it a phone number and instructions ("Book a dentist appointment for Tuesday at 3pm"), and it handles the entire conversation.&lt;/p&gt;

&lt;p&gt;The architecture bridges Twilio's real-time voice streaming with an AI backend, using RAG (Retrieval-Augmented Generation) with ChromaDB to give the AI access to personal documents like insurance cards or medical records during calls.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;chromadb&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;chromadb.config&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Settings&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;documents&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Retrieve relevant context from local vector store for RAG.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Settings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;chroma_db_impl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;duckdb+parquet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;persist_directory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./vectorstore&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;collection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_collection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;query_texts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;n_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;documents&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key insight here is that voice AI doesn't have to mean "send everything to a cloud transcription service." The RAG pipeline runs entirely locally — your documents are chunked, embedded, and stored in a local ChromaDB instance. When the AI needs context during a call, it queries the vector store on your machine.&lt;/p&gt;

&lt;p&gt;While CallPilot currently uses OpenAI's Realtime API for the voice streaming component (real-time bidirectional audio is still a hard problem for local models), the entire knowledge retrieval pipeline is local. As local speech-to-text and text-to-speech models improve, the goal is to make this fully offline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project 2: Language Learning Bot — Conversational AI for Education
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/kennedyraju55/language-learning-bot" rel="noopener noreferrer"&gt;Language Learning Bot&lt;/a&gt; is a polyglot companion that supports 15 languages through conversation practice, vocabulary drills, and structured lessons — all powered by a local LLM via Ollama.&lt;/p&gt;

&lt;p&gt;The conversation engine adapts to beginner, intermediate, or advanced levels and provides real-time corrections with grammar explanations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_tutor_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Build a language tutor system prompt for the local LLM.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are a friendly &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; language tutor.
The student&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s level is &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.

Rules:
- Respond primarily in &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; with English translations in parentheses
- Correct any grammar mistakes gently, explaining the rule
- Adapt vocabulary complexity to &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; level
- Include cultural context when relevant
- End each response with a follow-up question to keep practicing

Student says: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_message&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="c1"&gt;# Usage with local Ollama
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;query_local_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nf"&gt;create_tutor_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Spanish&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;beginner&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Yo quiero ir al parque&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What makes this project compelling for offline use is the privacy angle. Language learners make mistakes — that's the whole point. Having those mistakes processed locally, never logged on a remote server, creates a psychologically safer learning environment. Every chat session, vocabulary list, and progress metric stays on the user's machine in a local JSON store.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project 3: Sentiment Analysis Dashboard — Text Analytics with Streamlit
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/kennedyraju55/sentiment-analysis-dashboard" rel="noopener noreferrer"&gt;Sentiment Analysis Dashboard&lt;/a&gt; processes text files through an LLM-powered classification pipeline with confidence scores, trend detection, and word cloud generation.&lt;/p&gt;

&lt;p&gt;The core analysis pattern uses structured prompting to get consistent, parseable output from the local LLM:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;analyze_sentiment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Analyze sentiment of text using the local LLM with structured output.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Analyze the sentiment of the following text.
Return ONLY a JSON object with these fields:
- sentiment: one of &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;positive&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;negative&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;neutral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mixed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
- confidence: float between 0.0 and 1.0
- key_phrases: list of 3-5 important phrases from the text
- summary: one-sentence summary of the overall tone

Text: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

JSON:&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;raw_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;query_local_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Streamlit dashboard renders these results into interactive visualizations — sentiment distribution charts, sliding-window trend analysis, and word clouds. The entire pipeline processes text at seconds-per-entry with batch support, compared to minutes per entry for manual review.&lt;/p&gt;

&lt;p&gt;What I find most valuable here is the consistency. A human reviewer's sentiment judgment drifts throughout the day based on fatigue and mood. The local LLM produces consistent classifications with quantified confidence scores, and it does it without sending your text data to any third party.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project 4: News Digest Generator — Information Triage at Scale
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/kennedyraju55/news-digest-generator" rel="noopener noreferrer"&gt;News Digest Generator&lt;/a&gt; tackles information overload. Drop a folder of &lt;code&gt;.txt&lt;/code&gt; news articles on it and get back a structured, categorized digest with sentiment analysis and trend detection.&lt;/p&gt;

&lt;p&gt;The categorization pipeline is where the local LLM really shines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;categorize_articles&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;articles&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;num_categories&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Group articles into topic categories using the local LLM.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;titles&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;- [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;articles&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Given these news articles, group them into exactly
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;num_categories&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; topic categories.

Articles:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;titles&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Return a JSON array where each element has:
- &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: short category name
- &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;article_indices&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: list of article index numbers
- &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: 2-3 sentence summary of this topic cluster

JSON:&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;query_local_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The digest output includes key headlines, topic summaries, per-article sentiment, trending themes, and a forward-looking outlook section. It's the kind of tool that journalists, analysts, and researchers can run on sensitive or proprietary content without worrying about data leakage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project 5: Research Paper Q&amp;amp;A — RAG for Academic Literature
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/kennedyraju55/research-paper-qa" rel="noopener noreferrer"&gt;Research Paper Q&amp;amp;A&lt;/a&gt; lets you drop PDF research papers into a folder and ask questions about them in natural language. It uses a RAG pipeline with ChromaDB to chunk, embed, and retrieve relevant passages, then feeds them to Gemma 4 for answer generation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ask_paper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;paper_chunks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Answer a question about a research paper using RAG.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Retrieve the most relevant chunks
&lt;/span&gt;    &lt;span class="n"&gt;relevant&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;retrieve_chunks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;paper_chunks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;---&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;relevant&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Based on the following excerpts from a research paper,
answer the question. Only use information from the provided excerpts.
If the answer isn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t in the excerpts, say so.

Excerpts:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Question: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Answer:&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;query_local_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is perhaps the most natural fit for offline AI. Researchers often work with pre-publication papers, proprietary datasets, or materials under NDA. A local RAG pipeline means you can ask "What methodology did they use for the control group?" without that question — or the paper's content — ever touching an external server.&lt;/p&gt;

&lt;h2&gt;
  
  
  Patterns I Keep Coming Back To
&lt;/h2&gt;

&lt;p&gt;After building these five projects (and many more — I'm at 116+ open-source repos now), certain patterns have proven themselves repeatedly:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Structured prompting for parseable output&lt;/strong&gt;: Always ask the LLM to return JSON with a specific schema. It makes downstream processing predictable and testable.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Local vector stores for RAG&lt;/strong&gt;: ChromaDB with persistent storage is lightweight enough to embed in any project. The retrieval quality with even small embedding models is excellent for domain-specific content.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Ollama as the universal runtime&lt;/strong&gt;: By standardizing on Ollama's API, every project works with any compatible model. Swap Gemma for Llama or Mistral with a single config change.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;CLI-first, web-second&lt;/strong&gt;: Every project starts as a Click CLI tool, then gets a Streamlit or Gradio web UI. This ensures the core logic is clean, testable, and scriptable before any UI complexity enters the picture.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Privacy by architecture, not policy&lt;/strong&gt;: When the LLM runs on &lt;code&gt;localhost:11434&lt;/code&gt;, there's no privacy policy to read. The data physically cannot leave the machine.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;If you want to explore any of these projects:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install Ollama&lt;/span&gt;
curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://ollama.com/install.sh | sh
ollama pull gemma3:4b

&lt;span class="c"&gt;# Clone any project&lt;/span&gt;
git clone https://github.com/kennedyraju55/sentiment-analysis-dashboard.git
&lt;span class="nb"&gt;cd &lt;/span&gt;sentiment-analysis-dashboard
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt

&lt;span class="c"&gt;# Run the CLI&lt;/span&gt;
python main.py analyze &lt;span class="nt"&gt;--file&lt;/span&gt; sample.txt

&lt;span class="c"&gt;# Or launch the web UI&lt;/span&gt;
streamlit run app.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All five projects follow the same structure: install Ollama, pull a model, clone the repo, install dependencies, and run. No API keys. No account creation. No cloud configuration.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;The local LLM ecosystem is evolving fast. Models are getting smaller and more capable. Ollama recently added vision model support, which opens up entirely new offline use cases — document OCR, image-based Q&amp;amp;A, multimodal assistants. I'm actively building tools that leverage these capabilities.&lt;/p&gt;

&lt;p&gt;The thesis is simple: &lt;strong&gt;if your NLP tool requires an internet connection and it doesn't strictly need one, you're shipping a worse product than you could be.&lt;/strong&gt; Local LLMs have crossed the quality threshold for production use in dozens of NLP tasks. The tools I've shared here prove it.&lt;/p&gt;

&lt;p&gt;Every one of these projects is MIT-licensed and open source. Clone them, break them, improve them. That's the whole point.&lt;/p&gt;




&lt;h2&gt;
  
  
  About the Author
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Nrk Raju Guthikonda&lt;/strong&gt; is a Senior Software Engineer at Microsoft on the Copilot Search Infrastructure team, working on semantic indexing and retrieval-augmented generation. Outside of work, he maintains 116+ open-source repositories exploring AI, NLP, healthcare tech, developer tools, and creative applications — all built with local LLMs and a privacy-first philosophy.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🐙 GitHub: &lt;a href="https://github.com/kennedyraju55" rel="noopener noreferrer"&gt;github.com/kennedyraju55&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;✍️ Blog: &lt;a href="https://dev.to/kennedyraju55"&gt;dev.to/kennedyraju55&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;💼 LinkedIn: &lt;a href="https://www.linkedin.com/in/nrk-raju-guthikonda-504066a8/" rel="noopener noreferrer"&gt;linkedin.com/in/nrk-raju-guthikonda-504066a8&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>nlp</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>I Built 5 AI Developer Tools That Run Entirely on My Laptop — No API Keys, No Cloud, No Limits</title>
      <dc:creator>Nrk Raju Guthikonda</dc:creator>
      <pubDate>Sun, 12 Apr 2026 23:29:52 +0000</pubDate>
      <link>https://forem.com/kennedyraju55/i-built-5-ai-developer-tools-that-run-entirely-on-my-laptop-no-api-keys-no-cloud-no-limits-234</link>
      <guid>https://forem.com/kennedyraju55/i-built-5-ai-developer-tools-that-run-entirely-on-my-laptop-no-api-keys-no-cloud-no-limits-234</guid>
      <description>&lt;p&gt;Every developer has felt the friction: you want AI to help with a mundane task — writing standup notes, reviewing a pull request, generating boilerplate — but the moment you reach for a cloud API, you hit rate limits, accumulate costs, or worse, realize you can't send proprietary code to a third-party endpoint.&lt;/p&gt;

&lt;p&gt;What if the AI lived on your machine? No API keys. No network dependency. No billing surprises. Just a local model serving intelligent responses over localhost.&lt;/p&gt;

&lt;p&gt;Over the past year, I've built a suite of open-source developer productivity tools that run entirely on local LLMs using &lt;a href="https://ollama.com" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt; and Google's Gemma model family. In this post, I'll walk through the architecture, share real code, and explain why local-first AI is the most practical path for developer tooling today.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Local LLMs for Developer Tools?
&lt;/h2&gt;

&lt;p&gt;Cloud-hosted LLMs are powerful, but they carry trade-offs that matter in daily engineering workflows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cost accumulates fast.&lt;/strong&gt; A team of ten engineers each making 50 AI-assisted queries per day burns through API credits quickly. Local inference is free after the initial model download.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Offline-first matters.&lt;/strong&gt; Planes, coffee shops with spotty Wi-Fi, corporate VPNs that block external endpoints — local models don't care.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Privacy is non-negotiable.&lt;/strong&gt; When you're reviewing code from a private repository or generating reports that reference internal project names, sending that context to a remote API is a risk. Local inference keeps everything on-disk.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency is predictable.&lt;/strong&gt; No cold starts, no queue wait times, no variable response times based on provider load. A 4B parameter model on a modern laptop with 16 GB RAM responds in 1–3 seconds consistently.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In my experience building production search and retrieval systems, I've learned that the best developer tools are the ones with zero friction to adopt. Local LLMs eliminate the biggest friction point: setup and credentials.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Stack: Ollama + Gemma + FastAPI
&lt;/h2&gt;

&lt;p&gt;The architecture I've converged on across multiple projects is deliberately simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────┐
│              Developer's Laptop              │
│                                              │
│  ┌──────────┐    HTTP     ┌──────────────┐  │
│  │  FastAPI  │ ◄────────► │   Ollama      │  │
│  │  App      │  localhost  │   (Gemma 4)   │  │
│  │  :8000    │   :11434    │   4B params   │  │
│  └──────────┘             └──────────────┘  │
│       ▲                                      │
│       │  Browser / CLI / IDE Plugin          │
│  ┌──────────┐                                │
│  │   User   │                                │
│  └──────────┘                                │
└─────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Ollama&lt;/strong&gt; handles model management and inference. One command pulls a model, and it serves an OpenAI-compatible API on &lt;code&gt;localhost:11434&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gemma 4 (4B)&lt;/strong&gt; is the sweet spot — small enough to run on laptops without a dedicated GPU, capable enough for code understanding, summarization, and generation tasks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;FastAPI&lt;/strong&gt; provides the application layer: prompt engineering, input validation, structured output parsing, and a clean UI or CLI interface.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started: Ollama in 60 Seconds
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install Ollama (macOS/Linux)&lt;/span&gt;
curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://ollama.com/install.sh | sh

&lt;span class="c"&gt;# Pull the model (one-time ~2.5 GB download)&lt;/span&gt;
ollama pull gemma3:4b

&lt;span class="c"&gt;# Verify it's running&lt;/span&gt;
curl http://localhost:11434/api/tags
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On Windows, download the installer from &lt;a href="https://ollama.com" rel="noopener noreferrer"&gt;ollama.com&lt;/a&gt; and Ollama runs as a background service automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project 1: AI Standup Generator
&lt;/h2&gt;

&lt;p&gt;Every morning, the same ritual: open your git log, skim through Jira tickets, and type up a standup update that nobody will remember five minutes later. The &lt;a href="https://github.com/kennedyraju55/standup-generator" rel="noopener noreferrer"&gt;standup-generator&lt;/a&gt; automates this entirely.&lt;/p&gt;

&lt;p&gt;You feed it bullet points about what you worked on, and the local LLM transforms them into a structured standup report with "Yesterday," "Today," and "Blockers" sections.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;

&lt;span class="n"&gt;OLLAMA_URL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:11434/api/generate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_standup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_notes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are a concise engineering standup assistant.
Given these raw notes, produce a structured standup report
with sections: Yesterday, Today, Blockers.
Keep each bullet under 15 words.

Raw notes:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;raw_notes&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;OLLAMA_URL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma3:4b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;options&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;30.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key design decisions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Low temperature (0.3)&lt;/strong&gt; keeps output deterministic — standups shouldn't be creative writing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stream disabled&lt;/strong&gt; for simplicity in CLI/API mode; enable it for real-time UI feedback.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;httpx over requests&lt;/strong&gt; because it's async-friendly when you graduate to FastAPI endpoints.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Project 2: AI Code Review Bot
&lt;/h2&gt;

&lt;p&gt;Code reviews are where local AI shines brightest. You absolutely should not send your team's proprietary code to a third-party API for review. The &lt;a href="https://github.com/kennedyraju55/code-review-bot" rel="noopener noreferrer"&gt;code-review-bot&lt;/a&gt; runs a local Gemma model to analyze diffs and surface issues.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;review_code&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;read_text&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are a senior code reviewer. Analyze this code for:
1. Bugs or logic errors
2. Security vulnerabilities
3. Performance concerns
4. Readability improvements

Be specific. Reference line numbers. Skip style nitpicks.

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
python&lt;br&gt;
{code}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
    response = httpx.post(
        "http://localhost:11434/api/generate",
        json={
            "model": "gemma3:4b",
            "prompt": prompt,
            "stream": False,
            "options": {"temperature": 0.2, "num_ctx": 8192},
        },
        timeout=60.0,
    )
    return response.json()["response"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice &lt;code&gt;num_ctx: 8192&lt;/code&gt; — this extends the context window so the model can ingest larger files. For a 4B model, 8K tokens is the practical ceiling before quality degrades.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project 3: Cover Letter Generator
&lt;/h2&gt;

&lt;p&gt;Job applications are tedious. The &lt;a href="https://github.com/kennedyraju55/cover-letter-generator" rel="noopener noreferrer"&gt;cover-letter-generator&lt;/a&gt; takes a job description and your resume bullets, then produces a tailored cover letter — all without sending your personal career history to OpenAI's servers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_cover_letter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;job_description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;resume_points&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;company_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;resume_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s"&gt;n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;- &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;point&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;point&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;resume_points&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Write a professional cover letter for &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;company_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.

Job Description:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;job_description&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Candidate&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s Key Qualifications:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;resume_text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Requirements:
- 3 paragraphs maximum
- Specific connections between qualifications and job requirements
- Professional but authentic tone
- No generic filler sentences
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:11434/api/generate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma3:4b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;options&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;45.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Temperature at 0.5 here — slightly higher than standup or code review because cover letters benefit from a touch of variability while staying professional.&lt;/p&gt;

&lt;h2&gt;
  
  
  Beyond AI: The Full Developer Toolkit
&lt;/h2&gt;

&lt;p&gt;Not every productivity tool needs an LLM. Two other projects in my toolkit solve pure engineering problems:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/kennedyraju55/apiwatch" rel="noopener noreferrer"&gt;apiwatch&lt;/a&gt;&lt;/strong&gt; — An API contract testing and health monitoring CLI. You define API contracts in YAML, and apiwatch continuously validates your endpoints against those contracts. It catches breaking changes, performance degradation, and response schema violations before they hit production. Think of it as a lightweight Pact alternative that runs from a single CLI command.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/kennedyraju55/loadlens" rel="noopener noreferrer"&gt;loadlens&lt;/a&gt;&lt;/strong&gt; — A load testing and capacity planning toolkit built in Python. It helps teams understand their actual throughput — including why "8 RPS per machine" might be less impressive than it sounds when you factor in connection overhead, payload size, and downstream dependencies.&lt;/p&gt;

&lt;p&gt;Both tools follow the same philosophy: zero external dependencies for core functionality, runs anywhere Python runs, and delivers value in under five minutes of setup.&lt;/p&gt;

&lt;h2&gt;
  
  
  Patterns That Work Across All These Tools
&lt;/h2&gt;

&lt;p&gt;After building 116+ open-source repositories, certain patterns consistently emerge:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Structured Prompts with Clear Constraints
&lt;/h3&gt;

&lt;p&gt;The biggest improvement in local LLM output comes not from model size but from prompt structure. Always tell the model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What role to assume&lt;/li&gt;
&lt;li&gt;What input format to expect&lt;/li&gt;
&lt;li&gt;What output format you need&lt;/li&gt;
&lt;li&gt;What to exclude (often more important than what to include)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Temperature as a Knob, Not a Setting
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;th&gt;Temperature&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Code review&lt;/td&gt;
&lt;td&gt;0.1–0.2&lt;/td&gt;
&lt;td&gt;Deterministic, factual analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Standup reports&lt;/td&gt;
&lt;td&gt;0.2–0.3&lt;/td&gt;
&lt;td&gt;Structured but slightly varied phrasing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cover letters&lt;/td&gt;
&lt;td&gt;0.4–0.6&lt;/td&gt;
&lt;td&gt;Natural language that doesn't sound robotic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Creative writing&lt;/td&gt;
&lt;td&gt;0.7–0.9&lt;/td&gt;
&lt;td&gt;Exploratory, varied output&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  3. Timeout Budgets
&lt;/h3&gt;

&lt;p&gt;Local models on CPU can take 10–30 seconds for complex prompts. Always set explicit timeouts and provide user feedback (progress indicators or streaming responses) so the tool doesn't feel broken.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Graceful Degradation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;safe_generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fallback&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:11434/api/generate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma3:4b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;30.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="nf"&gt;except &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ConnectError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TimeoutException&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;fallback&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;⚠️ Ollama is not running. Start it with: ollama serve&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If Ollama isn't running, the tool should say so — not crash with a stack trace.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next: The Local AI Developer Stack
&lt;/h2&gt;

&lt;p&gt;The trajectory is clear. Models are getting smaller and more capable. Gemma 4 at 4B parameters today outperforms GPT-3.5 on many code tasks. By next year, we'll likely have sub-2B models that handle most developer productivity use cases.&lt;/p&gt;

&lt;p&gt;I'm working on expanding this toolkit to include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Git commit message generation&lt;/strong&gt; from staged diffs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Documentation generator&lt;/strong&gt; that reads code and produces API docs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test case suggester&lt;/strong&gt; that analyzes functions and proposes edge cases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All local. All open source. All free.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;Every project mentioned in this post is open source and ready to run:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Install &lt;a href="https://ollama.com" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Pull a model: &lt;code&gt;ollama pull gemma3:4b&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Clone any repo and follow the README&lt;/li&gt;
&lt;li&gt;Start building your own local AI tools&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The best developer tools are the ones you control completely. When the AI runs on your machine, you own the entire stack — model, data, and output. No vendor lock-in, no usage caps, no privacy concerns.&lt;/p&gt;

&lt;p&gt;Start local. Ship faster.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Nrk Raju Guthikonda is a Senior Software Engineer at Microsoft on the Copilot Search Infrastructure team, working on semantic indexing and retrieval-augmented generation (RAG) systems. He maintains 116+ open-source repositories exploring AI, developer tools, healthcare technology, and creative applications of local LLMs.&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🐙 &lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/kennedyraju55" rel="noopener noreferrer"&gt;github.com/kennedyraju55&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;✍️ &lt;strong&gt;Dev.to:&lt;/strong&gt; &lt;a href="https://dev.to/kennedyraju55"&gt;dev.to/kennedyraju55&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;💼 &lt;strong&gt;LinkedIn:&lt;/strong&gt; &lt;a href="https://www.linkedin.com/in/nrk-raju-guthikonda-504066a8/" rel="noopener noreferrer"&gt;linkedin.com/in/nrk-raju-guthikonda-504066a8&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>productivity</category>
      <category>devtools</category>
    </item>
    <item>
      <title>Stop Sending Your Security Alerts to Cloud AI — Build Local LLM Tools Instead</title>
      <dc:creator>Nrk Raju Guthikonda</dc:creator>
      <pubDate>Sun, 12 Apr 2026 23:25:14 +0000</pubDate>
      <link>https://forem.com/kennedyraju55/stop-sending-your-security-alerts-to-cloud-ai-build-local-llm-tools-instead-1dgl</link>
      <guid>https://forem.com/kennedyraju55/stop-sending-your-security-alerts-to-cloud-ai-build-local-llm-tools-instead-1dgl</guid>
      <description>&lt;p&gt;Every time a security analyst pastes a suspicious log entry into a cloud-based AI chatbot, they might be handing adversaries a roadmap. That firewall alert contains your internal IP ranges. That phishing email reveals which executives are being targeted. That threat intelligence report maps your entire attack surface.&lt;/p&gt;

&lt;p&gt;I learned this the hard way. As a Senior Software Engineer at Microsoft working on Copilot Search Infrastructure, I spend my days thinking about how AI systems ingest, index, and retrieve sensitive data at scale. That experience taught me a foundational principle: &lt;strong&gt;the most dangerous data leak is the one disguised as a productivity tool&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;So I built five open-source security AI tools — all powered by local LLMs through Ollama — that never send a single byte to the cloud. Here is why you should do the same, and how to get started.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Security Data Must Never Leave Your Network
&lt;/h2&gt;

&lt;p&gt;This is not theoretical paranoia. It is operational reality.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Compliance Exposure
&lt;/h3&gt;

&lt;p&gt;NIST SP 800-171, SOC 2, HIPAA, and GDPR all impose strict controls on where sensitive data can be processed. The moment you paste a security alert into a cloud AI service, you have potentially created a compliance violation. Most cloud AI providers explicitly state in their terms of service that they may use input data for model improvement.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Adversarial Intelligence Leakage
&lt;/h3&gt;

&lt;p&gt;Security alerts are not just operational noise — they are intelligence. An alert about a brute-force attempt on &lt;code&gt;admin@internal-crm.yourcompany.com&lt;/code&gt; tells an attacker three things: you have a CRM system, it uses that naming convention, and it is internet-facing. Sending this to a third-party API, even an encrypted one, expands your blast radius.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Supply Chain Risk
&lt;/h3&gt;

&lt;p&gt;Cloud AI providers are themselves targets. A breach at your AI provider could expose every query ever sent — including your security telemetry. Running locally eliminates this entire attack surface.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Latency in Incident Response
&lt;/h3&gt;

&lt;p&gt;During an active incident, you cannot afford to wait for API rate limits or deal with cloud outages. Local inference means your AI triage tools work even when the network is compromised — which is exactly when you need them most.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Local LLM Stack: Ollama + Python
&lt;/h2&gt;

&lt;p&gt;The architecture is simpler than you might expect. Ollama provides a local REST API that is compatible with the interface patterns most developers already know. Here is the foundation every tool in my security suite shares:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LocalLLM&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Interface to local Ollama instance for security analysis.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:11434&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;base_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base_url&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;analyze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Send a prompt to the local LLM. No data leaves localhost.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/api/generate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;health_check&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Verify Ollama is running before processing sensitive data.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/api/tags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;ConnectionError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice the low temperature setting of &lt;code&gt;0.3&lt;/code&gt;. For security analysis, you want deterministic, factual responses — not creative writing. This is a deliberate architectural choice that differs from most chatbot configurations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building a Security Alert Analyzer
&lt;/h2&gt;

&lt;p&gt;Let me walk through a concrete example: triaging a cybersecurity alert. The key insight is that not everything requires an LLM. Pattern extraction (IPs, hashes, CVEs) is best handled by regex, while the LLM handles contextual analysis and summarization.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dataclasses&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dataclass&lt;/span&gt;

&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SecurityAlert&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;raw_text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;iocs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;
    &lt;span class="n"&gt;threat_score&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
    &lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_iocs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;alert_text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Extract Indicators of Compromise without an LLM.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ips&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;\b(?:\d{1,3}\.){3}\d{1,3}\b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;alert_text&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;domains&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;\b[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;alert_text&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cves&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CVE-\d{4}-\d{4,}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;alert_text&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;md5_hashes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;\b[a-fA-F0-9]{32}\b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;alert_text&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sha256_hashes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;\b[a-fA-F0-9]{64}\b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;alert_text&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;analyze_alert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;alert_text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;LocalLLM&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;SecurityAlert&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Full alert analysis: regex extraction + local LLM summarization.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;iocs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_iocs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;alert_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are a senior SOC analyst. Analyze this security alert and provide:
1. Threat severity (CRITICAL/HIGH/MEDIUM/LOW)
2. Attack type classification
3. Recommended immediate actions
4. IOC summary

Alert:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;alert_text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Extracted IOCs: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;iocs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;analyze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Score based on IOC density and keyword severity
&lt;/span&gt;    &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;iocs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cves&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;3.0&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;iocs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ips&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1.5&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kw&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;alert_text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;kw&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;critical&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exploit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ransomware&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
        &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mf"&gt;5.0&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;SecurityAlert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;raw_text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;alert_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;iocs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;iocs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;threat_score&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;10.0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This hybrid approach — deterministic extraction plus LLM analysis — gives you the reliability of pattern matching with the contextual intelligence of a language model. And everything stays on &lt;code&gt;localhost:11434&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Five Tools, Zero Cloud Dependencies
&lt;/h2&gt;

&lt;p&gt;I have built and open-sourced a suite of security tools that follow this architecture. Each one solves a real problem I have encountered in production environments:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Cybersecurity Alert Summarizer
&lt;/h3&gt;

&lt;p&gt;The flagship tool. It ingests raw security alerts, extracts IOCs (IPs, domains, hashes, CVEs), queries a local CVE database for CVSS scores, calculates weighted threat scores, and generates executive-ready summaries. The correlation engine links related alerts across multiple data sources — critical for spotting coordinated attacks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tech:&lt;/strong&gt; Python, Ollama, Click CLI, FastAPI, Rich, Docker&lt;br&gt;&lt;br&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/kennedyraju55/cybersecurity-alert-summarizer" rel="noopener noreferrer"&gt;cybersecurity-alert-summarizer&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  2. DocShield — Privacy-First Document Analysis
&lt;/h3&gt;

&lt;p&gt;A multi-agent system using Gemma 4 that reads, explains, and audits sensitive documents. While originally built for medical documents (HIPAA compliance demands local processing), the architecture applies to any document type containing sensitive data — contracts, financial reports, legal discovery. Five specialized agents (Orchestrator, Reader, Explainer, Checker, Bill Analyzer) work in a pipeline, each with a focused responsibility.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tech:&lt;/strong&gt; Python, Gemma 4, Flask, Multi-Agent Pipeline, Docker&lt;br&gt;&lt;br&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/kennedyraju55/docshield" rel="noopener noreferrer"&gt;docshield&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  3. Password Strength Advisor
&lt;/h3&gt;

&lt;p&gt;Goes far beyond "must contain uppercase and special character." This tool calculates Shannon entropy with pattern penalty scoring, checks against a local breach database with leet-speak variation detection, generates NIST SP 800-63B compliant policies, and creates cryptographically secure passwords using Fisher-Yates shuffling. The LLM provides natural-language explanations of why a password is weak.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tech:&lt;/strong&gt; Python, Ollama, Click, Streamlit, FastAPI&lt;br&gt;&lt;br&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/kennedyraju55/password-strength-advisor" rel="noopener noreferrer"&gt;password-strength-advisor&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  4. Phishing Email Detector
&lt;/h3&gt;

&lt;p&gt;Analyzes email headers, body text, and embedded URLs to classify phishing attempts. The local LLM examines linguistic patterns (urgency cues, authority impersonation, grammatical anomalies) while deterministic checks handle SPF/DKIM validation and URL reputation lookups against local threat feeds. No email content ever leaves the analysis machine.&lt;/p&gt;
&lt;h3&gt;
  
  
  5. Threat Intelligence Summarizer
&lt;/h3&gt;

&lt;p&gt;Ingests threat intelligence reports (STIX/TAXII feeds, vendor advisories, CVE bulletins) and produces actionable summaries for different audiences — technical IOC lists for the SOC team, risk assessments for management, patch priority lists for the infrastructure team. The LLM translates dense technical reports into audience-appropriate language.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Architecture Pattern
&lt;/h2&gt;

&lt;p&gt;Every tool in this suite follows the same layered design:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────┐
│           Input Layer (CLI / Web / API)      │
└──────────────────┬──────────────────────────┘
                   ↓
┌─────────────────────────────────────────────┐
│     Deterministic Processing Layer          │
│  (Regex, Pattern Matching, Scoring, DB)     │
│  → No LLM needed, fast, reliable            │
└──────────────────┬──────────────────────────┘
                   ↓
┌─────────────────────────────────────────────┐
│     Local LLM Analysis Layer                │
│  (Ollama → Gemma 4 / Llama 3.2)            │
│  → Contextual analysis, summarization       │
│  → 127.0.0.1 only, no external calls        │
└──────────────────┬──────────────────────────┘
                   ↓
┌─────────────────────────────────────────────┐
│     Output Layer (Rich CLI / Streamlit)     │
│  → Formatted reports, threat dashboards     │
└─────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The critical design decision is the &lt;strong&gt;separation between deterministic and LLM layers&lt;/strong&gt;. Pattern extraction, scoring, and database lookups do not need an LLM and should not use one. The LLM handles what it is good at: contextual understanding, summarization, and natural-language generation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started in 5 Minutes
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install Ollama&lt;/span&gt;
curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://ollama.com/install.sh | sh

&lt;span class="c"&gt;# Pull a security-optimized model&lt;/span&gt;
ollama pull gemma4

&lt;span class="c"&gt;# Clone any tool from the suite&lt;/span&gt;
git clone https://github.com/kennedyraju55/cybersecurity-alert-summarizer.git
&lt;span class="nb"&gt;cd &lt;/span&gt;cybersecurity-alert-summarizer

&lt;span class="c"&gt;# Install and run&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
python &lt;span class="nt"&gt;-m&lt;/span&gt; src.cyber_alert.cli &lt;span class="nt"&gt;--alert&lt;/span&gt; alerts/sample.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For model selection, I recommend &lt;strong&gt;Gemma 4&lt;/strong&gt; for its strong reasoning capabilities and multimodal support, or &lt;strong&gt;Llama 3.2 (3B)&lt;/strong&gt; if you need faster inference on limited hardware. Both run comfortably on a machine with 16GB RAM.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;Cloud AI is transformative for many use cases. Security is not one of them. The data you are analyzing — alerts, logs, threat intel, credentials, internal network topology — is precisely the data that adversaries want. Every cloud API call is an exposure surface.&lt;/p&gt;

&lt;p&gt;Local LLMs have reached a capability threshold where they handle security analysis tasks effectively. The tools exist. The models are free. The only cost is the compute you already own.&lt;/p&gt;

&lt;p&gt;In my experience building production AI systems that process sensitive data at scale, the architecture that wins is the one that minimizes data movement. For security tooling, that means local inference, local storage, and zero external dependencies.&lt;/p&gt;

&lt;p&gt;Build local. Analyze local. Keep your security data where it belongs — on your network.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;About the Author&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Nrk Raju Guthikonda is a Senior Software Engineer at Microsoft on the Copilot Search Infrastructure team, focused on semantic indexing and retrieval-augmented generation (RAG). He maintains 116+ open-source repositories, including a suite of security AI tools powered by local LLMs. His work explores the intersection of AI, privacy, and practical security tooling.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/kennedyraju55" rel="noopener noreferrer"&gt;@kennedyraju55&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Dev.to: &lt;a href="https://dev.to/nrk_raju"&gt;nrk_raju&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;LinkedIn: &lt;a href="https://www.linkedin.com/in/nrk-raju-guthikonda-504066a8/" rel="noopener noreferrer"&gt;nrk-raju-guthikonda&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>security</category>
      <category>python</category>
      <category>privacy</category>
    </item>
    <item>
      <title>Your Financial Data Should Never Leave Your Machine — Here's How I Built 5 AI Tools That Prove It</title>
      <dc:creator>Nrk Raju Guthikonda</dc:creator>
      <pubDate>Sun, 12 Apr 2026 23:18:22 +0000</pubDate>
      <link>https://forem.com/kennedyraju55/your-financial-data-should-never-leave-your-machine-heres-how-i-built-5-ai-tools-that-prove-it-517l</link>
      <guid>https://forem.com/kennedyraju55/your-financial-data-should-never-leave-your-machine-heres-how-i-built-5-ai-tools-that-prove-it-517l</guid>
      <description>&lt;p&gt;Every day, millions of people paste their bank statements into ChatGPT. They upload invoices to cloud AI services. They feed their tax documents into web apps that promise "AI-powered analysis."&lt;/p&gt;

&lt;p&gt;And every single one of them is handing their most sensitive data — income, spending habits, debt, investments — to a third-party server they don't control.&lt;/p&gt;

&lt;p&gt;I'm a Senior Software Engineer at Microsoft, working on Copilot Search Infrastructure (semantic indexing, RAG pipelines). I build AI systems for a living. And I'm here to tell you: &lt;strong&gt;there is absolutely no reason your financial data needs to leave your machine to get intelligent analysis.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Over the past year, I've built 116+ open-source projects — and a significant chunk of them are financial AI tools that run entirely on your local hardware. No API keys. No cloud calls. No data exfiltration. Just you, your machine, and a local LLM.&lt;/p&gt;

&lt;p&gt;Let me show you how.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Privacy Problem Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;When you use a cloud-based AI service to analyze your finances, here's what actually happens:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Your data travels across the internet&lt;/strong&gt; — bank statements, transaction histories, salary information&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It lands on someone else's servers&lt;/strong&gt; — often in a jurisdiction you didn't choose&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It may be used for training&lt;/strong&gt; — many services reserve the right to use your inputs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It persists in logs&lt;/strong&gt; — even "deleted" data can live in backups, caches, and audit trails&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It's one breach away from exposure&lt;/strong&gt; — and financial data is the #1 target for attackers&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This isn't hypothetical. As someone who works on enterprise-grade AI infrastructure at Microsoft, I've seen firsthand how seriously large organizations take data residency and privacy. The question is: why don't individuals demand the same protections?&lt;/p&gt;

&lt;p&gt;The answer used to be "because local AI wasn't good enough." That's no longer true.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Local LLM Stack: Ollama + Gemma
&lt;/h2&gt;

&lt;p&gt;The foundation of every financial tool I've built is dead simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://ollama.com" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt;&lt;/strong&gt; — A local LLM runtime that makes running models as easy as &lt;code&gt;ollama pull gemma3&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://ai.google.dev/gemma" rel="noopener noreferrer"&gt;Gemma&lt;/a&gt;&lt;/strong&gt; — Google's open-weight model family, optimized for efficiency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python&lt;/strong&gt; — Because the ecosystem for data processing is unmatched&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FastAPI&lt;/strong&gt; — For REST APIs when you want programmatic access&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streamlit&lt;/strong&gt; — For web UIs that non-technical users can actually use&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's what the core integration looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LocalFinancialLLM&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Interface to local Ollama instance for financial analysis.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:11434&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;base_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base_url&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;analyze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Send a financial analysis prompt to the local LLM.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;full_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are a financial analyst assistant.
Analyze the following financial data and provide actionable insights.

Context: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Data/Question: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Provide a structured analysis with key findings and recommendations.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/api/generate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;full_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;options&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Low temp for financial accuracy
&lt;/span&gt;                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;num_predict&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2048&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;


&lt;span class="c1"&gt;# Usage — everything stays on your machine
&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LocalFinancialLLM&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;analyze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Categorize these transactions and identify unusual spending&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Monthly household expenses for March 2024&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. No API key. No cloud endpoint. No terms of service that let a corporation train on your bank statements. The model runs on your hardware, processes your data in your RAM, and the results never touch a network interface.&lt;/p&gt;

&lt;h2&gt;
  
  
  Five Tools, Zero Cloud Dependencies
&lt;/h2&gt;

&lt;p&gt;Let me walk you through the financial AI tools I've built and open-sourced. Every single one follows the same architecture: local LLM, local data, local results.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Household Budget Analyzer
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/kennedyraju55/household-budget-analyzer" rel="noopener noreferrer"&gt;kennedyraju55/household-budget-analyzer&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is the tool I use for my own family's finances. Feed it a CSV of transactions and it gives you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AI-powered spending analysis&lt;/strong&gt; with savings recommendations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto-categorization&lt;/strong&gt; — maps "Whole Foods" to Groceries, "AT&amp;amp;T" to Utilities&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Budget vs. actual comparison&lt;/strong&gt; — shows exactly where you're over or under&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recurring expense detection&lt;/strong&gt; — catches hidden subscriptions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Savings goal tracking&lt;/strong&gt; with estimated completion dates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monthly trend analysis&lt;/strong&gt; — visualizes spending patterns over time
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;budget_analyzer.core&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;load_expenses&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;compute_category_breakdown&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;detect_recurring&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SavingsGoal&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Load your private financial data — stays 100% local
&lt;/span&gt;&lt;span class="n"&gt;expenses&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_expenses&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/expenses.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;categories&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compute_category_breakdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expenses&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Detect subscriptions you forgot about
&lt;/span&gt;&lt;span class="n"&gt;recurring&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;detect_recurring&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expenses&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;recurring&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;📌 &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: $&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;avg_amount&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/month&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Track savings with AI-powered projections
&lt;/span&gt;&lt;span class="n"&gt;goal&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SavingsGoal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Emergency Fund&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;target_amount&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;current_amount&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;monthly_contribution&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🎯 &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;goal&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;track_progress&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;📅 Estimated completion: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;goal&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;estimate_completion&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Financial Report Generator
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/kennedyraju55/financial-report-generator" rel="noopener noreferrer"&gt;kennedyraju55/financial-report-generator&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Generates professional financial reports from raw data — the kind of output you'd expect from an analyst, but produced entirely on your laptop. Income statements, expense breakdowns, cash flow analysis, and forward-looking projections. All powered by Gemma running locally through Ollama.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Invoice Extractor
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/kennedyraju55/invoice-extractor" rel="noopener noreferrer"&gt;kennedyraju55/invoice-extractor&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Drop in an invoice (text, PDF content, or structured data) and the local LLM extracts vendor name, amounts, line items, dates, and tax information into clean structured JSON. Perfect for small business owners who process dozens of invoices monthly and don't want their vendor relationships and pricing sitting on someone else's server.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Sentiment Analysis Dashboard
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/kennedyraju55/sentiment-analysis-dashboard" rel="noopener noreferrer"&gt;kennedyraju55/sentiment-analysis-dashboard&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;While not strictly a "finance" tool, this is invaluable for market analysis. Feed it earnings call transcripts, financial news, or analyst reports and get sentiment scoring with explanations. I built this with a Streamlit dashboard so you can visualize sentiment trends over time — useful for anyone doing their own investment research.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Stock Report Generator
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/kennedyraju55/stock-report-generator" rel="noopener noreferrer"&gt;kennedyraju55/stock-report-generator&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;AI-powered technical analysis and risk assessment for stocks. It generates structured reports covering price analysis, risk factors, and market context. Again — your investment research stays private. No cloud service knows which stocks you're analyzing or what positions you're considering.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture Pattern
&lt;/h2&gt;

&lt;p&gt;Every tool follows the same proven architecture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────┐
│                   Your Machine                    │
│                                                   │
│  ┌─────────────┐    ┌──────────────────────────┐ │
│  │  Streamlit   │    │     FastAPI REST API     │ │
│  │  Web UI      │    │     (localhost:8000)     │ │
│  │  (:8501)     │    │                          │ │
│  └──────┬───────┘    └────────────┬─────────────┘ │
│         │                         │               │
│         ▼                         ▼               │
│  ┌────────────────────────────────────────────┐   │
│  │          Core Analysis Engine               │   │
│  │   (Python: pandas, data processing)         │   │
│  └─────────────────────┬──────────────────────┘   │
│                        │                          │
│                        ▼                          │
│  ┌────────────────────────────────────────────┐   │
│  │         Ollama (localhost:11434)            │   │
│  │         Running Gemma 3/4 Model            │   │
│  │         ~4-8GB RAM                         │   │
│  └────────────────────────────────────────────┘   │
│                                                   │
│  ┌────────────────────────────────────────────┐   │
│  │    Local Storage (CSV/JSON)                │   │
│  │    Your data. Your machine. Your rules.    │   │
│  └────────────────────────────────────────────┘   │
└──────────────────────────────────────────────────┘

        ❌ NOTHING crosses this boundary ❌
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key design decisions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No outbound network calls&lt;/strong&gt; — the entire stack runs on localhost&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Config-driven&lt;/strong&gt; — YAML files control model selection, temperature, categories&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dual interface&lt;/strong&gt; — CLI for power users, Streamlit for everyone else&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker-ready&lt;/strong&gt; — &lt;code&gt;docker compose up&lt;/code&gt; and you're running&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tested&lt;/strong&gt; — pytest suites with 80%+ coverage on core logic&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What You Need to Get Started
&lt;/h2&gt;

&lt;p&gt;The hardware requirements are surprisingly modest:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Minimum&lt;/th&gt;
&lt;th&gt;Recommended&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;RAM&lt;/td&gt;
&lt;td&gt;8 GB&lt;/td&gt;
&lt;td&gt;16 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage&lt;/td&gt;
&lt;td&gt;10 GB (for models)&lt;/td&gt;
&lt;td&gt;20 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CPU&lt;/td&gt;
&lt;td&gt;Any modern x86/ARM&lt;/td&gt;
&lt;td&gt;Apple Silicon / GPU&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPU&lt;/td&gt;
&lt;td&gt;Not required&lt;/td&gt;
&lt;td&gt;NVIDIA for speed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OS&lt;/td&gt;
&lt;td&gt;Linux, macOS, Windows&lt;/td&gt;
&lt;td&gt;Any&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Getting started takes five minutes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Install Ollama&lt;/span&gt;
curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://ollama.com/install.sh | sh

&lt;span class="c"&gt;# 2. Pull a model&lt;/span&gt;
ollama pull gemma3

&lt;span class="c"&gt;# 3. Clone any of the tools&lt;/span&gt;
git clone https://github.com/kennedyraju55/household-budget-analyzer.git
&lt;span class="nb"&gt;cd &lt;/span&gt;household-budget-analyzer

&lt;span class="c"&gt;# 4. Set up Python environment&lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; venv .venv &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;source&lt;/span&gt; .venv/bin/activate
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt

&lt;span class="c"&gt;# 5. Run it&lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; budget_analyzer.cli analyze &lt;span class="nt"&gt;--file&lt;/span&gt; expenses.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Why This Matters Beyond Privacy
&lt;/h2&gt;

&lt;p&gt;Privacy is the headline, but it's not the only benefit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cost: $0 forever.&lt;/strong&gt; No monthly API fees. No per-token charges. No surprise bills.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speed: No network latency.&lt;/strong&gt; Results come back in seconds, not after a round-trip to Virginia.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliability: No outages.&lt;/strong&gt; Your tools work on an airplane, in a cabin with no WiFi, during a cloud provider's bad day.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Control: Swap models freely.&lt;/strong&gt; Gemma too small? Try Llama 3. Want something lighter? Use Gemma 2B. No vendor lock-in.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance: GDPR/HIPAA-friendly by default.&lt;/strong&gt; If data never leaves your machine, there's nothing to regulate.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;I've now published 116+ open-source repositories, spanning healthcare AI, education tools, developer utilities, creative AI, and yes — financial tools. They all share the same conviction: &lt;strong&gt;powerful AI doesn't require surrendering your data.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Working on Copilot Search Infrastructure at Microsoft has given me a deep appreciation for what's possible with semantic indexing and RAG at scale. But it's also shown me that the most impactful AI isn't always the biggest model or the most expensive infrastructure. Sometimes it's a 4-billion-parameter model running on your laptop, analyzing your family's budget without telling anyone about it.&lt;/p&gt;

&lt;p&gt;The tools are open source. The models are open weight. The only thing standing between you and private financial AI is a &lt;code&gt;git clone&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Get Started
&lt;/h2&gt;

&lt;p&gt;All projects are MIT-licensed and ready to use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🏠 &lt;a href="https://github.com/kennedyraju55/household-budget-analyzer" rel="noopener noreferrer"&gt;household-budget-analyzer&lt;/a&gt; — Budget tracking &amp;amp; spending analysis&lt;/li&gt;
&lt;li&gt;📊 &lt;a href="https://github.com/kennedyraju55/financial-report-generator" rel="noopener noreferrer"&gt;financial-report-generator&lt;/a&gt; — Professional report generation&lt;/li&gt;
&lt;li&gt;📄 &lt;a href="https://github.com/kennedyraju55/invoice-extractor" rel="noopener noreferrer"&gt;invoice-extractor&lt;/a&gt; — Structured data extraction from invoices&lt;/li&gt;
&lt;li&gt;📈 &lt;a href="https://github.com/kennedyraju55/sentiment-analysis-dashboard" rel="noopener noreferrer"&gt;sentiment-analysis-dashboard&lt;/a&gt; — Market sentiment scoring&lt;/li&gt;
&lt;li&gt;📉 &lt;a href="https://github.com/kennedyraju55/stock-report-generator" rel="noopener noreferrer"&gt;stock-report-generator&lt;/a&gt; — Technical analysis &amp;amp; risk assessment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Star them, fork them, make them better. And stop sending your bank statements to the cloud.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Published by **Nrk Raju Guthikonda&lt;/em&gt;* — Senior Software Engineer at Microsoft (Copilot Search Infrastructure). Builder of 116+ open-source AI tools. Passionate about private, local-first AI.*&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Find me on &lt;a href="https://github.com/kennedyraju55" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; · &lt;a href="https://dev.to/kennedyraju55"&gt;dev.to&lt;/a&gt; · &lt;a href="https://www.linkedin.com/in/nrk-raju-guthikonda-504066a8/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>jellyfin</category>
      <category>privacy</category>
    </item>
    <item>
      <title>Semantic Search at Scale: What I Learned Building RAG Infrastructure at Microsoft Copilot</title>
      <dc:creator>Nrk Raju Guthikonda</dc:creator>
      <pubDate>Sun, 12 Apr 2026 23:13:16 +0000</pubDate>
      <link>https://forem.com/kennedyraju55/semantic-search-at-scale-what-i-learned-building-rag-infrastructure-at-microsoft-copilot-3d0h</link>
      <guid>https://forem.com/kennedyraju55/semantic-search-at-scale-what-i-learned-building-rag-infrastructure-at-microsoft-copilot-3d0h</guid>
      <description>&lt;p&gt;I work on Microsoft Copilot's Search Infrastructure team, where I focus on semantic indexing and RAG (Retrieval-Augmented Generation). The challenges of building search at scale are fundamentally different from what you encounter in tutorials. Here's what building production RAG taught me — and how I applied those lessons to my open-source projects.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Tutorial vs. Production Gap
&lt;/h2&gt;

&lt;p&gt;Most RAG tutorials show this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Tutorial RAG
&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;split_document&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;document&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;store_in_vector_db&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# At query time
&lt;/span&gt;&lt;span class="n"&gt;query_embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vector_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works for a demo. It fails spectacularly in production because:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Chunking strategy matters enormously&lt;/strong&gt; — naive splitting breaks mid-sentence, mid-paragraph, mid-concept&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embedding quality varies by domain&lt;/strong&gt; — a model trained on web text performs poorly on legal contracts or medical records&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Top-k retrieval isn't enough&lt;/strong&gt; — you need re-ranking, filtering, and relevance scoring&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context window management&lt;/strong&gt; — stuffing 5 chunks into a prompt wastes tokens on irrelevant content&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Freshness&lt;/strong&gt; — documents update, and your index needs to stay current without full re-indexing&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Lesson 1: Chunking Is an Art, Not a Split
&lt;/h2&gt;

&lt;p&gt;The biggest mistake in RAG is treating chunking as &lt;code&gt;text.split(max_length)&lt;/code&gt;. Good chunking preserves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Semantic boundaries&lt;/strong&gt; — paragraphs, sections, logical units&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context&lt;/strong&gt; — each chunk should be understandable in isolation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Overlap&lt;/strong&gt; — some repetition between chunks prevents information loss at boundaries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metadata&lt;/strong&gt; — source document, section header, page number
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SemanticChunker&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;document&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;sections&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_split_by_headers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;document&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;section&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sections&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_token_count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;section&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;section&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;paragraphs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;section&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_merge_paragraphs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;paragraphs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_merge_paragraphs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;paragraphs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;merged&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;para&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;paragraphs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_token_count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;para&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;para&lt;/span&gt;
            &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;merged&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
                &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;para&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;merged&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;merged&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In my experience building production systems, domain-specific chunking strategies consistently outperform generic ones on retrieval relevance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lesson 2: Re-ranking Changes Everything
&lt;/h2&gt;

&lt;p&gt;Vector similarity search returns "similar" results. Similar isn't the same as "relevant." A re-ranker bridges this gap:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;search_with_rerank&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Phase 1: Broad retrieval (cast a wide net)
&lt;/span&gt;    &lt;span class="n"&gt;candidates&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vector_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;top_k&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Phase 2: Re-rank with cross-encoder
&lt;/span&gt;    &lt;span class="n"&gt;scored&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;candidate&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;candidates&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reranker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;candidate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;scored&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;candidate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="c1"&gt;# Phase 3: Return top-k after re-ranking
&lt;/span&gt;    &lt;span class="n"&gt;scored&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;scored&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Retrieve 3x what you need, re-rank, then take the top results. This consistently improves answer quality by 15-25% over pure vector search.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lesson 3: Hybrid Search Beats Pure Semantic
&lt;/h2&gt;

&lt;p&gt;Pure embedding-based search misses exact matches. If a user searches for "error code E4012", semantic search might return results about "error handling" instead of the specific error code.&lt;/p&gt;

&lt;p&gt;The solution is hybrid search: combine semantic similarity with keyword/BM25 matching:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;hybrid_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;semantic_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;vector_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;keyword_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;bm25_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Reciprocal Rank Fusion
&lt;/span&gt;    &lt;span class="n"&gt;combined&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;semantic_results&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;combined&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;combined&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rank&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;keyword_results&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;combined&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;combined&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rank&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;combined&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In production RAG systems, hybrid search with Reciprocal Rank Fusion is a proven approach that consistently delivers better results.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lesson 4: Evaluation Is Non-Negotiable
&lt;/h2&gt;

&lt;p&gt;You can't improve what you can't measure. Every RAG system needs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Retrieval metrics&lt;/strong&gt;: Recall@K, MRR, NDCG&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generation metrics&lt;/strong&gt;: Faithfulness, relevance, completeness&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;End-to-end metrics&lt;/strong&gt;: User satisfaction, task completion rate
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;evaluate_retrieval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_set&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;recall_at_5&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;mrr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expected_docs&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;test_set&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;result_ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

        &lt;span class="c1"&gt;# Recall@5
&lt;/span&gt;        &lt;span class="n"&gt;hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result_ids&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expected_docs&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;recall_at_5&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hits&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expected_docs&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

        &lt;span class="c1"&gt;# MRR
&lt;/span&gt;        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rid&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result_ids&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;rid&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;expected_docs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;mrr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rank&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
                &lt;span class="k"&gt;break&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;mrr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;recall@5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;recall_at_5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;recall_at_5&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mrr&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mrr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mrr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Applying These Lessons to Open Source
&lt;/h2&gt;

&lt;p&gt;I've applied lessons learned from working on production-scale search to my open-source projects:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;pdf-chat-assistant&lt;/strong&gt; — RAG over PDF documents with semantic chunking and re-ranking&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;personal-knowledge-base&lt;/strong&gt; — Local RAG over your personal documents&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;study-buddy-bot&lt;/strong&gt; — RAG over textbook content for educational Q&amp;amp;A&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The architecture is the same. The scale is different. But the patterns transfer perfectly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Don't skip chunking strategy&lt;/strong&gt; — it's the most impactful optimization&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Always re-rank&lt;/strong&gt; — the cost is minimal, the quality improvement is significant&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use hybrid search&lt;/strong&gt; — semantic + keyword catches what pure semantic misses&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Measure everything&lt;/strong&gt; — build evaluation into your pipeline from day one&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Start local&lt;/strong&gt; — you can build and test great RAG systems on a single machine&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The full collection of RAG-powered tools is on &lt;a href="https://github.com/kennedyraju55" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;. The patterns are open source. Build better search.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Nrk Raju Guthikonda is a Senior Software Engineer at Microsoft on the Copilot Search Infrastructure team, where he builds semantic indexing and RAG systems at scale. He maintains 116+ open-source repositories. Read more on &lt;a href="https://dev.to/kennedyraju55"&gt;dev.to&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>todayisearched</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>The Developer's Guide to Running LLMs Locally: Ollama, Gemma 4, and Why Your Side Projects Don't Need an API Key</title>
      <dc:creator>Nrk Raju Guthikonda</dc:creator>
      <pubDate>Sun, 12 Apr 2026 23:09:13 +0000</pubDate>
      <link>https://forem.com/kennedyraju55/the-developers-guide-to-running-llms-locally-ollama-gemma-4-and-why-your-side-projects-dont-54oe</link>
      <guid>https://forem.com/kennedyraju55/the-developers-guide-to-running-llms-locally-ollama-gemma-4-and-why-your-side-projects-dont-54oe</guid>
      <description>&lt;p&gt;Every tutorial about building with LLMs starts the same way: "First, get your OpenAI API key." But what if I told you that you can build production-quality AI applications without ever making a cloud API call?&lt;/p&gt;

&lt;p&gt;I've built over 90 applications using local LLMs — no API keys, no cloud costs, no rate limits. Here's a practical guide to getting started with Ollama and Gemma 4 for your own projects.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Local LLMs?
&lt;/h2&gt;

&lt;p&gt;Before diving into the how, let's talk about why:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Zero Cost Per Request
&lt;/h3&gt;

&lt;p&gt;Cloud APIs charge per token. A moderate application making 1,000 requests/day costs $30-100/month. Scale to production and you're looking at thousands per month. Local inference costs electricity — pennies per hour.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. No Rate Limits
&lt;/h3&gt;

&lt;p&gt;I've hit OpenAI rate limits at 3 AM on a Sunday during a hackathon. With local models, you can generate as fast as your hardware allows, 24/7.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Privacy by Default
&lt;/h3&gt;

&lt;p&gt;No data leaves your machine. This isn't just nice-to-have — it's essential for healthcare (HIPAA), legal (attorney-client privilege), finance (PCI), and education (FERPA) applications.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Offline Capability
&lt;/h3&gt;

&lt;p&gt;Once the model is downloaded, you need zero internet. Build on a plane. Demo without WiFi. Deploy in air-gapped environments.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Reproducibility
&lt;/h3&gt;

&lt;p&gt;Cloud models change without notice. GPT-4 in January behaves differently than GPT-4 in June. Local models are frozen — same model, same behavior, always.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started: 5 Minutes to Your First Local LLM
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Install Ollama
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# macOS / Linux&lt;/span&gt;
curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://ollama.com/install.sh | sh

&lt;span class="c"&gt;# Windows&lt;/span&gt;
&lt;span class="c"&gt;# Download from https://ollama.com/download&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Pull Gemma 4
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama pull gemma4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This downloads the model (~5GB). One-time cost, then it's on your machine forever.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Test It
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama run gemma4 &lt;span class="s2"&gt;"Explain quantum computing in one paragraph"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. You now have a local LLM running on your machine.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building Applications with Python + Ollama
&lt;/h2&gt;

&lt;p&gt;Here's a minimal Python application:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# That's literally it
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;ask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What are the SOLID principles in software engineering?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Adding Structure: The Pattern I Use in 90+ Projects
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LocalLLMApp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                 &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This base class pattern is the foundation of every application I've built. Domain-specific logic goes in subclasses — the LLM integration stays clean and swappable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Adding a Web Interface: Streamlit
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;streamlit&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;st&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LocalLLMApp&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;My Local AI Tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;user_input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;text_area&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Enter your text:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;button&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Analyze&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;spinner&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Thinking...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three imports. Ten lines. A full web interface for your local AI tool.&lt;/p&gt;

&lt;h3&gt;
  
  
  Adding an API: FastAPI
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastAPI&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;

&lt;span class="n"&gt;api&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LocalLLMApp&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;

&lt;span class="nd"&gt;@api.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/analyze&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;analyze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Query&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you have a REST API that any frontend, mobile app, or service can call — all running locally.&lt;/p&gt;

&lt;h2&gt;
  
  
  Docker: One-Command Deployment
&lt;/h2&gt;

&lt;p&gt;Every project I build ships with this &lt;code&gt;docker-compose.yml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;ollama&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ollama/ollama:latest&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;11434:11434"&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ollama-data:/root/.ollama&lt;/span&gt;
    &lt;span class="na"&gt;deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;reservations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;devices&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;capabilities&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;gpu&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

  &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;.&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8501:8501"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8000:8000"&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ollama&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;OLLAMA_HOST=http://ollama:11434&lt;/span&gt;

&lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;ollama-data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;docker compose up&lt;/code&gt; — that's the entire deployment story. Works on any machine with Docker and a GPU.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance: What to Expect
&lt;/h2&gt;

&lt;p&gt;On consumer hardware (RTX 3080, 16GB RAM):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Simple Q&amp;amp;A&lt;/strong&gt;: 0.5-1 second&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Paragraph generation&lt;/strong&gt;: 2-5 seconds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Document analysis (2-3 pages)&lt;/strong&gt;: 5-15 seconds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long-form generation (1000+ words)&lt;/strong&gt;: 15-30 seconds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are practical, usable response times for interactive applications.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use Cloud vs. Local
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;th&gt;Local&lt;/th&gt;
&lt;th&gt;Cloud&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Prototyping&lt;/td&gt;
&lt;td&gt;✅ Zero cost&lt;/td&gt;
&lt;td&gt;❌ Token costs add up&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sensitive data&lt;/td&gt;
&lt;td&gt;✅ Privacy by default&lt;/td&gt;
&lt;td&gt;❌ Requires BAA/DPA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Production (small scale)&lt;/td&gt;
&lt;td&gt;✅ Fixed hardware cost&lt;/td&gt;
&lt;td&gt;✅ Easy to scale&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Production (large scale)&lt;/td&gt;
&lt;td&gt;❌ Hardware limits&lt;/td&gt;
&lt;td&gt;✅ Elastic scaling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Offline/air-gapped&lt;/td&gt;
&lt;td&gt;✅ Works anywhere&lt;/td&gt;
&lt;td&gt;❌ Requires internet&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cutting-edge capability&lt;/td&gt;
&lt;td&gt;❌ Smaller models&lt;/td&gt;
&lt;td&gt;✅ Latest models&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;My rule: start local, move to cloud only when you've proven the concept and need scale that local hardware can't handle.&lt;/p&gt;

&lt;h2&gt;
  
  
  90+ Projects and Counting
&lt;/h2&gt;

&lt;p&gt;I've applied this pattern across:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Healthcare&lt;/strong&gt;: Patient intake, lab results, EHR de-identification&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Legal&lt;/strong&gt;: Contract analysis, brief generation, compliance checking&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Education&lt;/strong&gt;: Study bots, exam generators, flashcard creators&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Creative&lt;/strong&gt;: Story generators, poetry engines, mood journals&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Developer Tools&lt;/strong&gt;: Code review, API docs, performance profiling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Finance&lt;/strong&gt;: Budget analyzers, financial report summarizers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security&lt;/strong&gt;: Vulnerability scanners, alert summarizers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every single one follows the same pattern: Ollama + Gemma 4 + Python + FastAPI + Streamlit + Docker.&lt;/p&gt;

&lt;p&gt;The code is open source: &lt;a href="https://github.com/kennedyraju55" rel="noopener noreferrer"&gt;github.com/kennedyraju55&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Start building locally. Your AI projects don't need an API key.&lt;/p&gt;




&lt;p&gt;*Nrk Raju Guthikonda is a Senior Software Engineer at Microsoft on the Copilot Search Infrastructure team. He maintains 116+ original open-source repositories built with local LLMs. Read more on &lt;a href="https://dev.to/kennedyraju55"&gt;dev.to&lt;/a&gt;.*aipythontutorialdocker&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>sideprojects</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Creative AI Without the Cloud: Building Story Generators, Poetry Engines, and More with Local LLMs</title>
      <dc:creator>Nrk Raju Guthikonda</dc:creator>
      <pubDate>Sun, 12 Apr 2026 23:06:39 +0000</pubDate>
      <link>https://forem.com/kennedyraju55/creative-ai-without-the-cloud-building-story-generators-poetry-engines-and-more-with-local-llms-cfa</link>
      <guid>https://forem.com/kennedyraju55/creative-ai-without-the-cloud-building-story-generators-poetry-engines-and-more-with-local-llms-cfa</guid>
      <description>&lt;p&gt;There's a common misconception that creative AI — story generation, poetry, songwriting, art descriptions — requires massive cloud models. GPT-4, Claude, Gemini — the bigger the better, right?&lt;/p&gt;

&lt;p&gt;Not necessarily. I've built seven creative AI tools that run entirely on local hardware using Gemma 4 via Ollama. They generate compelling stories, poems, song lyrics, and creative content without sending a single byte to the cloud. Here's what I learned about making local LLMs creative.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Local Creative AI Matters
&lt;/h2&gt;

&lt;p&gt;Creative writing involves personal expression. When you use a cloud-based AI to help with creative work, you're sharing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Your creative ideas&lt;/strong&gt; — which could be used for training data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Your writing style&lt;/strong&gt; — which becomes part of the model's knowledge&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Your intellectual property&lt;/strong&gt; — stories, poems, and lyrics you co-create&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For professional writers, this is a real concern. If your unpublished novel's plot gets fed into training data, who owns that idea?&lt;/p&gt;

&lt;p&gt;Local LLMs eliminate this entirely. Your creative work stays on your machine.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Creative AI Suite
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Story Generator
&lt;/h3&gt;

&lt;p&gt;The story generator creates narrative fiction from prompts, with control over genre, tone, length, and style.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;StoryGenerator&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;premise&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;genre&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;length&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;short&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                 &lt;span class="n"&gt;style&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;literary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pov&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;third&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Write a &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;length&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;genre&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; story in &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;pov&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; person with a &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;style&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; style.

Premise: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;premise&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Requirements:
- Strong opening hook
- Vivid sensory details
- Character development
- Satisfying resolution
- Show, don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t tell
- Natural dialogue (if applicable)&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;top_p&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice the temperature: &lt;strong&gt;0.8&lt;/strong&gt;. For creative writing, we want diversity and surprise. This is the opposite of clinical or legal applications where we used 0.1-0.2. Creative AI needs to take risks.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Poetry Engine
&lt;/h3&gt;

&lt;p&gt;The poetry engine handles multiple forms: sonnets, haiku, free verse, limericks, villanelles, and more.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;POETRY_FORMS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sonnet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;14 lines, iambic pentameter, ABAB CDCD EFEF GG rhyme scheme&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;haiku&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3 lines: 5-7-5 syllables, nature imagery&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;free_verse&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No fixed meter or rhyme, but with rhythm and imagery&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;limerick&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5 lines, AABBA rhyme, humorous&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;villanelle&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;19 lines, 5 tercets + 1 quatrain, two refrains&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;theme&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;form&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mood&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;contemplative&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;form_rules&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;POETRY_FORMS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;form&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;free form poetry&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Compose a &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;form&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; poem about &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;theme&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; with a &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;mood&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; mood.

Form requirements: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;form_rules&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Write with:
- Vivid imagery and metaphor
- Emotional resonance
- Precise word choice
- Natural rhythm even within formal constraints&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Song Lyric Writer
&lt;/h3&gt;

&lt;p&gt;Generates lyrics with verse-chorus structure, including chord suggestions.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Family Story Creator
&lt;/h3&gt;

&lt;p&gt;A unique tool that generates personalized stories for children using family members as characters. Parents input names and traits, and the AI creates bedtime stories featuring their actual family.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Creative Writing Coach
&lt;/h3&gt;

&lt;p&gt;Analyzes drafts and provides constructive feedback on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pacing and structure&lt;/li&gt;
&lt;li&gt;Character voice consistency&lt;/li&gt;
&lt;li&gt;Show vs. tell balance&lt;/li&gt;
&lt;li&gt;Dialogue naturalness&lt;/li&gt;
&lt;li&gt;Opening hook strength&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Temperature Spectrum
&lt;/h2&gt;

&lt;p&gt;The most important lesson from building both clinical and creative AI tools is understanding the temperature parameter:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Application&lt;/th&gt;
&lt;th&gt;Temperature&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Clinical summarization&lt;/td&gt;
&lt;td&gt;0.1&lt;/td&gt;
&lt;td&gt;Accuracy over creativity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Legal analysis&lt;/td&gt;
&lt;td&gt;0.2&lt;/td&gt;
&lt;td&gt;Reasoning with minimal hallucination&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code generation&lt;/td&gt;
&lt;td&gt;0.3&lt;/td&gt;
&lt;td&gt;Correct syntax, some flexibility&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Educational content&lt;/td&gt;
&lt;td&gt;0.5&lt;/td&gt;
&lt;td&gt;Balanced: accurate but engaging&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Business writing&lt;/td&gt;
&lt;td&gt;0.6&lt;/td&gt;
&lt;td&gt;Professional but not robotic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Creative fiction&lt;/td&gt;
&lt;td&gt;0.8&lt;/td&gt;
&lt;td&gt;Surprising, expressive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Poetry/experimental&lt;/td&gt;
&lt;td&gt;0.9&lt;/td&gt;
&lt;td&gt;Maximum creativity&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This spectrum emerged from building 90+ tools across every domain. It's not in any textbook — it's practical knowledge from thousands of generations across different use cases.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running on Consumer Hardware
&lt;/h2&gt;

&lt;p&gt;All seven creative tools run on a single machine with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GPU&lt;/strong&gt;: RTX 3080 or equivalent (8GB+ VRAM)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RAM&lt;/strong&gt;: 16GB system RAM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage&lt;/strong&gt;: 10GB for the model&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Generation times:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Short story (500 words)&lt;/strong&gt;: 5-10 seconds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Poem&lt;/strong&gt;: 2-4 seconds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Song lyrics&lt;/strong&gt;: 4-8 seconds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Story feedback&lt;/strong&gt;: 8-15 seconds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Fast enough for interactive creative sessions where you generate, read, regenerate, and iterate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open Source Creative Tools
&lt;/h2&gt;

&lt;p&gt;All tools are available on GitHub:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/kennedyraju55/family-story-creator" rel="noopener noreferrer"&gt;family-story-creator&lt;/a&gt; — Personalized family stories&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/kennedyraju55/mood-journal-bot" rel="noopener noreferrer"&gt;mood-journal-bot&lt;/a&gt; — AI-powered mood journaling&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/kennedyraju55/standup-generator" rel="noopener noreferrer"&gt;standup-generator&lt;/a&gt; — Creative standup comedy bits&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/kennedyraju55/diary-journal-organizer" rel="noopener noreferrer"&gt;diary-journal-organizer&lt;/a&gt; — Smart journal organization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Creative AI doesn't need to be a cloud service. With local LLMs, your creative work stays private, runs fast, and costs nothing per generation.&lt;/p&gt;




&lt;p&gt;*Nrk Raju Guthikonda is a Senior Software Engineer at Microsoft on the Copilot Search Infrastructure team. He maintains 116+ original open-source repositories spanning healthcare, legal, education, creative AI, and developer tools. Find his work on &lt;a href="https://github.com/kennedyraju55" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; and &lt;a href="https://dev.to/kennedyraju55"&gt;dev.to&lt;/a&gt;.*aicreativepythonwriting&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>privacy</category>
      <category>showdev</category>
    </item>
    <item>
      <title>AI Tutoring That Works Offline: Building Education Tools That Don't Need the Internet</title>
      <dc:creator>Nrk Raju Guthikonda</dc:creator>
      <pubDate>Sun, 12 Apr 2026 23:03:06 +0000</pubDate>
      <link>https://forem.com/kennedyraju55/ai-tutoring-that-works-offline-building-education-tools-that-dont-need-the-internet-cd7</link>
      <guid>https://forem.com/kennedyraju55/ai-tutoring-that-works-offline-building-education-tools-that-dont-need-the-internet-cd7</guid>
      <description>&lt;p&gt;The digital divide in education isn't just about having a device — it's about having reliable internet. According to the FCC, over 14 million students in the US lack adequate home internet access. Cloud-based AI tutoring tools are useless to them.&lt;/p&gt;

&lt;p&gt;I built a suite of education AI tools that run entirely offline using local LLMs. No internet required after initial setup. No student data leaving the device. No subscription costs per student.&lt;/p&gt;

&lt;p&gt;Here's how local AI can democratize education technology.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Connectivity Problem
&lt;/h2&gt;

&lt;p&gt;Most "AI in education" products assume always-on internet:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ChatGPT needs cloud access for every interaction&lt;/li&gt;
&lt;li&gt;Khan Academy's AI tutor requires constant connectivity
&lt;/li&gt;
&lt;li&gt;Google's learning tools depend on Google Cloud&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This creates a two-tier education system: students with fast internet get AI-powered learning, and everyone else gets left behind. Rural schools, developing countries, and low-income households — the communities that need educational technology most — can't use it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Solution: Education AI That Runs Locally
&lt;/h2&gt;

&lt;p&gt;My education tools use Gemma 4 via Ollama to provide:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Study Buddy Bot
&lt;/h3&gt;

&lt;p&gt;An interactive study companion that helps students work through problems step-by-step.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;StudyBuddy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;subject&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subject&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subject&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;level&lt;/span&gt;  &lt;span class="c1"&gt;# elementary, middle, high, college
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conversation_history&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;help_with_problem&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;problem&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are a patient &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;-level &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subject&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; tutor.

A student asks for help with: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;problem&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Rules:
1. Don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t give the answer directly
2. Guide them through the reasoning step by step
3. Ask questions to check understanding
4. Use analogies appropriate for &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; level
5. Encourage when they make progress
6. Correct misconceptions gently

Provide the next step of guidance.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key design choice: &lt;strong&gt;never give the answer directly&lt;/strong&gt;. The Socratic method works better for learning, and it's also safer — if the LLM makes a mistake in the final answer, the student learns the wrong thing. But if the LLM asks guiding questions, the student develops their own understanding.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Exam Generator
&lt;/h3&gt;

&lt;p&gt;Creates practice exams calibrated to curriculum standards.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_exam&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;difficulty&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_questions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                  &lt;span class="n"&gt;question_types&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;multiple_choice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;short_answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;essay&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Generate a &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;difficulty&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; difficulty exam on &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.

Requirements:
- &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;num_questions&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; questions total
- Mix of types: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question_types&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
- Multiple choice: 4 options each, one correct
- Include an answer key with explanations
- Align with standard curriculum for this topic
- Vary Bloom&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s taxonomy levels (remember, understand, apply, analyze)

Output as structured JSON.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Teachers can generate unlimited practice exams without internet. Each generation is unique, reducing cheating on take-home practice.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Reading List Manager
&lt;/h3&gt;

&lt;p&gt;Analyzes reading materials and suggests study paths.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Extracts key concepts from textbook chapters&lt;/li&gt;
&lt;li&gt;Identifies prerequisite knowledge gaps&lt;/li&gt;
&lt;li&gt;Suggests reading order for optimal learning&lt;/li&gt;
&lt;li&gt;Generates chapter summaries and study guides&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Flashcard Generator
&lt;/h3&gt;

&lt;p&gt;Transforms any text into spaced-repetition flashcards.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Extracts key terms and definitions&lt;/li&gt;
&lt;li&gt;Creates question-answer pairs from content&lt;/li&gt;
&lt;li&gt;Generates "explain this concept" cards for deeper understanding&lt;/li&gt;
&lt;li&gt;Exports to Anki-compatible format&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;The total cost to deploy these tools is a one-time hardware purchase — a laptop with a decent GPU. After downloading the model once (with internet), everything runs offline indefinitely.&lt;/p&gt;

&lt;p&gt;Compare this to per-student SaaS subscriptions that cost schools thousands annually, require constant internet, and send student data to cloud servers (a FERPA concern).&lt;/p&gt;

&lt;h2&gt;
  
  
  Deployment for Schools
&lt;/h2&gt;

&lt;p&gt;Every tool ships with Docker for easy deployment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# One command to start the entire education AI suite&lt;/span&gt;
docker compose up
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A school IT administrator can set up a single server that runs all the tools. Students connect via the local network — no internet needed. Student interactions stay on the school's own hardware.&lt;/p&gt;

&lt;p&gt;For individual students, the tools run on any laptop with 8GB+ RAM. Not gaming-PC specs — just a reasonable modern laptop.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;I'm working on:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Curriculum alignment&lt;/strong&gt; — mapping generated content to Common Core and state standards&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-language support&lt;/strong&gt; — Gemma 4 supports multiple languages, enabling tutoring in students' native languages&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Teacher dashboards&lt;/strong&gt; — aggregated (anonymized) analytics showing class-wide concept gaps&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accessibility&lt;/strong&gt; — screen reader support and simplified interfaces for students with disabilities&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The tools are open source:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/kennedyraju55/study-buddy-bot" rel="noopener noreferrer"&gt;study-buddy-bot&lt;/a&gt; — Interactive AI study companion&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/kennedyraju55/exam-generator" rel="noopener noreferrer"&gt;exam-generator&lt;/a&gt; — Practice exam generation&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/kennedyraju55/reading-list-manager" rel="noopener noreferrer"&gt;reading-list-manager&lt;/a&gt; — Smart reading path suggestions&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/kennedyraju55/flashcard-generator" rel="noopener noreferrer"&gt;flashcard-generator&lt;/a&gt; — Automated flashcard creation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Education AI should work for every student, not just those with fast internet and school budgets for SaaS subscriptions. Local LLMs make that possible today.&lt;/p&gt;




&lt;p&gt;*Nrk Raju Guthikonda is a Senior Software Engineer at Microsoft on the Copilot Search Infrastructure team. He builds privacy-first AI tools across healthcare, legal, education, and enterprise domains. Explore his 116+ open-source repositories on &lt;a href="https://github.com/kennedyraju55" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; and read more on &lt;a href="https://dev.to/kennedyraju55"&gt;dev.to&lt;/a&gt;.*educationaipythonopensource&lt;/p&gt;

</description>
      <category>ai</category>
      <category>learning</category>
      <category>llm</category>
      <category>showdev</category>
    </item>
    <item>
      <title>From Side Projects to 116 Repositories: How I Built an Open-Source AI Portfolio While Working Full-Time at Microsoft</title>
      <dc:creator>Nrk Raju Guthikonda</dc:creator>
      <pubDate>Sun, 12 Apr 2026 23:01:36 +0000</pubDate>
      <link>https://forem.com/kennedyraju55/from-side-projects-to-116-repositories-how-i-built-an-open-source-ai-portfolio-while-working-1c8f</link>
      <guid>https://forem.com/kennedyraju55/from-side-projects-to-116-repositories-how-i-built-an-open-source-ai-portfolio-while-working-1c8f</guid>
      <description>&lt;p&gt;Two years ago, I had a handful of GitHub repositories — mostly experimental scripts and weekend hacks. Today, I maintain 116 original repositories spanning healthcare AI, legal tech, developer tools, creative AI, education, finance, and security.&lt;/p&gt;

&lt;p&gt;Every single one is original work. Zero forks. All built with a consistent philosophy: AI should run locally, respect privacy, and solve real problems.&lt;/p&gt;

&lt;p&gt;Here's what I learned building this portfolio while working full-time as a Senior Software Engineer on Microsoft's Copilot Search Infrastructure team.&lt;/p&gt;




&lt;h2&gt;
  
  
  The 90-Local-LLM Rule
&lt;/h2&gt;

&lt;p&gt;Early on, I made a decision that shaped everything: every AI project would run locally. No cloud API keys. No data transmission. No per-token costs.&lt;/p&gt;

&lt;p&gt;This wasn't just a technical preference — it was a product thesis. I believe the future of AI isn't centralized cloud APIs but distributed local inference. And I wanted to prove it was practical by building 90+ working applications across every domain I could think of.&lt;/p&gt;

&lt;p&gt;The stack is consistent across all projects:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Gemma 4&lt;/strong&gt; (or earlier Gemma models) via &lt;strong&gt;Ollama&lt;/strong&gt; for inference&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python&lt;/strong&gt; for core logic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FastAPI&lt;/strong&gt; for API layers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streamlit&lt;/strong&gt; for user interfaces&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker&lt;/strong&gt; for deployment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This consistency means each project builds on patterns from previous ones. The tenth healthcare tool took a fraction of the time of the first because the architecture was battle-tested.&lt;/p&gt;

&lt;h2&gt;
  
  
  Picking Domains That Matter
&lt;/h2&gt;

&lt;p&gt;I didn't build 116 "todo app with AI" variations. Each project targets a real problem in a specific domain:&lt;/p&gt;

&lt;h3&gt;
  
  
  Healthcare (15+ repos)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Patient intake summarizers that keep PHI on-premise&lt;/li&gt;
&lt;li&gt;Lab result interpreters with clinical context&lt;/li&gt;
&lt;li&gt;EHR de-identification tools&lt;/li&gt;
&lt;li&gt;Medical document assistants&lt;/li&gt;
&lt;li&gt;Mental health check-in tools&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The healthcare tools are built around a single principle: &lt;strong&gt;no patient data should ever leave the hospital's network&lt;/strong&gt;. Every one runs entirely offline after initial model download.&lt;/p&gt;

&lt;h3&gt;
  
  
  Legal Tech (8+ repos)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Contract clause analyzers&lt;/li&gt;
&lt;li&gt;Legal brief generators&lt;/li&gt;
&lt;li&gt;Compliance checkers&lt;/li&gt;
&lt;li&gt;Court case summarizers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Legal AI has the same confidentiality imperative as healthcare — attorney-client privilege doesn't survive a round trip to a cloud API.&lt;/p&gt;

&lt;h3&gt;
  
  
  Developer Tools (20+ repos)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Code review assistants&lt;/li&gt;
&lt;li&gt;API documentation generators&lt;/li&gt;
&lt;li&gt;Git analytics dashboards&lt;/li&gt;
&lt;li&gt;Performance profiling tools&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are tools I actually use in my day job. Building them made me a better engineer, and open-sourcing them helped others.&lt;/p&gt;

&lt;h3&gt;
  
  
  Education, Finance, Security, Creative AI (50+ repos)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Exam generators and tutoring bots&lt;/li&gt;
&lt;li&gt;Financial report analyzers&lt;/li&gt;
&lt;li&gt;Security audit tools&lt;/li&gt;
&lt;li&gt;Story generators and poetry engines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each domain taught me something about how LLMs interact with domain-specific knowledge. Medical terminology behaves differently than legal jargon, which behaves differently than financial reporting language. The prompting strategies that work for clinical summarization fail for creative writing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture Pattern
&lt;/h2&gt;

&lt;p&gt;After 116 repos, I've converged on a pattern that works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;project/
├── src/
│   ├── core/          # Domain logic (no LLM dependency)
│   ├── llm/           # LLM integration layer
│   ├── api/           # FastAPI endpoints
│   └── ui/            # Streamlit interface
├── tests/
├── docker-compose.yml # One-command deployment
├── README.md          # Problem, solution, architecture, demo
└── .env.example       # Configuration template
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key principles:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Separate domain logic from LLM integration&lt;/strong&gt; — the core business logic should work with any model, or even without one&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Always provide both API and UI&lt;/strong&gt; — API for integration, UI for demos and non-technical users&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker-first deployment&lt;/strong&gt; — &lt;code&gt;docker compose up&lt;/code&gt; should be the only command needed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Comprehensive README&lt;/strong&gt; — every project explains the problem it solves, not just how to run it&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Time Management: Building While Working Full-Time
&lt;/h2&gt;

&lt;p&gt;The most common question I get: "How do you build this much while working full-time?"&lt;/p&gt;

&lt;p&gt;The honest answer:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Reuse patterns aggressively&lt;/strong&gt; — that project template above means I can scaffold a new project in 20 minutes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build in domains you know&lt;/strong&gt; — working on Copilot Search taught me RAG patterns that directly informed my retrieval-augmented projects&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Small, focused projects&lt;/strong&gt; — each repo solves one problem well. A contract analyzer doesn't try to also manage cases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weekend sprints&lt;/strong&gt; — most projects start as Saturday afternoon prototypes. If the prototype works, it gets a full README and Docker setup the next day&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automate everything else&lt;/strong&gt; — I have scripts for repo creation, README generation, and deployment&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What This Portfolio Has Done for My Career
&lt;/h2&gt;

&lt;p&gt;Building 116 original repositories has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Deepened my expertise&lt;/strong&gt; — you don't truly understand RAG until you've built it for healthcare, legal, and education domains&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Created a public body of work&lt;/strong&gt; — every repo is a verifiable, runnable demonstration of skill&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Opened conversations&lt;/strong&gt; — colleagues and recruiters reference specific projects&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Contributed to open source&lt;/strong&gt; — over 50 projects have README-driven documentation that helps others learn&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Built credibility in AI/ML&lt;/strong&gt; — a portfolio this size, with this consistency, demonstrates sustained commitment&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Advice for Building Your Own Portfolio
&lt;/h2&gt;

&lt;p&gt;If you're considering building a similar open-source portfolio:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pick a consistent stack&lt;/strong&gt; — don't learn a new framework for each project. Master one stack and push it to its limits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Solve real problems&lt;/strong&gt; — "GPT wrapper" projects don't demonstrate skill. Privacy-first healthcare AI demonstrates both technical ability and domain understanding&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Write the README first&lt;/strong&gt; — if you can't explain the problem and solution clearly, the project isn't ready&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ship Docker&lt;/strong&gt; — if someone can't run your project with a single command, they won't try it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Be original&lt;/strong&gt; — forking and modifying existing projects teaches less than building from scratch&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stay consistent&lt;/strong&gt; — 116 repos didn't happen overnight. Commit to building something new every week&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The full portfolio is available at &lt;a href="https://github.com/kennedyraju55" rel="noopener noreferrer"&gt;github.com/kennedyraju55&lt;/a&gt; and showcased at &lt;a href="https://kennedyraju55.github.io" rel="noopener noreferrer"&gt;kennedyraju55.github.io&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;*Nrk Raju Guthikonda is a Senior Software Engineer at Microsoft on the Copilot Search Infrastructure team, specializing in semantic indexing and RAG systems. He maintains 116+ original open-source repositories. Read more on &lt;a href="https://dev.to/kennedyraju55"&gt;dev.to&lt;/a&gt;.*opensourceaicareerprogramming&lt;/p&gt;

</description>
      <category>ai</category>
      <category>career</category>
      <category>opensource</category>
      <category>sideprojects</category>
    </item>
    <item>
      <title>Contract Analysis with Local LLMs: Why Law Firms Should Stop Sending Documents to the Cloud</title>
      <dc:creator>Nrk Raju Guthikonda</dc:creator>
      <pubDate>Sun, 12 Apr 2026 22:58:00 +0000</pubDate>
      <link>https://forem.com/kennedyraju55/contract-analysis-with-local-llms-why-law-firms-should-stop-sending-documents-to-the-cloud-41hi</link>
      <guid>https://forem.com/kennedyraju55/contract-analysis-with-local-llms-why-law-firms-should-stop-sending-documents-to-the-cloud-41hi</guid>
      <description>&lt;p&gt;Legal documents are among the most sensitive files in any organization. Yet the current wave of "AI-powered contract review" tools wants you to upload those documents to cloud APIs — exposing client confidentiality, attorney-client privilege, and trade secrets to third-party servers.&lt;/p&gt;

&lt;p&gt;I built an alternative: a contract clause analyzer that runs entirely on your machine using Gemma 4 via Ollama. Zero cloud transmission. Complete confidentiality. Here's how it works and why it matters.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Confidentiality Problem
&lt;/h2&gt;

&lt;p&gt;When a law firm uploads a contract to a cloud-based AI tool, several things happen:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Attorney-client privilege may be waived&lt;/strong&gt; — transmitting privileged documents to a third party without proper safeguards can constitute a waiver&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Client data leaves your control&lt;/strong&gt; — even with encryption, the cloud provider processes the text in plaintext during inference&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Regulatory exposure increases&lt;/strong&gt; — GDPR, CCPA, and industry regulations impose strict requirements on data processing locations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Competitive intelligence leaks&lt;/strong&gt; — M&amp;amp;A contracts, employment agreements, and IP licenses contain strategic information&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The American Bar Association's Model Rule 1.6 requires lawyers to make "reasonable efforts to prevent the inadvertent or unauthorized disclosure of, or unauthorized access to, information relating to the representation of a client." Sending contracts to a cloud LLM is a gray area that many ethics committees are actively scrutinizing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture: Local-First Contract Analysis
&lt;/h2&gt;

&lt;p&gt;My contract-clause-analyzer uses a three-stage pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────┐
│         Document Input Layer            │
│  (PDF, DOCX, TXT parsing + OCR)        │
├─────────────────────────────────────────┤
│         Clause Extraction Engine        │
│  (Section splitting, clause typing,     │
│   reference resolution)                 │
├─────────────────────────────────────────┤
│         LLM Analysis Layer              │
│  (Gemma 4 via Ollama — local only)     │
│  Risk scoring, term comparison,         │
│  plain-English summaries                │
└─────────────────────────────────────────┘
         ↕ Everything stays on localhost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Stage 1: Document Parsing
&lt;/h3&gt;

&lt;p&gt;Contracts come in every format. The parser handles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PDF&lt;/strong&gt; — both text-based and scanned (with Tesseract OCR fallback)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DOCX&lt;/strong&gt; — preserving section structure and numbering&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plain text&lt;/strong&gt; — for already-extracted content&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key challenge is preserving document structure. A clause that says "Subject to Section 4.2(a)" needs to be linked to that section. The parser builds a section tree that maintains these cross-references.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 2: Clause Extraction
&lt;/h3&gt;

&lt;p&gt;Not every paragraph in a contract is a "clause" worth analyzing. The extraction engine identifies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Operative clauses&lt;/strong&gt; — obligations, rights, conditions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Boilerplate&lt;/strong&gt; — standard terms that still matter (governing law, dispute resolution, force majeure)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Definitions&lt;/strong&gt; — terms that affect interpretation of other clauses&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schedules and exhibits&lt;/strong&gt; — referenced attachments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each clause is classified by type:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;CLAUSE_TYPES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;indemnification&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;indemnif&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hold harmless&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defend and indemnify&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;limitation_of_liability&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;limitation of liability&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;aggregate liability&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cap on damages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;termination&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;terminat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expiration&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cancellation rights&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;confidentiality&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;confidential&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;non-disclosure&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;proprietary information&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ip_assignment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;intellectual property&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;work product&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assignment of rights&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;non_compete&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;non-compete&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;non-solicitation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;restrictive covenant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payment_terms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;invoice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;net 30&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;compensation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;warranty&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;warrant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;represent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;guarantee&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;force_majeure&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;force majeure&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;act of god&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;beyond reasonable control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;governing_law&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;governing law&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jurisdiction&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;venue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dispute_resolution&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arbitration&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mediation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dispute resolution&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data_protection&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data protection&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GDPR&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;personal data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;privacy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Stage 3: LLM Analysis
&lt;/h3&gt;

&lt;p&gt;This is where Gemma 4 shines. For each extracted clause, the LLM provides:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Risk Assessment:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;analyze_clause&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;clause_text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;clause_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;party_role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are a contract analysis assistant. Analyze this &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;clause_type&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; clause 
from the perspective of the &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;party_role&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; (the party you are advising).

Clause:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;clause_text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Provide:
1. RISK_LEVEL: HIGH, MEDIUM, or LOW
2. KEY_ISSUES: List specific concerns (max 5)
3. MISSING_PROTECTIONS: What standard protections are absent
4. PLAIN_ENGLISH: Explain what this clause means in simple terms
5. NEGOTIATION_POINTS: Suggested changes to improve the party&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s position

Be specific. Reference exact language from the clause.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_parse_analysis&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The temperature is set to 0.2 — slightly higher than clinical applications because legal analysis benefits from some reasoning diversity, but still low enough to avoid hallucinating contract terms that don't exist.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Comparative Analysis:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The tool can also compare clauses against a library of "standard" terms:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compare_to_standard&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;clause_text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;clause_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;standard&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;standard_library&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clause_type&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;standard&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;comparison&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No standard template available for this clause type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Compare this contract clause to the standard template below.

ACTUAL CLAUSE:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;clause_text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

STANDARD TEMPLATE:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;standard&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Identify:
1. DEVIATIONS: Where the actual clause differs from standard
2. FAVORABLE_TERMS: Terms that are better than standard (for our client)
3. UNFAVORABLE_TERMS: Terms that are worse than standard
4. MISSING_TERMS: Standard protections that are absent&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is incredibly powerful for junior associates who need to review contracts against firm templates — they get instant markup of deviations without senior partner time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Use Cases
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. M&amp;amp;A Due Diligence
&lt;/h3&gt;

&lt;p&gt;During an acquisition, the legal team might review hundreds of contracts. The tool can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Batch-process all vendor agreements&lt;/li&gt;
&lt;li&gt;Flag non-standard termination clauses (change of control triggers)&lt;/li&gt;
&lt;li&gt;Identify IP assignment gaps&lt;/li&gt;
&lt;li&gt;Summarize aggregate exposure from indemnification clauses&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Employment Agreement Review
&lt;/h3&gt;

&lt;p&gt;HR and legal teams can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Compare non-compete scopes across different state jurisdictions&lt;/li&gt;
&lt;li&gt;Flag overly broad IP assignment clauses&lt;/li&gt;
&lt;li&gt;Ensure severance terms are consistent across employee levels&lt;/li&gt;
&lt;li&gt;Identify clauses that may not be enforceable in specific states&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Vendor Contract Management
&lt;/h3&gt;

&lt;p&gt;Procurement teams can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Score vendor contracts by risk level&lt;/li&gt;
&lt;li&gt;Track SLA terms across multiple vendors&lt;/li&gt;
&lt;li&gt;Flag auto-renewal clauses before they trigger&lt;/li&gt;
&lt;li&gt;Ensure data protection addenda are present and adequate&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Performance
&lt;/h2&gt;

&lt;p&gt;On a consumer GPU (RTX 3080):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Single clause analysis&lt;/strong&gt;: 1-3 seconds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full contract (50 pages)&lt;/strong&gt;: 2-5 minutes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Batch processing (100 contracts)&lt;/strong&gt;: 3-4 hours unattended&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These times are comparable to cloud APIs — but without the per-token costs that make batch processing expensive. A 50-page contract might cost $2-5 in cloud API tokens. Locally, after the one-time hardware investment, the marginal cost is electricity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters for Legal AI
&lt;/h2&gt;

&lt;p&gt;The legal industry is at an inflection point with AI. Firms that adopt AI will outcompete those that don't. But adopting cloud-based AI for sensitive legal work creates risks that may outweigh the benefits.&lt;/p&gt;

&lt;p&gt;Local LLMs offer a third path: the productivity gains of AI without the confidentiality risks of cloud processing. As models like Gemma 4 continue to improve, the quality gap between local and cloud inference will continue to shrink.&lt;/p&gt;

&lt;p&gt;The code is open source and ready to deploy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/kennedyraju55/contract-clause-analyzer" rel="noopener noreferrer"&gt;contract-clause-analyzer&lt;/a&gt; — Full contract analysis pipeline&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/kennedyraju55/legal-brief-generator" rel="noopener noreferrer"&gt;legal-brief-generator&lt;/a&gt; — Generate legal brief drafts from case notes&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/kennedyraju55/ai-compliance-checker" rel="noopener noreferrer"&gt;ai-compliance-checker&lt;/a&gt; — Regulatory compliance analysis&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;*Nrk Raju Guthikonda is a Senior Software Engineer at Microsoft on the Copilot Search Infrastructure team. He builds privacy-first AI tools across healthcare, legal, and enterprise domains. Explore his 116+ open-source repositories on &lt;a href="https://github.com/kennedyraju55" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; and read more on &lt;a href="https://dev.to/kennedyraju55"&gt;dev.to&lt;/a&gt;.*aipythonlegalprivacy&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>privacy</category>
      <category>showdev</category>
    </item>
    <item>
      <title>How I Built a Privacy-First Healthcare AI Agent Using MCP and Local LLMs</title>
      <dc:creator>Nrk Raju Guthikonda</dc:creator>
      <pubDate>Sun, 12 Apr 2026 22:54:45 +0000</pubDate>
      <link>https://forem.com/kennedyraju55/how-i-built-a-privacy-first-healthcare-ai-agent-using-mcp-and-local-llms-kg9</link>
      <guid>https://forem.com/kennedyraju55/how-i-built-a-privacy-first-healthcare-ai-agent-using-mcp-and-local-llms-kg9</guid>
      <description>&lt;p&gt;Most healthcare AI demos have a fatal flaw: they send patient data to the cloud. That's not just a bad practice — it's a regulatory minefield. HIPAA violations can cost $50,000 per incident, and "but our AI vendor said it was secure" isn't a defense.&lt;/p&gt;

&lt;p&gt;I decided to build healthcare AI tools that solve this problem at the architecture level. No patient data ever leaves the machine. Zero cloud API calls. Complete HIPAA compliance by design, not by policy.&lt;/p&gt;

&lt;p&gt;Here's how I built a suite of healthcare AI agents — including a patient intake summarizer, lab results interpreter, EHR de-identifier, and medical document assistant — all running locally with Gemma 4 via Ollama.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem with Cloud-Based Healthcare AI
&lt;/h2&gt;

&lt;p&gt;Every time a healthcare organization sends patient data to a cloud LLM API, they're creating:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;A HIPAA liability&lt;/strong&gt; — PHI (Protected Health Information) transmitted to a third party requires a Business Associate Agreement, encryption in transit and at rest, and audit trails&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A single point of failure&lt;/strong&gt; — API outages mean your clinical workflow stops&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A cost that scales linearly&lt;/strong&gt; — every patient encounter means another API call, and token costs add up fast in healthcare where documents are long&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A trust problem&lt;/strong&gt; — patients and providers increasingly ask "where does my data go?"&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The solution isn't to avoid AI — it's to bring the AI to the data instead of sending the data to the AI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture: Local LLM + MCP Pattern
&lt;/h2&gt;

&lt;p&gt;My architecture uses three core components:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────┐
│           Clinical Application              │
│  (Streamlit UI / FastAPI / CLI)             │
├─────────────────────────────────────────────┤
│           MCP Server Layer                  │
│  (Tool definitions, prompt templates,       │
│   FHIR resource handlers)                  │
├─────────────────────────────────────────────┤
│           Ollama Runtime                    │
│  (Gemma 4 model, local inference,          │
│   zero network transmission)               │
└─────────────────────────────────────────────┘
         ↕ Everything stays on localhost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;strong&gt;Model Context Protocol (MCP)&lt;/strong&gt; layer is what makes this modular. Instead of hardcoding LLM interactions, each healthcare capability is exposed as an MCP tool:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;summarize_intake&lt;/code&gt; — processes patient intake forms into structured clinical summaries&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;interpret_lab_results&lt;/code&gt; — analyzes lab values against reference ranges with clinical context&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;deidentify_ehr&lt;/code&gt; — strips PHI from electronic health records while preserving clinical meaning&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;analyze_document&lt;/code&gt; — multi-agent document analysis for medical records&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why MCP?
&lt;/h3&gt;

&lt;p&gt;MCP provides a standardized interface between AI models and tools. For healthcare, this means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Interoperability&lt;/strong&gt; — any MCP-compatible client can use the healthcare tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Composability&lt;/strong&gt; — chain multiple tools (e.g., de-identify → summarize → flag risks)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Testability&lt;/strong&gt; — each tool can be tested independently with known inputs/outputs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit trail&lt;/strong&gt; — every tool invocation is logged with inputs and outputs&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Building the Patient Intake Summarizer
&lt;/h2&gt;

&lt;p&gt;Let me walk through one tool in detail. The Patient Intake Summarizer takes unstructured intake forms and produces structured clinical summaries.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Challenge
&lt;/h3&gt;

&lt;p&gt;Patient intake forms are messy. They contain free-text descriptions mixed with medical terminology, abbreviations, and varying formats. A typical intake might read:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"52F, presenting with lower back pain x 3 weeks, worse with sitting. PMH: DM2 on metformin 500mg BID, HTN on lisinopril 10mg daily. No known allergies. Family hx: mother had MI at 62."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A clinician can parse this instantly. An LLM needs structured prompting to extract the same information reliably.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Solution
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;IntakeSummarizer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;summarize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;intake_text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;structured&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_build_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;intake_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;  &lt;span class="c1"&gt;# Low temp for clinical accuracy
&lt;/span&gt;        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_parse_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_build_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are a clinical documentation assistant. 
Summarize the following patient intake form into a &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; summary.

IMPORTANT: Extract ALL of the following categories:
- Demographics (age, sex, presenting complaint)
- Medical History (conditions, surgeries, hospitalizations)  
- Current Medications (drug, dose, frequency)
- Allergies (drug, food, environmental)
- Family History (conditions, relationships)
- Social History (occupation, habits, living situation)
- Risk Factors (clinical flags requiring attention)
- Missing Information (gaps that need follow-up)

Intake Form:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Provide the summary in structured JSON format.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key insight is &lt;strong&gt;temperature 0.1&lt;/strong&gt;. For creative writing, you want high temperature. For clinical summarization, you want the model to be as deterministic and faithful to the source text as possible. Hallucinated medical information isn't creative — it's dangerous.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multi-Format Output
&lt;/h3&gt;

&lt;p&gt;The summarizer supports three output formats:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Brief&lt;/strong&gt; — 2-3 sentence overview for quick triage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Detailed&lt;/strong&gt; — paragraph-form comprehensive summary&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured&lt;/strong&gt; — JSON with categorized fields for EHR integration&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The structured format is particularly valuable because it can be directly ingested by downstream systems — no manual re-entry, no copy-paste errors.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lab Results Interpreter
&lt;/h2&gt;

&lt;p&gt;The lab interpreter is more complex because it needs reference ranges and clinical context.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;REFERENCE_RANGES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;glucose_fasting&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;low&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;70&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mg/dL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;critical_low&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;critical_high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hba1c&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;low&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;4.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;5.6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;%&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;critical_high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;14.0&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;creatinine&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;low&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;1.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mg/dL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;critical_high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;10.0&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="c1"&gt;# ... 50+ lab values
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;interpret&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lab_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;patient_context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;ref&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;REFERENCE_RANGES&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lab_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_classify_value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Only call LLM for abnormal values or when context matters
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;normal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;patient_context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;interpretation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_llm_interpret&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lab_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;patient_context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;interpretation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;lab_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; is within normal range.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lab&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;lab_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reference_range&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;low&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;unit&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;interpretation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;interpretation&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice the optimization: we only call the LLM for abnormal values or when patient context might change the interpretation. A normal glucose in a diabetic patient means something different than in a healthy patient — that's when the LLM adds value. For straightforward normal results, a rule-based response is faster and just as accurate.&lt;/p&gt;

&lt;h2&gt;
  
  
  EHR De-identification
&lt;/h2&gt;

&lt;p&gt;De-identification is critical for research, training, and any scenario where clinical data needs to be shared without exposing patient identity.&lt;/p&gt;

&lt;p&gt;The tool identifies and removes 18 HIPAA identifier categories:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Names, dates, phone numbers, emails&lt;/li&gt;
&lt;li&gt;Social Security numbers, medical record numbers&lt;/li&gt;
&lt;li&gt;Geographic data smaller than a state&lt;/li&gt;
&lt;li&gt;Biometric identifiers, device identifiers&lt;/li&gt;
&lt;li&gt;URLs, IP addresses, account numbers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The LLM approach has an advantage over regex-based de-identification: it understands context. "Dr. Smith recommended the Smith protocol" — the first "Smith" is PHI, the second is a medical protocol name. A regex would remove both; the LLM preserves the clinically meaningful reference.&lt;/p&gt;

&lt;h2&gt;
  
  
  Docker Deployment
&lt;/h2&gt;

&lt;p&gt;Every tool ships with Docker Compose for one-command deployment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;ollama&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ollama/ollama:latest&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;11434:11434"&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ollama-data:/root/.ollama&lt;/span&gt;
    &lt;span class="na"&gt;deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;reservations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;devices&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;capabilities&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;gpu&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

  &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;.&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8501:8501"&lt;/span&gt;  &lt;span class="c1"&gt;# Streamlit UI&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8000:8000"&lt;/span&gt;  &lt;span class="c1"&gt;# FastAPI API&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ollama&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;OLLAMA_HOST=http://ollama:11434&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;docker compose up&lt;/code&gt; and you have a fully functional healthcare AI tool running locally. No API keys, no cloud accounts, no data leaving your network.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results and Impact
&lt;/h2&gt;

&lt;p&gt;Across the four healthcare tools, the architecture delivers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zero data transmission&lt;/strong&gt; — verified with network monitoring, no outbound connections during inference&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sub-second response times&lt;/strong&gt; — Gemma 4 on a consumer GPU generates clinical summaries in 800ms-2s&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consistent accuracy&lt;/strong&gt; — low temperature + structured prompting produces reliable, reproducible outputs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complete audit trail&lt;/strong&gt; — every tool invocation logged with timestamp, input hash, and output&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;I'm currently exploring:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;FHIR R4 integration&lt;/strong&gt; — mapping tool outputs to FHIR resources for EHR interoperability&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A2A (Agent-to-Agent) protocol&lt;/strong&gt; — enabling healthcare agents to collaborate (e.g., intake summarizer triggers lab interpreter which triggers risk assessment)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Federated evaluation&lt;/strong&gt; — benchmarking accuracy across institutions without sharing data&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The code is open source. If you're building healthcare AI that respects patient privacy, check out the repos:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/kennedyraju55/patient-intake-summarizer" rel="noopener noreferrer"&gt;patient-intake-summarizer&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kennedyraju55/lab-results-interpreter" rel="noopener noreferrer"&gt;lab-results-interpreter&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kennedyraju55/ehr-deidentifier" rel="noopener noreferrer"&gt;ehr-deidentifier&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kennedyraju55/docshield" rel="noopener noreferrer"&gt;docshield&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;*Nrk Raju Guthikonda is a Senior Software Engineer at Microsoft on the Copilot Search Infrastructure team. He has built 116+ open-source repositories, including a suite of privacy-first healthcare AI tools. Find his work on &lt;a href="https://github.com/kennedyraju55" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; and &lt;a href="https://dev.to/kennedyraju55"&gt;dev.to&lt;/a&gt;.*healthcareaipythonsecurity&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>mcp</category>
      <category>privacy</category>
    </item>
    <item>
      <title>Building RAG Pipelines That Actually Work: Lessons from Microsoft Copilot</title>
      <dc:creator>Nrk Raju Guthikonda</dc:creator>
      <pubDate>Sun, 12 Apr 2026 22:45:40 +0000</pubDate>
      <link>https://forem.com/kennedyraju55/building-rag-pipelines-that-actually-work-lessons-from-microsoft-copilot-1fnn</link>
      <guid>https://forem.com/kennedyraju55/building-rag-pipelines-that-actually-work-lessons-from-microsoft-copilot-1fnn</guid>
      <description>&lt;p&gt;Most RAG tutorials show you the happy path. You chunk a handful of PDFs, toss them into a vector store, wire up an LLM, and — magic — your chatbot answers questions about your documents. Demo complete. Applause.&lt;/p&gt;

&lt;p&gt;Here's what those tutorials don't show you: what happens when you deploy RAG at scale. When your corpus isn't 10 PDFs but 10 million documents. When your latency budget is 200 milliseconds, not "however long it takes." When a wrong answer isn't a minor inconvenience but a trust-destroying event for millions of users.&lt;/p&gt;

&lt;p&gt;I work on Microsoft Copilot's Search Infrastructure team, where my focus is semantic indexing and retrieval-augmented generation. I've also built over 116 open-source repositories, many of which experiment with RAG patterns across healthcare, developer tools, education, and creative AI. What follows is a distillation of what I've learned — the patterns that survive contact with production, and the failure modes that tutorials conveniently skip.&lt;/p&gt;




&lt;h2&gt;
  
  
  What RAG Actually Is (Quick Refresher)
&lt;/h2&gt;

&lt;p&gt;Retrieval-Augmented Generation is a simple idea: instead of asking an LLM to answer from memory alone, you first &lt;em&gt;retrieve&lt;/em&gt; relevant documents, then feed them as context alongside the user's query. The basic flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Query → Embed → Retrieve from Index → Augment Prompt → Generate Response
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That five-step pipeline hides an enormous amount of complexity. Every arrow in that diagram is a place where things can go wrong. Let's walk through each stage and talk about what actually matters.&lt;/p&gt;




&lt;h2&gt;
  
  
  Chunking Strategies That Matter
&lt;/h2&gt;

&lt;p&gt;Chunking is the most underrated part of the RAG pipeline. Get it wrong and nothing downstream can save you — not a better embedding model, not a smarter LLM, not a fancier retrieval algorithm. Garbage chunks in, garbage answers out.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fixed-Size Chunking
&lt;/h3&gt;

&lt;p&gt;The naive approach: split every N tokens. It's fast, deterministic, and almost always wrong. A 512-token window doesn't care that it just sliced a paragraph in half, separated a code function from its docstring, or split a table across two chunks. The resulting fragments lack semantic coherence, which means your embeddings will be noisy and your retrieval will suffer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Semantic Chunking
&lt;/h3&gt;

&lt;p&gt;A better approach respects the natural boundaries of text. Sentences, paragraphs, sections — these are the units humans write in, and they're the units that produce coherent embeddings. The key insight is that a chunk should be a self-contained unit of meaning.&lt;/p&gt;

&lt;h3&gt;
  
  
  Recursive and Hierarchical Chunking
&lt;/h3&gt;

&lt;p&gt;For structured documents (markdown, HTML, code), recursive chunking splits along structural boundaries first — headers, then paragraphs, then sentences — falling back to smaller splits only when a section exceeds your token budget. This preserves the document's inherent hierarchy and produces chunks that actually make sense.&lt;/p&gt;

&lt;h3&gt;
  
  
  Overlapping Windows
&lt;/h3&gt;

&lt;p&gt;Here's a pattern that pays for itself immediately: overlap between adjacent chunks. Without overlap, information that spans a chunk boundary is effectively invisible to retrieval. A query about concept X might match the end of chunk 4 and the beginning of chunk 5, but neither chunk alone scores high enough to be retrieved.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;semantic_chunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;sentences&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;split_into_sentences&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current_chunk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;current_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sentences&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;count_tokens&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current_tokens&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;current_chunk&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current_chunk&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="c1"&gt;# Keep overlap sentences
&lt;/span&gt;            &lt;span class="n"&gt;overlap_sentences&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;current_chunk&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;
            &lt;span class="n"&gt;current_chunk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;overlap_sentences&lt;/span&gt;
            &lt;span class="n"&gt;current_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;count_tokens&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;current_chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;current_chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;current_tokens&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current_chunk&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current_chunk&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I keep two trailing sentences as overlap. This is a deliberate choice — enough context to preserve cross-boundary meaning, but not so much that you're bloating your index with redundant content. Tune the overlap to your domain: technical documentation tends to need more overlap than conversational text.&lt;/p&gt;




&lt;h2&gt;
  
  
  Embedding Models — Choosing Wisely
&lt;/h2&gt;

&lt;p&gt;Your embedding model is the lens through which your entire corpus is viewed. Choose poorly and retrieval becomes a game of chance.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Landscape
&lt;/h3&gt;

&lt;p&gt;OpenAI's &lt;code&gt;text-embedding-ada-002&lt;/code&gt; is the default choice for many teams, and it's a solid baseline — 1536 dimensions, reasonable performance across domains, easy API integration. But it's not always the right answer. Open-source models like &lt;strong&gt;BGE-large&lt;/strong&gt;, &lt;strong&gt;E5-large-v2&lt;/strong&gt;, and the &lt;strong&gt;sentence-transformers&lt;/strong&gt; family offer competitive quality with significant advantages: no API costs at scale, lower latency (run locally or on your own GPU fleet), and the ability to fine-tune on your domain.&lt;/p&gt;

&lt;h3&gt;
  
  
  Domain-Specific vs. General-Purpose
&lt;/h3&gt;

&lt;p&gt;If your corpus is specialized — legal documents, medical literature, codebases — a general-purpose embedding model may not capture the nuances that matter. A model fine-tuned on biomedical text will understand that "MI" means "myocardial infarction," not "Michigan." The MTEB leaderboard is your friend here: benchmark models against your actual query distribution, not generic benchmarks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Dimensionality Tradeoffs
&lt;/h3&gt;

&lt;p&gt;Higher dimensions capture more nuance but cost more in storage and search latency. At scale, the difference between 384 and 1536 dimensions is not academic — it's the difference between fitting your index in memory or needing distributed infrastructure. I've seen 768-dimensional models outperform 1536-dimensional ones on domain-specific tasks after fine-tuning. Measure, don't assume.&lt;/p&gt;

&lt;h3&gt;
  
  
  Asymmetric Embedding
&lt;/h3&gt;

&lt;p&gt;This is the insight that separates production RAG from tutorial RAG: &lt;strong&gt;the query and the document should not be embedded the same way.&lt;/strong&gt; A query like "How do I reset my password?" is semantically different from a documentation passage that contains the answer. Models like E5 handle this explicitly with &lt;code&gt;query:&lt;/code&gt; and &lt;code&gt;passage:&lt;/code&gt; prefixes. If your embedding model supports asymmetric encoding, use it. The retrieval quality improvement is substantial and essentially free.&lt;/p&gt;




&lt;h2&gt;
  
  
  Retrieval Ranking — Beyond Cosine Similarity
&lt;/h2&gt;

&lt;p&gt;Cosine similarity against a dense vector index is the starting point, not the finish line. In production, you need a ranking pipeline, not a single similarity score.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hybrid Search: Dense + Sparse
&lt;/h3&gt;

&lt;p&gt;Dense retrieval (vector search) excels at semantic matching — it understands that "automobile" and "car" are related. Sparse retrieval (BM25, keyword matching) excels at exact matching — it knows that "error code 0x8007045D" is a precise string, not a semantic concept. Neither alone is sufficient. The winning combination is both.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;hybrid_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bm25_index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;dense_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;sparse_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bm25_index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Reciprocal Rank Fusion
&lt;/span&gt;    &lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dense_results&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;doc_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rank&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sparse_results&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;doc_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rank&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)[:&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Reciprocal Rank Fusion (RRF) is elegant because it doesn't require score normalization — it works purely on rank positions. The constant 60 in the denominator is a standard dampening factor that prevents top-ranked results from dominating. The &lt;code&gt;alpha&lt;/code&gt; parameter controls the dense-vs-sparse balance; 0.7 is a reasonable starting point, but you should tune it against your evaluation set.&lt;/p&gt;

&lt;h3&gt;
  
  
  Re-Ranking with Cross-Encoders
&lt;/h3&gt;

&lt;p&gt;Bi-encoders (your embedding model) are fast because they encode queries and documents independently. Cross-encoders are &lt;em&gt;accurate&lt;/em&gt; because they process the query-document pair jointly, capturing fine-grained interactions. The pattern: retrieve broadly with a bi-encoder, then re-rank the top candidates with a cross-encoder. Models like &lt;code&gt;cross-encoder/ms-marco-MiniLM-L-12-v2&lt;/code&gt; can re-rank 100 candidates in milliseconds.&lt;/p&gt;

&lt;h3&gt;
  
  
  Metadata Filtering
&lt;/h3&gt;

&lt;p&gt;Not all retrieval should be purely semantic. If a user asks about "Python 3.11 features," you should filter by language and version &lt;em&gt;before&lt;/em&gt; running vector search, not after. Pre-filtering reduces the search space and eliminates false positives that would otherwise waste context window budget.&lt;/p&gt;

&lt;h3&gt;
  
  
  The "Lost in the Middle" Problem
&lt;/h3&gt;

&lt;p&gt;Research from Stanford showed that LLMs pay disproportionate attention to the beginning and end of their context window, often ignoring information in the middle. This has direct implications for how you order retrieved passages. Place your most relevant chunks at the beginning of the context, not in order of retrieval rank. Or better yet, interleave high and low relevance to force the model to attend evenly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Context Window Management
&lt;/h2&gt;

&lt;p&gt;You've retrieved your chunks. Now you need to fit them — along with a system prompt, the user's query, and room for the response — into a fixed token budget. This is a packing problem, and it deserves more attention than it gets.&lt;/p&gt;

&lt;h3&gt;
  
  
  Token Budgeting
&lt;/h3&gt;

&lt;p&gt;Be explicit about your budget allocation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;TOTAL_CONTEXT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;8192&lt;/span&gt;  &lt;span class="c1"&gt;# or 128k, depends on your model
&lt;/span&gt;&lt;span class="n"&gt;SYSTEM_PROMPT_TOKENS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;
&lt;span class="n"&gt;RESPONSE_RESERVE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;
&lt;span class="n"&gt;USER_QUERY_TOKENS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;  &lt;span class="c1"&gt;# estimate or measure
&lt;/span&gt;
&lt;span class="n"&gt;CONTEXT_BUDGET&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;TOTAL_CONTEXT&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;SYSTEM_PROMPT_TOKENS&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;RESPONSE_RESERVE&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;USER_QUERY_TOKENS&lt;/span&gt;
&lt;span class="c1"&gt;# = 6468 tokens for retrieved passages
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every token you spend on a low-relevance chunk is a token you can't spend on a high-relevance one. Rank your chunks by retrieval score and pack greedily until the budget is full.&lt;/p&gt;

&lt;h3&gt;
  
  
  Compression
&lt;/h3&gt;

&lt;p&gt;When your top chunks exceed the budget, you have two choices: drop chunks or compress them. Compression techniques range from simple extractive summarization (keep only the most relevant sentences within each chunk) to LLM-based summarization. The tradeoff is latency vs. information density. In latency-sensitive pipelines, extractive approaches win.&lt;/p&gt;

&lt;h3&gt;
  
  
  Strategy Selection: Stuff vs. Map-Reduce vs. Refine
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Stuff:&lt;/strong&gt; Concatenate all retrieved chunks into a single prompt. Simple, fast, works when everything fits in the context window.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Map-Reduce:&lt;/strong&gt; Process each chunk independently, then aggregate the results. Necessary when the total retrieved content exceeds the context window. More LLM calls, higher latency, but handles scale.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Refine:&lt;/strong&gt; Process chunks sequentially, refining the answer with each new chunk. Produces high-quality answers but has the highest latency. Use for offline or batch workloads.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In production, I default to Stuff with aggressive filtering. If your retrieval and ranking are good, you shouldn't need more than 5–8 highly relevant chunks to answer most questions.&lt;/p&gt;




&lt;h2&gt;
  
  
  Common Failure Modes (And How to Debug Them)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Retrieval Misses
&lt;/h3&gt;

&lt;p&gt;The document exists in your corpus but wasn't retrieved. Debug by running the query embedding against the target document's embedding directly — if the similarity is low, the problem is in your chunking or embedding model. If the similarity is high but the document wasn't in the top-K, your index may have quantization issues or you're not retrieving enough candidates before re-ranking.&lt;/p&gt;

&lt;h3&gt;
  
  
  Context Poisoning
&lt;/h3&gt;

&lt;p&gt;You retrieved 10 chunks, but 7 of them are irrelevant. The LLM now has to distinguish signal from noise, and it doesn't always succeed. The fix is upstream: better chunking, better ranking, and aggressive relevance thresholds. Drop any chunk below a minimum similarity score rather than always returning a fixed K.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hallucination Despite Correct Context
&lt;/h3&gt;

&lt;p&gt;The right chunk was retrieved and included in the prompt, but the LLM still hallucinated. This is often a prompt engineering problem. Explicit instructions like "Answer based ONLY on the provided context. If the context doesn't contain the answer, say so" are essential, not optional. Also consider: is the relevant information buried in a long passage? The "lost in the middle" effect applies within individual chunks too.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stale Embeddings
&lt;/h3&gt;

&lt;p&gt;Your documents were updated but the embeddings weren't re-computed. This is the RAG equivalent of a cache invalidation bug. Build your indexing pipeline with incremental updates from day one. Track document hashes and re-embed only what changed. At scale, a full re-index is a multi-hour, multi-GPU operation — you don't want to do it unnecessarily.&lt;/p&gt;




&lt;h2&gt;
  
  
  Lessons from Scale
&lt;/h2&gt;

&lt;p&gt;What changes when you go from a demo to a production system serving millions of users?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Index management becomes a first-class concern.&lt;/strong&gt; You need index versioning, blue-green deployments for index updates, and the ability to roll back a bad index without downtime. Your index is as critical as your database — treat it that way.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Latency budgets force hard tradeoffs.&lt;/strong&gt; At Microsoft-scale, every millisecond matters. You might skip re-ranking on low-importance queries. You might use a smaller embedding model for initial retrieval and reserve the expensive cross-encoder for the final top-10. Tiered retrieval architectures are common in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monitoring is non-negotiable.&lt;/strong&gt; Track retrieval precision and recall against labeled query sets. Monitor embedding drift over time. Alert on sudden drops in answer quality. Log the full pipeline: query → retrieved chunks → generated answer, so you can debug failures post-hoc. The RAG pipeline that you can't observe is the RAG pipeline that silently degrades.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evaluation is continuous.&lt;/strong&gt; Build evaluation sets that reflect your actual query distribution. Automated metrics (faithfulness, relevance, answer correctness) run on every pipeline change. Human evaluation catches what automated metrics miss. This isn't optional — it's how you maintain quality over time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The RAG pipeline is deceptively simple to prototype and genuinely hard to operate at scale. The architecture diagram fits on a napkin: embed, retrieve, generate. But the difference between a demo and a production system lives in the details — how you chunk documents, which embedding model you choose, how you rank and filter results, how you manage the context window, and how you monitor the whole thing.&lt;/p&gt;

&lt;p&gt;My advice: build incrementally. Start with the simplest version that works, instrument everything, and let your evaluation data tell you where to invest next. Don't over-engineer the retrieval before you've verified your chunking is sound. Don't add re-ranking before you've confirmed your base retrieval is reasonable.&lt;/p&gt;

&lt;p&gt;And don't skip the boring parts. Chunking and ranking aren't glamorous, but they're where production RAG systems are won or lost. The LLM is the easy part.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Nrk Raju Guthikonda is a Senior Software Engineer at Microsoft on the Copilot Search Infrastructure team, where he works on Semantic Indexing and Retrieval-Augmented Generation. He has built 116+ open-source repositories spanning AI/ML, healthcare, developer tools, and creative AI. Find his work on GitHub at &lt;a href="https://github.com/kennedyraju55" rel="noopener noreferrer"&gt;github.com/kennedyraju55&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>rag</category>
      <category>ai</category>
      <category>mojo</category>
    </item>
  </channel>
</rss>
