Forem: Nrk Raju Guthikonda

I've Shipped 113 Local-AI Projects in 18 Months — Here Are the Five Architecture Patterns That Actually Survived

Nrk Raju Guthikonda — Sat, 02 May 2026 05:14:24 +0000

I've Shipped 113 Local-AI Projects in 18 Months. Here Are the Five Architecture Patterns That Actually Survived

Tags: ai, llm, architecture, opensource

A weird thing happens around project number forty.

You stop being excited about model picking. You stop arguing about which framework to use. You stop thinking the hard part is the LLM. The boring patterns — the ones you reach for whether the project is a healthcare summarizer, a legal-clause analyzer, or an AI that calls a restaurant and orders biryani — start to crystallize into a small set of choices that you make almost without thinking. Everything else is decoration.

I've shipped 113 original applied-AI projects under the GitHub identity kennedyraju55 in the last 18 months — every one runs on local LLM inference (Gemma 4 + Ollama), no cloud, no API keys. The portfolio spans healthcare, legal, education, finance, security, dev tools, voice agents, and creative-writing assistants. Across all of them, the same five architecture decisions have either saved or sunk the project. This post is the retrospective I wish someone had handed me at project number one.

Pattern 1: The LLM is a function call, not a framework

The first ten projects all imported some flavor of "agent framework." LangChain, LlamaIndex, AutoGPT-likes, CrewAI. Every one of them got ripped out by the time the project shipped.

The reason was the same every time. The framework wanted to manage the LLM — chains, agents, memory, tool dispatch — and what I actually needed was the opposite. I wanted the LLM to be a small, deterministic function I called from inside my control flow. Input goes in, structured JSON comes out, my code decides what happens next. That's the whole shape.

The pattern that survived is almost embarrassingly small:

def ask_llm(system, user, schema=None):
    resp = ollama.chat(
        model="gemma3:4b",
        messages=[{"role":"system","content":system},
                  {"role":"user","content":user}],
        options={"temperature": 0, "num_ctx": 8192},
        format=schema,  # ollama's structured-output mode
    )
    return json.loads(resp["message"]["content"])

That's the entire interface. Every one of the 113 projects has some variant of this function and very little else from any "framework." The control flow — when to call it, what to do with the response, how to recover from a malformed output — lives in plain Python. The agent pattern, when I needed one, became a while loop with a step counter. That's not a downgrade. That's the point.

Pattern 2: Compute everything you can before you generate

The single biggest quality difference between projects 1–20 and projects 80–113 is this: never ask the LLM to do something a deterministic function can do better.

The flaky-test triager I described in the last post is the cleanest example. The LLM does not compute the failure rate. The LLM does not count the recent outcomes. The LLM does not parse the timestamp. Python does all of that, hands the LLM the numbers, and asks the LLM exactly one question: given these numbers, classify this failure into one of five categories.

Same shape in the healthcare projects: I never ask the LLM to extract drug-name + dosage + frequency from the note. I extract those with a deterministic NER pipeline first, hand the LLM the structured triple, and ask the LLM to assess the clinical intent. Same in the legal projects: the contract date, the parties, the governing law — those come from structured extraction. The LLM is asked the interpretive question on top of the structured ground truth.

The rule, sharpened over 113 projects: make the LLM's job a reading-comprehension question with the facts already labeled. Models, especially small open ones, are excellent reading-comprehension engines. They are mediocre extraction engines and worse arithmetic engines. Treat them as their best self.

Pattern 3: Constrain the output, every single time

For my first dozen projects I spent almost as much time parsing LLM outputs as building the rest of the system. Free-form generation produces beautiful prose and unparseable garbage in roughly equal measure.

The patch is one line per project: every prompt ends with "Respond ONLY with a JSON object matching this schema:" plus the schema. Then the runtime enforces it. Ollama supports this natively (the format field above). Llama.cpp does too via grammar-constrained sampling. There is essentially no reason in 2026 to be parsing an unconstrained LLM output by hand.

Two corollaries that took me longer to learn:

Validate before you return. Even with constrained output, models occasionally produce JSON that parses but violates the semantic schema (an enum value not in the list, a number out of range). Validate every output against a Pydantic model. If validation fails, retry once at temperature 0. If the second attempt fails, return a typed error — not a guess.
Schema design is product design. The shape of the JSON you ask for is, in effect, the API of your LLM-powered feature. Spend time on it. The schema that says {"category": "flaky_intermittent" | "recently_broken" | ..., "confidence": 0..1, "reasoning": str} is a different and better product than the schema that says {"answer": str}.

Pattern 4: Retrieval is two stages, not one

Every project that does anything more than text-in / text-out hits a retrieval problem. The healthcare assistant needs to ground in a clinical guideline. The legal-clause analyzer needs precedent paragraphs. The CallPilot voice agent needs the restaurant menu. The personal-knowledge-base "second brain" needs a relevant note.

The pattern that survived all 113 projects is two-stage retrieval, never one.

Stage 1: a fast, cheap, recall-oriented retriever. Usually BM25 over the corpus, sometimes a small bi-encoder embedding (bge-small-en or nomic-embed-text — both fit on a laptop). Pull the top 50 candidates. Throw away anything below a sane similarity floor.

Stage 2: a precision-oriented reranker. A small cross-encoder (bge-reranker-base is great) or, for anything where domain matters, the LLM itself prompted with "rank these 50 candidates 1–50 for relevance to this query." Take the top 5.

One stage is never enough. A pure embedding-similarity retriever produces 50 candidates of which 30 are vaguely on-topic and 5 are exactly right; the LLM doesn't know which of the 30 is which, and it hallucinates around the missing precision. A pure BM25 retriever produces 50 candidates with high lexical overlap but blind to paraphrase. Two stages — recall first, then precision — eats both failure modes.

Pattern 5: The system prompt is your contract; the user prompt is the call

I waited too long to internalize this one. In every successful project, the system prompt does all of the following, and the user prompt does none of them:

Defines the role ("You are a clinical-note summarizer for a primary-care physician")
Defines the audience ("The reader is a busy clinician who has 30 seconds")
Defines the constraints ("Never invent a medication. Never assert a diagnosis the source did not assert.")
Defines the output schema ("Respond ONLY with the JSON schema below")
Defines the rules for edge cases ("If the input is missing the chief complaint, return { \"error\": \"missing_chief_complaint\" }")

The user prompt is then almost embarrassingly minimal:

Source note:
<<<
{the actual note}
>>>

That's it. The LLM is being called, not negotiated with. The system prompt is the function signature; the user prompt is the argument. Treating them this way makes prompts versionable (I keep them in plain .md files with hashes in the filenames), diffable, and testable. It also collapses a whole class of "the model isn't following instructions" bugs — usually because the instructions were tangled into the user message and got down-weighted.

What I got wrong, in order, so you don't have to

A short list, because the failures were the lessons:

I kept switching models for too long. Projects 1–30 alternated between five different local models. The variance in output across models obscured the variance in prompt quality. Pick a model. Stay there for at least 20 projects. Then switch with a controlled comparison.
I built UIs before I built CLIs. Streamlit and Gradio are wonderful, and I shipped a lot of them — but the projects where I built the CLI first and only added the UI when the CLI was correct shipped twice as fast and were twice as easy to debug.
I optimized for novelty in early projects. Projects 5, 7, 11 tried to be clever — multi-agent routing, self-reflection loops, dynamic prompt rewriting. Every one of them got removed before shipping. The boring single-call shape kept winning.
I underestimated voice. When I finally built CallPilot — a multi-provider voice agent that actually places phone calls — it was both the hardest project in the portfolio and the one that produced the loudest "wait, this is real?" reaction. If you can ship a voice agent, ship a voice agent.
I overdocumented at first and underdocumented in the middle. The first ten projects had elaborate READMEs that nobody read. The middle thirty had thin READMEs and forgotten launch commands. The last few dozen converged on a four-section README — what it is, why it exists, how to run it, what to break next — that's enough to hand the project to a stranger.

The portfolio-level architecture

There's a meta-pattern I didn't appreciate until the last twenty projects. The portfolio itself behaves like a system, not a collection. Every project I build now uses the same Python virtualenv pattern, the same Makefile targets, the same scripts/run.sh entry point, the same env-var convention, the same logging format. I can ssh into any of the 113 repos, run make demo, and watch the thing work.

That consistency is what made the portfolio possible at this volume. If every new project required relearning the harness, I would have stopped at twenty. The harness is not what I publish — readers see the LLM logic, the prompts, the architecture diagrams. But the reason there's a "the LLM logic, the prompts, the architecture diagrams" to look at across 113 projects is that the parts of the system that aren't the LLM are uniform enough that I never have to think about them again.

What I'd build differently if I started over today

A handful of things, none of them about the model:

Start with structured output from project one. The day I made every prompt return JSON was the day my projects stopped breaking on Friday afternoons.
Two-stage retrieval from project one. Single-stage retrieval was the cause of half my early hallucination bugs.
A shared eval harness from project one. I built one at project sixty. It should have been project two. Even ten labeled examples per project, run before every prompt change, would have caught regressions I shipped instead.
A canonical, opinionated CLI shape from project one. Click + a run/eval/demo triad. Saves a working week per project, compounded.
A small library of "prompts that worked" with a hash and a benchmark. I have one now. I should have had one at project five.

Closing

If you take one thing from this: the best local-AI projects are the ones where the LLM is the smallest part of the system. The retrieval is solid. The structured output is enforced. The deterministic computations are done in code. The system prompt is a contract. The user prompt is a call. The harness is invisible because it's the same every time.

You don't need a framework. You need a JSON schema, a two-stage retriever, a Pydantic validator, and a system-prompt file in version control. Stack a few of those together and you've got the spine of every project that survived in my portfolio.

The other 109 ideas? Decoration.

I'm a Senior Software Engineer at Microsoft and the author of 113 open-source local-AI projects and 22 technical articles on Dev.to. If you've shipped a few local-AI projects yourself and your "patterns that survived" list is different from mine, I genuinely want to read it — leave a comment or DM.

Your CI/CD Pipeline Should Have Its Own AI — Here's How I Built One That Runs Locally

Nrk Raju Guthikonda — Sat, 02 May 2026 05:01:24 +0000

Your CI/CD Pipeline Should Have Its Own AI — Here's How I Built One That Runs Locally

Tags: ai, devops, cicd, python

If you've shipped software for any meaningful length of time, you've felt the same three pains:

A test failed in CI, but it failed two days ago too, and yesterday it passed, and today it failed again. Is it broken or is it flaky?
The PR diff is 1,800 lines, the description says "small refactor," and the reviewer has 12 minutes before their next meeting.
The deploy went out at 4:55pm on Friday. By 5:10pm something is off in the logs, but nobody's sure if it's the deploy, the cache rollout, or the third-party API hiccupping again.

Every team I've worked on has built some homegrown answer to these. Spreadsheets of flaky tests. CODEOWNERS rules. Slack channels that page on log spikes. They all work — kind of — until the project grows past the point where humans can keep the rules current.

So I built a small set of CI/CD assistants that run on local LLMs, do exactly one job each, and plug into existing pipelines via a single CLI call. No cloud, no API keys, no per-call billing. The pipeline calls a binary, gets back JSON, and decides what to do with it.

This post walks through the four assistants I built, what each one is actually good at, and the small handful of design decisions that made them useful instead of ignorable.

Why Local LLMs Are Actually a Good Fit Here

CI/CD is the one place in your stack where "send my data to a third-party AI" is a worse idea than usual. Test logs leak schema names and table contents. Stack traces leak file paths and internal hostnames. PR diffs are, by definition, your unreleased proprietary code. A junior engineer pasting a stack trace into ChatGPT to debug it is — depending on your industry — somewhere between "uncomfortable" and "regulatory incident."

Running the model locally also gives you three boring-but-real advantages:

Reproducibility. Same model, same prompt, same temperature → same output. Nice property when you're using AI to make merge decisions.
No quota. You can run the analyzer on every PR, every build, every deploy, instead of rationing it to the "important" ones.
Latency that fits CI. Gemma 4 on a developer laptop or a self-hosted runner returns a typical analysis in 3–8 seconds. That fits inside an existing CI step. A round-trip to a hosted API plus rate-limit backoff often doesn't.

The four assistants I'm going to describe all share the same backbone: Ollama running Gemma 4, called via HTTP from a small Python CLI, with structured-output prompting that returns JSON the pipeline can act on.

Assistant #1: The Flaky Test Triager

The most boring of the four, and by a wide margin the most useful.

When a test fails in CI, my pipeline runs:

flaky-triage --test "test_user_creation_idempotent" \
             --history .ci/test-history.jsonl \
             --log build.log \
             --output triage.json

The CLI does three things:

Loads the last 200 runs of that test from a rolling JSONL file (cheap, just grep | tail).
Computes the actual stats — pass rate, average duration, recent failure pattern.
Hands the LLM the recent stack trace plus the stats, and asks for a one-of-five classification: flaky_intermittent, flaky_environmental, recently_broken, chronically_broken, or legitimate_failure.

The classification is the part the LLM does well. The stats are the part the LLM does badly, so I don't ask it. Same prompt: "Here is the failure rate (0.18 over 200 runs), here are the last 10 outcomes, here is the new stack trace. Classify."

The output goes into a GitHub Actions check that either:

Auto-retries the test once if it's flaky_intermittent (and posts a comment so we can see the rate climbing).
Blocks the merge if it's recently_broken (i.e., the test was passing reliably until the last 5 commits).
Files an issue with the existing pattern if it's chronically_broken and there isn't already an open issue.

The win isn't in any one decision. It's that the team stopped having the daily "is this flaky or is this real?" Slack thread.

Assistant #2: The PR Risk Reviewer

This one has a higher false-positive rate than #1, but the false positives are still useful.

On every PR, a CI step runs:

pr-risk --diff $(git diff origin/main...HEAD) \
        --files-changed-history .ci/touch-history.jsonl \
        --output risk.json

The risk reviewer scores three things on a 0–10 scale:

Blast radius — how much of the system this PR can break if it's wrong.
Surprise factor — how unusual the changed files are vs. the PR title and description.
Test density — whether the test coverage in this PR is consistent with the code coverage in the rest of the repo.

The first two are the LLM's job. The third I compute deterministically (lines of test changed / lines of non-test changed, normalized against the repo's historical ratio) and then have the LLM interpret.

The reason this works is the same reason most LLM-assisted code review fails: I never ask the LLM "is this PR good." I ask it "does the PR description match the diff?" and "what files outside the stated scope did this PR touch?" Those are reading-comprehension questions, which is what these models are actually good at. They're not architecture review questions, which is what they're bad at.

The output gets posted as a single review comment with three paragraphs and a final risk score. Reviewers tell me the most-used part is the "files outside stated scope" sentence — the model catches the # also bumped this lib version while I was here lines that hide in long diffs.

Assistant #3: The Deploy Log Watcher

This one runs after the deploy, on a 5-minute and a 30-minute interval, and looks at structured logs from the freshly-deployed service.

deploy-watch --service api \
             --window 5m \
             --baseline-window 24h \
             --logs $(kubectl logs --since=5m ...)

The trick that made this assistant actually useful — instead of crying-wolf useful — is that I do the statistics in code and let the LLM do the explanation. The CLI computes:

Error-rate delta vs. the 24h baseline.
The top 10 new error signatures (by message clustering — also done locally, with a small embedding model).
Latency p50/p95/p99 deltas.

If nothing crosses a threshold, the assistant doesn't even call the LLM. It posts a "deploy looks healthy" check and exits.

If thresholds are crossed, it hands those numbers + 200 lines of representative log samples to the LLM and asks one question: "Write a one-paragraph status-page-style update describing what's happening, suitable for a Slack #incidents post. Include the affected endpoint, the symptom, the magnitude, and whether to roll back."

That last clause — whether to roll back — is where I made the LLM's job easy. I give it the rule explicitly in the prompt: "Recommend rollback if the error-rate delta is over 5x baseline AND the new error signatures correlate with files changed in this deploy. Otherwise recommend monitor."

The model never has to invent a rule. It just has to apply the rule I gave it to the data I gave it. That's a thing local 7B models are great at.

Assistant #4: The Release Notes Writer

This one I built last, and it's the one that surprised me with how much joy it generated.

Every Friday afternoon a job runs:

release-notes --since "last release tag" \
              --commits $(git log ...) \
              --merged-prs $(gh pr list --state merged ...) \
              --output release-notes.md

It produces a markdown document with three sections: "What's new for users," "What changed for ops," and "Internal cleanup." It groups commits by intent (not by author), it strips the boring conventional-commit prefixes, and it links each line back to the merged PR.

The reason engineers love this one is the same reason they previously hated writing release notes by hand: it does the grouping well. Five PRs that all touched the auth subsystem get summarized as one paragraph in "What changed for ops." A docs-typo PR doesn't show up in "What's new for users." A migration script gets called out at the top.

The whole output is a draft, not the final release note. An engineer reads it, edits maybe 20% of it, and ships it. Pre-AI, the same engineer was spending 45 minutes staring at a git log and then writing four bullet points that nobody read.

The Architecture That Ties Them Together

All four assistants are the same shape:

┌───────────────────┐    ┌─────────────────────┐    ┌──────────────┐
│  Pipeline step    │ →  │  Local Python CLI   │ →  │  Ollama HTTP │
│  (GH Actions /    │    │  (deterministic     │    │  (Gemma 4)   │
│   Jenkins / etc)  │    │   stats + prompt)   │    │              │
└───────────────────┘    └─────────────────────┘    └──────────────┘
                                   ↓
                         ┌─────────────────────┐
                         │   structured JSON   │
                         │   back to pipeline  │
                         └─────────────────────┘

Three rules I followed in every CLI:

Compute before you generate. Anything you can count (failure rates, line deltas, latency percentiles) you should count yourself, and pass the numbers to the LLM as facts. Don't ask the LLM to do arithmetic on log lines.
Constrain output to JSON. Every prompt ends with "Respond ONLY with a JSON object matching this schema." Validate before returning. If validation fails, retry once at temperature 0.
Make the rule explicit, not implicit. Anywhere the assistant has to make a recommendation (rollback, block merge, file issue), put the rule in the prompt verbatim. The LLM is applying the rule. It is not inventing the rule.

These three rules are 90% of the difference between an LLM-augmented pipeline that engineers actually trust and one that gets disabled by the second on-call rotation.

What I Got Wrong the First Time

Three honest mistakes worth sharing:

Mistake 1: Letting the LLM write the rules. First version of the deploy watcher just got "Decide whether to roll back" with no threshold guidance. It rolled back twice on noise and missed an actual incident the third time. Trust collapsed in one week.

Mistake 2: One giant prompt instead of four small ones. I tried building a single "DevOps copilot" assistant that handled all four tasks. Latency went up, error rate went up, and the JSON output schema got too complicated for the model to consistently produce. Splitting into four small CLIs cut combined latency in half.

Mistake 3: Not versioning the prompts. First two months, prompts lived in Python string literals. When output drifted, I couldn't tell whether it was the model, the prompt, or the inputs. Now every prompt is a separate file with a hash in its name, logged alongside the output.

What's Next

I'm working on assistant #5: a "PR description rewriter" that takes the diff plus the author's two-line PR description and proposes an improved version, with the same three-section structure as the release notes. Early signal is good but not yet good enough to ship — it's too eager to add bullet points that aren't supported by the diff. Same lesson as before: the model wants to invent things; the prompt has to forbid it.

If you want to see the actual code, all four assistants are open source and run on a single laptop with Ollama and Gemma 4. The repository has the prompts, the CLIs, the GitHub Actions workflows, and the test-history JSONL format I use. Local-first AI for CI/CD turns out to be a much better fit than I expected, and I think the pattern generalizes to a lot more places than just pipelines.

The PR is small. The LLM is local. The decision is auditable. That's the bar.

I'm a Senior Software Engineer at Microsoft and the author of 90+ open-source local-AI projects. If this helped, ⭐ the repos and follow @kennedyraju55 here on Dev.to.

I Built an AI That Makes Real Phone Calls — Here's the Architecture

Nrk Raju Guthikonda — Sat, 25 Apr 2026 18:52:12 +0000

A few weeks ago, I asked my AI agent to call a local Indian restaurant and order chicken biryani for pickup. I gave it my name, my pickup time preference, and one rule: don't make anything up.

It dialed the number. It introduced itself as my assistant. It asked for one chicken biryani to-go, confirmed the $15 price, accepted a 20-minute pickup window, said thank you, and hung up. I drove over and picked up dinner.

No human on my end. The transcript is saved as a .txt file. The audio is saved as an .mp3. The whole thing cost about 18 cents.

This isn't a demo video edited to look smooth. It's the actual behavior of CallPilot, an open-source FastAPI server I built to make outbound phone calls on my behalf. In this post I'm going to walk through exactly how it works — the WebSocket bridge, the realtime audio plumbing, the RAG layer, the multi-provider abstraction, and the small handful of details that turn out to matter much more than they should.

The Problem CallPilot Solves

Phone calls are still the worst piece of personal infrastructure most people have. Booking a dentist, calling about an order, scheduling a follow-up, canceling a subscription — these are all turn-based, predictable, structured conversations that don't require a human's full attention. They just require someone's full attention for ten minutes, and that someone is usually you, in the middle of work.

I wanted an AI that could do these calls for me, with three properties:

Real audio, real PSTN — not a chatbot, not a meeting summarizer. An actual outbound call to an actual phone number.
Knows my context — has read my insurance card, knows my address, knows my pickup preferences. Not a fresh agent every time.
Doesn't hallucinate — if it doesn't know something, it says "I'll check with my client and get back to you," not invents a fake account number.

Off-the-shelf options either solved one of these (LLM chatbots) or all three at the cost of a SaaS contract (Bland.ai, Vapi). I wanted to own the stack and run it from my laptop, so I built it.

The High-Level Architecture

There are five moving pieces:

┌──────────┐   POST /call   ┌──────────┐   REST    ┌─────────┐   PSTN   ┌────────┐
│ Web UI   │───────────────▶│ FastAPI  │──────────▶│ Twilio  │─────────▶│ Callee │
└────┬─────┘                └────┬─────┘           └────┬────┘          └────┬───┘
     │                           │                      │ Media Stream       │
     │   live transcript         │ ◀────────────────────┘ (WebSocket)        │
     │ ◀─────────────────────────│                                            │
     │                           │                                            │
     │                  ┌────────▼─────────┐    bidi WS    ┌─────────────────┐│
     │                  │  Media Bridge    │──────────────▶│ AI Voice        ││
     │                  │  (WebSocket)     │◀──────────────│ Provider        ││
     │                  └────────┬─────────┘  (audio +     │ (OpenAI / Gemini)│
     │                           │            transcripts) └─────────────────┘│
     │                  ┌────────▼─────────┐                                  │
     │                  │ Context Builder  │ ◀──── ChromaDB ◀── PDFs/TXT      │
     │                  └──────────────────┘                                  │

In words:

The browser POSTs {to_number, instructions} to FastAPI.
FastAPI runs RAG against the user's documents, builds a system prompt, then asks Twilio to place an outbound call.
Twilio dials the number. When the call connects, Twilio opens a WebSocket back to my server and streams the caller's audio to it as base64 g711_ulaw 8 kHz.
My server runs a bidirectional bridge: forward Twilio's audio to the AI provider, forward the AI's audio back to Twilio.
Transcripts and audio are saved to disk when the call ends.

Almost everything interesting lives in step 4.

The WebSocket Bridge

This is the heart of the system. It's two coroutines tied together by a streamSid. The snippets below are abridged from the real source — imports, helpers, and record lookups are omitted for readability; see the repo for the runnable version:

async def handle_media_stream(websocket: WebSocket, call_id: str):
    await websocket.accept()
    stream_sid = await _wait_for_twilio_start(websocket, call_id)

    rag_context = retrieve_context(record.instructions, client_id=record.client_id)
    system_prompt = build_system_prompt(record.instructions, rag_context, record.client_id)

    provider = get_provider(call_id, system_prompt)   # OpenAI or Gemini
    await provider.connect()
    await provider.configure_session()

    async def twilio_to_ai():
        async for raw in websocket.iter_text():
            data = json.loads(raw)
            if data.get("event") == "media":
                await provider.send_audio(data["media"]["payload"])
            elif data.get("event") == "stop":
                break

    async def ai_to_twilio():
        async for event in provider.events():
            if event["type"] == "audio":
                await websocket.send_json({
                    "event": "media",
                    "streamSid": stream_sid,
                    "media": {"payload": event["data"]},
                })
            elif event["type"] == "speech_started" and event.get("ai_speaking"):
                # Caller cut in — kill the in-flight response
                await provider.cancel_response()
                await websocket.send_json({"event": "clear", "streamSid": stream_sid})
            elif event["type"] in ("transcript_ai", "transcript_caller"):
                record.transcript.append({"role": ..., "text": event["text"]})

    await asyncio.wait(
        [asyncio.create_task(twilio_to_ai()), asyncio.create_task(ai_to_twilio())],
        return_when=asyncio.FIRST_COMPLETED,
    )

That's the whole loop. The hard problems are hidden inside provider.

The Provider Abstraction

I started with OpenAI's Realtime API. It works great. Then Gemini Live launched and was about 30% the cost for the same conversational quality, so I wanted to be able to switch. Rather than fork the codebase, I extracted a tiny provider interface — every provider yields the same six event types:

{"type": "audio",              "data": "<b64 g711_ulaw>"}
{"type": "transcript_ai",      "text": "..."}
{"type": "transcript_caller",  "text": "..."}
{"type": "speech_started",     "ai_speaking": bool}
{"type": "response_done"}
{"type": "error",              "message": "..."}

The bridge code above doesn't know or care which provider is connected. Switching is one env var:

AI_PROVIDER=openai   # or: gemini

This is the pattern I'd recommend for any voice-AI project. The "hard part" of voice isn't the LLM — it's the audio pipeline and the interruption logic. Decouple them and you can swap models in 30 seconds.

OpenAI provider

Almost trivial. OpenAI's Realtime API speaks g711_ulaw natively, which is exactly what Twilio sends, so the audio passes through untouched:

# Twilio sends → forward as-is
await self._ws.send(json.dumps({
    "type": "input_audio_buffer.append",
    "audio": twilio_b64_payload,
}))

# OpenAI sends → forward as-is
async for raw in self._ws:
    msg = json.loads(raw)
    if msg["type"] == "response.audio.delta":
        yield {"type": "audio", "data": msg["delta"]}

Gemini provider — the audio conversion tax

Gemini Live wants PCM16 16 kHz in and emits PCM16 24 kHz out. Twilio is g711_ulaw 8 kHz in both directions. So every audio frame has to be transcoded twice — once on the way in, once on the way out:

def _mulaw_to_pcm16_16k(mulaw_b64: str) -> str:
    raw = base64.b64decode(mulaw_b64)
    pcm_8k = audioop.ulaw2lin(raw, 2)
    pcm_16k, _ = audioop.ratecv(pcm_8k, 2, 1, 8000, 16000, None)
    return base64.b64encode(pcm_16k).decode()

def _pcm16_24k_to_mulaw(pcm_b64: str) -> str:
    raw = base64.b64decode(pcm_b64)
    pcm_8k, _ = audioop.ratecv(raw, 2, 1, 24000, 8000, None)
    mulaw = audioop.lin2ulaw(pcm_8k, 2)
    return base64.b64encode(mulaw).decode()

A footgun: audioop was removed from stdlib in Python 3.13. The fix is pip install audioop-lts and a try/except import. Took me embarrassingly long to track that down.

RAG Mid-Call

The "doesn't hallucinate" requirement is what made this project actually useful instead of just a toy. The pattern is straightforward:

At startup, every file in clients/<client_id>/ (PDF, TXT, DOCX, MD) gets parsed, chunked at ~500 chars, embedded with OpenAI embeddings, and stored in a per-client ChromaDB collection.

At call time, the user's instruction string is embedded and used to retrieve the top 5 chunks. Those chunks are injected into the system prompt under a clearly-labeled section:

REFERENCE INFORMATION FROM RAJU'S DOCUMENTS:
<top 5 chunks>

Use the above information to answer any questions during the call.
If the information isn't in your documents, say you'll need to check
with Raju and get back to them.

The "if not in your documents" line matters more than you'd think. Without it the model will cheerfully invent an insurance policy number when asked. With it, the model defers — which is exactly the behavior I want from an agent acting on my behalf.

For longer calls I'm experimenting with mid-call re-querying: when the model emits a function_call for lookup_context, the bridge runs a fresh RAG query and injects the result. That's still gated behind a feature flag while I tune it.

Interruption Handling

The single most "uncanny valley" failure mode in voice AI is the agent talking over you. If you say "hold on a sec" and the AI keeps droning, every other thing it does is going to feel broken.

Both providers expose a speech_started event when their VAD detects the caller has begun speaking. The bridge listens for this and, if the AI is currently mid-response:

Tells the provider to cancel the in-flight response (so it stops generating tokens).
Sends a clear event to Twilio (so any audio already in Twilio's playback buffer is flushed immediately).

Both steps are required. The provider cancellation alone leaves ~500ms of buffered audio playing at the callee, which feels worse than not handling interruption at all because now the AI sounds like a politician.

elif etype == "speech_started" and event.get("ai_speaking"):
    await provider.cancel_response()
    await websocket.send_json({"event": "clear", "streamSid": stream_sid})

Voicemail Detection (AMD)

About 30% of my test calls hit voicemail. The first version of CallPilot would happily leave a 90-second monologue of itself trying to confirm an order with an answering machine. Funny once, useless after that.

Twilio has Answering Machine Detection (AMD) built in. You enable it on the call create and Twilio fires a webhook with AnsweredBy=human or AnsweredBy=machine_*. If it's a machine, I just hang up:

twilio_client.calls.create(
    to=to_number,
    from_=settings.twilio_from_number,
    twiml=twiml_url,
    machine_detection="DetectMessageEnd",
    async_amd=True,
    async_amd_status_callback=f"{settings.public_url}/amd/{call_id}",
)

Free, fast, and it's the difference between an MVP and something I trust to call my dentist.

Recordings

Every call now writes an .mp3 to recordings/. Twilio will record the whole call for you and POST a status callback with the recording URL once it's ready; you fetch it and save it. The handful of lines:

twilio_client.calls.create(
    ...,
    record=True,
    recording_status_callback=f"{settings.public_url}/recording/{call_id}",
)

@app.post("/recording/{call_id}")
async def recording_ready(call_id: str, RecordingUrl: str = Form(...)):
    audio = httpx.get(f"{RecordingUrl}.mp3", auth=(sid, token)).content
    Path(f"recordings/{call_id}.mp3").write_bytes(audio)

The biryani recording lives on my laptop. I'm not posting it here because the restaurant didn't sign up to be in a blog post, but as a demo of "AI agents in 2026 actually work," it's the most convincing thing I've ever shipped.

Cost Per Call

For a ~2-minute call:

Component	OpenAI	Gemini Live
Twilio voice	~$0.03	~$0.03
Realtime audio LLM	~$0.10–0.20	~$0.04–0.08
Embeddings (one-time at index time)	~$0.001	~$0.001
Total	$0.13–0.23	$0.07–0.11

Below the cost of a stamp. For a freelancer or a small clinic batching no-show reminders, this is rounding error.

What's Next

Things I'm working on or plan to ship:

Mid-call RAG re-querying via function calls — already prototyped, needs more tuning before I trust it.
Speaker diarization on multi-party calls — useful when a receptionist transfers you to scheduling.
Local-only mode — swap the realtime API for a local Whisper + LLM + Piper TTS pipeline. Latency is the tough constraint; sub-300ms first token is the bar.
Subscription cancellation runbook — the canonical demo. Cancel a Sirius XM subscription, decline retention offers, get a confirmation number, save the recording.

Why This Matters

The realtime audio APIs from OpenAI and Google quietly crossed a usability threshold a few months ago. Latency under 500ms, natural turn-taking, decent prosody. The "AI that calls people for you" idea has been a sci-fi staple for thirty years; the surprising part of 2026 is that the core capability is now a 600-line FastAPI server and ~$0.10 per call.

If you're building anything in this space — agentic workflows, accessibility tools, automation for solo professionals — the architecture above is a good starting point. The full source is here:

Repo: github.com/kennedyraju55/callpilot

It's MIT-licensed. PRs welcome. If you build something with it, I'd love to see it.

I'm a Senior Software Engineer at Microsoft. CallPilot is a personal open-source project, not a Microsoft product, and nothing in this post reflects internal Microsoft work. Find me on GitHub and LinkedIn.

Build Your Own Second Brain: RAG-Powered Knowledge Tools That Never Leave Your Machine

Nrk Raju Guthikonda — Tue, 14 Apr 2026 02:25:43 +0000

Tags: #ai #python #rag #productivity

Every day, we generate an enormous volume of personal knowledge — research papers we read, journal entries we write, PDFs we annotate, news articles we bookmark. Most of this knowledge ends up scattered across folders, apps, and cloud services, never to be retrieved when we actually need it.

What if you could build AI-powered tools that understand your knowledge, answer questions about your documents, and run entirely on your local machine — no API keys, no cloud costs, no data leaving your laptop?

That is exactly what I have been building. Over the past several months, I have created a suite of open-source RAG (Retrieval-Augmented Generation) tools for personal knowledge management, all powered by local LLMs through Ollama. In this post, I will walk you through the RAG architecture patterns behind these tools, share working code, and explain why local-first AI is the future of personal productivity.

Why RAG? Why Local?

Large Language Models are powerful, but they have a fundamental limitation: they only know what they were trained on. Ask a vanilla LLM about your research notes from last Tuesday, and it will hallucinate an answer.

RAG solves this by giving the LLM access to your actual documents at inference time. The pattern is straightforward:

Chunk your documents into manageable pieces
Embed each chunk into a vector representation
Store the vectors in a local database
Retrieve the most relevant chunks for a given query
Generate an answer grounded in the retrieved context

The "local" part matters enormously for personal knowledge. Your journal entries, medical records, research notes, and private documents should never need to leave your machine. Running everything locally with Ollama and Gemma 3 means zero data exposure, zero API costs, and full control.

The Core RAG Pipeline

Here is the foundational pipeline I use across all five projects. Every tool builds on this same pattern:

import ollama
import chromadb
from pathlib import Path

# Initialize local vector store
client = chromadb.PersistentClient(path="./knowledge_db")
collection = client.get_or_create_collection(
    name="documents",
    metadata={"hnsw:space": "cosine"}
)

def chunk_text(text, chunk_size=500, overlap=50):
    """Split text into overlapping chunks for better retrieval."""
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start += chunk_size - overlap
    return chunks

def embed_and_store(doc_id, text):
    """Embed document chunks and store in ChromaDB."""
    chunks = chunk_text(text)
    for i, chunk in enumerate(chunks):
        response = ollama.embed(model="gemma3", input=chunk)
        collection.add(
            ids=[f"{doc_id}_chunk_{i}"],
            embeddings=[response["embeddings"][0]],
            documents=[chunk],
            metadatas=[{"source": doc_id, "chunk_index": i}]
        )

def retrieve(query, n_results=5):
    """Retrieve the most relevant chunks for a query."""
    query_embedding = ollama.embed(model="gemma3", input=query)
    results = collection.query(
        query_embeddings=[query_embedding["embeddings"][0]],
        n_results=n_results
    )
    return results["documents"][0]

def generate_answer(query, context_chunks):
    """Generate an answer grounded in retrieved context."""
    context = "\\n\\n---\\n\\n".join(context_chunks)
    prompt = f"""Based on the following context, answer the question.
If the answer is not in the context, say so.

Context:
{context}

Question: {query}

Answer:"""
    response = ollama.chat(
        model="gemma3",
        messages=[{"role": "user", "content": prompt}]
    )
    return response["message"]["content"]

This is the skeleton. Each project extends it with domain-specific parsing, chunking strategies, and prompt engineering. Let me walk through all five.

Project 1: Personal Knowledge Base

Repo: personal-knowledge-base

This is the central hub — a system that ingests markdown notes, text files, and documents into a searchable, queryable knowledge graph. Think of it as a second brain you can actually have a conversation with.

The key architectural decision here is hierarchical chunking. Rather than treating every 500-character block equally, the system preserves document structure:

import re

def hierarchical_chunk(markdown_text, source_file):
    """Chunk markdown while preserving heading hierarchy."""
    sections = re.split(r'(^#{1,3}\\s+.+$)', markdown_text, flags=re.MULTILINE)
    chunks = []
    current_heading = "Introduction"

    for section in sections:
        if re.match(r'^#{1,3}\\s+', section):
            current_heading = section.strip('# \\n')
        else:
            if section.strip():
                for chunk in chunk_text(section.strip(), chunk_size=400):
                    chunks.append({
                        "text": chunk,
                        "heading": current_heading,
                        "source": source_file
                    })
    return chunks

This means when you ask "What did I write about transformer architectures?", the retrieval step returns chunks with their original heading context — dramatically improving answer quality.

Project 2: PDF Chat Assistant

Repo: pdf-chat-assistant

PDFs are the lingua franca of professional knowledge — research papers, contracts, reports, whitepapers. This tool lets you drop in a PDF and start asking questions.

The interesting RAG challenge here is table and figure handling. Raw PDF text extraction often mangles tables. The solution is a multi-pass extraction strategy:

import fitz  # PyMuPDF

def extract_pdf_with_structure(pdf_path):
    """Extract text from PDF preserving structural elements."""
    doc = fitz.open(pdf_path)
    pages = []

    for page_num, page in enumerate(doc):
        blocks = page.get_text("dict")["blocks"]
        page_content = []

        for block in blocks:
            if block["type"] == 0:  # Text block
                lines = []
                for line in block["lines"]:
                    text = " ".join(span["text"] for span in line["spans"])
                    font_size = line["spans"][0]["size"] if line["spans"] else 12
                    lines.append({"text": text, "font_size": font_size})
                page_content.append({
                    "type": "text",
                    "lines": lines,
                    "bbox": block["bbox"]
                })

        pages.append({
            "page_num": page_num + 1,
            "content": page_content
        })
    return pages

Each chunk retains its page number and structural role (heading, body, caption), so when the LLM generates an answer, it can cite "page 7, section 3" — making the tool genuinely useful for academic research.

Project 3: News Digest Generator

Repo: news-digest-generator

This project flips the RAG pattern slightly. Instead of querying a static corpus, it continuously ingests news feeds and generates personalized digests. The RAG component handles deduplication and topic clustering:

def generate_digest(articles, user_interests):
    """Generate a personalized news digest using RAG."""
    # Embed and store today's articles
    for article in articles:
        embed_and_store(article["id"], article["content"])

    # Retrieve articles matching user interests
    relevant = []
    for interest in user_interests:
        chunks = retrieve(interest, n_results=3)
        relevant.extend(chunks)

    # Generate digest
    prompt = f"""Create a concise news digest from these articles.
Group by topic. Highlight key insights.

Articles:
{chr(10).join(relevant)}

Digest:"""
    response = ollama.chat(
        model="gemma3",
        messages=[{"role": "user", "content": prompt}]
    )
    return response["message"]["content"]

The power here is that your interests drive the retrieval, so the digest is automatically personalized — no algorithm deciding what you should see.

Project 4: Diary Journal Organizer

Repo: diary-journal-organizer

This is perhaps the most personal of the five tools. It ingests your journal entries and lets you explore patterns in your own thinking over time. The RAG pattern here emphasizes temporal retrieval — not just semantic similarity, but time-aware search:

from datetime import datetime

def temporal_retrieve(query, date_range=None, n_results=5):
    """Retrieve journal entries with optional date filtering."""
    query_embedding = ollama.embed(model="gemma3", input=query)
    where_filter = None

    if date_range:
        where_filter = {
            "$and": [
                {"date": {"$gte": date_range[0]}},
                {"date": {"$lte": date_range[1]}}
            ]
        }

    results = collection.query(
        query_embeddings=[query_embedding["embedding"]],
        n_results=n_results,
        where=where_filter
    )
    return results

Ask it "What was I stressed about in January?" and it retrieves semantically relevant entries scoped to that month. This is deeply personal data — exactly the kind of information that should never touch a cloud API.

Project 5: Research Paper QA

Repo: research-paper-qa

Built for the workflow of reading dozens of papers for a literature review. Drop a folder of PDFs, and the system builds a cross-paper knowledge base you can query:

def cross_paper_query(question, paper_collection):
    """Query across multiple research papers with source attribution."""

    query_embedding = ollama.embed(model="gemma3", input=question)
    results = paper_collection.query(
        query_embeddings=[query_embedding["embeddings"][0]],
        n_results=8,
        include=["documents", "metadatas"]
    )

    context_with_sources = []
    for doc, meta in zip(results["documents"][0], results["metadatas"][0]):
        citation = f"[{meta['paper_title']}, p.{meta['page']}]"
        context_with_sources.append(f"{doc}\\n— {citation}")

    answer = generate_answer(question, context_with_sources)
    return answer

The killer feature is cross-paper synthesis. Ask "How do different authors define retrieval-augmented generation?" and it pulls relevant definitions from every paper in your collection, with proper citations.

Getting Started

All five projects run on the same local stack. Here is the setup:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull Gemma 3
ollama pull gemma3

# Install Python dependencies
pip install ollama chromadb pymupdf

Clone any of the repos from my GitHub and you are up and running. No API keys. No cloud accounts. Just your machine, your data, and an LLM that respects your privacy.

What I Have Learned

Building these tools has reinforced a few convictions:

Chunking strategy matters more than model size. A well-chunked document with a smaller model consistently outperforms sloppy chunking with a larger model. Invest time in understanding your document structure.

Overlap in chunks is not optional. Without overlap, you lose context at chunk boundaries. A 10-15% overlap catches most cross-boundary information.

Local LLMs are good enough. Gemma 3 running through Ollama handles personal knowledge tasks remarkably well. You do not need GPT-4 to search your own notes.

The best knowledge system is the one you control. Cloud AI services are powerful, but for personal data — journals, health records, private research — local is the only option that makes sense.

All of these projects are open source and actively maintained. I would love for you to try them, break them, and contribute. The future of personal AI is local, private, and open.

Written by Nrk Raju Guthikonda, Senior Software Engineer at Microsoft on the Copilot Search Infrastructure team. Builder of 110+ open-source AI projects. Find more on dev.to/kennedyraju55 and connect on LinkedIn.

Every Student Deserves an AI Tutor: 5 Education Tools I Built That Work Without WiFi

Nrk Raju Guthikonda — Tue, 14 Apr 2026 02:22:10 +0000

Tags: #ai #python #education #opensource

Every student deserves access to AI-powered learning tools. But here's the uncomfortable truth: most AI education platforms require internet connectivity, charge per-API-call fees, and send student data to third-party servers. For underfunded school districts, rural communities, and privacy-conscious institutions, that's a non-starter.

What if you could build powerful AI tutors, summarizers, and research assistants that run entirely on a laptop — no internet required, no student data leaving the device, and absolutely zero API costs?

Over the past several months, I've built five open-source education AI tools powered by local LLMs using Ollama and Gemma 3. In this post, I'll walk through each one, explain why local-first matters for education, and share the code so you can build your own.

Why Education AI Must Be Local-First

Before diving into the projects, let's talk about why running AI locally isn't just a nice-to-have for education — it's a necessity.

Student Privacy and FERPA Compliance

The Family Educational Rights and Privacy Act (FERPA) protects student education records. When a student asks an AI chatbot about their struggles with calculus or their reading comprehension challenges, that's sensitive data. Sending it to OpenAI's servers or any third-party API creates compliance headaches that most school IT departments aren't equipped to handle.

With local LLMs, student interactions never leave the device. There's no data processing agreement to negotiate, no vendor to audit, and no breach notification to worry about. The data stays on the machine, period.

Offline Access Is an Equity Issue

According to the FCC, roughly 17 million American students lack home internet access. In many developing countries, the numbers are far worse. Cloud-dependent AI tools are invisible to these students. A local LLM running on a modest laptop works the same whether you're in downtown Austin or a rural village with no cell coverage.

Cost Matters for Schools

A single school district running GPT-4 API calls for 10,000 students could easily spend $50,000+ per month. Local models like Gemma 3 running through Ollama cost exactly zero after the initial setup. That's the difference between a pilot program and a sustainable deployment.

Customization Without Vendor Lock-In

When you control the model and the application layer, you can fine-tune behavior for specific curricula, age groups, and pedagogical approaches. No waiting for a vendor to add features. No praying they don't deprecate your favorite endpoint.

The Tech Stack

All five projects share a common foundation:

Ollama — Local LLM runtime that makes running models as simple as ollama run gemma3
Gemma 3 — Google's open-weight model that balances capability with reasonable hardware requirements
Python — The lingua franca of AI tooling
Streamlit — For quick, interactive web UIs that teachers and students can actually use

Getting started takes about five minutes:

# Install Ollama (visit ollama.com for your platform)
curl -fsSL https://ollama.com/install.sh | sh

# Pull Gemma 3
ollama pull gemma3

# Install Python dependencies
pip install ollama streamlit PyPDF2

Project 1: Study Buddy Bot — Your AI Study Companion

Repo: github.com/kennedyraju55/study-buddy-bot

Study Buddy Bot is a conversational AI that helps students review material, quiz themselves, and work through concepts they're struggling with. Unlike generic chatbots, it maintains conversation context and adapts its explanations based on the student's level.

import ollama

def create_study_session(subject: str, level: str = "high school"):
    """Start an interactive study session on a given subject."""
    system_prompt = (
        f"You are a patient, encouraging study buddy helping a {level} student "
        f"learn {subject}. Break down complex concepts into simple steps. "
        f"Use analogies and examples. Ask follow-up questions to check understanding. "
        f"If the student is confused, try a different explanation approach."
    )
    messages = [{"role": "system", "content": system_prompt}]

    def chat(user_message: str) -> str:
        messages.append({"role": "user", "content": user_message})
        response = ollama.chat(
            model="gemma3",
            messages=messages
        )
        assistant_reply = response["message"]["content"]
        messages.append({"role": "assistant", "content": assistant_reply})
        return assistant_reply

    return chat

# Usage
study = create_study_session("organic chemistry", level="college")
print(study("Explain nucleophilic substitution like I'm five"))
print(study("Can you quiz me on SN1 vs SN2 reactions?"))

The closure-based design keeps conversation history in memory without any database, which means zero student data persistence after the session ends — a privacy feature by design.

Project 2: Textbook Summarizer — Chapters in Minutes

Repo: github.com/kennedyraju55/textbook-summarizer

Students drown in reading assignments. Textbook Summarizer takes PDF chapters and produces structured summaries with key concepts, definitions, and review questions — all processed locally.

import ollama
from PyPDF2 import PdfReader

def summarize_chapter(pdf_path: str, detail_level: str = "detailed") -> dict:
    """Extract text from a PDF chapter and generate a structured summary."""
    reader = PdfReader(pdf_path)
    chapter_text = "\\n".join(
        page.extract_text() or "" for page in reader.pages
    )

    # Chunk long chapters to fit context window
    max_chars = 6000
    chunks = [
        chapter_text[i:i + max_chars]
        for i in range(0, len(chapter_text), max_chars)
    ]

    summaries = []
    for i, chunk in enumerate(chunks):
        prompt = (
            f"Summarize this section of a textbook chapter ({detail_level} level). "
            f"Include: key concepts, important definitions, and 3 review questions.\\n\\n"
            f"Section {i + 1}:\\n{chunk}"
        )
        response = ollama.chat(
            model="gemma3",
            messages=[{"role": "user", "content": prompt}]
        )
        summaries.append(response["message"]["content"])

    return {
        "total_pages": len(reader.pages),
        "chunks_processed": len(chunks),
        "summaries": summaries
    }

result = summarize_chapter("biology_ch7.pdf")
for i, summary in enumerate(result["summaries"]):
    print(f"\\n--- Section {i + 1} ---\\n{summary}")

The chunking strategy handles textbooks of any length. Each chunk gets its own summary, and a teacher could extend this to produce a final consolidated summary across all chunks.

Project 3: Language Learning Bot — Practice Without Judgment

Repo: github.com/kennedyraju55/language-learning-bot

Language learning requires practice, and practice requires a patient partner who won't judge you for conjugating every verb wrong. Language Learning Bot provides conversational practice with grammar correction and vocabulary building.

import ollama

def language_tutor(target_language: str, native_language: str = "English"):
    """Create an AI language tutor for conversational practice."""
    system_prompt = (
        f"You are a friendly {target_language} tutor. The student speaks {native_language}. "
        f"Conduct conversations in {target_language} at an appropriate difficulty. "
        f"After each student message: (1) gently correct any grammar or vocabulary errors, "
        f"(2) provide the corrected version, (3) explain why in {native_language}, "
        f"(4) continue the conversation with a follow-up question in {target_language}."
    )
    messages = [{"role": "system", "content": system_prompt}]

    def practice(student_message: str) -> str:
        messages.append({"role": "user", "content": student_message})
        response = ollama.chat(
            model="gemma3",
            messages=messages
        )
        reply = response["message"]["content"]
        messages.append({"role": "assistant", "content": reply})
        return reply

    return practice

# Spanish practice session
tutor = language_tutor("Spanish")
print(tutor("Hola, yo soy estudiante. Yo quiero practicar español."))
print(tutor("Ayer yo voy al mercado y compré frutas."))

Notice how the prompt instructs the model to correct errors gently and explain in the student's native language. This mirrors best practices in language pedagogy — correction should inform, not discourage.

Project 4: Research Paper QA — Ask Questions, Get Answers

Repo: github.com/kennedyraju55/research-paper-qa

Graduate students and researchers spend hours parsing dense papers. Research Paper QA lets you load a paper and ask natural language questions about its methodology, findings, and implications.

import ollama
from PyPDF2 import PdfReader

def load_paper(pdf_path: str) -> str:
    """Extract text content from a research paper PDF."""
    reader = PdfReader(pdf_path)
    return "\\n".join(page.extract_text() or "" for page in reader.pages)

def ask_paper(paper_text: str, question: str) -> str:
    """Ask a question about a loaded research paper."""
    prompt = (
        "You are a research assistant. Based on the following paper, "
        "answer the question accurately. Cite specific sections when possible. "
        "If the paper doesn't contain enough information to answer, say so.\\n\\n"
        f"Paper content:\\n{paper_text[:8000]}\\n\\n"
        f"Question: {question}"
    )
    response = ollama.chat(
        model="gemma3",
        messages=[{"role": "user", "content": prompt}]
    )
    return response["message"]["content"]

# Load and query a paper
paper = load_paper("attention_is_all_you_need.pdf")
print(ask_paper(paper, "What problem does the transformer architecture solve?"))
print(ask_paper(paper, "What are the key limitations mentioned by the authors?"))
print(ask_paper(paper, "Summarize the self-attention mechanism in simple terms."))

For longer papers, a production version would implement retrieval-augmented generation (RAG) with local embeddings to handle papers that exceed the context window. In my experience, even the basic approach shown here handles most conference papers effectively.

Project 5: Reading List Manager — Your AI Librarian

Repo: github.com/kennedyraju55/reading-list-manager

Reading List Manager helps students and educators curate, prioritize, and get AI-powered recommendations from their reading lists. Feed it your course syllabus or personal reading goals, and it suggests what to read next and why.

import ollama
import json

def manage_reading_list(reading_list: list[dict]) -> str:
    """Analyze a reading list and provide prioritized recommendations."""
    list_text = "\\n".join(
        f"- \\"{item['title']}\\" by {item['author']} "
        f"(Topic: {item.get('topic', 'General')}, "
        f"Read: {'Yes' if item.get('read') else 'No'})"
        for item in reading_list
    )
    prompt = (
        "You are an academic reading advisor. Analyze this reading list and provide:\\n"
        "1. A suggested reading order based on topic progression\\n"
        "2. Which unread items to prioritize and why\\n"
        "3. Connections between the readings that the student should look for\\n"
        "4. Two additional book recommendations that complement this list\\n\\n"
        f"Reading List:\\n{list_text}"
    )
    response = ollama.chat(
        model="gemma3",
        messages=[{"role": "user", "content": prompt}]
    )
    return response["message"]["content"]

my_list = [
    {"title": "Thinking, Fast and Slow", "author": "Daniel Kahneman",
     "topic": "Cognitive Science", "read": True},
    {"title": "The Art of Learning", "author": "Josh Waitzkin",
     "topic": "Learning Theory", "read": False},
    {"title": "Make It Stick", "author": "Peter Brown",
     "topic": "Learning Science", "read": False},
    {"title": "Mindset", "author": "Carol Dweck",
     "topic": "Psychology", "read": True},
]

print(manage_reading_list(my_list))

This one is deceptively powerful. By analyzing reading lists holistically, it helps students see connections between texts they might otherwise miss — turning a pile of books into a coherent learning journey.

What's Next

These five tools are just the beginning. The local LLM ecosystem is evolving fast, and every improvement to models like Gemma 3 makes these tools more capable without changing a single line of application code. A few directions I'm exploring:

Adaptive difficulty — Tracking student performance across sessions to automatically adjust complexity
Multi-modal support — Processing diagrams, charts, and handwritten notes alongside text
Collaborative features — Letting students share AI-generated summaries and study guides locally over a network

All five projects are open source and available on my GitHub. Clone them, break them, improve them, and deploy them in your school. Education AI should be a public good, not a premium service.

About the Author

Nrk Raju Guthikonda is a Senior Software Engineer at Microsoft on the Copilot Search Infrastructure team. With 110+ open-source repositories spanning AI, healthcare, developer tools, and education, he builds tools that bring advanced AI capabilities to real-world problems — especially where privacy and accessibility matter most.

🐙 GitHub: github.com/kennedyraju55
✍️ Dev.to: dev.to/kennedyraju55
💼 LinkedIn: linkedin.com/in/nrk-raju-guthikonda-504066a8

Your Code Never Leaves Your Machine: 5 AI Developer Tools I Built with Local LLMs

Nrk Raju Guthikonda — Tue, 14 Apr 2026 02:19:15 +0000

Every developer has faced the dilemma: you want AI to help analyze your code, but shipping proprietary source code to a cloud API feels wrong. What if the analysis happened entirely on your machine?

Over the past few months, I've built a suite of open-source developer productivity tools powered by local LLMs. No API keys. No cloud dependencies. No data leaving your laptop. In this post, I'll walk through five of these tools, explain why local-first AI matters for code analysis, and share real code you can run today.

Why Code Analysis Should Stay Local

Before diving into the tools, let's talk about why running AI locally for code analysis isn't just a nice-to-have — it's often the right default.

Proprietary code stays private. When you're analyzing enterprise codebases, client projects, or pre-release features, sending code to a third-party API creates compliance and IP risks. Local inference means your code never leaves your machine.

Speed without rate limits. Cloud APIs throttle requests, add network latency, and sometimes go down entirely. A local model running on your GPU responds in seconds with zero network dependency.

Cost drops to zero. API calls add up fast when you're analyzing hundreds of files. Local models have a one-time setup cost (downloading the model) and then run for free, forever.

Offline capability. Airports, coffee shops with spotty WiFi, or air-gapped environments — local AI works everywhere.

In my experience building developer tools, the combination of privacy and zero-cost iteration makes local LLMs the ideal foundation for code analysis workflows.

The Stack: Ollama + Gemma 3 + Python

All five tools share a common architecture:

Ollama as the local model server — it manages model downloads, GPU allocation, and exposes a simple REST API
Gemma 3 (Google's open-weight model) as the LLM — excellent at code understanding, translation, and structured analysis
Python with Streamlit for interactive UIs and requests for Ollama communication

Here's the base pattern every tool uses:

import requests
import json

def query_ollama(prompt, model="gemma3"):
    """Send a prompt to the local Ollama instance and return the response."""
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": False
        }
    )
    return response.json()["response"]

This simple function is the foundation of every tool. No API keys, no authentication, no billing — just a local HTTP call.

Tool 1: Code Complexity Analyzer

Repo: code-complexity-analyzer

Understanding code complexity is critical for maintainability. This tool uses Gemma 3 to analyze functions and classes, identifying cyclomatic complexity, cognitive complexity, and potential refactoring opportunities that static analyzers miss.

def analyze_complexity(code_snippet, language="python"):
    prompt = f"""Analyze the following {language} code for complexity:

{code_snippet}

Provide:
1. Cyclomatic complexity score (1-10)
2. Cognitive complexity assessment
3. Nested depth analysis
4. Specific refactoring suggestions
5. Overall maintainability rating

Format your response as structured JSON."""

    result = query_ollama(prompt)
    return json.loads(result)

What makes this powerful is that the LLM doesn't just count branches — it understands semantic complexity. It can flag a function that's technically simple but cognitively confusing because of poor naming or implicit side effects. In my experience, running this across a codebase of 500+ files takes about 10 minutes locally versus costing $15–20 in cloud API calls.

Tool 2: Code Translator

Repo: code-translator

Migrating code between languages is one of the most tedious tasks in software engineering. This tool translates code between Python, JavaScript, TypeScript, Go, Rust, Java, and C# while preserving logic, comments, and structure.

def translate_code(source_code, source_lang, target_lang):
    prompt = f"""Translate the following {source_lang} code to {target_lang}.

Source ({source_lang}):
{source_code}

Requirements:
- Preserve all logic and edge cases
- Use idiomatic {target_lang} patterns and conventions
- Include equivalent error handling
- Add comments explaining non-obvious translations

Provide only the translated code."""

    return query_ollama(prompt)

# Example: Python to Rust
python_code = """
def fibonacci(n):
    if n <= 1:
        return n
    a, b = 0, 1
    for _ in range(2, n + 1):
        a, b = b, a + b
    return b
"""

rust_code = translate_code(python_code, "python", "rust")
print(rust_code)

The Streamlit UI makes this particularly interactive — paste code on the left, select your target language, and get idiomatic translated code on the right. Gemma 3 handles nuances like Python's dynamic typing to Rust's ownership model surprisingly well.

Tool 3: Resume Analyzer

Repo: resume-analyzer

While not strictly a code analysis tool, this demonstrates how the same local LLM pattern applies to document analysis for developers. The resume analyzer parses resumes, scores them against job descriptions, and suggests improvements.

def analyze_resume(resume_text, job_description=None):
    jd_section = f"Job Description: {job_description}" if job_description else ""
    prompt = f"""Analyze this resume for a software engineering position:

Resume:
{resume_text}

{jd_section}

Provide:
1. Overall strength score (1-100)
2. Technical skills identified
3. Experience level assessment
4. Key strengths and areas for improvement
5. Missing keywords for ATS optimization
6. Formatting suggestions

Be specific and actionable."""

    return query_ollama(prompt)

I built this because I've seen too many talented engineers get filtered out by ATS systems. Running it locally means your resume content — which contains highly personal information — never touches a cloud service. Privacy matters here more than anywhere.

Tool 4: Email Draft Assistant

Repo: email-draft-assistant

Developer productivity isn't just about code. This tool helps craft professional emails — standup updates, project proposals, status reports, and technical discussions — using context-aware AI generation.

def draft_email(context, tone="professional", email_type="status_update"):
    prompt = f"""Draft a {tone} {email_type} email based on this context:

Context:
{context}

Requirements:
- Clear subject line
- Concise but comprehensive body
- Appropriate greeting and sign-off
- Action items clearly highlighted
- Tone: {tone}

Format with Subject, Body, and suggested follow-up actions."""

    return query_ollama(prompt)

# Example usage
email = draft_email(
    context="Completed feature branch for dashboard redesign. "
            "Fixed 3 UI bugs. Next up: performance optimization.",
    tone="professional",
    email_type="status_update"
)
print(email)

The beauty of running this locally is speed of iteration. You can generate five drafts, tweak the tone parameter, and settle on the right version in under a minute — with zero cost per generation.

Tool 5: Sentiment Analyzer

Repo: sentiment-analyzer

Understanding sentiment in code reviews, team communications, and user feedback is an underrated developer skill. This tool analyzes text for emotional tone, constructiveness, and communication quality.

def analyze_sentiment(text, context="code_review"):
    prompt = f"""Analyze the sentiment of this {context} text:

Text: "{text}"

Provide:
1. Overall sentiment (positive/negative/neutral)
2. Confidence score (0-1)
3. Emotional tone (constructive, critical, encouraging, etc.)
4. Specific phrases driving the sentiment
5. Suggestion for more constructive rephrasing if negative

Return as structured JSON."""

    result = query_ollama(prompt)
    return json.loads(result)

# Analyzing a code review comment
review = "This function is a mess. Why didn't you just use a dictionary?"
analysis = analyze_sentiment(review, context="code_review")
print(json.dumps(analysis, indent=2))

I built this after noticing how much tone affects code review culture. Running it locally means team communications stay private while still getting AI-powered insights into communication patterns.

Patterns That Emerged

After building all five tools, several patterns crystallized:

Prompt engineering is the real differentiator. The model is the same across all tools — what changes is how you structure the prompt. Specific, structured prompts with clear output formats consistently produce better results than vague instructions.

Streamlit is the perfect rapid UI. Every tool ships with a Streamlit interface that took less than an hour to build. For developer tools, it's the sweet spot between CLI and full web app.

Gemma 3 punches above its weight. For code-related tasks especially, Gemma 3 running locally produces results that rival much larger cloud models. The combination of Google's training data and Ollama's efficient inference makes it genuinely practical for daily use.

Composability matters. Because each tool follows the same pattern, they can be chained. Analyze complexity, translate to a better-suited language, generate the PR description email. The Unix philosophy applies to AI tools too.

Getting Started

Every repo follows the same setup:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull Gemma 3
ollama pull gemma3

# Clone any tool (example: code-complexity-analyzer)
git clone https://github.com/kennedyraju55/code-complexity-analyzer
cd code-complexity-analyzer

# Install dependencies and run
pip install -r requirements.txt
streamlit run app.py

All five tools are open source, MIT licensed, and designed to be forked and customized. If you're building developer tools, I'd encourage you to start with the query_ollama pattern above and see where it takes you.

The future of developer tooling is local, private, and composable. These five projects are my contribution to that future — and with 116+ repos and counting, I'm just getting started.

Nrk Raju Guthikonda is a Senior Software Engineer at Microsoft on the Copilot Search Infrastructure team. He maintains 116+ open-source repositories exploring AI, local LLMs, and developer productivity tools. Find him on GitHub, Dev.to, and LinkedIn.

Stop Uploading Confidential Documents to AI: Build Your Own Local Processing Pipeline

Nrk Raju Guthikonda — Tue, 14 Apr 2026 02:13:32 +0000

Every day, millions of sensitive documents — invoices, legal contracts, meeting transcripts, academic textbooks — get uploaded to cloud AI services for processing. Most people don't think twice about it. But here's the uncomfortable truth: once your data leaves your machine, you've lost control of it.

What if you could build a complete document processing pipeline that never phones home? One that runs entirely on your hardware, processes confidential documents without any privacy concerns, and costs exactly zero dollars per API call?

That's exactly what I built. In this post, I'll walk you through five open-source tools I created for AI-powered document processing — all running locally with Gemma 3 via Ollama. No cloud APIs. No subscriptions. No data leaks.

Why Local LLMs for Document Processing?

Before diving into code, let me explain why I'm passionate about running document AI locally.

1. Confidential Documents Deserve Confidential Processing

Think about what document processing typically involves: financial invoices with bank details, legal contracts with proprietary terms, internal meeting notes with strategic discussions, student records, medical reports. These aren't cat photos — they're sensitive data with real consequences if exposed.

In my experience building search and retrieval systems, I've seen firsthand how important data privacy is. When you process documents locally, the data never leaves your machine. Period.

2. Legal Compliance Is Not Optional

GDPR, HIPAA, SOC 2, attorney-client privilege — the list of regulations governing document handling grows every year. Running a local LLM sidesteps an enormous category of compliance headaches. There's no third-party data processing agreement needed when there's no third party.

3. Cost Savings Add Up Fast

Cloud LLM APIs charge per token. If you're processing hundreds of documents daily, those costs compound quickly. A local setup with Ollama running Gemma 3 on a decent GPU costs nothing after the initial hardware investment. I've processed thousands of documents across my projects without spending a cent on API fees.

4. Offline-First Means Always Available

No internet? No problem. Your document pipeline works on an airplane, in a secure facility, or during an outage. For mission-critical workflows, this reliability is non-negotiable.

The Stack: Ollama + Gemma 3 + Python

All five tools share a common foundation:

Ollama — Local LLM runtime that makes running models as easy as ollama run gemma3
Gemma 3 — Google's powerful open-weight model, excellent for document understanding
Python — The glue that ties everything together
PyPDF2 / pdfplumber — PDF text extraction
Streamlit — Clean web UI for non-technical users

Here's the base pattern every tool follows:

import requests
import json

def query_local_llm(prompt: str, model: str = "gemma3") -> str:
    """Send a prompt to the local Ollama instance."""
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": False,
            "options": {
                "temperature": 0.3,
                "num_predict": 2048
            }
        }
    )
    return response.json()["response"]

Low temperature (0.3) keeps the output factual and consistent — exactly what you want for document processing where hallucinations are unacceptable.

Project 1: PDF Report Generator

📄 pdf-report-generator — AI-powered PDF report generator using local Gemma 3 LLM via Ollama

This tool takes raw data or notes and generates polished, structured PDF reports. Think quarterly summaries, research briefs, or project status reports — all generated locally.

from fpdf import FPDF
import requests

def generate_report(topic: str, raw_notes: str) -> str:
    prompt = f"""You are a professional report writer. 
Given the following topic and raw notes, generate a well-structured 
report with sections: Executive Summary, Key Findings, 
Detailed Analysis, and Recommendations.

Topic: {topic}
Notes: {raw_notes}

Write the report in a formal, professional tone."""

    response = requests.post(
        "http://localhost:11434/api/generate",
        json={"model": "gemma3", "prompt": prompt, "stream": False}
    )
    report_text = response.json()["response"]

    # Generate PDF
    pdf = FPDF()
    pdf.add_page()
    pdf.set_font("Arial", "B", 16)
    pdf.cell(0, 10, topic, ln=True, align="C")
    pdf.set_font("Arial", size=11)
    pdf.multi_cell(0, 7, report_text)

    output_path = f"report_{topic.replace(' ', '_').lower()}.pdf"
    pdf.output(output_path)
    return output_path

The beauty here is that your proprietary data — project metrics, financial figures, strategic notes — never touches an external server.

Project 2: Invoice Extractor

📄 invoice-extractor — AI-powered invoice data extractor using local Gemma 3 LLM via Ollama

Invoices contain some of the most sensitive financial data in any organization: vendor details, bank account numbers, tax IDs, payment amounts. This tool extracts structured data from invoice PDFs entirely offline.

import pdfplumber
import json
import requests

def extract_invoice_data(pdf_path: str) -> dict:
    # Extract text from PDF
    with pdfplumber.open(pdf_path) as pdf:
        text = "\\n".join(page.extract_text() or "" for page in pdf.pages)

    prompt = f"""Extract the following fields from this invoice text 
and return them as valid JSON:
- invoice_number
- date
- vendor_name
- vendor_address
- line_items (list of description, quantity, unit_price, total)
- subtotal
- tax
- total_amount
- payment_terms

Invoice text:
{text}

Return ONLY valid JSON, no explanation."""

    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": "gemma3",
            "prompt": prompt,
            "stream": False,
            "options": {"temperature": 0.1}
        }
    )

    result = response.json()["response"]
    return json.loads(result)

I set temperature to 0.1 here — even lower than usual — because invoice extraction demands precision. A wrong decimal point in a payment amount is a real problem.

Project 3: Textbook Summarizer

📄 textbook-summarizer — AI-powered textbook chapter summarizer using local Gemma 3 LLM via Ollama

Whether you're a student preparing for exams or a professional staying current with technical literature, summarizing dense textbook chapters is a time sink. This tool processes chapters into concise, structured summaries.

import pdfplumber
import requests

def summarize_chapter(pdf_path: str, chapter_start: int, chapter_end: int) -> str:
    with pdfplumber.open(pdf_path) as pdf:
        chapter_text = "\\n".join(
            pdf.pages[i].extract_text() or ""
            for i in range(chapter_start - 1, min(chapter_end, len(pdf.pages)))
        )

    prompt = f"""Summarize this textbook chapter. Include:
1. **Key Concepts** — Main ideas and definitions
2. **Important Formulas/Rules** — Any critical formulas or rules
3. **Summary** — A 3-5 paragraph overview of the chapter
4. **Study Questions** — 5 questions a student should be able to answer

Chapter text:
{chapter_text[:8000]}

Provide a thorough but concise summary."""

    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": "gemma3",
            "prompt": prompt,
            "stream": False,
            "options": {"temperature": 0.3, "num_predict": 3000}
        }
    )
    return response.json()["response"]

The chunking strategy (limiting to 8000 characters) is important. For longer chapters, you'd want to implement a sliding window approach that summarizes sections individually, then combines them.

Project 4: Legal Document Summarizer

📄 legal-doc-summarizer — AI-powered legal document summarizer using local Gemma 3 LLM via Ollama

Legal documents are the poster child for why local processing matters. Attorney-client privilege, confidential settlement terms, proprietary contract clauses — none of this should ever be sent to a cloud API.

import pdfplumber
import requests

def summarize_legal_document(pdf_path: str) -> dict:
    with pdfplumber.open(pdf_path) as pdf:
        full_text = "\\n".join(page.extract_text() or "" for page in pdf.pages)
        page_count = len(pdf.pages)

    prompt = f"""You are a legal document analyst. Analyze this legal 
document and provide:

1. **Document Type** — Contract, NDA, agreement, filing, etc.
2. **Parties Involved** — Who are the parties?
3. **Key Terms** — Important obligations, rights, and conditions
4. **Critical Dates** — Deadlines, effective dates, expiration
5. **Financial Terms** — Payment amounts, penalties, fees
6. **Risk Factors** — Potential concerns or unusual clauses
7. **Plain Language Summary** — What this document means in simple terms

Document text:
{full_text[:10000]}

Be thorough but concise. Flag anything unusual."""

    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": "gemma3",
            "prompt": prompt,
            "stream": False,
            "options": {"temperature": 0.2, "num_predict": 4000}
        }
    )

    return {
        "source_file": pdf_path,
        "summary": response.json()["response"],
        "pages_processed": page_count
    }

This tool is specifically designed to be a first-pass analysis tool. It doesn't replace legal counsel — it helps legal professionals quickly triage and understand large volumes of documents.

Project 5: Meeting Summarizer

📄 meeting-summarizer — AI-powered meeting notes summarizer using local Gemma 3 LLM via Ollama

Meeting notes and transcripts often contain sensitive strategic discussions, personnel decisions, and financial planning. This tool turns raw meeting content into structured, actionable summaries.

import requests

def summarize_meeting(transcript: str, meeting_title: str = "") -> str:
    prompt = f"""Summarize these meeting notes into a structured format:

**Meeting:** {meeting_title}

**Required sections:**
1. **Attendees** — Who was present (if mentioned)
2. **Agenda Items** — What topics were discussed
3. **Key Decisions** — Decisions that were made
4. **Action Items** — Tasks assigned, with owners and deadlines
5. **Open Questions** — Unresolved items needing follow-up
6. **Next Steps** — What happens next

Meeting transcript:
{transcript}

Focus on actionable takeaways. Be specific about who owns what."""

    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": "gemma3",
            "prompt": prompt,
            "stream": False,
            "options": {"temperature": 0.3, "num_predict": 2500}
        }
    )
    return response.json()["response"]

The structured output format — especially the action items with owners — makes this immediately useful for project management workflows.

Getting Started

Setting up the entire pipeline takes about 10 minutes:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull the Gemma model
ollama pull gemma3

# Clone any of the projects
git clone https://github.com/kennedyraju55/pdf-report-generator.git
cd pdf-report-generator

# Install dependencies
pip install -r requirements.txt

# Run the app
streamlit run app.py

All five projects follow this same pattern. Clone, install, run. No API keys to configure, no accounts to create, no billing to set up.

What I Learned Building These Tools

After building 116+ open-source repositories — many focused on local AI — here are my key takeaways:

Prompt engineering matters more than model size. A well-crafted prompt with Gemma 3 outperforms a lazy prompt with GPT-4 for structured extraction tasks.
Temperature is your precision dial. For document processing, keep it between 0.1-0.3. Creative writing? Crank it up. Invoice extraction? Keep it cold.
Chunking strategy is critical. Long documents need intelligent splitting — by sections, not arbitrary character counts. Respect document structure.
Local doesn't mean slow. On a modern GPU, Gemma 3 processes most documents in seconds. The overhead of an API call (network latency, rate limits, retries) often makes cloud slower in practice.
Privacy is a feature, not a limitation. The moment I stopped treating local inference as a compromise and started treating it as a feature, the use cases multiplied.

Conclusion

You don't need to send sensitive documents to the cloud to get AI-powered processing. With Ollama, Gemma 3, and a bit of Python, you can build a complete document processing pipeline that's private, free, and works offline.

All five projects are open source and ready to use. Clone them, modify them, build on top of them. That's the whole point.

About the Author

Nrk Raju Guthikonda is a Senior Software Engineer at Microsoft on the Copilot Search Infrastructure team, specializing in semantic indexing and RAG systems. With 116+ open-source repositories, he builds AI-powered tools that prioritize privacy and local-first processing.

🔗 GitHub: github.com/kennedyraju55
🔗 Dev.to: dev.to/kennedyraju55
🔗 LinkedIn: linkedin.com/in/nrk-raju-guthikonda-504066a8

Your Health Data Deserves Better: Building Privacy-First Wellness AI with Local LLMs

Nrk Raju Guthikonda — Tue, 14 Apr 2026 02:12:04 +0000

Would you hand a stranger your therapy journal, your eating habits, and your workout log — all at once? Probably not. Yet that's essentially what happens every time a wellness app sends your data to a cloud API.

I've been thinking about this problem a lot. As a Senior Software Engineer at Microsoft on the Copilot Search Infrastructure team, I work with large-scale AI systems daily. But when it comes to my own health and wellness data, I wanted something different — something that never leaves my machine.

So I built five open-source wellness and lifestyle AI tools, all powered by local LLMs through Ollama. No cloud APIs. No data exfiltration. No subscription fees. Just your hardware, your data, and an AI that genuinely helps you live better.

In this post, I'll walk you through each project, share the architecture decisions, and show you how to build privacy-first wellness AI yourself.

Why Health Data Should Never Leave Your Machine

Before diving into code, let's talk about why this matters.

Health data is uniquely sensitive. Your mood patterns, eating habits, fitness levels, and daily routines paint an intimate portrait of your life. In the wrong hands — or even in the right hands with the wrong incentives — this data becomes a liability.

Consider the risks:

HIPAA-adjacent concerns: While personal wellness apps may not technically fall under HIPAA, the type of data they collect (mood, mental health patterns, physical conditions) is exactly the kind regulators are watching. Data breaches involving health information carry severe reputational and legal consequences.
Third-party data sharing: Many wellness apps monetize your data by selling aggregated (or not-so-aggregated) insights to insurers, advertisers, and data brokers.
Model training on your data: When you send prompts to cloud LLMs, your conversations may be used to improve their models. Do you want your anxiety patterns training someone else's AI?
Permanence: Once data hits a server, you lose control. Deletion requests are promises, not guarantees.

The local LLM alternative changes everything. With Ollama running Gemma 3 or Llama 3.2 on your own hardware, your wellness data never touches a network. It's processed, analyzed, and discarded — all within your machine's memory.

The Tech Stack

All five projects share a common foundation:

Python 3.11+ — the backbone
Ollama — local LLM inference server
Gemma 3 / Llama 3.2 — the language models
Streamlit — web UI for interactive dashboards
FastAPI — REST API layer (some projects)
SQLite / JSON — local data persistence

Here's the core pattern every project uses to talk to Ollama:

import requests
import json

def query_local_llm(prompt: str, model: str = "gemma3:4b") -> str:
    """Send a prompt to the local Ollama instance. Zero network calls."""
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": False,
            "options": {
                "temperature": 0.7,
                "top_p": 0.9,
            }
        }
    )
    return response.json()["response"]

That localhost:11434 is the key. Your data goes from your app to your local Ollama server and back. No DNS resolution, no TLS handshake with a remote server, no data in transit.

Project 1: Fitness Coach Bot 🏋️

Repo: fitness-coach-bot

An AI-powered personal fitness coach that generates custom workout plans, tracks your progress, and provides exercise guidance — all running 100% locally.

The core idea is simple: you tell the bot your fitness goals, current level, and available equipment, and it designs a personalized program. But unlike cloud-based fitness apps, your body measurements, injury history, and performance data never leave your machine.

def generate_workout_plan(user_profile: dict) -> str:
    """Generate a personalized workout plan based on user profile."""
    prompt = f"""You are an expert fitness coach. Create a detailed weekly 
workout plan for this person:

- Goal: {user_profile['goal']}
- Fitness level: {user_profile['level']}
- Available equipment: {', '.join(user_profile['equipment'])}
- Time per session: {user_profile['session_minutes']} minutes
- Injuries/limitations: {user_profile.get('limitations', 'None')}

Provide exercises with sets, reps, rest periods, and form cues.
Include warm-up and cool-down routines."""

    return query_local_llm(prompt)

# Example usage
plan = generate_workout_plan({
    "goal": "build lean muscle",
    "level": "intermediate",
    "equipment": ["dumbbells", "pull-up bar", "resistance bands"],
    "session_minutes": 45,
    "limitations": "mild lower back sensitivity"
})

The fitness coach also tracks workout history locally and adjusts recommendations based on your progress — progressive overload suggestions, deload week reminders, and plateau-breaking strategies.

Project 2: Meal Planner Bot 🥗

Repo: meal-planner-bot

An AI-powered meal planning assistant that generates personalized meal plans, detailed recipes, and consolidated shopping lists — all running 100% locally.

Dietary data is deeply personal. Allergies, intolerances, religious dietary restrictions, medical conditions like diabetes or celiac disease — this information has no business on someone else's server.

def create_meal_plan(preferences: dict, days: int = 7) -> dict:
    """Generate a weekly meal plan with recipes and shopping list."""
    prompt = f"""You are a professional nutritionist. Create a {days}-day 
meal plan with these requirements:

- Dietary style: {preferences['diet_type']}
- Daily calorie target: {preferences['calories']} kcal
- Allergies: {', '.join(preferences.get('allergies', ['None']))}
- Cuisine preferences: {', '.join(preferences.get('cuisines', ['Any']))}
- Budget: {preferences.get('budget', 'moderate')}

For each day, provide:
1. Breakfast, lunch, dinner, and one snack
2. Estimated calories and macros per meal
3. Brief recipe instructions

End with a consolidated shopping list for the entire period.
Respond in valid JSON format."""

    response = query_local_llm(prompt)
    return json.loads(response)

meal_plan = create_meal_plan({
    "diet_type": "high-protein vegetarian",
    "calories": 2200,
    "allergies": ["tree nuts"],
    "cuisines": ["Mediterranean", "Indian"],
    "budget": "moderate"
})

The meal planner remembers your past preferences and learns what you actually cook versus what you skip, refining future suggestions — all stored in a local SQLite database.

Project 3: Mood Journal Bot 🧠

Repo: mood-journal-bot

A conversational AI journal that understands your emotions, tracks mood patterns, and provides personalized insights — powered by local Gemma 3 via Ollama. 100% private. Your data never leaves your machine.

This is the project where privacy matters most. Mental health data is arguably the most sensitive category of personal information. Mood patterns, anxiety triggers, depressive episodes — this is the kind of data that could affect insurance rates, employment decisions, and personal relationships if exposed.

from datetime import datetime

def analyze_journal_entry(entry: str, mood_history: list) -> dict:
    """Analyze a journal entry for mood, sentiment, and patterns."""
    recent_moods = ", ".join(
        [f"{m['date']}: {m['mood']}" for m in mood_history[-7:]]
    )

    prompt = f"""You are a compassionate AI wellness companion. Analyze 
this journal entry:

Entry: "{entry}"
Date: {datetime.now().strftime('%Y-%m-%d %H:%M')}
Recent mood history: {recent_moods}

Provide:
1. Detected mood (one word)
2. Sentiment score (-1.0 to 1.0)
3. Key emotional themes
4. A supportive, empathetic response
5. One gentle suggestion for self-care

Important: You are NOT a therapist. If the entry suggests crisis, 
recommend professional help resources."""

    response = query_local_llm(prompt)
    return {"analysis": response, "timestamp": datetime.now().isoformat()}

The Streamlit UI provides beautiful mood visualizations — heatmaps, trend lines, and pattern detection — so you can see your emotional landscape over time. The architecture uses a three-layer stack: Streamlit web interface → journal processing engine (sentiment analysis, mood classification, pattern detection) → Ollama with Gemma 3. Every layer runs on localhost.

Project 4: Habit Tracker Analyzer 📊

Repo: habit-tracker-analyzer

A comprehensive habit tracking system with streak computation, completion rate analytics, habit correlation discovery, gamified achievements, calendar heatmaps, weekly/monthly reports, and AI-powered behavioral analysis — running 100% locally.

What makes this different from yet another habit tracker is the AI-powered correlation discovery. The LLM analyzes your habit data to find connections you might miss: "You tend to skip your morning run on days when you stayed up past midnight" or "Your meditation streak correlates with higher productivity scores."

def analyze_habit_correlations(habit_data: list) -> str:
    """Discover hidden correlations between habits using AI analysis."""
    # Format habit completion data for analysis
    formatted_data = "\\n".join([
        f"Date: {d['date']} | Completed: {', '.join(d['completed'])} | "
        f"Missed: {', '.join(d['missed'])}"
        for d in habit_data[-30:]  # Last 30 days
    ])

    prompt = f"""You are a behavioral analytics expert. Analyze these 
30 days of habit tracking data and identify patterns:

{formatted_data}

Provide:
1. Habit completion rates (percentage for each habit)
2. Streak analysis (current and longest streaks)
3. Correlation insights (which habits support or conflict with each other)
4. Day-of-week patterns
5. Three actionable recommendations to improve consistency

Be specific and data-driven in your analysis."""

    return query_local_llm(prompt)

The gamification layer adds six achievement types — streak milestones, consistency badges, and more — to keep you motivated. The FastAPI backend exposes a clean REST API, while the Streamlit dashboard renders calendar heatmaps and trend charts.

Project 5: Time Management Coach ⏱️

Repo: time-management-coach

A comprehensive time management system with productivity scoring, Pomodoro planning, time-block scheduling, deep work analytics, weekly reviews, and AI-powered coaching — your personal productivity consultant running 100% locally.

Time data reveals your work patterns, focus capacity, peak hours, and how you actually spend your days. That's valuable intelligence — for you, not for advertisers.

def generate_daily_schedule(tasks: list, preferences: dict) -> str:
    """Create an AI-optimized daily schedule with time blocks."""
    task_list = "\\n".join([
        f"- {t['name']} (priority: {t['priority']}, "
        f"estimated: {t['duration_min']}min, type: {t['category']})"
        for t in tasks
    ])

    prompt = f"""You are an expert time management coach specializing in 
deep work and the Pomodoro technique. Create an optimized daily schedule:

Tasks for today:
{task_list}

Preferences:
- Work start: {preferences['start_time']}
- Work end: {preferences['end_time']}
- Peak focus hours: {preferences['peak_hours']}
- Pomodoro length: {preferences.get('pomodoro_min', 25)} minutes
- Break preference: {preferences.get('break_style', 'standard')}

Create a time-blocked schedule that:
1. Places deep work tasks during peak focus hours
2. Groups similar tasks to minimize context switching
3. Includes Pomodoro breaks and a lunch break
4. Adds buffer time between blocks
5. Assigns a productivity score prediction (0-100)"""

    return query_local_llm(prompt)

schedule = generate_daily_schedule(
    tasks=[
        {"name": "Write design doc", "priority": "high",
         "duration_min": 90, "category": "deep_work"},
        {"name": "Code review PRs", "priority": "medium",
         "duration_min": 45, "category": "collaborative"},
        {"name": "Team standup", "priority": "high",
         "duration_min": 15, "category": "meeting"},
    ],
    preferences={
        "start_time": "8:00 AM",
        "end_time": "5:00 PM",
        "peak_hours": "9:00 AM - 12:00 PM"
    }
)

The weekly review feature is particularly powerful — the AI analyzes your time logs to identify patterns like "You spend 3x more time on meetings on Wednesdays" and "Your deep work efficiency drops after 3 PM on Fridays."

Getting Started

Every project follows the same setup pattern:

# 1. Install Ollama (one-time setup)
# Download from https://ollama.com

# 2. Pull a model
ollama pull gemma3:4b

# 3. Clone any project
git clone https://github.com/kennedyraju55/fitness-coach-bot
cd fitness-coach-bot

# 4. Install dependencies
pip install -r requirements.txt

# 5. Run
streamlit run app.py

That's it. No API keys. No account creation. No terms of service to accept. Just ollama pull and streamlit run.

The Bigger Picture

These five projects are part of a larger portfolio of 116+ open-source repositories I've built, many exploring what becomes possible when you combine local LLMs with domain-specific applications. In my experience, the "local-first" approach isn't just about privacy — it's about ownership. When your AI tools run on your hardware, you control the model, the data, and the upgrade cycle.

The wellness and lifestyle category is where this philosophy matters most. Your health data is not training data. Your mood patterns are not product metrics. Your fitness journey is not a data point in someone else's aggregate.

Build local. Stay private. Live better.

Nrk Raju Guthikonda is a Senior Software Engineer at Microsoft on the Copilot Search Infrastructure team (Semantic Indexing, RAG). He builds open-source AI tools powered by local LLMs.

🐙 GitHub: kennedyraju55 (116+ repos)
✍️ Dev.to: kennedyraju55
💼 LinkedIn: nrk-raju-guthikonda

Building AI Agents That Actually Work: MCP Servers, Tool Orchestration, and Running Everything Locally

Nrk Raju Guthikonda — Tue, 14 Apr 2026 02:08:23 +0000

The AI world has a plumbing problem. We have incredible language models, but connecting them to real tools — databases, APIs, file systems, other agents — still feels like duct-taping HTTP endpoints together and praying. That's why the Model Context Protocol (MCP) matters. It's the missing standard for how AI models talk to the outside world, and it's changing how I build every agent project.

Over the past year, I've built 116+ open-source projects — voice assistants, healthcare summarizers, legal document analyzers, security tools — and the pattern I keep returning to is the same: a local LLM, an MCP server exposing tools, and an agent loop that ties it all together. No cloud dependency. No API keys expiring at 2 AM. Just a model, a protocol, and a purpose.

In this post, I'll walk through how MCP works, show you how to build your own MCP server in Python, and share patterns I've learned from shipping real agent projects.

What Is MCP and Why Should You Care?

The Model Context Protocol is an open standard (originally introduced by Anthropic) that defines how AI models discover and invoke external tools. Think of it as USB-C for AI: a single, standardized interface that any model can use to talk to any tool.

Before MCP, every AI integration was bespoke. Want your model to search a database? Write a custom function-calling wrapper. Want it to read PDFs? Another wrapper. Each tool spoke its own dialect, and switching models meant rewriting your glue code.

MCP fixes this with three core concepts:

Tools — Functions the model can invoke (e.g., search_documents, analyze_clause)
Resources — Data the model can read (files, database records, API responses)
Prompts — Reusable prompt templates that guide the model's behavior

The beauty is that MCP servers are model-agnostic. The same server works with Claude, GPT, Gemma, Llama, or any model that supports tool use. Build once, swap models freely.

Anatomy of an MCP Server in Python

Let's build a minimal MCP server. I use the mcp Python SDK, which makes this surprisingly clean:

from mcp.server.fastmcp import FastMCP

# Initialize the MCP server
mcp = FastMCP("document-tools")

@mcp.tool()
def summarize_document(text: str, max_length: int = 200) -> str:
    """Summarize a document to the specified length."""
    # In practice, this calls your local LLM
    from ollama import chat
    response = chat(
        model="gemma3:4b",
        messages=[{
            "role": "user",
            "content": f"Summarize in {max_length} words:\n\n{text}"
        }]
    )
    return response.message.content

@mcp.tool()
def extract_entities(text: str) -> dict:
    """Extract named entities from text."""
    from ollama import chat
    response = chat(
        model="gemma3:4b",
        messages=[{
            "role": "user",
            "content": f"Extract entities (people, orgs, dates) as JSON:\n\n{text}"
        }]
    )
    return {"entities": response.message.content}

@mcp.resource("docs://{doc_id}")
def get_document(doc_id: str) -> str:
    """Retrieve a document by ID."""
    docs = load_document_store()
    return docs.get(doc_id, "Document not found")

if __name__ == "__main__":
    mcp.run(transport="stdio")

That's a complete MCP server. The @mcp.tool() decorator registers functions that any MCP-compatible client can discover and call. The @mcp.resource() decorator exposes data through URI templates. Run it, and any MCP client can connect, list available tools, and start invoking them.

The Agent Loop: Where MCP Meets Real Work

An MCP server alone is just a toolbox. The magic happens when you wire it into an agent loop — the cycle where a model reasons about a task, picks a tool, executes it, and decides what to do next.

Here's the pattern I use across my projects:

import json
from ollama import chat
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

async def agent_loop(task: str, server_path: str):
    """Run an agent loop with MCP tool access."""
    server = StdioServerParameters(
        command="python", args=[server_path]
    )

    async with stdio_client(server) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()

            # Discover available tools
            tools = await session.list_tools()
            tool_descriptions = format_tools_for_prompt(tools)

            messages = [{
                "role": "system",
                "content": f"You have these tools:\n{tool_descriptions}\n"
                           f"Respond with JSON to call a tool: "
                           f'{{"tool": "name", "args": {{...}}}}'
            }, {
                "role": "user",
                "content": task
            }]

            # Agent loop: reason → act → observe → repeat
            for step in range(10):  # Max 10 steps
                response = chat(model="gemma3:4b", messages=messages)
                reply = response.message.content

                # Check if model wants to call a tool
                if '{"tool"' in reply:
                    call = json.loads(extract_json(reply))
                    result = await session.call_tool(
                        call["tool"], call.get("args", {})
                    )
                    messages.append({"role": "assistant", "content": reply})
                    messages.append({
                        "role": "user",
                        "content": f"Tool result: {result.content[0].text}"
                    })
                else:
                    return reply  # Final answer

            return "Max steps reached"

This is the skeleton behind several of my projects. The model discovers tools dynamically through MCP, reasons about which to call, executes them, and loops until it has an answer. No hardcoded tool lists. No brittle if-else chains.

Real Projects, Real Patterns

Let me show how this architecture maps to actual projects I've shipped:

CallPilot — Voice AI with MCP-Style Tool Routing

CallPilot is a voice AI assistant that routes spoken commands to specialized tools. The architecture mirrors MCP: a central orchestrator receives voice input, transcribes it, and dispatches to tool handlers — calendar lookups, email drafts, web searches — each registered as discrete, discoverable capabilities. The insight? Voice AI needs the same tool-routing patterns that text agents do.

Patient Intake Summarizer — Healthcare AI Agent

Patient Intake Summarizer processes patient intake forms and generates structured clinical summaries. The MCP pattern here exposes tools for PDF extraction, entity recognition (medications, conditions, allergies), and summary generation. Each tool runs locally — critical for healthcare where data cannot leave the premises.

Contract Clause Analyzer — Legal AI Agent

Contract Clause Analyzer breaks legal documents into clauses, classifies risk levels, and flags problematic language. The tool registration pattern shines here: extract_clauses, classify_risk, compare_to_template — each is a discrete MCP tool that the agent orchestrates based on what it finds.

DocShield — Document Security Agent

DocShield scans documents for sensitive information — SSNs, credit card numbers, API keys — and redacts them. The MCP resource pattern works perfectly: documents are exposed as resources, and scanning/redaction tools operate on them. The agent decides which scans to run based on document type.

PDF Chat Assistant — Conversational Document Q&A

PDF Chat Assistant lets you have a conversation with any PDF. It uses RAG (Retrieval-Augmented Generation) with a local vector store, exposing search_chunks and get_page as MCP tools. The agent retrieves relevant passages and synthesizes answers — all running on your machine.

Agent-to-Agent Communication: The A2A Protocol

MCP handles model-to-tool communication brilliantly, but what about agent-to-agent communication? That's where Google's A2A (Agent-to-Agent) protocol enters the picture.

A2A defines how autonomous agents discover each other, negotiate capabilities, and delegate tasks. Imagine a healthcare workflow where a patient intake agent hands off to a billing agent, which hands off to an insurance verification agent — each running independently, each exposing its capabilities through an "Agent Card."

# Example: A2A Agent Card (simplified)
agent_card = {
    "name": "patient-intake-agent",
    "description": "Processes patient intake forms",
    "url": "http://localhost:8001",
    "capabilities": {
        "streaming": True,
        "pushNotifications": False
    },
    "skills": [
        {
            "id": "intake-summary",
            "name": "Patient Intake Summary",
            "description": "Generates structured clinical summaries"
        }
    ]
}

MCP and A2A are complementary: MCP connects agents to tools, A2A connects agents to each other. Together, they form the backbone of truly interoperable AI systems.

Why Local-First Matters

Every project I mentioned runs entirely on local hardware with local models (I primarily use Gemma 3 through Ollama). This isn't a limitation — it's a feature:

Privacy: Healthcare and legal data stays on-premises. Period.
Cost: No per-token API charges. Run thousands of inferences for free.
Reliability: No network dependency. No rate limits. No surprise deprecations.
Speed: Local inference on a decent GPU is fast enough for most agent workflows.
Control: You own the entire stack. Swap models, modify prompts, add tools — no vendor lock-in.

The MCP architecture amplifies these benefits because your tools are decoupled from your model. When a better local model drops (and they drop weekly now), you swap it in without touching a single tool definition.

Getting Started

If you want to build your own MCP-powered agent:

Install the basics: pip install mcp ollama and pull a model with ollama pull gemma3:4b
Start with one tool: Build an MCP server with a single useful tool. Get the loop working.
Add tools incrementally: Each new tool is just another decorated function.
Think in resources: What data does your agent need? Expose it through MCP resources.
Keep it local: You don't need cloud APIs for most agent work. A 4B parameter model handles tool routing surprisingly well.

The MCP ecosystem is growing fast. IDE integrations, framework support, community servers — it's all converging on this standard. The agents you build today with MCP will plug into tomorrow's ecosystem without rewrites.

Conclusion

Building AI agents isn't about chasing the biggest model or the fanciest framework. It's about having a clean protocol for connecting models to tools, a reliable agent loop, and problems worth solving. MCP gives us that protocol. Local models give us independence. And the combination lets anyone — not just companies with massive API budgets — build agents that do real work.

I've published all the projects mentioned here (and 110+ more) as open source. Clone them, break them, build on them. That's the point.

About the Author

Nrk Raju Guthikonda is a Senior Software Engineer at Microsoft on the Copilot Search Infrastructure team, working on Semantic Indexing and RAG systems. He maintains 116+ open-source repositories spanning AI/ML, healthcare, legal tech, developer tools, and creative AI — all built to run locally with models like Gemma and Ollama.

GitHub: github.com/kennedyraju55
dev.to: dev.to/kennedyraju55
LinkedIn: linkedin.com/in/nrk-raju-guthikonda-504066a8

Your Voice Assistant Doesn't Need the Cloud — Here's How I Built 5 Offline NLP Tools

Nrk Raju Guthikonda — Sun, 12 Apr 2026 23:35:15 +0000

Every time I build an AI-powered tool that requires an internet connection, I feel a small pang of guilt. We've normalized shipping software that stops working the moment a cloud API goes down, a subscription lapses, or a user happens to be on an airplane. But here's the thing: most NLP tasks — sentiment analysis, text summarization, conversational AI, even voice assistants — don't need the cloud anymore.

Over the past year, I've built a series of open-source tools that prove this point. They handle voice calls, language tutoring, sentiment dashboards, news digestion, and research paper Q&A — all running locally with Ollama and models like Gemma 4. No API keys. No cloud bills. No data leaving your machine.

In this post, I'll walk through the patterns I've found most effective for building offline-first NLP applications in Python, with real code from five of my projects.

Why Offline NLP Matters More Than You Think

The conversation around AI tooling is dominated by cloud-first thinking. GPT-4o, Claude, Gemini — they're brilliant, but they come with strings attached:

Privacy: Every prompt you send is processed on someone else's server. For healthcare data, legal documents, or personal conversations, that's a non-starter.
Cost: API calls add up fast. A sentiment analysis pipeline processing 10,000 documents a day can cost hundreds per month.
Reliability: Cloud APIs have rate limits, outages, and deprecation cycles. Your local GPU doesn't.
Latency: A local model on an M-series Mac or a decent NVIDIA card returns responses in milliseconds, not seconds.

In my experience building search and retrieval systems, I've learned that the best AI tool is the one that's always available. Local LLMs make that possible for a surprising range of NLP tasks.

The Foundation: Ollama as Your Local AI Runtime

Every project I'll discuss uses the same foundation: Ollama running a local model (typically Gemma 4). The setup is dead simple:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model
ollama pull gemma3:4b

# Verify it's running
ollama list

And the Python integration pattern I use across all my projects:

import requests
import json

def query_local_llm(prompt: str, model: str = "gemma3:4b") -> str:
    """Send a prompt to the local Ollama instance and return the response."""
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": False,
            "options": {
                "temperature": 0.7,
                "top_p": 0.9,
                "num_predict": 1024,
            }
        }
    )
    response.raise_for_status()
    return response.json()["response"]

This function is the beating heart of every tool I build. It's simple, it's reliable, and it works identically whether you're on a MacBook, a Linux workstation, or a Windows machine with WSL.

Project 1: CallPilot — A Voice AI Assistant

CallPilot is probably the most ambitious project in this collection. It's an AI-powered outbound phone call assistant: you give it a phone number and instructions ("Book a dentist appointment for Tuesday at 3pm"), and it handles the entire conversation.

The architecture bridges Twilio's real-time voice streaming with an AI backend, using RAG (Retrieval-Augmented Generation) with ChromaDB to give the AI access to personal documents like insurance cards or medical records during calls.

from chromadb import Client
from chromadb.config import Settings

def build_context(query: str, collection_name: str = "documents") -> str:
    """Retrieve relevant context from local vector store for RAG."""
    client = Client(Settings(
        chroma_db_impl="duckdb+parquet",
        persist_directory="./vectorstore"
    ))
    collection = client.get_collection(collection_name)
    results = collection.query(
        query_texts=[query],
        n_results=5
    )
    chunks = results["documents"][0]
    return "\n\n".join(chunks)

The key insight here is that voice AI doesn't have to mean "send everything to a cloud transcription service." The RAG pipeline runs entirely locally — your documents are chunked, embedded, and stored in a local ChromaDB instance. When the AI needs context during a call, it queries the vector store on your machine.

While CallPilot currently uses OpenAI's Realtime API for the voice streaming component (real-time bidirectional audio is still a hard problem for local models), the entire knowledge retrieval pipeline is local. As local speech-to-text and text-to-speech models improve, the goal is to make this fully offline.

Project 2: Language Learning Bot — Conversational AI for Education

Language Learning Bot is a polyglot companion that supports 15 languages through conversation practice, vocabulary drills, and structured lessons — all powered by a local LLM via Ollama.

The conversation engine adapts to beginner, intermediate, or advanced levels and provides real-time corrections with grammar explanations:

def create_tutor_prompt(language: str, level: str, user_message: str) -> str:
    """Build a language tutor system prompt for the local LLM."""
    return f"""You are a friendly {language} language tutor.
The student's level is {level}.

Rules:
- Respond primarily in {language} with English translations in parentheses
- Correct any grammar mistakes gently, explaining the rule
- Adapt vocabulary complexity to {level} level
- Include cultural context when relevant
- End each response with a follow-up question to keep practicing

Student says: {user_message}"""

# Usage with local Ollama
response = query_local_llm(
    create_tutor_prompt("Spanish", "beginner", "Yo quiero ir al parque")
)

What makes this project compelling for offline use is the privacy angle. Language learners make mistakes — that's the whole point. Having those mistakes processed locally, never logged on a remote server, creates a psychologically safer learning environment. Every chat session, vocabulary list, and progress metric stays on the user's machine in a local JSON store.

Project 3: Sentiment Analysis Dashboard — Text Analytics with Streamlit

Sentiment Analysis Dashboard processes text files through an LLM-powered classification pipeline with confidence scores, trend detection, and word cloud generation.

The core analysis pattern uses structured prompting to get consistent, parseable output from the local LLM:

import json

def analyze_sentiment(text: str) -> dict:
    """Analyze sentiment of text using the local LLM with structured output."""
    prompt = f"""Analyze the sentiment of the following text.
Return ONLY a JSON object with these fields:
- sentiment: one of "positive", "negative", "neutral", "mixed"
- confidence: float between 0.0 and 1.0
- key_phrases: list of 3-5 important phrases from the text
- summary: one-sentence summary of the overall tone

Text: {text}

JSON:"""

    raw_response = query_local_llm(prompt)
    return json.loads(raw_response.strip())

The Streamlit dashboard renders these results into interactive visualizations — sentiment distribution charts, sliding-window trend analysis, and word clouds. The entire pipeline processes text at seconds-per-entry with batch support, compared to minutes per entry for manual review.

What I find most valuable here is the consistency. A human reviewer's sentiment judgment drifts throughout the day based on fatigue and mood. The local LLM produces consistent classifications with quantified confidence scores, and it does it without sending your text data to any third party.

Project 4: News Digest Generator — Information Triage at Scale

News Digest Generator tackles information overload. Drop a folder of .txt news articles on it and get back a structured, categorized digest with sentiment analysis and trend detection.

The categorization pipeline is where the local LLM really shines:

def categorize_articles(articles: list[dict], num_categories: int = 5) -> dict:
    """Group articles into topic categories using the local LLM."""
    titles = "\n".join(
        f"- [{i}] {a['title']}" for i, a in enumerate(articles)
    )
    prompt = f"""Given these news articles, group them into exactly
{num_categories} topic categories.

Articles:
{titles}

Return a JSON array where each element has:
- "category": short category name
- "article_indices": list of article index numbers
- "summary": 2-3 sentence summary of this topic cluster

JSON:"""

    return json.loads(query_local_llm(prompt))

The digest output includes key headlines, topic summaries, per-article sentiment, trending themes, and a forward-looking outlook section. It's the kind of tool that journalists, analysts, and researchers can run on sensitive or proprietary content without worrying about data leakage.

Project 5: Research Paper Q&A — RAG for Academic Literature

Research Paper Q&A lets you drop PDF research papers into a folder and ask questions about them in natural language. It uses a RAG pipeline with ChromaDB to chunk, embed, and retrieve relevant passages, then feeds them to Gemma 4 for answer generation.

def ask_paper(question: str, paper_chunks: list[str]) -> str:
    """Answer a question about a research paper using RAG."""
    # Retrieve the most relevant chunks
    relevant = retrieve_chunks(question, paper_chunks, top_k=5)
    context = "\n---\n".join(relevant)

    prompt = f"""Based on the following excerpts from a research paper,
answer the question. Only use information from the provided excerpts.
If the answer isn't in the excerpts, say so.

Excerpts:
{context}

Question: {question}

Answer:"""

    return query_local_llm(prompt)

This is perhaps the most natural fit for offline AI. Researchers often work with pre-publication papers, proprietary datasets, or materials under NDA. A local RAG pipeline means you can ask "What methodology did they use for the control group?" without that question — or the paper's content — ever touching an external server.

Patterns I Keep Coming Back To

After building these five projects (and many more — I'm at 116+ open-source repos now), certain patterns have proven themselves repeatedly:

Structured prompting for parseable output: Always ask the LLM to return JSON with a specific schema. It makes downstream processing predictable and testable.
Local vector stores for RAG: ChromaDB with persistent storage is lightweight enough to embed in any project. The retrieval quality with even small embedding models is excellent for domain-specific content.
Ollama as the universal runtime: By standardizing on Ollama's API, every project works with any compatible model. Swap Gemma for Llama or Mistral with a single config change.
CLI-first, web-second: Every project starts as a Click CLI tool, then gets a Streamlit or Gradio web UI. This ensures the core logic is clean, testable, and scriptable before any UI complexity enters the picture.
Privacy by architecture, not policy: When the LLM runs on localhost:11434, there's no privacy policy to read. The data physically cannot leave the machine.

Getting Started

If you want to explore any of these projects:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
ollama pull gemma3:4b

# Clone any project
git clone https://github.com/kennedyraju55/sentiment-analysis-dashboard.git
cd sentiment-analysis-dashboard
pip install -r requirements.txt

# Run the CLI
python main.py analyze --file sample.txt

# Or launch the web UI
streamlit run app.py

All five projects follow the same structure: install Ollama, pull a model, clone the repo, install dependencies, and run. No API keys. No account creation. No cloud configuration.

What's Next

The local LLM ecosystem is evolving fast. Models are getting smaller and more capable. Ollama recently added vision model support, which opens up entirely new offline use cases — document OCR, image-based Q&A, multimodal assistants. I'm actively building tools that leverage these capabilities.

The thesis is simple: if your NLP tool requires an internet connection and it doesn't strictly need one, you're shipping a worse product than you could be. Local LLMs have crossed the quality threshold for production use in dozens of NLP tasks. The tools I've shared here prove it.

Every one of these projects is MIT-licensed and open source. Clone them, break them, improve them. That's the whole point.

About the Author

Nrk Raju Guthikonda is a Senior Software Engineer at Microsoft on the Copilot Search Infrastructure team, working on semantic indexing and retrieval-augmented generation. Outside of work, he maintains 116+ open-source repositories exploring AI, NLP, healthcare tech, developer tools, and creative applications — all built with local LLMs and a privacy-first philosophy.

🐙 GitHub: github.com/kennedyraju55
✍️ Blog: dev.to/kennedyraju55
💼 LinkedIn: linkedin.com/in/nrk-raju-guthikonda-504066a8

I Built 5 AI Developer Tools That Run Entirely on My Laptop — No API Keys, No Cloud, No Limits

Nrk Raju Guthikonda — Sun, 12 Apr 2026 23:29:52 +0000

Every developer has felt the friction: you want AI to help with a mundane task — writing standup notes, reviewing a pull request, generating boilerplate — but the moment you reach for a cloud API, you hit rate limits, accumulate costs, or worse, realize you can't send proprietary code to a third-party endpoint.

What if the AI lived on your machine? No API keys. No network dependency. No billing surprises. Just a local model serving intelligent responses over localhost.

Over the past year, I've built a suite of open-source developer productivity tools that run entirely on local LLMs using Ollama and Google's Gemma model family. In this post, I'll walk through the architecture, share real code, and explain why local-first AI is the most practical path for developer tooling today.

Why Local LLMs for Developer Tools?

Cloud-hosted LLMs are powerful, but they carry trade-offs that matter in daily engineering workflows:

Cost accumulates fast. A team of ten engineers each making 50 AI-assisted queries per day burns through API credits quickly. Local inference is free after the initial model download.
Offline-first matters. Planes, coffee shops with spotty Wi-Fi, corporate VPNs that block external endpoints — local models don't care.
Privacy is non-negotiable. When you're reviewing code from a private repository or generating reports that reference internal project names, sending that context to a remote API is a risk. Local inference keeps everything on-disk.
Latency is predictable. No cold starts, no queue wait times, no variable response times based on provider load. A 4B parameter model on a modern laptop with 16 GB RAM responds in 1–3 seconds consistently.

In my experience building production search and retrieval systems, I've learned that the best developer tools are the ones with zero friction to adopt. Local LLMs eliminate the biggest friction point: setup and credentials.

The Stack: Ollama + Gemma + FastAPI

The architecture I've converged on across multiple projects is deliberately simple:

┌─────────────────────────────────────────────┐
│              Developer's Laptop              │
│                                              │
│  ┌──────────┐    HTTP     ┌──────────────┐  │
│  │  FastAPI  │ ◄────────► │   Ollama      │  │
│  │  App      │  localhost  │   (Gemma 4)   │  │
│  │  :8000    │   :11434    │   4B params   │  │
│  └──────────┘             └──────────────┘  │
│       ▲                                      │
│       │  Browser / CLI / IDE Plugin          │
│  ┌──────────┐                                │
│  │   User   │                                │
│  └──────────┘                                │
└─────────────────────────────────────────────┘

Ollama handles model management and inference. One command pulls a model, and it serves an OpenAI-compatible API on localhost:11434.

Gemma 4 (4B) is the sweet spot — small enough to run on laptops without a dedicated GPU, capable enough for code understanding, summarization, and generation tasks.

FastAPI provides the application layer: prompt engineering, input validation, structured output parsing, and a clean UI or CLI interface.

Getting Started: Ollama in 60 Seconds

# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull the model (one-time ~2.5 GB download)
ollama pull gemma3:4b

# Verify it's running
curl http://localhost:11434/api/tags

On Windows, download the installer from ollama.com and Ollama runs as a background service automatically.

Project 1: AI Standup Generator

Every morning, the same ritual: open your git log, skim through Jira tickets, and type up a standup update that nobody will remember five minutes later. The standup-generator automates this entirely.

You feed it bullet points about what you worked on, and the local LLM transforms them into a structured standup report with "Yesterday," "Today," and "Blockers" sections.

import httpx

OLLAMA_URL = "http://localhost:11434/api/generate"

def generate_standup(raw_notes: str) -> str:
    prompt = f"""You are a concise engineering standup assistant.
Given these raw notes, produce a structured standup report
with sections: Yesterday, Today, Blockers.
Keep each bullet under 15 words.

Raw notes:
{raw_notes}
"""
    response = httpx.post(
        OLLAMA_URL,
        json={
            "model": "gemma3:4b",
            "prompt": prompt,
            "stream": False,
            "options": {"temperature": 0.3}
        },
        timeout=30.0,
    )
    return response.json()["response"]

The key design decisions:

Low temperature (0.3) keeps output deterministic — standups shouldn't be creative writing.
Stream disabled for simplicity in CLI/API mode; enable it for real-time UI feedback.
httpx over requests because it's async-friendly when you graduate to FastAPI endpoints.

Project 2: AI Code Review Bot

Code reviews are where local AI shines brightest. You absolutely should not send your team's proprietary code to a third-party API for review. The code-review-bot runs a local Gemma model to analyze diffs and surface issues.

from pathlib import Path
import httpx

def review_code(file_path: str) -> str:
    code = Path(file_path).read_text()
    prompt = f"""You are a senior code reviewer. Analyze this code for:
1. Bugs or logic errors
2. Security vulnerabilities
3. Performance concerns
4. Readability improvements

Be specific. Reference line numbers. Skip style nitpicks.

python
{code}


    response = httpx.post(
        "http://localhost:11434/api/generate",
        json={
            "model": "gemma3:4b",
            "prompt": prompt,
            "stream": False,
            "options": {"temperature": 0.2, "num_ctx": 8192},
        },
        timeout=60.0,
    )
    return response.json()["response"]

Notice num_ctx: 8192 — this extends the context window so the model can ingest larger files. For a 4B model, 8K tokens is the practical ceiling before quality degrades.

Project 3: Cover Letter Generator

Job applications are tedious. The cover-letter-generator takes a job description and your resume bullets, then produces a tailored cover letter — all without sending your personal career history to OpenAI's servers.

def generate_cover_letter(
    job_description: str,
    resume_points: list[str],
    company_name: str,
) -> str:
    resume_text = "\\n".join(f"- {point}" for point in resume_points)
    prompt = f"""Write a professional cover letter for {company_name}.

Job Description:
{job_description}

Candidate's Key Qualifications:
{resume_text}

Requirements:
- 3 paragraphs maximum
- Specific connections between qualifications and job requirements
- Professional but authentic tone
- No generic filler sentences
"""
    response = httpx.post(
        "http://localhost:11434/api/generate",
        json={
            "model": "gemma3:4b",
            "prompt": prompt,
            "stream": False,
            "options": {"temperature": 0.5},
        },
        timeout=45.0,
    )
    return response.json()["response"]

Temperature at 0.5 here — slightly higher than standup or code review because cover letters benefit from a touch of variability while staying professional.

Beyond AI: The Full Developer Toolkit

Not every productivity tool needs an LLM. Two other projects in my toolkit solve pure engineering problems:

apiwatch — An API contract testing and health monitoring CLI. You define API contracts in YAML, and apiwatch continuously validates your endpoints against those contracts. It catches breaking changes, performance degradation, and response schema violations before they hit production. Think of it as a lightweight Pact alternative that runs from a single CLI command.

loadlens — A load testing and capacity planning toolkit built in Python. It helps teams understand their actual throughput — including why "8 RPS per machine" might be less impressive than it sounds when you factor in connection overhead, payload size, and downstream dependencies.

Both tools follow the same philosophy: zero external dependencies for core functionality, runs anywhere Python runs, and delivers value in under five minutes of setup.

Patterns That Work Across All These Tools

After building 116+ open-source repositories, certain patterns consistently emerge:

1. Structured Prompts with Clear Constraints

The biggest improvement in local LLM output comes not from model size but from prompt structure. Always tell the model:

What role to assume
What input format to expect
What output format you need
What to exclude (often more important than what to include)

2. Temperature as a Knob, Not a Setting

Use Case	Temperature	Why
Code review	0.1–0.2	Deterministic, factual analysis
Standup reports	0.2–0.3	Structured but slightly varied phrasing
Cover letters	0.4–0.6	Natural language that doesn't sound robotic
Creative writing	0.7–0.9	Exploratory, varied output

3. Timeout Budgets

Local models on CPU can take 10–30 seconds for complex prompts. Always set explicit timeouts and provide user feedback (progress indicators or streaming responses) so the tool doesn't feel broken.

4. Graceful Degradation

def safe_generate(prompt: str, fallback: str = "") -> str:
    try:
        response = httpx.post(
            "http://localhost:11434/api/generate",
            json={"model": "gemma3:4b", "prompt": prompt, "stream": False},
            timeout=30.0,
        )
        response.raise_for_status()
        return response.json()["response"]
    except (httpx.ConnectError, httpx.TimeoutException):
        return fallback or "⚠️ Ollama is not running. Start it with: ollama serve"

If Ollama isn't running, the tool should say so — not crash with a stack trace.

What's Next: The Local AI Developer Stack

The trajectory is clear. Models are getting smaller and more capable. Gemma 4 at 4B parameters today outperforms GPT-3.5 on many code tasks. By next year, we'll likely have sub-2B models that handle most developer productivity use cases.

I'm working on expanding this toolkit to include:

Git commit message generation from staged diffs
Documentation generator that reads code and produces API docs
Test case suggester that analyzes functions and proposes edge cases

All local. All open source. All free.

Try It Yourself

Every project mentioned in this post is open source and ready to run:

Install Ollama
Pull a model: ollama pull gemma3:4b
Clone any repo and follow the README
Start building your own local AI tools

The best developer tools are the ones you control completely. When the AI runs on your machine, you own the entire stack — model, data, and output. No vendor lock-in, no usage caps, no privacy concerns.

Start local. Ship faster.

Nrk Raju Guthikonda is a Senior Software Engineer at Microsoft on the Copilot Search Infrastructure team, working on semantic indexing and retrieval-augmented generation (RAG) systems. He maintains 116+ open-source repositories exploring AI, developer tools, healthcare technology, and creative applications of local LLMs.

🐙 GitHub: github.com/kennedyraju55
✍️ Dev.to: dev.to/kennedyraju55
💼 LinkedIn: linkedin.com/in/nrk-raju-guthikonda-504066a8

Stop Sending Your Security Alerts to Cloud AI — Build Local LLM Tools Instead

Nrk Raju Guthikonda — Sun, 12 Apr 2026 23:25:14 +0000

Every time a security analyst pastes a suspicious log entry into a cloud-based AI chatbot, they might be handing adversaries a roadmap. That firewall alert contains your internal IP ranges. That phishing email reveals which executives are being targeted. That threat intelligence report maps your entire attack surface.

I learned this the hard way. As a Senior Software Engineer at Microsoft working on Copilot Search Infrastructure, I spend my days thinking about how AI systems ingest, index, and retrieve sensitive data at scale. That experience taught me a foundational principle: the most dangerous data leak is the one disguised as a productivity tool.

So I built five open-source security AI tools — all powered by local LLMs through Ollama — that never send a single byte to the cloud. Here is why you should do the same, and how to get started.

Why Security Data Must Never Leave Your Network

This is not theoretical paranoia. It is operational reality.

1. Compliance Exposure

NIST SP 800-171, SOC 2, HIPAA, and GDPR all impose strict controls on where sensitive data can be processed. The moment you paste a security alert into a cloud AI service, you have potentially created a compliance violation. Most cloud AI providers explicitly state in their terms of service that they may use input data for model improvement.

2. Adversarial Intelligence Leakage

Security alerts are not just operational noise — they are intelligence. An alert about a brute-force attempt on admin@internal-crm.yourcompany.com tells an attacker three things: you have a CRM system, it uses that naming convention, and it is internet-facing. Sending this to a third-party API, even an encrypted one, expands your blast radius.

3. Supply Chain Risk

Cloud AI providers are themselves targets. A breach at your AI provider could expose every query ever sent — including your security telemetry. Running locally eliminates this entire attack surface.

4. Latency in Incident Response

During an active incident, you cannot afford to wait for API rate limits or deal with cloud outages. Local inference means your AI triage tools work even when the network is compromised — which is exactly when you need them most.

The Local LLM Stack: Ollama + Python

The architecture is simpler than you might expect. Ollama provides a local REST API that is compatible with the interface patterns most developers already know. Here is the foundation every tool in my security suite shares:

import requests
import json
from typing import Optional

class LocalLLM:
    """Interface to local Ollama instance for security analysis."""

    def __init__(self, model: str = "gemma4", base_url: str = "http://localhost:11434"):
        self.model = model
        self.base_url = base_url

    def analyze(self, prompt: str, temperature: float = 0.3) -> str:
        """Send a prompt to the local LLM. No data leaves localhost."""
        response = requests.post(
            f"{self.base_url}/api/generate",
            json={
                "model": self.model,
                "prompt": prompt,
                "temperature": temperature,
                "stream": False
            }
        )
        response.raise_for_status()
        return response.json()["response"]

    def health_check(self) -> bool:
        """Verify Ollama is running before processing sensitive data."""
        try:
            resp = requests.get(f"{self.base_url}/api/tags")
            return resp.status_code == 200
        except requests.ConnectionError:
            return False

Notice the low temperature setting of 0.3. For security analysis, you want deterministic, factual responses — not creative writing. This is a deliberate architectural choice that differs from most chatbot configurations.

Building a Security Alert Analyzer

Let me walk through a concrete example: triaging a cybersecurity alert. The key insight is that not everything requires an LLM. Pattern extraction (IPs, hashes, CVEs) is best handled by regex, while the LLM handles contextual analysis and summarization.

import re
from dataclasses import dataclass

@dataclass
class SecurityAlert:
    raw_text: str
    iocs: dict
    threat_score: float
    summary: str

def extract_iocs(alert_text: str) -> dict:
    """Extract Indicators of Compromise without an LLM."""
    return {
        "ips": re.findall(r'\b(?:\d{1,3}\.){3}\d{1,3}\b', alert_text),
        "domains": re.findall(r'\b[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b', alert_text),
        "cves": re.findall(r'CVE-\d{4}-\d{4,}', alert_text),
        "md5_hashes": re.findall(r'\b[a-fA-F0-9]{32}\b', alert_text),
        "sha256_hashes": re.findall(r'\b[a-fA-F0-9]{64}\b', alert_text),
    }

def analyze_alert(alert_text: str, llm: LocalLLM) -> SecurityAlert:
    """Full alert analysis: regex extraction + local LLM summarization."""
    iocs = extract_iocs(alert_text)

    prompt = f"""You are a senior SOC analyst. Analyze this security alert and provide:
1. Threat severity (CRITICAL/HIGH/MEDIUM/LOW)
2. Attack type classification
3. Recommended immediate actions
4. IOC summary

Alert:
{alert_text}

Extracted IOCs: {json.dumps(iocs, indent=2)}
"""
    summary = llm.analyze(prompt)

    # Score based on IOC density and keyword severity
    score = len(iocs["cves"]) * 3.0 + len(iocs["ips"]) * 1.5
    if any(kw in alert_text.lower() for kw in ["critical", "exploit", "ransomware"]):
        score += 5.0

    return SecurityAlert(
        raw_text=alert_text,
        iocs=iocs,
        threat_score=min(score, 10.0),
        summary=summary
    )

This hybrid approach — deterministic extraction plus LLM analysis — gives you the reliability of pattern matching with the contextual intelligence of a language model. And everything stays on localhost:11434.

Five Tools, Zero Cloud Dependencies

I have built and open-sourced a suite of security tools that follow this architecture. Each one solves a real problem I have encountered in production environments:

1. Cybersecurity Alert Summarizer

The flagship tool. It ingests raw security alerts, extracts IOCs (IPs, domains, hashes, CVEs), queries a local CVE database for CVSS scores, calculates weighted threat scores, and generates executive-ready summaries. The correlation engine links related alerts across multiple data sources — critical for spotting coordinated attacks.

Tech: Python, Ollama, Click CLI, FastAPI, Rich, Docker

GitHub: cybersecurity-alert-summarizer

2. DocShield — Privacy-First Document Analysis

A multi-agent system using Gemma 4 that reads, explains, and audits sensitive documents. While originally built for medical documents (HIPAA compliance demands local processing), the architecture applies to any document type containing sensitive data — contracts, financial reports, legal discovery. Five specialized agents (Orchestrator, Reader, Explainer, Checker, Bill Analyzer) work in a pipeline, each with a focused responsibility.

Tech: Python, Gemma 4, Flask, Multi-Agent Pipeline, Docker

GitHub: docshield

3. Password Strength Advisor

Goes far beyond "must contain uppercase and special character." This tool calculates Shannon entropy with pattern penalty scoring, checks against a local breach database with leet-speak variation detection, generates NIST SP 800-63B compliant policies, and creates cryptographically secure passwords using Fisher-Yates shuffling. The LLM provides natural-language explanations of why a password is weak.

Tech: Python, Ollama, Click, Streamlit, FastAPI

GitHub: password-strength-advisor

4. Phishing Email Detector

Analyzes email headers, body text, and embedded URLs to classify phishing attempts. The local LLM examines linguistic patterns (urgency cues, authority impersonation, grammatical anomalies) while deterministic checks handle SPF/DKIM validation and URL reputation lookups against local threat feeds. No email content ever leaves the analysis machine.

5. Threat Intelligence Summarizer

Ingests threat intelligence reports (STIX/TAXII feeds, vendor advisories, CVE bulletins) and produces actionable summaries for different audiences — technical IOC lists for the SOC team, risk assessments for management, patch priority lists for the infrastructure team. The LLM translates dense technical reports into audience-appropriate language.

The Architecture Pattern

Every tool in this suite follows the same layered design:

┌─────────────────────────────────────────────┐
│           Input Layer (CLI / Web / API)      │
└──────────────────┬──────────────────────────┘
                   ↓
┌─────────────────────────────────────────────┐
│     Deterministic Processing Layer          │
│  (Regex, Pattern Matching, Scoring, DB)     │
│  → No LLM needed, fast, reliable            │
└──────────────────┬──────────────────────────┘
                   ↓
┌─────────────────────────────────────────────┐
│     Local LLM Analysis Layer                │
│  (Ollama → Gemma 4 / Llama 3.2)            │
│  → Contextual analysis, summarization       │
│  → 127.0.0.1 only, no external calls        │
└──────────────────┬──────────────────────────┘
                   ↓
┌─────────────────────────────────────────────┐
│     Output Layer (Rich CLI / Streamlit)     │
│  → Formatted reports, threat dashboards     │
└─────────────────────────────────────────────┘

The critical design decision is the separation between deterministic and LLM layers. Pattern extraction, scoring, and database lookups do not need an LLM and should not use one. The LLM handles what it is good at: contextual understanding, summarization, and natural-language generation.

Getting Started in 5 Minutes

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull a security-optimized model
ollama pull gemma4

# Clone any tool from the suite
git clone https://github.com/kennedyraju55/cybersecurity-alert-summarizer.git
cd cybersecurity-alert-summarizer

# Install and run
pip install -r requirements.txt
python -m src.cyber_alert.cli --alert alerts/sample.txt

For model selection, I recommend Gemma 4 for its strong reasoning capabilities and multimodal support, or Llama 3.2 (3B) if you need faster inference on limited hardware. Both run comfortably on a machine with 16GB RAM.

The Bottom Line

Cloud AI is transformative for many use cases. Security is not one of them. The data you are analyzing — alerts, logs, threat intel, credentials, internal network topology — is precisely the data that adversaries want. Every cloud API call is an exposure surface.

Local LLMs have reached a capability threshold where they handle security analysis tasks effectively. The tools exist. The models are free. The only cost is the compute you already own.

In my experience building production AI systems that process sensitive data at scale, the architecture that wins is the one that minimizes data movement. For security tooling, that means local inference, local storage, and zero external dependencies.

Build local. Analyze local. Keep your security data where it belongs — on your network.

About the Author

Nrk Raju Guthikonda is a Senior Software Engineer at Microsoft on the Copilot Search Infrastructure team, focused on semantic indexing and retrieval-augmented generation (RAG). He maintains 116+ open-source repositories, including a suite of security AI tools powered by local LLMs. His work explores the intersection of AI, privacy, and practical security tooling.

GitHub: @kennedyraju55
Dev.to: nrk_raju
LinkedIn: nrk-raju-guthikonda