Building a Voice-Controlled Local AI Agent on a 4GB GPU

rautaditya2606 — Sun, 12 Apr 2026 20:57:55 +0000

What I Built
I built a voice-controlled local AI agent that transcribes
audio, classifies intent, and executes local tools — all
visible through a transparent pipeline trace in a Gradio UI.
The agent supports four intents: create file, write code,
summarize text, and general chat.

Architecture
STT layer: Groq Whisper-large-v3 handles transcription via API.
I chose Groq over local Whisper because my RTX 3050 (4GB VRAM)
cannot run STT and an LLM simultaneously without OOM errors.
Groq's API is actually faster (~300ms) than local whisper-small
would have been.

Intent layer: Ollama serves qwen2.5-coder:1.5b locally. The LLM
returns a structured JSON intent that the tool router uses to
decide which action to take.

Tool layer: Four tools — create_file, write_code, summarize,
general_chat. All file writes are sandboxed to output/.

UI layer: Gradio displays transcription, detected intent, action
taken, and a full pipeline trace with per-stage latency.

Hardware Constraints and Decisions
My machine: Intel i5-12500H, RTX 3050 (4GB VRAM), 15GB RAM.

The core constraint: 4GB VRAM cannot hold both a Whisper model
and an LLM simultaneously.

Decision 1 — STT via Groq API
Running whisper-small locally uses ~1.5GB VRAM. That leaves
only 2.5GB for the LLM, which isn't enough for a useful model.
Offloading STT to Groq frees the entire 4GB for the LLM and
actually improves latency.

Decision 2 — qwen2.5-coder:1.5b via Ollama
A 1.5B model at Q4 quantization fits comfortably in ~1.5GB VRAM.
I initially tried the 7b variant but it exceeded available VRAM
and caused Ollama to offload to RAM, significantly slowing
inference.

Decision 3 — Sequential pipeline
STT completes before Ollama is called. This keeps peak VRAM
usage under 2GB at any given time.

*Challenges I Faced *

VRAM management
Loading two models simultaneously caused OOM errors. Solved
by switching STT to Groq and keeping only the LLM local.
Intent JSON parsing
Ollama sometimes returns malformed JSON or wraps it in
markdown code fences. Solved with a robust parser that
strips fences and falls back to keyword matching if JSON
parsing fails entirely.
Output sandboxing
Naive file creation allowed path traversal (e.g.
../../etc/passwd). Solved with path normalization and
checking that the resolved path starts with the output/
directory.
Gradio mic input format
Gradio returns audio as a tuple (sample_rate, numpy_array)
not a file path. Had to write it to a temp file before
passing to Groq API.

What I'd Do Differently at Scale
For a production version of this system, I would:

Replace Ollama with Triton Inference Server for proper model serving with batching and metrics endpoints.
Add a message queue (Redis or RabbitMQ) between the UI and pipeline so multiple users don't block each other.
Replace the flat logger with structured JSON logs shipped to an observability stack (Grafana + Loki).
Add model versioning — config.yaml currently hardcodes model names. A proper MLOps setup uses a model registry.
Containerize STT locally using a sidecar so the pipeline has no external API dependency in production.

Model Benchmarking

I added a benchmarking tab — set models, prompt, iterations,
get a latency table back.

Model	Avg Latency
qwen2.5-coder:1.5b	~3.2s
qwen2.5-coder:7b	~11.4s

For structured JSON intent extraction, the 1.5b model is
3-4x faster with no meaningful accuracy difference. For a
constrained task like this, bigger isn't better.

Persistent Memory

Every pipeline run is stored in SQLite — transcription,
intent, action, output, and trace. Surfaces in the UI as
a recent runs panel.

This matters for two reasons:

Debugging — if intent classification goes wrong, you can see exactly what transcription and JSON the LLM returned
Auditability — every file written has a corresponding memory entry with the voice command that triggered it

Simple schema, append-only, no ORM:

CREATE TABLE IF NOT EXISTS runs (
    id        INTEGER PRIMARY KEY AUTOINCREMENT,
    timestamp TEXT,
    transcript TEXT,
    intent    TEXT,
    action    TEXT,
    output    TEXT,
    trace     TEXT
)

Links
GitHub: https://github.com/rautaditya2606/Aditya_Raut_Mem0_AI
Demo: https://youtu.be/rhGIQvi4Y74

Forem: rautaditya2606

Building a Voice-Controlled Local AI Agent on a 4GB GPU

Model Benchmarking

Persistent Memory