<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: rautaditya2606</title>
    <description>The latest articles on Forem by rautaditya2606 (@rautaditya2606).</description>
    <link>https://forem.com/rautaditya2606</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2747140%2F696730cd-b32e-4fc6-ad1b-d6c8e7bf9df7.png</url>
      <title>Forem: rautaditya2606</title>
      <link>https://forem.com/rautaditya2606</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/rautaditya2606"/>
    <language>en</language>
    <item>
      <title>Building a Voice-Controlled Local AI Agent on a 4GB GPU</title>
      <dc:creator>rautaditya2606</dc:creator>
      <pubDate>Sun, 12 Apr 2026 20:57:55 +0000</pubDate>
      <link>https://forem.com/rautaditya2606/building-a-voice-controlled-local-ai-agent-on-a-4gb-gpu-emc</link>
      <guid>https://forem.com/rautaditya2606/building-a-voice-controlled-local-ai-agent-on-a-4gb-gpu-emc</guid>
      <description>&lt;p&gt;&lt;strong&gt;What I Built&lt;/strong&gt;&lt;br&gt;
I built a voice-controlled local AI agent that transcribes &lt;br&gt;
audio, classifies intent, and executes local tools — all &lt;br&gt;
visible through a transparent pipeline trace in a Gradio UI.&lt;br&gt;
The agent supports four intents: create file, write code, &lt;br&gt;
summarize text, and general chat.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture&lt;/strong&gt;&lt;br&gt;
STT layer: Groq Whisper-large-v3 handles transcription via API.&lt;br&gt;
I chose Groq over local Whisper because my RTX 3050 (4GB VRAM) &lt;br&gt;
cannot run STT and an LLM simultaneously without OOM errors. &lt;br&gt;
Groq's API is actually faster (~300ms) than local whisper-small &lt;br&gt;
would have been.&lt;/p&gt;

&lt;p&gt;Intent layer: Ollama serves qwen2.5-coder:1.5b locally. The LLM &lt;br&gt;
returns a structured JSON intent that the tool router uses to &lt;br&gt;
decide which action to take.&lt;/p&gt;

&lt;p&gt;Tool layer: Four tools — create_file, write_code, summarize, &lt;br&gt;
general_chat. All file writes are sandboxed to output/.&lt;/p&gt;

&lt;p&gt;UI layer: Gradio displays transcription, detected intent, action &lt;br&gt;
taken, and a full pipeline trace with per-stage latency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hardware Constraints and Decisions&lt;/strong&gt; &lt;br&gt;
My machine: Intel i5-12500H, RTX 3050 (4GB VRAM), 15GB RAM.&lt;/p&gt;

&lt;p&gt;The core constraint: 4GB VRAM cannot hold both a Whisper model &lt;br&gt;
and an LLM simultaneously.&lt;/p&gt;

&lt;p&gt;Decision 1 — STT via Groq API&lt;br&gt;
Running whisper-small locally uses ~1.5GB VRAM. That leaves &lt;br&gt;
only 2.5GB for the LLM, which isn't enough for a useful model. &lt;br&gt;
Offloading STT to Groq frees the entire 4GB for the LLM and &lt;br&gt;
actually improves latency.&lt;/p&gt;

&lt;p&gt;Decision 2 — qwen2.5-coder:1.5b via Ollama&lt;br&gt;
A 1.5B model at Q4 quantization fits comfortably in ~1.5GB VRAM.&lt;br&gt;
I initially tried the 7b variant but it exceeded available VRAM &lt;br&gt;
and caused Ollama to offload to RAM, significantly slowing &lt;br&gt;
inference.&lt;/p&gt;

&lt;p&gt;Decision 3 — Sequential pipeline&lt;br&gt;
STT completes before Ollama is called. This keeps peak VRAM &lt;br&gt;
usage under 2GB at any given time.&lt;/p&gt;

&lt;p&gt;*&lt;em&gt;Challenges I Faced *&lt;/em&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;VRAM management&lt;br&gt;
Loading two models simultaneously caused OOM errors. Solved &lt;br&gt;
by switching STT to Groq and keeping only the LLM local.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Intent JSON parsing&lt;br&gt;
Ollama sometimes returns malformed JSON or wraps it in &lt;br&gt;
markdown code fences. Solved with a robust parser that &lt;br&gt;
strips fences and falls back to keyword matching if JSON &lt;br&gt;
parsing fails entirely.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Output sandboxing&lt;br&gt;
Naive file creation allowed path traversal (e.g. &lt;br&gt;
../../etc/passwd). Solved with path normalization and &lt;br&gt;
checking that the resolved path starts with the output/ &lt;br&gt;
directory.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Gradio mic input format&lt;br&gt;
Gradio returns audio as a tuple (sample_rate, numpy_array) &lt;br&gt;
not a file path. Had to write it to a temp file before &lt;br&gt;
passing to Groq API.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;What I'd Do Differently at Scale&lt;/strong&gt;&lt;br&gt;
For a production version of this system, I would:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Replace Ollama with Triton Inference Server for proper 
model serving with batching and metrics endpoints.&lt;/li&gt;
&lt;li&gt;Add a message queue (Redis or RabbitMQ) between the UI 
and pipeline so multiple users don't block each other.&lt;/li&gt;
&lt;li&gt;Replace the flat logger with structured JSON logs shipped 
to an observability stack (Grafana + Loki).&lt;/li&gt;
&lt;li&gt;Add model versioning — config.yaml currently hardcodes 
model names. A proper MLOps setup uses a model registry.&lt;/li&gt;
&lt;li&gt;Containerize STT locally using a sidecar so the pipeline 
has no external API dependency in production.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Model Benchmarking
&lt;/h2&gt;

&lt;p&gt;I added a benchmarking tab — set models, prompt, iterations,&lt;br&gt;
get a latency table back.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Avg Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;qwen2.5-coder:1.5b&lt;/td&gt;
&lt;td&gt;~3.2s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;qwen2.5-coder:7b&lt;/td&gt;
&lt;td&gt;~11.4s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For structured JSON intent extraction, the 1.5b model is&lt;br&gt;
3-4x faster with no meaningful accuracy difference. For a&lt;br&gt;
constrained task like this, bigger isn't better.&lt;/p&gt;
&lt;h2&gt;
  
  
  Persistent Memory
&lt;/h2&gt;

&lt;p&gt;Every pipeline run is stored in SQLite — transcription,&lt;br&gt;
intent, action, output, and trace. Surfaces in the UI as&lt;br&gt;
a recent runs panel.&lt;/p&gt;

&lt;p&gt;This matters for two reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Debugging&lt;/strong&gt; — if intent classification goes wrong, you
can see exactly what transcription and JSON the LLM
returned&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auditability&lt;/strong&gt; — every file written has a corresponding
memory entry with the voice command that triggered it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Simple schema, append-only, no ORM:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="n"&gt;NOT&lt;/span&gt; &lt;span class="n"&gt;EXISTS&lt;/span&gt; &lt;span class="nf"&gt;runs &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nb"&gt;id&lt;/span&gt;        &lt;span class="n"&gt;INTEGER&lt;/span&gt; &lt;span class="n"&gt;PRIMARY&lt;/span&gt; &lt;span class="n"&gt;KEY&lt;/span&gt; &lt;span class="n"&gt;AUTOINCREMENT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;timestamp&lt;/span&gt; &lt;span class="n"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;transcript&lt;/span&gt; &lt;span class="n"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;intent&lt;/span&gt;    &lt;span class="n"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;action&lt;/span&gt;    &lt;span class="n"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt;    &lt;span class="n"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;trace&lt;/span&gt;     &lt;span class="n"&gt;TEXT&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Links&lt;br&gt;
GitHub: &lt;a href="https://github.com/rautaditya2606/Aditya_Raut_Mem0_AI" rel="noopener noreferrer"&gt;https://github.com/rautaditya2606/Aditya_Raut_Mem0_AI&lt;/a&gt;&lt;br&gt;
Demo: &lt;a href="https://youtu.be/rhGIQvi4Y74" rel="noopener noreferrer"&gt;https://youtu.be/rhGIQvi4Y74&lt;/a&gt;  &lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
