<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Dhruv Sood</title>
    <description>The latest articles on Forem by Dhruv Sood (@dhruvsood_565).</description>
    <link>https://forem.com/dhruvsood_565</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3877223%2F056acf29-a9a2-48a2-a544-3acb563ed373.png</url>
      <title>Forem: Dhruv Sood</title>
      <link>https://forem.com/dhruvsood_565</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/dhruvsood_565"/>
    <language>en</language>
    <item>
      <title>Building a Voice-Controlled Local AI Agent (End-to-End)</title>
      <dc:creator>Dhruv Sood</dc:creator>
      <pubDate>Mon, 13 Apr 2026 18:27:04 +0000</pubDate>
      <link>https://forem.com/dhruvsood_565/building-a-voice-controlled-local-ai-agent-end-to-end-3k2i</link>
      <guid>https://forem.com/dhruvsood_565/building-a-voice-controlled-local-ai-agent-end-to-end-3k2i</guid>
      <description>&lt;p&gt;I recently built a fully local, voice-controlled AI agent that can listen to audio, understand user intent, and execute real actions like creating files or generating code. Here’s a quick breakdown of how it works and what I learned along the way.&lt;/p&gt;

&lt;p&gt;🧠 Architecture Overview&lt;/p&gt;

&lt;p&gt;The system follows a clean pipeline:&lt;/p&gt;

&lt;p&gt;Audio → Speech-to-Text → Intent Detection → Tool Execution → UI Output&lt;/p&gt;

&lt;p&gt;Speech-to-Text (STT): I used Faster-Whisper running locally for accurate and fast transcription.&lt;br&gt;
Intent Detection: A lightweight LLM (phi3:latest via Ollama) classifies what the user wants.&lt;br&gt;
Tool Execution: Based on intent, the system triggers actions like:&lt;br&gt;
Creating files&lt;br&gt;
Writing code&lt;br&gt;
Summarizing text&lt;br&gt;
General chat&lt;br&gt;
Frontend: Built with Flask + simple UI to visualize each stage of the pipeline.&lt;br&gt;
🤖 Why These Models?&lt;br&gt;
🎤 Faster-Whisper&lt;br&gt;
Works locally (no API dependency)&lt;br&gt;
Handles multiple audio formats&lt;br&gt;
Good balance of speed and accuracy on CPU&lt;br&gt;
🧠 Phi-3 (via Ollama)&lt;br&gt;
Lightweight (~2–4GB runtime)&lt;br&gt;
Fast inference → avoids timeouts&lt;br&gt;
Reliable for structured outputs (JSON)&lt;/p&gt;

&lt;p&gt;Initially, I tried larger models like Qwen, but they caused latency issues and frequent timeouts on my hardware. Switching to Phi-3 made the system much more responsive.&lt;/p&gt;

&lt;p&gt;⚙️ Key Features&lt;br&gt;
Supports both mic input and file upload&lt;br&gt;
Multi-intent handling (e.g., “create a file and write code”)&lt;br&gt;
Safety sandbox (output/ folder)&lt;br&gt;
Chat history memory&lt;br&gt;
Streaming responses for better UX&lt;br&gt;
😵 Challenges I Faced&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;❌ Model Timeouts&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Large models were too slow for real-time interaction. Requests would simply hang or return empty responses.&lt;/p&gt;

&lt;p&gt;👉 Fix: Switched to a smaller model and reduced token generation.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;🎤 Speech-to-Text Errors&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Whisper sometimes misheard:&lt;/p&gt;

&lt;p&gt;hello.py → hello.5&lt;br&gt;
dot py → .5&lt;/p&gt;

&lt;p&gt;👉 Fix: Added preprocessing rules to normalize filenames.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;🧠 Incorrect Intent Detection&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The model often classified everything as “chat,” even when the user clearly wanted to create a file.&lt;/p&gt;

&lt;p&gt;👉 Fix: Added rule-based overrides (hybrid system = rules + LLM).&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;🔄 Streaming Bugs&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Enabling streaming broke responses because I was still parsing them like normal JSON.&lt;/p&gt;

&lt;p&gt;Fix: Switched to chunk-based parsing for streaming responses.&lt;/p&gt;

&lt;p&gt;What I Learned&lt;br&gt;
Smaller, faster models are often better for real-time systems.&lt;br&gt;
LLMs alone are not reliable for control logic — rules are essential.&lt;br&gt;
Preprocessing (especially for speech input) is critical.&lt;br&gt;
Good UX (like streaming) makes a huge difference in perception.&lt;/p&gt;

&lt;p&gt;Final Thoughts&lt;br&gt;
This project taught me how to build a practical AI system—not just a model, but a full pipeline that works reliably in real-world conditions.&lt;/p&gt;

&lt;p&gt;If I were to extend this further, I’d add:&lt;br&gt;
Real-time voice streaming&lt;br&gt;
Persistent memory (vector DB)&lt;br&gt;
Better UI with live token streaming&lt;/p&gt;

&lt;p&gt;Thanks for reading! 🚀&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>python</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
