<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Kaustubh Gole</title>
    <description>The latest articles on Forem by Kaustubh Gole (@kaustubh_gole_6).</description>
    <link>https://forem.com/kaustubh_gole_6</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3873053%2Fe814603b-5ca1-45e4-b4f8-4c2e11a61899.png</url>
      <title>Forem: Kaustubh Gole</title>
      <link>https://forem.com/kaustubh_gole_6</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/kaustubh_gole_6"/>
    <language>en</language>
    <item>
      <title>Voice-Controlled Local AI Agent with Whisper, Ollama, and Safe Local Tools</title>
      <dc:creator>Kaustubh Gole</dc:creator>
      <pubDate>Sat, 11 Apr 2026 07:04:58 +0000</pubDate>
      <link>https://forem.com/kaustubh_gole_6/voice-controlled-local-ai-agent-with-whisper-ollama-and-safe-local-tools-2bf7</link>
      <guid>https://forem.com/kaustubh_gole_6/voice-controlled-local-ai-agent-with-whisper-ollama-and-safe-local-tools-2bf7</guid>
      <description>&lt;h1&gt;
  
  
  Voice-Controlled Local AI Agent with Whisper, Ollama, and Safe Local Tools
&lt;/h1&gt;

&lt;p&gt;I built a voice-controlled local AI agent that accepts direct microphone input, transcribes speech, detects intent, and executes safe local actions inside a sandboxed output folder.&lt;/p&gt;

&lt;p&gt;This project was designed as a local-first demo, but I also focused on making it practical in real-world conditions. That meant adding fallback behavior, transparent pipeline visibility, and guardrails around file operations so the assistant stays useful without becoming risky.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the project does
&lt;/h2&gt;

&lt;p&gt;The app follows a simple but effective pipeline:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Microphone input -&amp;gt; Speech-to-text -&amp;gt; Intent detection -&amp;gt; Tool execution -&amp;gt; Final output&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It supports a few core actions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create a file&lt;/li&gt;
&lt;li&gt;Write code to a file&lt;/li&gt;
&lt;li&gt;Summarize text&lt;/li&gt;
&lt;li&gt;General chat&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The entire flow is displayed in the UI so you can see exactly what the system heard, what it understood, and what action it took.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tech stack
&lt;/h2&gt;

&lt;p&gt;The project uses:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Streamlit&lt;/strong&gt; for the user interface&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Whisper&lt;/strong&gt; for speech-to-text&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ollama&lt;/strong&gt; for local LLM-based intent reasoning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python&lt;/strong&gt; for orchestration and local tool execution&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;requests&lt;/strong&gt; for API fallback transcription&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sandboxed file handling&lt;/strong&gt; inside an &lt;code&gt;output/&lt;/code&gt; directory&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Architecture overview
&lt;/h2&gt;

&lt;p&gt;The code is organized into small modules so each part of the pipeline stays focused:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;app.py&lt;/code&gt; handles the Streamlit UI&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;stt.py&lt;/code&gt; handles transcription&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;intents.py&lt;/code&gt; detects what the user wants&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tools.py&lt;/code&gt; performs safe local actions&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pipeline.py&lt;/code&gt; connects everything together&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;config.py&lt;/code&gt; stores runtime settings&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That structure makes the application easier to debug and easier to extend later.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Audio input
&lt;/h3&gt;

&lt;p&gt;The UI accepts direct microphone input using Streamlit’s audio component.&lt;br&gt;&lt;br&gt;
I also kept file upload support and a manual text rerun option so the app remains usable if speech recognition is noisy.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Speech-to-text
&lt;/h3&gt;

&lt;p&gt;The default transcription path uses a local Whisper model through HuggingFace Transformers.&lt;/p&gt;

&lt;p&gt;If local STT fails due to environment issues, the app can fall back to an API-based transcription path. That fallback is helpful on weaker machines or when local dependencies are not fully available.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Intent detection
&lt;/h3&gt;

&lt;p&gt;Once the transcript is available, the app sends it to a local Ollama model to classify intent.&lt;/p&gt;

&lt;p&gt;Supported intents include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;create_file&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;write_code&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;summarize_text&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;general_chat&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the model is unavailable, the app uses a keyword-based fallback parser so the pipeline still works.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Tool execution
&lt;/h3&gt;

&lt;p&gt;After intent detection, the pipeline routes to the correct tool:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;create_file()&lt;/code&gt; creates a safe empty file&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;write_code_file()&lt;/code&gt; generates and writes code&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;summarize_text()&lt;/code&gt; returns a concise summary&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;general_chat()&lt;/code&gt; handles general conversational output&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All file-related actions are restricted to the &lt;code&gt;output/&lt;/code&gt; folder, which acts as a sandbox.&lt;/p&gt;

&lt;h2&gt;
  
  
  Safety guardrails
&lt;/h2&gt;

&lt;p&gt;One of the most important design decisions was limiting file operations to a safe local directory.&lt;/p&gt;

&lt;p&gt;That means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No arbitrary path writes&lt;/li&gt;
&lt;li&gt;Filenames are sanitized&lt;/li&gt;
&lt;li&gt;File extensions are restricted&lt;/li&gt;
&lt;li&gt;Generated files stay inside &lt;code&gt;output/&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This makes the assistant much safer for demo and assignment use.&lt;/p&gt;

&lt;h2&gt;
  
  
  Challenges I ran into
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Local STT dependencies
&lt;/h3&gt;

&lt;p&gt;Speech-to-text on local machines can be fragile, especially when audio decoding libraries like &lt;code&gt;ffmpeg&lt;/code&gt; are missing.&lt;/p&gt;

&lt;p&gt;To reduce that friction, I added error handling and a fallback path for WAV files.&lt;/p&gt;

&lt;h3&gt;
  
  
  Local model availability
&lt;/h3&gt;

&lt;p&gt;Local LLMs are useful, but they can fail if Ollama is not running or if the configured model is unavailable.&lt;/p&gt;

&lt;p&gt;To handle that, the app shows runtime diagnostics and falls back to simpler behavior when needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Noisy transcription
&lt;/h3&gt;

&lt;p&gt;Speech recognition is not always perfect, especially with background noise or accents.&lt;/p&gt;

&lt;p&gt;To make the workflow more forgiving, I added a manual transcript edit box so the user can correct the text and rerun the intent and tool pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I learned
&lt;/h2&gt;

&lt;p&gt;This project reinforced a few important lessons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A good AI assistant is not just about the model&lt;/li&gt;
&lt;li&gt;Fallbacks matter as much as the primary path&lt;/li&gt;
&lt;li&gt;Transparency improves trust&lt;/li&gt;
&lt;li&gt;Safety constraints should be built in from the beginning&lt;/li&gt;
&lt;li&gt;A simple modular architecture makes debugging much easier&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Demo flow
&lt;/h2&gt;

&lt;p&gt;For the video demo, I plan to show:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Direct microphone input&lt;/li&gt;
&lt;li&gt;Transcript generation&lt;/li&gt;
&lt;li&gt;Intent detection&lt;/li&gt;
&lt;li&gt;File creation or code generation&lt;/li&gt;
&lt;li&gt;Final output inside the app&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That gives a clean end-to-end view of how the system behaves.&lt;/p&gt;

&lt;h2&gt;
  
  
  Future improvements
&lt;/h2&gt;

&lt;p&gt;A few enhancements I would add next:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Better support for compound commands&lt;/li&gt;
&lt;li&gt;Confirmation prompts before tool execution&lt;/li&gt;
&lt;li&gt;More tools, like search or note-taking&lt;/li&gt;
&lt;li&gt;Memory for multi-turn workflows&lt;/li&gt;
&lt;li&gt;Improved structured intent schemas with confidence scores&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This project was a practical exercise in building a local-first voice assistant that is usable, safe, and transparent.&lt;/p&gt;

&lt;p&gt;Instead of aiming for a flashy demo with a single model call, I focused on the full pipeline: audio input, transcription, intent detection, tool routing, safe execution, and clear UI feedback.&lt;/p&gt;

&lt;p&gt;That combination made the system feel much more realistic and much easier to trust.&lt;/p&gt;

&lt;p&gt;If you want to try a similar build, start small, keep the architecture modular, and make failure cases visible from day one.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>streamlit</category>
      <category>ollama</category>
    </item>
  </channel>
</rss>
