Voice-Controlled Local AI Agent with Whisper, Ollama, and Safe Local Tools

Kaustubh Gole — Sat, 11 Apr 2026 07:04:58 +0000

Voice-Controlled Local AI Agent with Whisper, Ollama, and Safe Local Tools

I built a voice-controlled local AI agent that accepts direct microphone input, transcribes speech, detects intent, and executes safe local actions inside a sandboxed output folder.

This project was designed as a local-first demo, but I also focused on making it practical in real-world conditions. That meant adding fallback behavior, transparent pipeline visibility, and guardrails around file operations so the assistant stays useful without becoming risky.

What the project does

The app follows a simple but effective pipeline:

Microphone input -> Speech-to-text -> Intent detection -> Tool execution -> Final output

It supports a few core actions:

Create a file
Write code to a file
Summarize text
General chat

The entire flow is displayed in the UI so you can see exactly what the system heard, what it understood, and what action it took.

Tech stack

The project uses:

Streamlit for the user interface
Whisper for speech-to-text
Ollama for local LLM-based intent reasoning
Python for orchestration and local tool execution
requests for API fallback transcription
Sandboxed file handling inside an output/ directory

Architecture overview

The code is organized into small modules so each part of the pipeline stays focused:

app.py handles the Streamlit UI
stt.py handles transcription
intents.py detects what the user wants
tools.py performs safe local actions
pipeline.py connects everything together
config.py stores runtime settings

That structure makes the application easier to debug and easier to extend later.

1. Audio input

The UI accepts direct microphone input using Streamlit’s audio component.

I also kept file upload support and a manual text rerun option so the app remains usable if speech recognition is noisy.

2. Speech-to-text

The default transcription path uses a local Whisper model through HuggingFace Transformers.

If local STT fails due to environment issues, the app can fall back to an API-based transcription path. That fallback is helpful on weaker machines or when local dependencies are not fully available.

3. Intent detection

Once the transcript is available, the app sends it to a local Ollama model to classify intent.

Supported intents include:

create_file
write_code
summarize_text
general_chat

If the model is unavailable, the app uses a keyword-based fallback parser so the pipeline still works.

4. Tool execution

After intent detection, the pipeline routes to the correct tool:

create_file() creates a safe empty file
write_code_file() generates and writes code
summarize_text() returns a concise summary
general_chat() handles general conversational output

All file-related actions are restricted to the output/ folder, which acts as a sandbox.

Safety guardrails

One of the most important design decisions was limiting file operations to a safe local directory.

That means:

No arbitrary path writes
Filenames are sanitized
File extensions are restricted
Generated files stay inside output/

This makes the assistant much safer for demo and assignment use.

Challenges I ran into

Local STT dependencies

Speech-to-text on local machines can be fragile, especially when audio decoding libraries like ffmpeg are missing.

To reduce that friction, I added error handling and a fallback path for WAV files.

Local model availability

Local LLMs are useful, but they can fail if Ollama is not running or if the configured model is unavailable.

To handle that, the app shows runtime diagnostics and falls back to simpler behavior when needed.

Noisy transcription

Speech recognition is not always perfect, especially with background noise or accents.

To make the workflow more forgiving, I added a manual transcript edit box so the user can correct the text and rerun the intent and tool pipeline.

What I learned

This project reinforced a few important lessons:

A good AI assistant is not just about the model
Fallbacks matter as much as the primary path
Transparency improves trust
Safety constraints should be built in from the beginning
A simple modular architecture makes debugging much easier

Demo flow

For the video demo, I plan to show:

Direct microphone input
Transcript generation
Intent detection
File creation or code generation
Final output inside the app

That gives a clean end-to-end view of how the system behaves.

Future improvements

A few enhancements I would add next:

Better support for compound commands
Confirmation prompts before tool execution
More tools, like search or note-taking
Memory for multi-turn workflows
Improved structured intent schemas with confidence scores

Conclusion

This project was a practical exercise in building a local-first voice assistant that is usable, safe, and transparent.

Instead of aiming for a flashy demo with a single model call, I focused on the full pipeline: audio input, transcription, intent detection, tool routing, safe execution, and clear UI feedback.

That combination made the system feel much more realistic and much easier to trust.

If you want to try a similar build, start small, keep the architecture modular, and make failure cases visible from day one.

Forem: Kaustubh Gole

Voice-Controlled Local AI Agent with Whisper, Ollama, and Safe Local Tools

Voice-Controlled Local AI Agent with Whisper, Ollama, and Safe Local Tools

What the project does

Tech stack

Architecture overview

1. Audio input

2. Speech-to-text

3. Intent detection

4. Tool execution

Safety guardrails

Challenges I ran into

Local STT dependencies

Local model availability

Noisy transcription

What I learned

Demo flow

Future improvements

Conclusion