Forem: Ben Racicot

Using LLaVA With Ollama on Mac - Without the Base64 Encoding

Ben Racicot — Tue, 14 Apr 2026 01:03:47 +0000

Ollama supports vision models. LLaVA, Gemma 3, Moondream, Llama 3.2 Vision - pull them the same way you pull any other model. The inference works. The problem is the interface.

Here's what using a vision model through Ollama's API looks like:

curl http://localhost:11434/api/generate -d '{
  "model": "llava",
  "prompt": "What is in this image?",
  "images": ["iVBORw0KGgoAAAANSUhEUgAA..."]
}'

That images field expects base64. For a typical screenshot, that's 50,000-200,000 characters pasted into a terminal command. Generate it with base64 -i screenshot.png, paste it into the JSON payload. It works. Nobody does it twice voluntarily.

The CLI ollama run llava supports a file path shorthand, but it's still a text-only workflow for a fundamentally visual task.

What vision models can do

Vision-language models process an image and a text prompt together. They don't just classify. They reason.

Image Q&A. "What's the error in this screenshot?" "How many people are in this photo?"

Document understanding. Point at a chart, table, or handwritten note. Ask it to extract data or describe relationships. Goes further than OCR - vision models understand layout and context.

UI analysis. Screenshot a web page and ask the model to identify elements, describe layout, or spot accessibility issues.

Scene description. Detailed descriptions for accessibility narration, content tagging, or creative prompts.

All of these work with Ollama's vision models on your Mac. The capability is there. What's missing is a way to use it that doesn't involve base64 strings.

The drag-and-drop approach

ModelPiper handles the encoding automatically. Drag an image onto the chat window. Type your question. The model sees both the image and text, responds in the same thread.

If the model runs through Ollama, the request goes to Ollama's /api/generate with the image payload. If it runs through ToolPiper's built-in engine, it goes to local llama.cpp. Either way, no base64 in sight.

The chat shows the image alongside the response, so you can compare what the model said to what's actually in the picture. For iterative work - "now describe just the chart in the upper right" - reference the same image across multiple messages.

Which models to use

Model	Size	RAM	Best for
Moondream	1.6B	~1.5GB	Simple descriptions, 8GB Macs
Gemma 3	4B	~3GB	Balanced quality/speed
LLaVA 1.6	7B	~5GB	General-purpose image Q&A
LLaVA 1.6	13B	~9GB	Complex visual reasoning
Llama 3.2 Vision	11B	~7GB	Strong reasoning, documents
Llama 3.2 Vision	90B	~48GB+	Near-cloud quality (64GB+ Mac)

Pull any of these with ollama pull <model>. They appear in ModelPiper's model selector when Ollama is connected as a provider.

Combining vision with pipelines

Vision models get more useful when chained:

Vision + OCR: Apple Vision OCR extracts raw text, then a chat model summarizes or analyzes. More reliable than asking a vision model to read dense text directly.
Vision + TTS: Describe an image, pipe the description to text-to-speech. Audio descriptions of visual content.
Vision + Translation: Describe in English, translate to another language.

These multi-step workflows are where a pipeline builder earns its complexity.

Limitations worth knowing

Smaller models miss details. Moondream and even LLaVA 7B will miss fine text in screenshots, misread chart numbers, sometimes hallucinate details. For text extraction, Apple Vision OCR is more reliable.

Image downscaling. Vision models resize images internally to 336x336 or 672x672 pixels. Fine details below that resolution are lost. Crop to the relevant portion before sending.

Memory pressure. Vision models are larger than text-only models at the same parameter count because they include a vision encoder. LLaVA 7B uses more memory than Llama 3.2 7B text-only.

Cloud is still better for hard tasks. GPT-4 Vision and Claude outperform local models on complex document analysis and multi-object reasoning. For quick descriptions and simple Q&A, local models are good enough.

Full article with setup steps and pipeline examples

Ollama Pipelines on Mac: Chain Models Without Writing Glue Code

Ben Racicot — Tue, 14 Apr 2026 01:03:32 +0000

Ollama runs one model at a time. Send it a prompt, get a response. For single-turn chat, that's enough.

But useful work chains capabilities. Record a meeting, transcribe the audio, summarize the transcript, extract action items. Each step needs a different model or tool. Ollama handles one of those steps. The orchestration is your problem.

The usual workaround is a Python script calling Ollama's API in a loop, piping output from one model into the next. Then you want to swap the summarization model, add a translation step, or figure out why step three produced garbage. Now you're maintaining a custom orchestration layer for something that should be a drag-and-drop operation.

What a pipeline looks like

A pipeline is a visual workflow where each block represents a model or operation. Data flows between blocks through connections on a canvas. You build workflows, not prompts.

Block types include text generation, speech-to-text, text-to-speech, OCR, embedding, and image upscale. Text generation blocks can use any model from any connected provider - Ollama, ToolPiper's built-in llama.cpp, or a cloud API. The pipeline builder handles data flow automatically.

Configurations are stored as JSON. Duplicate a pipeline, swap one block, have a variation running in seconds.

Three pipelines you can build with Ollama models

Voice conversation: STT → LLM → TTS

The simplest multi-model pipeline. Speech-to-text (Parakeet v3) transcribes your voice. An Ollama chat model reasons about the transcript. Text-to-speech reads the response aloud. Three blocks, three capabilities, one workflow.

Document Q&A: OCR → Embed → Index → Chat

Drop a scanned PDF in. OCR (Apple Vision) extracts text. An embedding block indexes it in a local vector collection. A chat block with RAG context answers questions, citing specific passages. Documents stay on your Mac and become searchable through natural language.

Multilingual content: Chat → Translate → TTS

Ask your Ollama model a question in English. A second chat block translates the response. TTS reads it aloud in the target language. Changing the language is a one-field edit in the translation block's system prompt.

How Ollama fits in

Connect Ollama as a provider in ToolPiper. Every downloaded model appears as an option in the pipeline builder's text generation blocks. No re-downloading, no format conversion.

The practical advantage: you've already invested time pulling the right models. A 7B coding model, a 3B fast-chat model, a 13B for complex reasoning. In a pipeline, use each where it's strongest - fast model for classification, large one for generation - without managing separate API calls.

The Ollama connection works through localhost:11434. You'll need CORS configured for the browser-based builder to reach Ollama. Or use ToolPiper's built-in engine, which needs no CORS setup.

Build one from scratch

A three-block pipeline: transcribe an audio clip, then summarize the transcript. Meeting notes in two clicks.

Open the pipeline builder. Empty canvas with a block palette.
Drag an STT block onto the canvas. Defaults to Parakeet v3 (Neural Engine).
Drag a text generation block next to it. Select an Ollama model (Llama 3.2 3B works well for summarization). Set the system prompt: "Summarize this transcript in 3-5 concise bullet points."
Draw a connection from STT output to the chat block input.
Drop an audio file into the STT block. Click run.

Parakeet transcribes. The transcript flows to your Ollama model. Summary appears in the output panel. Two models, one click, no scripting.

Want to extend it? Add a TTS block after the summary to hear bullet points read aloud. Add a translation block between summarization and TTS for a multilingual workflow. Each extension is another block and another connection.

Limitations

Latency compounds. Each block adds processing time. A three-block voice pipeline adds ~1-2s total on M2 Max. Five blocks with OCR, embedding, retrieval, chat, and TTS takes longer. For real-time interaction, keep chains short.

Memory adds up. Each model block needs its own RAM. Voice chat (STT + 3B + TTS) needs ~3GB. Document Q&A (OCR + embeddings + 7B) might need 6-7GB. ToolPiper's resource monitor shows whether a pipeline's models fit before you run it.

Single-model chat doesn't need a pipeline. If you're asking a question and reading an answer, the pipeline builder is overhead. Pipelines earn their complexity when the workflow involves more than one capability.

Full walkthrough with more pipeline examples

Adding Voice to Ollama on Mac: The 3-Model Chain

Ben Racicot — Tue, 14 Apr 2026 01:03:17 +0000

Ollama runs language models. It doesn't listen and it doesn't speak. Type a question in the terminal, read the answer on screen. That's the entire interaction model.

Voice changes what local AI feels like. Instead of typing and reading, you talk and listen. But getting there requires three separate AI models working together, and Ollama only handles one of them.

What voice chat requires

Three models running in sequence every time you speak:

Speech-to-text (STT). Your voice in, text transcription out. Needs a dedicated model - Whisper, Parakeet, or similar. Ollama doesn't include one.

Language model (LLM). The transcribed text goes to your chat model. This is what Ollama does well. Any model you've pulled works.

Text-to-speech (TTS). The model's text response gets converted to audio. Another dedicated model. Ollama doesn't include this either.

The hard part isn't running each model. It's the coordination. STT output needs to feed into the LLM prompt. The LLM response needs to stream into TTS as tokens arrive, not after the full response completes. Latency between stages compounds - if each handoff adds 500ms, the conversation feels broken.

You could wire this together with Python scripts, a Whisper server, and a TTS tool. Some people do. It takes hours of setup and the result is fragile.

The pre-wired approach

ToolPiper ships STT, LLM, and TTS as built-in backends, all running on Apple Silicon hardware acceleration. The tp-local-voice-chat pipeline template wires all three together.

STT: Parakeet v3, running on Apple's Neural Engine. Transcribes in real-time.

LLM: ToolPiper's bundled llama.cpp engine or your existing Ollama instance. Your Ollama models appear in the pipeline's LLM block alongside built-in models.

TTS options:

PocketTTS - Neural Engine. Near-instant generation. Best for conversational pace.
Soprano - Metal GPU. Higher audio quality, slightly more latency.
Orpheus - Expressive model with emotional range. Best for content creation.

All three run entirely on-device. No audio leaves the machine.

Latency numbers

Measured on M2 Max 32GB, using Qwen 3.5 3B (Q4):

Stage	3B Model	7B Model	13B Model
STT (Parakeet v3)	~400ms	~400ms	~400ms
LLM time-to-first-token	~300ms	~600ms	~1200ms
TTS first audio (PocketTTS)	~350ms	~350ms	~350ms
Total round-trip	~1.5s	~2.5s	~3.5s
Total RAM (STT + LLM + TTS)	~3GB	~5.5GB	~9.5GB

With a 3B model, the pause between your question and the spoken response is short enough to feel like the model is thinking. With a 13B model, the pause is noticeable - you start wondering if something broke before the first word arrives.

For comparison, ChatGPT's voice mode responds in under a second on optimized server hardware. Local voice chat on consumer hardware can't match that speed, but it runs entirely on-device with no internet connection.

Setup steps

Install ToolPiper from the Mac App Store or modelpiper.com. A starter model downloads on first launch.
If you have Ollama models, add Ollama as a provider - your models appear automatically.
Open the pipeline templates, select tp-local-voice-chat.
Choose your LLM and TTS voice. Click the microphone button. Talk.

The pipeline is three blocks connected in sequence: mic → STT → LLM → TTS → speaker. Push-to-talk by default (more predictable than continuous listening).

Limitations

Latency is real. 1.5s round-trip with a 3B model is the floor. Larger models push it higher. Cloud voice assistants are faster.

Three models in memory. STT (~500MB) + LLM (2-5GB) + TTS (~300MB). On 8GB, stick with 3B chat models. On 16GB+, 7B is comfortable.

No interruption handling. If the model is speaking and you start talking, the current implementation doesn't stop TTS mid-sentence. You wait for it to finish or manually stop playback.

For brainstorming, dictation review, and Q&A while your hands are busy - local voice chat works. For rapid-fire dialogue where sub-second latency matters, cloud voice modes are still faster.

Full walkthrough with voice selection and pipeline customization

Ollama Chat Without Docker: Native Mac Alternatives to Open WebUI

Ben Racicot — Tue, 14 Apr 2026 01:00:47 +0000

"Open WebUI needs Docker." Four words that filter out half the people who wanted an Ollama frontend on Mac.

Docker Desktop on macOS allocates 2GB of RAM by default before you load a single model. Open WebUI's docs recommend bumping that to 4GB. On an 8GB MacBook Air - still the most common Mac Apple sells - that's half your memory gone before you type a prompt.

What Docker actually costs you on Mac

RAM you can't get back. Docker Desktop runs a Linux VM through Apple's Hypervisor framework. That VM reserves memory at startup. On Apple Silicon, the GPU and CPU share the same unified memory pool. Every gigabyte Docker takes is a gigabyte your model can't use.

A 7B model at Q4 quantization needs roughly 4-5GB. On 16GB: Docker (4GB) + Open WebUI's Python stack + 7B model + macOS overhead = right at the edge. On 8GB, you're past it.

30+ second cold start. Docker Desktop boots its Linux VM (15-30s). Open WebUI's Python process adds 10-15s on top. A native Mac app launches in under a second. If you open and close your AI tool throughout the day, that startup tax compounds.

Four-layer update stack. Docker Desktop, the Docker engine, the Open WebUI container image, and the Ollama connection each update independently. When something breaks - and it will - you're debugging across container boundaries. Is it Docker VM networking? A Python dependency inside the image? A port mapping conflict?

Not a Mac citizen. No Spotlight indexing of conversations. No menu bar presence. No native notifications. No Keychain for credentials. Open WebUI runs in a browser tab that looks and feels like what it is: a Linux web application inside a virtual machine.

Native alternatives

Three options connect to Ollama without containers:

Ollama's own app

Shipped in early 2026. Minimal: single conversation view, model selector, text input. No conversation history across sessions. No voice, no vision, no pipelines. Think of it as a calculator for language models - open, ask, close.

Ollamac Pro

Third-party native Mac app built in SwiftUI. Conversation history, multiple model support, clean interface. One-time purchase. Deliberately scoped to multi-turn text chat and nothing more.

ToolPiper

Native Swift app that bundles llama.cpp directly - same models, same GGUF format, same Metal GPU speed. Also connects to Ollama as an external provider, so existing models appear alongside the built-in engine.

Beyond chat: voice conversation (STT + LLM + TTS chained locally), visual pipelines, per-model resource monitoring, 136 MCP tools, browser automation, OCR, RAG. The tradeoff is more surface area to learn.

Side-by-side comparison

	Open WebUI (Docker)	Ollama App (Native)	ToolPiper (Native)
Install steps	5-7	0 (built in)	1 (Mac App Store)
RAM beyond models	2-4GB	~20MB	~50MB
Time to first chat	10-15 min	~2 min	~60 seconds
Cold start	30-45s	Under 1s	Under 1s
Update mechanism	4 layers	Ships with Ollama	Auto-update
Voice (STT + TTS)	No	No	Yes
Visual pipelines	No	No	Yes
Resource monitoring	None	None	Per-model memory + GPU
Multi-user	Yes	No	No

When Docker still makes sense

Server deployments and multi-user setups. Open WebUI in Docker gives you user accounts, shared conversations, and role-based access. That's a real use case native apps aren't designed for.

Linux environments. Docker runs natively on Linux without the Hypervisor VM. The RAM overhead drops from gigabytes to megabytes. The performance tax that makes Docker a poor fit on macOS barely exists on Linux.

Existing infrastructure. If your team already runs a Docker Compose stack, one more container is marginal cost. That's a pragmatic reason, not a technical one, and it's valid.

For a single person on a Mac who wants to talk to local models, Docker is overhead that doesn't earn its keep.

Full comparison with detailed walkthroughs

Fix Ollama CORS Errors on Mac: One Environment Variable

Ben Racicot — Tue, 14 Apr 2026 00:59:55 +0000

You pointed a web app at localhost:11434 and got nothing back. The browser console shows a CORS policy error. Ollama blocked your request on purpose.

The fix is one environment variable.

The fix

launchctl setenv OLLAMA_ORIGINS "*" && pkill Ollama; open -a Ollama

Sets OLLAMA_ORIGINS and restarts Ollama. Works in about five seconds.

This doesn't persist across reboots. For a permanent fix, add to ~/.zshrc:

export OLLAMA_ORIGINS="*"

To scope it to specific origins instead of a wildcard:

launchctl setenv OLLAMA_ORIGINS "http://localhost:4200,http://localhost:3000"

Comma-separated, no spaces. Each origin is an exact match including port.

Verify it worked

Open your browser's developer console:

fetch('http://localhost:11434/api/tags')
  .then(r => r.json())
  .then(console.log)

Your model list should print. If you still get a CORS error, Ollama hasn't picked up the new variable - kill and restart: pkill Ollama; open -a Ollama.

Why this happens

Ollama serves on localhost:11434 without CORS headers. When a browser makes a cross-origin request, it sends a preflight OPTIONS request. Ollama responds without Access-Control-Allow-Origin, and the browser kills the actual request before it fires.

This is a security decision, not a bug. Ollama's API can load and unload models - a destructive operation you don't want arbitrary webpages triggering. Shipping without CORS means every browser-based client has to opt in explicitly.

The tradeoff is real. Security by default is the right call for an API server. But it means every new user hits the same wall on their first day.

Common gotchas

Setting disappears after reboot. launchctl setenv doesn't persist across macOS restarts. The ~/.zshrc export is the permanent fix.

Homebrew Ollama ignores shell variables. If you installed via brew services start ollama, it runs as a launchd service that doesn't inherit your shell environment. Use launchctl setenv at the system level, or edit the Homebrew plist directly.

CORS is fixed but requests hang. If the browser error is gone but responses take forever, Ollama might be loading a model on first request. Large models (7B+) take 3-5 seconds to load from disk. Wait for the first response - subsequent requests will be fast.

Is the wildcard safe? For local development, yes. Ollama only listens on localhost by default - remote machines can't reach it regardless of the CORS setting. The wildcard allows any webpage on your machine to make requests to the API, which is fine for development. Only scope to specific origins if you've bound Ollama to 0.0.0.0.

The alternative

If managing environment variables per-tool isn't your idea of a good time - ToolPiper bundles llama.cpp directly with CORS headers built in. Same GGUF models, same Metal GPU acceleration, zero configuration. It also connects to your existing Ollama instance, so you can use both without choosing.

Full article with more edge cases