<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Ben Racicot</title>
    <description>The latest articles on Forem by Ben Racicot (@benracicot).</description>
    <link>https://forem.com/benracicot</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F190153%2Fc1d7c790-a350-4ba2-93db-9910bec25376.png</url>
      <title>Forem: Ben Racicot</title>
      <link>https://forem.com/benracicot</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/benracicot"/>
    <language>en</language>
    <item>
      <title>Using LLaVA With Ollama on Mac - Without the Base64 Encoding</title>
      <dc:creator>Ben Racicot</dc:creator>
      <pubDate>Tue, 14 Apr 2026 01:03:47 +0000</pubDate>
      <link>https://forem.com/benracicot/using-llava-with-ollama-on-mac-without-the-base64-encoding-3656</link>
      <guid>https://forem.com/benracicot/using-llava-with-ollama-on-mac-without-the-base64-encoding-3656</guid>
      <description>&lt;p&gt;Ollama supports vision models. LLaVA, Gemma 3, Moondream, Llama 3.2 Vision - pull them the same way you pull any other model. The inference works. The problem is the interface.&lt;/p&gt;

&lt;p&gt;Here's what using a vision model through Ollama's API looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:11434/api/generate &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
  "model": "llava",
  "prompt": "What is in this image?",
  "images": ["iVBORw0KGgoAAAANSUhEUgAA..."]
}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;images&lt;/code&gt; field expects base64. For a typical screenshot, that's 50,000-200,000 characters pasted into a terminal command. Generate it with &lt;code&gt;base64 -i screenshot.png&lt;/code&gt;, paste it into the JSON payload. It works. Nobody does it twice voluntarily.&lt;/p&gt;

&lt;p&gt;The CLI &lt;code&gt;ollama run llava&lt;/code&gt; supports a file path shorthand, but it's still a text-only workflow for a fundamentally visual task.&lt;/p&gt;

&lt;h2&gt;
  
  
  What vision models can do
&lt;/h2&gt;

&lt;p&gt;Vision-language models process an image and a text prompt together. They don't just classify. They reason.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Image Q&amp;amp;A.&lt;/strong&gt; "What's the error in this screenshot?" "How many people are in this photo?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Document understanding.&lt;/strong&gt; Point at a chart, table, or handwritten note. Ask it to extract data or describe relationships. Goes further than OCR - vision models understand layout and context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;UI analysis.&lt;/strong&gt; Screenshot a web page and ask the model to identify elements, describe layout, or spot accessibility issues.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scene description.&lt;/strong&gt; Detailed descriptions for accessibility narration, content tagging, or creative prompts.&lt;/p&gt;

&lt;p&gt;All of these work with Ollama's vision models on your Mac. The capability is there. What's missing is a way to use it that doesn't involve base64 strings.&lt;/p&gt;

&lt;h2&gt;
  
  
  The drag-and-drop approach
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://modelpiper.com?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=ollama-cluster" rel="noopener noreferrer"&gt;ModelPiper&lt;/a&gt; handles the encoding automatically. Drag an image onto the chat window. Type your question. The model sees both the image and text, responds in the same thread.&lt;/p&gt;

&lt;p&gt;If the model runs through Ollama, the request goes to Ollama's &lt;code&gt;/api/generate&lt;/code&gt; with the image payload. If it runs through ToolPiper's built-in engine, it goes to local llama.cpp. Either way, no base64 in sight.&lt;/p&gt;

&lt;p&gt;The chat shows the image alongside the response, so you can compare what the model said to what's actually in the picture. For iterative work - "now describe just the chart in the upper right" - reference the same image across multiple messages.&lt;/p&gt;

&lt;h2&gt;
  
  
  Which models to use
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;th&gt;RAM&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Moondream&lt;/td&gt;
&lt;td&gt;1.6B&lt;/td&gt;
&lt;td&gt;~1.5GB&lt;/td&gt;
&lt;td&gt;Simple descriptions, 8GB Macs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 3&lt;/td&gt;
&lt;td&gt;4B&lt;/td&gt;
&lt;td&gt;~3GB&lt;/td&gt;
&lt;td&gt;Balanced quality/speed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLaVA 1.6&lt;/td&gt;
&lt;td&gt;7B&lt;/td&gt;
&lt;td&gt;~5GB&lt;/td&gt;
&lt;td&gt;General-purpose image Q&amp;amp;A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLaVA 1.6&lt;/td&gt;
&lt;td&gt;13B&lt;/td&gt;
&lt;td&gt;~9GB&lt;/td&gt;
&lt;td&gt;Complex visual reasoning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.2 Vision&lt;/td&gt;
&lt;td&gt;11B&lt;/td&gt;
&lt;td&gt;~7GB&lt;/td&gt;
&lt;td&gt;Strong reasoning, documents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.2 Vision&lt;/td&gt;
&lt;td&gt;90B&lt;/td&gt;
&lt;td&gt;~48GB+&lt;/td&gt;
&lt;td&gt;Near-cloud quality (64GB+ Mac)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Pull any of these with &lt;code&gt;ollama pull &amp;lt;model&amp;gt;&lt;/code&gt;. They appear in ModelPiper's model selector when Ollama is connected as a provider.&lt;/p&gt;

&lt;h2&gt;
  
  
  Combining vision with pipelines
&lt;/h2&gt;

&lt;p&gt;Vision models get more useful when chained:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Vision + OCR:&lt;/strong&gt; Apple Vision OCR extracts raw text, then a chat model summarizes or analyzes. More reliable than asking a vision model to read dense text directly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vision + TTS:&lt;/strong&gt; Describe an image, pipe the description to text-to-speech. Audio descriptions of visual content.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vision + Translation:&lt;/strong&gt; Describe in English, translate to another language.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These multi-step workflows are where a &lt;a href="https://modelpiper.com/blog/ollama-pipelines-mac?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=ollama-cluster" rel="noopener noreferrer"&gt;pipeline builder&lt;/a&gt; earns its complexity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations worth knowing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Smaller models miss details.&lt;/strong&gt; Moondream and even LLaVA 7B will miss fine text in screenshots, misread chart numbers, sometimes hallucinate details. For text extraction, Apple Vision OCR is more reliable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Image downscaling.&lt;/strong&gt; Vision models resize images internally to 336x336 or 672x672 pixels. Fine details below that resolution are lost. Crop to the relevant portion before sending.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Memory pressure.&lt;/strong&gt; Vision models are larger than text-only models at the same parameter count because they include a vision encoder. LLaVA 7B uses more memory than Llama 3.2 7B text-only.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloud is still better for hard tasks.&lt;/strong&gt; GPT-4 Vision and Claude outperform local models on complex document analysis and multi-object reasoning. For quick descriptions and simple Q&amp;amp;A, local models are good enough.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://modelpiper.com/blog/ollama-vision-gui-mac?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=ollama-cluster" rel="noopener noreferrer"&gt;Full article with setup steps and pipeline examples&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ollama</category>
      <category>ai</category>
      <category>computervision</category>
      <category>macos</category>
    </item>
    <item>
      <title>Ollama Pipelines on Mac: Chain Models Without Writing Glue Code</title>
      <dc:creator>Ben Racicot</dc:creator>
      <pubDate>Tue, 14 Apr 2026 01:03:32 +0000</pubDate>
      <link>https://forem.com/benracicot/ollama-pipelines-on-mac-chain-models-without-writing-glue-code-1e26</link>
      <guid>https://forem.com/benracicot/ollama-pipelines-on-mac-chain-models-without-writing-glue-code-1e26</guid>
      <description>&lt;p&gt;Ollama runs one model at a time. Send it a prompt, get a response. For single-turn chat, that's enough.&lt;/p&gt;

&lt;p&gt;But useful work chains capabilities. Record a meeting, transcribe the audio, summarize the transcript, extract action items. Each step needs a different model or tool. Ollama handles one of those steps. The orchestration is your problem.&lt;/p&gt;

&lt;p&gt;The usual workaround is a Python script calling Ollama's API in a loop, piping output from one model into the next. Then you want to swap the summarization model, add a translation step, or figure out why step three produced garbage. Now you're maintaining a custom orchestration layer for something that should be a drag-and-drop operation.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a pipeline looks like
&lt;/h2&gt;

&lt;p&gt;A pipeline is a visual workflow where each block represents a model or operation. Data flows between blocks through connections on a canvas. You build workflows, not prompts.&lt;/p&gt;

&lt;p&gt;Block types include text generation, speech-to-text, text-to-speech, OCR, embedding, and image upscale. Text generation blocks can use any model from any connected provider - Ollama, ToolPiper's built-in llama.cpp, or a cloud API. The pipeline builder handles data flow automatically.&lt;/p&gt;

&lt;p&gt;Configurations are stored as JSON. Duplicate a pipeline, swap one block, have a variation running in seconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three pipelines you can build with Ollama models
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Voice conversation: STT → LLM → TTS
&lt;/h3&gt;

&lt;p&gt;The simplest multi-model pipeline. Speech-to-text (Parakeet v3) transcribes your voice. An Ollama chat model reasons about the transcript. Text-to-speech reads the response aloud. Three blocks, three capabilities, one workflow.&lt;/p&gt;

&lt;h3&gt;
  
  
  Document Q&amp;amp;A: OCR → Embed → Index → Chat
&lt;/h3&gt;

&lt;p&gt;Drop a scanned PDF in. OCR (Apple Vision) extracts text. An embedding block indexes it in a local vector collection. A chat block with RAG context answers questions, citing specific passages. Documents stay on your Mac and become searchable through natural language.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multilingual content: Chat → Translate → TTS
&lt;/h3&gt;

&lt;p&gt;Ask your Ollama model a question in English. A second chat block translates the response. TTS reads it aloud in the target language. Changing the language is a one-field edit in the translation block's system prompt.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Ollama fits in
&lt;/h2&gt;

&lt;p&gt;Connect Ollama as a provider in &lt;a href="https://modelpiper.com?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=ollama-cluster" rel="noopener noreferrer"&gt;ToolPiper&lt;/a&gt;. Every downloaded model appears as an option in the pipeline builder's text generation blocks. No re-downloading, no format conversion.&lt;/p&gt;

&lt;p&gt;The practical advantage: you've already invested time pulling the right models. A 7B coding model, a 3B fast-chat model, a 13B for complex reasoning. In a pipeline, use each where it's strongest - fast model for classification, large one for generation - without managing separate API calls.&lt;/p&gt;

&lt;p&gt;The Ollama connection works through &lt;code&gt;localhost:11434&lt;/code&gt;. You'll need &lt;a href="https://modelpiper.com/blog/ollama-cors-fix-mac?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=ollama-cluster" rel="noopener noreferrer"&gt;CORS configured&lt;/a&gt; for the browser-based builder to reach Ollama. Or use ToolPiper's built-in engine, which needs no CORS setup.&lt;/p&gt;

&lt;h2&gt;
  
  
  Build one from scratch
&lt;/h2&gt;

&lt;p&gt;A three-block pipeline: transcribe an audio clip, then summarize the transcript. Meeting notes in two clicks.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Open the pipeline builder. Empty canvas with a block palette.&lt;/li&gt;
&lt;li&gt;Drag an STT block onto the canvas. Defaults to Parakeet v3 (Neural Engine).&lt;/li&gt;
&lt;li&gt;Drag a text generation block next to it. Select an Ollama model (Llama 3.2 3B works well for summarization). Set the system prompt: "Summarize this transcript in 3-5 concise bullet points."&lt;/li&gt;
&lt;li&gt;Draw a connection from STT output to the chat block input.&lt;/li&gt;
&lt;li&gt;Drop an audio file into the STT block. Click run.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Parakeet transcribes. The transcript flows to your Ollama model. Summary appears in the output panel. Two models, one click, no scripting.&lt;/p&gt;

&lt;p&gt;Want to extend it? Add a TTS block after the summary to hear bullet points read aloud. Add a translation block between summarization and TTS for a multilingual workflow. Each extension is another block and another connection.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Latency compounds.&lt;/strong&gt; Each block adds processing time. A three-block voice pipeline adds ~1-2s total on M2 Max. Five blocks with OCR, embedding, retrieval, chat, and TTS takes longer. For real-time interaction, keep chains short.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Memory adds up.&lt;/strong&gt; Each model block needs its own RAM. Voice chat (STT + 3B + TTS) needs ~3GB. Document Q&amp;amp;A (OCR + embeddings + 7B) might need 6-7GB. ToolPiper's resource monitor shows whether a pipeline's models fit before you run it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Single-model chat doesn't need a pipeline.&lt;/strong&gt; If you're asking a question and reading an answer, the pipeline builder is overhead. Pipelines earn their complexity when the workflow involves more than one capability.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://modelpiper.com/blog/ollama-pipelines-mac?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=ollama-cluster" rel="noopener noreferrer"&gt;Full walkthrough with more pipeline examples&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ollama</category>
      <category>ai</category>
      <category>macos</category>
      <category>workflow</category>
    </item>
    <item>
      <title>Adding Voice to Ollama on Mac: The 3-Model Chain</title>
      <dc:creator>Ben Racicot</dc:creator>
      <pubDate>Tue, 14 Apr 2026 01:03:17 +0000</pubDate>
      <link>https://forem.com/benracicot/adding-voice-to-ollama-on-mac-the-3-model-chain-4hop</link>
      <guid>https://forem.com/benracicot/adding-voice-to-ollama-on-mac-the-3-model-chain-4hop</guid>
      <description>&lt;p&gt;Ollama runs language models. It doesn't listen and it doesn't speak. Type a question in the terminal, read the answer on screen. That's the entire interaction model.&lt;/p&gt;

&lt;p&gt;Voice changes what local AI feels like. Instead of typing and reading, you talk and listen. But getting there requires three separate AI models working together, and Ollama only handles one of them.&lt;/p&gt;

&lt;h2&gt;
  
  
  What voice chat requires
&lt;/h2&gt;

&lt;p&gt;Three models running in sequence every time you speak:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Speech-to-text (STT).&lt;/strong&gt; Your voice in, text transcription out. Needs a dedicated model - Whisper, Parakeet, or similar. Ollama doesn't include one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Language model (LLM).&lt;/strong&gt; The transcribed text goes to your chat model. This is what Ollama does well. Any model you've pulled works.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Text-to-speech (TTS).&lt;/strong&gt; The model's text response gets converted to audio. Another dedicated model. Ollama doesn't include this either.&lt;/p&gt;

&lt;p&gt;The hard part isn't running each model. It's the coordination. STT output needs to feed into the LLM prompt. The LLM response needs to stream into TTS as tokens arrive, not after the full response completes. Latency between stages compounds - if each handoff adds 500ms, the conversation feels broken.&lt;/p&gt;

&lt;p&gt;You could wire this together with Python scripts, a Whisper server, and a TTS tool. Some people do. It takes hours of setup and the result is fragile.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pre-wired approach
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://modelpiper.com?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=ollama-cluster" rel="noopener noreferrer"&gt;ToolPiper&lt;/a&gt; ships STT, LLM, and TTS as built-in backends, all running on Apple Silicon hardware acceleration. The &lt;code&gt;tp-local-voice-chat&lt;/code&gt; pipeline template wires all three together.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;STT:&lt;/strong&gt; Parakeet v3, running on Apple's Neural Engine. Transcribes in real-time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM:&lt;/strong&gt; ToolPiper's bundled llama.cpp engine or your existing Ollama instance. Your Ollama models appear in the pipeline's LLM block alongside built-in models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TTS options:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PocketTTS&lt;/strong&gt; - Neural Engine. Near-instant generation. Best for conversational pace.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Soprano&lt;/strong&gt; - Metal GPU. Higher audio quality, slightly more latency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Orpheus&lt;/strong&gt; - Expressive model with emotional range. Best for content creation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All three run entirely on-device. No audio leaves the machine.&lt;/p&gt;

&lt;h2&gt;
  
  
  Latency numbers
&lt;/h2&gt;

&lt;p&gt;Measured on M2 Max 32GB, using Qwen 3.5 3B (Q4):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;3B Model&lt;/th&gt;
&lt;th&gt;7B Model&lt;/th&gt;
&lt;th&gt;13B Model&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;STT (Parakeet v3)&lt;/td&gt;
&lt;td&gt;~400ms&lt;/td&gt;
&lt;td&gt;~400ms&lt;/td&gt;
&lt;td&gt;~400ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM time-to-first-token&lt;/td&gt;
&lt;td&gt;~300ms&lt;/td&gt;
&lt;td&gt;~600ms&lt;/td&gt;
&lt;td&gt;~1200ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TTS first audio (PocketTTS)&lt;/td&gt;
&lt;td&gt;~350ms&lt;/td&gt;
&lt;td&gt;~350ms&lt;/td&gt;
&lt;td&gt;~350ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total round-trip&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~1.5s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~2.5s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~3.5s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total RAM (STT + LLM + TTS)&lt;/td&gt;
&lt;td&gt;~3GB&lt;/td&gt;
&lt;td&gt;~5.5GB&lt;/td&gt;
&lt;td&gt;~9.5GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;With a 3B model, the pause between your question and the spoken response is short enough to feel like the model is thinking. With a 13B model, the pause is noticeable - you start wondering if something broke before the first word arrives.&lt;/p&gt;

&lt;p&gt;For comparison, ChatGPT's voice mode responds in under a second on optimized server hardware. Local voice chat on consumer hardware can't match that speed, but it runs entirely on-device with no internet connection.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup steps
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Install ToolPiper from the Mac App Store or &lt;a href="https://modelpiper.com?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=ollama-cluster" rel="noopener noreferrer"&gt;modelpiper.com&lt;/a&gt;. A starter model downloads on first launch.&lt;/li&gt;
&lt;li&gt;If you have Ollama models, add Ollama as a provider - your models appear automatically.&lt;/li&gt;
&lt;li&gt;Open the pipeline templates, select &lt;code&gt;tp-local-voice-chat&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Choose your LLM and TTS voice. Click the microphone button. Talk.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The pipeline is three blocks connected in sequence: mic → STT → LLM → TTS → speaker. Push-to-talk by default (more predictable than continuous listening).&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Latency is real.&lt;/strong&gt; 1.5s round-trip with a 3B model is the floor. Larger models push it higher. Cloud voice assistants are faster.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Three models in memory.&lt;/strong&gt; STT (~500MB) + LLM (2-5GB) + TTS (~300MB). On 8GB, stick with 3B chat models. On 16GB+, 7B is comfortable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No interruption handling.&lt;/strong&gt; If the model is speaking and you start talking, the current implementation doesn't stop TTS mid-sentence. You wait for it to finish or manually stop playback.&lt;/p&gt;

&lt;p&gt;For brainstorming, dictation review, and Q&amp;amp;A while your hands are busy - local voice chat works. For rapid-fire dialogue where sub-second latency matters, cloud voice modes are still faster.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://modelpiper.com/blog/ollama-voice-chat-mac?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=ollama-cluster" rel="noopener noreferrer"&gt;Full walkthrough with voice selection and pipeline customization&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ollama</category>
      <category>ai</category>
      <category>macos</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Ollama Chat Without Docker: Native Mac Alternatives to Open WebUI</title>
      <dc:creator>Ben Racicot</dc:creator>
      <pubDate>Tue, 14 Apr 2026 01:00:47 +0000</pubDate>
      <link>https://forem.com/benracicot/ollama-chat-without-docker-native-mac-alternatives-to-open-webui-3dg4</link>
      <guid>https://forem.com/benracicot/ollama-chat-without-docker-native-mac-alternatives-to-open-webui-3dg4</guid>
      <description>&lt;p&gt;"Open WebUI needs Docker." Four words that filter out half the people who wanted an Ollama frontend on Mac.&lt;/p&gt;

&lt;p&gt;Docker Desktop on macOS allocates 2GB of RAM by default before you load a single model. Open WebUI's docs recommend bumping that to 4GB. On an 8GB MacBook Air - still the most common Mac Apple sells - that's half your memory gone before you type a prompt.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Docker actually costs you on Mac
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;RAM you can't get back.&lt;/strong&gt; Docker Desktop runs a Linux VM through Apple's Hypervisor framework. That VM reserves memory at startup. On Apple Silicon, the GPU and CPU share the same unified memory pool. Every gigabyte Docker takes is a gigabyte your model can't use.&lt;/p&gt;

&lt;p&gt;A 7B model at Q4 quantization needs roughly 4-5GB. On 16GB: Docker (4GB) + Open WebUI's Python stack + 7B model + macOS overhead = right at the edge. On 8GB, you're past it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;30+ second cold start.&lt;/strong&gt; Docker Desktop boots its Linux VM (15-30s). Open WebUI's Python process adds 10-15s on top. A native Mac app launches in under a second. If you open and close your AI tool throughout the day, that startup tax compounds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Four-layer update stack.&lt;/strong&gt; Docker Desktop, the Docker engine, the Open WebUI container image, and the Ollama connection each update independently. When something breaks - and it will - you're debugging across container boundaries. Is it Docker VM networking? A Python dependency inside the image? A port mapping conflict?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not a Mac citizen.&lt;/strong&gt; No Spotlight indexing of conversations. No menu bar presence. No native notifications. No Keychain for credentials. Open WebUI runs in a browser tab that looks and feels like what it is: a Linux web application inside a virtual machine.&lt;/p&gt;

&lt;h2&gt;
  
  
  Native alternatives
&lt;/h2&gt;

&lt;p&gt;Three options connect to Ollama without containers:&lt;/p&gt;

&lt;h3&gt;
  
  
  Ollama's own app
&lt;/h3&gt;

&lt;p&gt;Shipped in early 2026. Minimal: single conversation view, model selector, text input. No conversation history across sessions. No voice, no vision, no pipelines. Think of it as a calculator for language models - open, ask, close.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ollamac Pro
&lt;/h3&gt;

&lt;p&gt;Third-party native Mac app built in SwiftUI. Conversation history, multiple model support, clean interface. One-time purchase. Deliberately scoped to multi-turn text chat and nothing more.&lt;/p&gt;

&lt;h3&gt;
  
  
  ToolPiper
&lt;/h3&gt;

&lt;p&gt;Native Swift app that bundles llama.cpp directly - same models, same GGUF format, same Metal GPU speed. Also connects to Ollama as an external provider, so existing models appear alongside the built-in engine.&lt;/p&gt;

&lt;p&gt;Beyond chat: voice conversation (STT + LLM + TTS chained locally), visual pipelines, per-model resource monitoring, 136 MCP tools, browser automation, OCR, RAG. The tradeoff is more surface area to learn.&lt;/p&gt;

&lt;h2&gt;
  
  
  Side-by-side comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Open WebUI (Docker)&lt;/th&gt;
&lt;th&gt;Ollama App (Native)&lt;/th&gt;
&lt;th&gt;ToolPiper (Native)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Install steps&lt;/td&gt;
&lt;td&gt;5-7&lt;/td&gt;
&lt;td&gt;0 (built in)&lt;/td&gt;
&lt;td&gt;1 (Mac App Store)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAM beyond models&lt;/td&gt;
&lt;td&gt;2-4GB&lt;/td&gt;
&lt;td&gt;~20MB&lt;/td&gt;
&lt;td&gt;~50MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time to first chat&lt;/td&gt;
&lt;td&gt;10-15 min&lt;/td&gt;
&lt;td&gt;~2 min&lt;/td&gt;
&lt;td&gt;~60 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cold start&lt;/td&gt;
&lt;td&gt;30-45s&lt;/td&gt;
&lt;td&gt;Under 1s&lt;/td&gt;
&lt;td&gt;Under 1s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Update mechanism&lt;/td&gt;
&lt;td&gt;4 layers&lt;/td&gt;
&lt;td&gt;Ships with Ollama&lt;/td&gt;
&lt;td&gt;Auto-update&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Voice (STT + TTS)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Visual pipelines&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Resource monitoring&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Per-model memory + GPU&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-user&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  When Docker still makes sense
&lt;/h2&gt;

&lt;p&gt;Server deployments and multi-user setups. Open WebUI in Docker gives you user accounts, shared conversations, and role-based access. That's a real use case native apps aren't designed for.&lt;/p&gt;

&lt;p&gt;Linux environments. Docker runs natively on Linux without the Hypervisor VM. The RAM overhead drops from gigabytes to megabytes. The performance tax that makes Docker a poor fit on macOS barely exists on Linux.&lt;/p&gt;

&lt;p&gt;Existing infrastructure. If your team already runs a Docker Compose stack, one more container is marginal cost. That's a pragmatic reason, not a technical one, and it's valid.&lt;/p&gt;

&lt;p&gt;For a single person on a Mac who wants to talk to local models, Docker is overhead that doesn't earn its keep.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://modelpiper.com/blog/ollama-no-docker-mac?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=ollama-cluster" rel="noopener noreferrer"&gt;Full comparison with detailed walkthroughs&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ollama</category>
      <category>macos</category>
      <category>docker</category>
      <category>ai</category>
    </item>
    <item>
      <title>Fix Ollama CORS Errors on Mac: One Environment Variable</title>
      <dc:creator>Ben Racicot</dc:creator>
      <pubDate>Tue, 14 Apr 2026 00:59:55 +0000</pubDate>
      <link>https://forem.com/benracicot/fix-ollama-cors-errors-on-mac-one-environment-variable-1b88</link>
      <guid>https://forem.com/benracicot/fix-ollama-cors-errors-on-mac-one-environment-variable-1b88</guid>
      <description>&lt;p&gt;You pointed a web app at &lt;code&gt;localhost:11434&lt;/code&gt; and got nothing back. The browser console shows a CORS policy error. Ollama blocked your request on purpose.&lt;/p&gt;

&lt;p&gt;The fix is one environment variable.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;launchctl setenv OLLAMA_ORIGINS &lt;span class="s2"&gt;"*"&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; pkill Ollama&lt;span class="p"&gt;;&lt;/span&gt; open &lt;span class="nt"&gt;-a&lt;/span&gt; Ollama
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Sets &lt;code&gt;OLLAMA_ORIGINS&lt;/code&gt; and restarts Ollama. Works in about five seconds.&lt;/p&gt;

&lt;p&gt;This doesn't persist across reboots. For a permanent fix, add to &lt;code&gt;~/.zshrc&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OLLAMA_ORIGINS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"*"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To scope it to specific origins instead of a wildcard:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;launchctl setenv OLLAMA_ORIGINS &lt;span class="s2"&gt;"http://localhost:4200,http://localhost:3000"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Comma-separated, no spaces. Each origin is an exact match including port.&lt;/p&gt;

&lt;h2&gt;
  
  
  Verify it worked
&lt;/h2&gt;

&lt;p&gt;Open your browser's developer console:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;http://localhost:11434/api/tags&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;then&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;then&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your model list should print. If you still get a CORS error, Ollama hasn't picked up the new variable - kill and restart: &lt;code&gt;pkill Ollama; open -a Ollama&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this happens
&lt;/h2&gt;

&lt;p&gt;Ollama serves on &lt;code&gt;localhost:11434&lt;/code&gt; without CORS headers. When a browser makes a cross-origin request, it sends a preflight &lt;code&gt;OPTIONS&lt;/code&gt; request. Ollama responds without &lt;code&gt;Access-Control-Allow-Origin&lt;/code&gt;, and the browser kills the actual request before it fires.&lt;/p&gt;

&lt;p&gt;This is a security decision, not a bug. Ollama's API can load and unload models - a destructive operation you don't want arbitrary webpages triggering. Shipping without CORS means every browser-based client has to opt in explicitly.&lt;/p&gt;

&lt;p&gt;The tradeoff is real. Security by default is the right call for an API server. But it means every new user hits the same wall on their first day.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common gotchas
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Setting disappears after reboot.&lt;/strong&gt; &lt;code&gt;launchctl setenv&lt;/code&gt; doesn't persist across macOS restarts. The &lt;code&gt;~/.zshrc&lt;/code&gt; export is the permanent fix.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Homebrew Ollama ignores shell variables.&lt;/strong&gt; If you installed via &lt;code&gt;brew services start ollama&lt;/code&gt;, it runs as a launchd service that doesn't inherit your shell environment. Use &lt;code&gt;launchctl setenv&lt;/code&gt; at the system level, or edit the Homebrew plist directly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CORS is fixed but requests hang.&lt;/strong&gt; If the browser error is gone but responses take forever, Ollama might be loading a model on first request. Large models (7B+) take 3-5 seconds to load from disk. Wait for the first response - subsequent requests will be fast.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is the wildcard safe?&lt;/strong&gt; For local development, yes. Ollama only listens on &lt;code&gt;localhost&lt;/code&gt; by default - remote machines can't reach it regardless of the CORS setting. The wildcard allows any webpage on your machine to make requests to the API, which is fine for development. Only scope to specific origins if you've bound Ollama to &lt;code&gt;0.0.0.0&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The alternative
&lt;/h2&gt;

&lt;p&gt;If managing environment variables per-tool isn't your idea of a good time - &lt;a href="https://modelpiper.com?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=ollama-cluster" rel="noopener noreferrer"&gt;ToolPiper&lt;/a&gt; bundles llama.cpp directly with CORS headers built in. Same GGUF models, same Metal GPU acceleration, zero configuration. It also connects to your existing Ollama instance, so you can use both without choosing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://modelpiper.com/blog/ollama-cors-fix-mac?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=ollama-cluster" rel="noopener noreferrer"&gt;Full article with more edge cases&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ollama</category>
      <category>macos</category>
      <category>webdev</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
