<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: thehwang</title>
    <description>The latest articles on Forem by thehwang (@thehwang).</description>
    <link>https://forem.com/thehwang</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3923429%2F6a0a283b-ca79-41a4-90ca-9bbb2e4d8bfd.png</url>
      <title>Forem: thehwang</title>
      <link>https://forem.com/thehwang</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/thehwang"/>
    <language>en</language>
    <item>
      <title>Gemma 4 wrote three summaries in one response. The middle one was a self-disclaimer.</title>
      <dc:creator>thehwang</dc:creator>
      <pubDate>Wed, 20 May 2026 20:23:25 +0000</pubDate>
      <link>https://forem.com/thehwang/gemma-4-wrote-three-summaries-in-one-response-the-middle-one-was-a-self-disclaimer-3pj9</link>
      <guid>https://forem.com/thehwang/gemma-4-wrote-three-summaries-in-one-response-the-middle-one-was-a-self-disclaimer-3pj9</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The short version, in case the title was being coy:&lt;/strong&gt; at &lt;code&gt;num_ctx=2048&lt;/code&gt;, Gemma 4 E2B produces three sequential outputs in a single response — a mostly-hallucinated meeting summary, a &lt;code&gt;Note:&lt;/code&gt; saying that summary isn't actually in the transcript, then a more careful retry. Three runs at &lt;code&gt;temperature=0.0&lt;/code&gt;, identical pattern every time. Other E-class models in this envelope don't do this. The rest of this post is the 15-run ablation that found it, and why my last Gemma 4 article framed it wrong.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A couple of weeks ago I published &lt;a href="https://dev.to/thehwang/i-asked-gemma-4-to-summarize-it-said-the-transcript-looked-truncated-it-was-right-4pff"&gt;a post for the Gemma 4 Challenge&lt;/a&gt; with what felt at the time like a confident, well-defended claim: Gemma 4 E2B, faced with a silently-truncated transcript, "detected" the problem and pushed back. I called this calibration. I called it useful. I went to bed pleased with myself.&lt;/p&gt;

&lt;p&gt;Then two engineers showed up in the comments and politely set me on fire.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/dannwaneri"&gt;&lt;strong&gt;Daniel Nwaneri&lt;/strong&gt;&lt;/a&gt; pointed out that "mix of unrelated topics" is a &lt;em&gt;content&lt;/em&gt; claim, not a length claim — so the model is doing more than I was giving it credit for, but also: a self-contained paragraph isn't a meeting transcript, and I should run a truncated paragraph &lt;em&gt;from the same session&lt;/em&gt; as the cleaner control before declaring victory.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/wildeconforce"&gt;&lt;strong&gt;vericum&lt;/strong&gt;&lt;/a&gt; asked, very politely, whether I had published the harness — which I had not, because there was no harness, because I'd shipped the claim from a sample size of &lt;em&gt;vibes&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;So I built the harness. I ran the ablation. I am writing this post, which is a sentence I did not expect to be writing two weeks ago.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;: At &lt;code&gt;num_ctx=32768&lt;/code&gt;, Gemma 4 E2B does not hedge on any input shape Daniel suggested as a control. The "calibration" I claimed was actually the &lt;code&gt;num_ctx=2048&lt;/code&gt; setting doing something I didn't notice the first time, which I'll get to in a minute, and which is honestly weirder than what I claimed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The ablation
&lt;/h2&gt;

&lt;p&gt;Six rows, length-matched within ~15%. &lt;code&gt;temperature=0.0&lt;/code&gt;. Three runs each. Gemma 4 E2B via Ollama on a 16 GB M-series Mac.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Row&lt;/th&gt;
&lt;th&gt;Content&lt;/th&gt;
&lt;th&gt;Syntactic&lt;/th&gt;
&lt;th&gt;Semantic&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Full 5K-token transcript&lt;/td&gt;
&lt;td&gt;whole&lt;/td&gt;
&lt;td&gt;whole&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Mid-session paragraph from row 1&lt;/td&gt;
&lt;td&gt;whole&lt;/td&gt;
&lt;td&gt;mid-stream&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Row 2, cut mid-word at "rare earth ma-"&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;broken&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;mid-stream&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Wikipedia paragraph on the Antikythera mechanism&lt;/td&gt;
&lt;td&gt;whole&lt;/td&gt;
&lt;td&gt;whole&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Tail of row 1 — mid-conversation, no opening&lt;/td&gt;
&lt;td&gt;whole&lt;/td&gt;
&lt;td&gt;mid-stream&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Four hypotheses, increasingly specific. &lt;strong&gt;H1&lt;/strong&gt; length artifact. &lt;strong&gt;H2&lt;/strong&gt; "damaged input as a class." &lt;strong&gt;H3&lt;/strong&gt; the model distinguishes syntactic from semantic damage. &lt;strong&gt;H4&lt;/strong&gt; tail-of-larger-document signal — the hedge tracks "this looks like the end of something with the opening cut off." I added H4 after rows 2–4 came back clean and I refused to accept that as the answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  The result
&lt;/h2&gt;

&lt;p&gt;At &lt;code&gt;num_ctx=32768&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Row&lt;/th&gt;
&lt;th&gt;Hedged?&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;no&lt;/strong&gt; (3/3)&lt;/td&gt;
&lt;td&gt;Confident summaries every time.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;no&lt;/strong&gt; (3/3)&lt;/td&gt;
&lt;td&gt;Syntactic damage alone: nothing.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;no&lt;/strong&gt; (3/3)&lt;/td&gt;
&lt;td&gt;Cheerfully summarized the Antikythera mechanism using the meeting-summary template, including action items. To the ancient Greeks, presumably.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;no&lt;/strong&gt; (3/3)&lt;/td&gt;
&lt;td&gt;The H4-killer. Even the shape closest to what &lt;code&gt;num_ctx=2048&lt;/code&gt; truncation produces — nothing.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That null refutes H1, H2, H3, and my last-ditch H4. Daniel was right. I was wrong. I went and made tea.&lt;/p&gt;

&lt;p&gt;Then I ran row 1 — the original full transcript — at &lt;code&gt;num_ctx=2048&lt;/code&gt;, the original article's setting. Three runs.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Row&lt;/th&gt;
&lt;th&gt;num_ctx&lt;/th&gt;
&lt;th&gt;Hedged?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1 ×3&lt;/td&gt;
&lt;td&gt;2,048&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;yes&lt;/strong&gt; (3/3)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Same configuration, same hedge, every time. Now we have to talk about &lt;em&gt;how&lt;/em&gt; it hedged, because this is the part I missed in the original article.&lt;/p&gt;

&lt;h2&gt;
  
  
  What &lt;code&gt;num_ctx=2048&lt;/code&gt; actually produces
&lt;/h2&gt;

&lt;p&gt;My classifier said "no hedge" on the first pass, because the model &lt;em&gt;did&lt;/em&gt; produce a structured &lt;code&gt;SUMMARY:&lt;/code&gt; / &lt;code&gt;ACTION ITEMS:&lt;/code&gt; block, which the classifier was looking for as a positive case. I almost shipped that as a clean null. Then I read the actual output. Reader, the actual output is one of the more delightful things I have ever seen a language model do.&lt;/p&gt;

&lt;p&gt;Abbreviated trace (all three runs identical in structure):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;SUMMARY:
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Office/Location Update:**&lt;/span&gt; ...
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Remote Work Policy:**&lt;/span&gt; ...
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Team Development:**&lt;/span&gt; ...

ACTION ITEMS:
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**David:**&lt;/span&gt; Schedule the team development discussions.
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Team:**&lt;/span&gt; Prepare for the upcoming offsite ...

&lt;span class="gs"&gt;**Note:**&lt;/span&gt; The provided transcript does not contain the
information listed in the summary or action items above.

&lt;span class="ge"&gt;**&lt;/span&gt;Based &lt;span class="ge"&gt;*only*&lt;/span&gt; on the provided transcript, here is a
summary of the key decisions and action items:&lt;span class="ge"&gt;**&lt;/span&gt;

&lt;span class="gs"&gt;**Key Discussion Points:**&lt;/span&gt;
&lt;span class="p"&gt;*&lt;/span&gt; &lt;span class="gs"&gt;**Office/Location:**&lt;/span&gt; ... (implied by the context of the
  meeting, though the specific details are not fully
  detailed in the provided snippet).

&lt;span class="gs"&gt;**Note:**&lt;/span&gt; The transcript is a segment of a meeting, and
the provided summary/action items above are inferred
based on the flow of the conversation, not explicitly
stated as formal action items in the text.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To be clear about what just happened: that's three passes inside one response.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A confident, templated summary that is &lt;strong&gt;mostly hallucinated&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;A note from the model saying, in its own words, that the above is not in the transcript.&lt;/li&gt;
&lt;li&gt;A more hedged retry, repeatedly flagging things as "implied" / "inferred" / "not fully detailed."&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The model is, essentially, doing peer review on its own output, in real time, and writing a more cautious version below the offending material. It does this every time at &lt;code&gt;num_ctx=2048&lt;/code&gt; and never once at &lt;code&gt;num_ctx=32768&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I now think (and what I deliberately don't)
&lt;/h2&gt;

&lt;p&gt;This is configuration-deterministic, not input-shape-deterministic. The hedge fires specifically when the context budget is too small for the input, on a transcript-shaped task, at &lt;code&gt;temperature=0.0&lt;/code&gt;, on this size of model. Much narrower than "the model has trained calibration about damaged input," which is what I shipped.&lt;/p&gt;

&lt;p&gt;I do not know — and this ablation does not tell us — whether the self-disclaimer is (a) genuine introspection about a truncated KV cache, (b) a pattern memorized from training data, or (c) something specific to E2B-scale RLHF on outputs that look unreliable. Three different mechanisms; I'd not bet against any of them.&lt;/p&gt;

&lt;p&gt;Daniel was right that "mix of unrelated topics" is a content claim, not a length claim. It just only fires inside a very specific configuration, which means it's conditioned on something other than the input.&lt;/p&gt;

&lt;p&gt;I was wrong that the model is doing general semantic input evaluation. The honest version: "at &lt;code&gt;num_ctx=2048&lt;/code&gt;, Gemma 4 E2B does a multi-pass hallucinate-disclaim-retry that other E-class models in this size envelope don't." Still favorable to Gemma 4 — just at the deployment-configuration layer, not the trained-behavior layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Corrections, the harness, the people
&lt;/h2&gt;

&lt;p&gt;I'm adding a Correction box at the top of the original article linking here. Not deleting; the original is part of the trail.&lt;/p&gt;

&lt;p&gt;Harness: &lt;a href="https://github.com/thehwang/Scripta/tree/main/benchmarks/calibration-ablation" rel="noopener noreferrer"&gt;&lt;code&gt;benchmarks/calibration-ablation/&lt;/code&gt;&lt;/a&gt; in the Scripta repo. README, inputs, results, classification report, raw outputs — all of it. ~6–10 minutes on a 16 GB Mac.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/thehwang/Scripta &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;Scripta/benchmarks/calibration-ablation
bash run.sh                            &lt;span class="c"&gt;# rows 2, 3, 4, 6 at num_ctx=32768&lt;/span&gt;
&lt;span class="nv"&gt;NUM_CTX&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2048 bash run.sh &lt;span class="nt"&gt;--rows&lt;/span&gt; row1   &lt;span class="c"&gt;# the configuration-deterministic case&lt;/span&gt;
python3 classify.py &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; classification-report.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Things I'd love to see someone else test: does the multi-pass pattern survive at E4B / 27B? Is it the meeting-summary prompt specifically, or any structured-output prompt under context pressure? &lt;strong&gt;vericum&lt;/strong&gt; is already planning a RTX 4060 8GB replication, different VRAM envelope, same questions.&lt;/p&gt;

&lt;p&gt;This post exists because &lt;a class="mentioned-user" href="https://dev.to/dannwaneri"&gt;@dannwaneri&lt;/a&gt; and &lt;a class="mentioned-user" href="https://dev.to/wildeconforce"&gt;@wildeconforce&lt;/a&gt; read my original carefully and pushed back specifically. Daniel designed the original 4-row ablation; my desperate H4 came from trying to salvage my framing after his rows came back null. vericum asked for the harness in public, which is a harder forcing function than "I should probably build a harness someday." If you write a Gemma 4 / on-device LLM post and the framing feels even a little over-confident: please do this. The people who reviewed mine were exceptionally kind about it. I would rather be corrected than not.&lt;/p&gt;

&lt;p&gt;I could have left the original article alone and hoped nobody ran the ablation. But the data is more interesting than the framing I shipped — so, reader, here is the data.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Harness + raw outputs + classification report: &lt;a href="https://github.com/thehwang/Scripta/tree/main/benchmarks/calibration-ablation" rel="noopener noreferrer"&gt;&lt;code&gt;benchmarks/calibration-ablation/&lt;/code&gt;&lt;/a&gt;. Original article: &lt;a href="https://dev.to/thehwang/i-asked-gemma-4-to-summarize-it-said-the-transcript-looked-truncated-it-was-right-4pff"&gt;"I asked Gemma 4 to summarize. It said the transcript looked truncated. It was right."&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>gemma</category>
      <category>llm</category>
      <category>ollama</category>
      <category>ablation</category>
    </item>
    <item>
      <title>I asked Gemma 4 to summarize. It said the transcript looked truncated. It was right.</title>
      <dc:creator>thehwang</dc:creator>
      <pubDate>Tue, 19 May 2026 13:42:44 +0000</pubDate>
      <link>https://forem.com/thehwang/i-asked-gemma-4-to-summarize-it-said-the-transcript-looked-truncated-it-was-right-4pff</link>
      <guid>https://forem.com/thehwang/i-asked-gemma-4-to-summarize-it-said-the-transcript-looked-truncated-it-was-right-4pff</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/google-gemma-2026-05-06"&gt;Gemma 4 Challenge: Build with Gemma 4&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Correction (May 20, 2026):&lt;/strong&gt; The framing in this post — that Gemma 4 E2B "detected" damaged input and pushed back on it as a general behavior — is too strong. A 15-run ablation, designed in response to comments from &lt;a href="https://dev.to/dannwaneri"&gt;@dannwaneri&lt;/a&gt; and &lt;a href="https://dev.to/wildeconforce"&gt;@wildeconforce&lt;/a&gt;, shows the hedging behavior is &lt;strong&gt;configuration-deterministic&lt;/strong&gt; on &lt;code&gt;num_ctx=2048&lt;/code&gt; specifically, not a general semantic-input-quality signal. Full write-up + falsification: &lt;a href="https://dev.to/thehwang/gemma-4-wrote-three-summaries-in-one-response-the-middle-one-was-a-self-disclaimer-3pj9"&gt;"Gemma 4 wrote three summaries in one response. The middle one was a self-disclaimer."&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/thehwang/Scripta" rel="noopener noreferrer"&gt;Scripta&lt;/a&gt; is a 100% local macOS meeting transcriber. It captures microphone + system audio in two parallel channels, transcribes them in real time with &lt;code&gt;whisper.cpp&lt;/code&gt; and &lt;code&gt;SFSpeechRecognizer&lt;/code&gt;, and uses a local LLM via Ollama to produce a summary — never sending a byte of meeting audio or text off your machine.&lt;/p&gt;

&lt;p&gt;I shipped Scripta as v3.1.0 a few weeks ago. v3.2.0, released today, adds &lt;strong&gt;Gemma 4 E2B&lt;/strong&gt; as a recommended model, surfaces the model's context window in the picker, and — almost by accident — fixes a bug that had silently been compressing every previous Scripta summary down to the last five minutes of the meeting.&lt;/p&gt;

&lt;p&gt;The combined story is what this post is about.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Falkpqu42bpj0qiq8dclc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Falkpqu42bpj0qiq8dclc.png" alt="Scripta's model picker showing Gemma 4 E2B selected with NEW badge and 128K context indicator" width="628" height="616"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/owW2F3VU_n0"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;90-second walkthrough: pick Gemma 4 E2B in Settings → record a short&lt;br&gt;
clip with mic + system audio in two channels → click Summarize → watch&lt;br&gt;
the streaming summary use the model's full 128K context window&lt;br&gt;
(&lt;code&gt;num_ctx=131072&lt;/code&gt; confirmed in the debug log).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Install on your own machine in one line (macOS 14+):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://raw.githubusercontent.com/thehwang/Scripta/main/scripts/install.sh | bash
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To pre-download Gemma 4 during install instead of from the in-app picker:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://raw.githubusercontent.com/thehwang/Scripta/main/scripts/install.sh | &lt;span class="nv"&gt;SCRIPTA_INSTALL_GEMMA4&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 bash
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Code
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Repository:&lt;/strong&gt; &lt;a href="https://github.com/thehwang/Scripta" rel="noopener noreferrer"&gt;github.com/thehwang/Scripta&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latest release:&lt;/strong&gt; &lt;a href="https://github.com/thehwang/Scripta/releases/latest" rel="noopener noreferrer"&gt;v3.2.1 (latest)&lt;/a&gt; — Gemma 4 integration shipped in &lt;a href="https://github.com/thehwang/Scripta/releases/tag/v3.2.0" rel="noopener noreferrer"&gt;v3.2.0&lt;/a&gt;; v3.2.1 is a UX patch on top&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integration commit:&lt;/strong&gt; &lt;a href="https://github.com/thehwang/Scripta/commit/c211678" rel="noopener noreferrer"&gt;&lt;code&gt;c211678&lt;/code&gt; — Integrate Gemma 4 and fix Ollama context window truncation&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Benchmark harness:&lt;/strong&gt; &lt;a href="https://github.com/thehwang/Scripta/commit/4281a0f" rel="noopener noreferrer"&gt;&lt;code&gt;4281a0f&lt;/code&gt; — Add benchmark harness for model + context comparison&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The whole change is 163 lines added across 5 Swift files, 1 shell script, and an Info.plist bump. The benchmark commit adds a synthetic fixture + reproducible script so anyone can verify the findings below on their own hardware.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Used Gemma 4
&lt;/h2&gt;

&lt;p&gt;I chose &lt;strong&gt;Gemma 4 E2B&lt;/strong&gt; (the 2-billion-effective-parameter variant, 7.2 GB on disk, 128K context window). Three reasons, in order of weight:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. 128K context = no chunking on real meetings
&lt;/h3&gt;

&lt;p&gt;Scripta's job is to summarize a transcript that arrives in chunks during a meeting and then ask follow-up questions about it after. A typical 60-minute meeting transcript is ~15,000 words → ~20K tokens. With most popular 3B-class models offering 32K context (Qwen 2.5) or 128K (Llama 3.2, Gemma 4), the meeting fits with room to spare for any of them.&lt;/p&gt;

&lt;p&gt;Where Gemma 4 separates is the &lt;em&gt;consistency&lt;/em&gt; of its 128K window: it's a first-class window, not a long-context retrofit. Multi-hour meetings, all-day workshops, and "summarize this entire week of standups" prompts all fit in one pass without chunking infrastructure. For a one-developer side project, "no chunking" is huge — chunking + map-reduce + merging is its own ML engineering rabbit hole.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. E2B fits alongside Whisper on a 16 GB Mac
&lt;/h3&gt;

&lt;p&gt;Scripta is built for ordinary developer machines, not workstations. On a 16 GB unified-memory MacBook or Mac mini, the working set during a recording includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;whisper-base&lt;/code&gt; model (~150 MB resident)&lt;/li&gt;
&lt;li&gt;Swift app + audio pipeline (~400 MB resident)&lt;/li&gt;
&lt;li&gt;Browser tabs, IDE, Slack, etc. (whatever else is open)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That leaves roughly 9–11 GB of headroom. E2B at 7.2 GB fits cleanly. E4B at 9.6 GB technically fits but pushes the system into swap territory the moment a video call also wants memory. The 31B Dense model isn't a candidate — its inference speed on Apple Silicon at consumer RAM levels is too slow for a usable summary experience.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;E2B vs E4B&lt;/strong&gt; decision is therefore not "which is better" but "which is reliable on the hardware Scripta actually runs on." E2B is the recommended default; E4B is offered as an opt-in for users with 32 GB+.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The reasoning behavior caught me off guard (in a good way)
&lt;/h3&gt;

&lt;p&gt;This is the discovery I genuinely didn't expect from a 4-billion-effective-parameter model, and it's a major reason I'm now confident in Gemma 4 as a default for non-trivial summarization tasks.&lt;/p&gt;

&lt;p&gt;When I first ran Gemma 4 against Scripta's existing prompt path — which (it turns out) was capped at 2,048 tokens of context due to an Ollama default — Gemma 4 didn't just produce a worse summary. It &lt;strong&gt;told the user the transcript looked truncated&lt;/strong&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"The provided transcript seems to be a mix of several unrelated topics, making it difficult to extract a single, coherent summary based on the provided text alone. ... If you are looking for a summary of the &lt;em&gt;actual&lt;/em&gt; conversation content, please provide the relevant transcript."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's the model recognizing that the context it received doesn't match a plausible meeting structure. Qwen 2.5 3B, faced with the same truncated input, just confidently produced a wrong summary based on the trailing Q&amp;amp;A.&lt;/p&gt;

&lt;p&gt;This &lt;strong&gt;calibration&lt;/strong&gt; — knowing what you don't know — is what makes Gemma 4 useful for production summaries, not just benchmark wins.&lt;/p&gt;

&lt;h3&gt;
  
  
  The bug I uncovered while integrating Gemma 4
&lt;/h3&gt;

&lt;p&gt;This isn't a bug in Ollama — &lt;code&gt;num_ctx=2048&lt;/code&gt; is the &lt;a href="https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-specify-the-context-window-size-the-model-uses" rel="noopener noreferrer"&gt;documented default&lt;/a&gt;, and plenty of Ollama users know it. &lt;strong&gt;The bug was on my side&lt;/strong&gt;: Scripta's Ollama call had no &lt;code&gt;num_ctx&lt;/code&gt; parameter at all, so every model I called — Gemma, Llama, Qwen — was silently working with 2,048 tokens of context regardless of the model's actual capability.&lt;/p&gt;

&lt;p&gt;Combined with a 3,000-character hard truncation in &lt;code&gt;buildPrompt()&lt;/code&gt; left over from an early prototype, &lt;strong&gt;every Scripta summary before v3.2.0 was generated from at most the last five minutes of audio&lt;/strong&gt;. A 60-minute meeting compressed to the last ~750 tokens of the transcript.&lt;/p&gt;

&lt;p&gt;What this article is really about isn't the default. It's how I &lt;em&gt;noticed&lt;/em&gt;: Gemma 4 pushed back on the truncated transcript before I'd realized anything was wrong (see the earlier quote). Most models in this parameter class would have confidently produced a worse summary; this one detected an input it couldn't trust.&lt;/p&gt;

&lt;p&gt;The fix is in &lt;code&gt;SummaryService.swift&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Before:&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="s"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;modelName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s"&gt;"prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s"&gt;"options"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="s"&gt;"temperature"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s"&gt;"num_predict"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;maxTokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="c1"&gt;// No num_ctx → Ollama defaults to 2048.&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;// After:&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;contextTokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;SummaryModelManager&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;contextWindow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;for&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;modelName&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="s"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;modelName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s"&gt;"prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s"&gt;"options"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="s"&gt;"temperature"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s"&gt;"num_predict"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;maxTokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s"&gt;"num_ctx"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;contextTokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// Now uses the model's real capability.&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Plus a dynamic truncation in &lt;code&gt;buildPrompt()&lt;/code&gt; that uses the available tokens for the actual transcript:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;availableTokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1_500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;contextTokens&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1200&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;// 1200 reserves for template + output&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;maxChars&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;Double&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;availableTokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;3.5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;        &lt;span class="c1"&gt;// ~3.5 chars/token (mixed languages)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;contextWindow(for:)&lt;/code&gt; function lives in &lt;code&gt;SummaryModelManager.swift&lt;/code&gt; and knows every recommended model's true context window, with a heuristic fallback for user-pulled models:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;contextWindow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="nv"&gt;modelName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;known&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;recommendedModels&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;first&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;where&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;$0&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;modelName&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;known&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;contextTokens&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;lower&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;modelName&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lowercased&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"gemma4"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"llama3.2"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;131_072&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"qwen2.5"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"qwen3"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;32_768&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;8_192&lt;/span&gt;   &lt;span class="c1"&gt;// Conservative fallback, still 4x Ollama's default.&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Benchmark — how dramatic is "before" vs "after"?
&lt;/h3&gt;

&lt;p&gt;I built a benchmark harness (&lt;a href="https://github.com/thehwang/Scripta/blob/main/scripts/benchmark_models.sh" rel="noopener noreferrer"&gt;&lt;code&gt;scripts/benchmark_models.sh&lt;/code&gt;&lt;/a&gt;) that runs any installed Ollama model at any &lt;code&gt;num_ctx&lt;/code&gt; against a fixed transcript and records wall-clock latency, tokens per second, and the raw summary text. The transcript (&lt;a href="https://github.com/thehwang/Scripta/blob/main/benchmarks/synthetic-transcript.md" rel="noopener noreferrer"&gt;&lt;code&gt;benchmarks/synthetic-transcript.md&lt;/code&gt;&lt;/a&gt;) is a fully fictional 60-minute all-hands meeting for an invented company called Atlas Robotics — no real meeting data is committed to the repository.&lt;/p&gt;

&lt;p&gt;The transcript contains five segments, each with specific, distinct content:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Segment 1 (CEO opening):&lt;/strong&gt; Q2 ARR $4.2M, headcount 47, new VP Engineering Marcus Reyes, Cambridge office move&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Segment 2 (Engineering):&lt;/strong&gt; Project Lighthouse launch July 15, 3x perception perf improvement, 5 named hires, tech debt items&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Segment 3 (Product):&lt;/strong&gt; Three new logos (Boeing, Amazon, FedEx), Toyota loss, pricing 15% increase, voice control + multi-robot roadmap&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Segment 4 (CS):&lt;/strong&gt; Renewal rate 94%, NPS 67, documentation overhaul, 2 SE hires&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Segment 5 (Closing):&lt;/strong&gt; Q3 priorities, Series B prep, Engineer of the Quarter (Priya Sharma), Q&amp;amp;A&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A good summary should mention most of these. A bad summary will only mention items from the segment that fits within &lt;code&gt;num_ctx&lt;/code&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;num_ctx&lt;/th&gt;
&lt;th&gt;Wall&lt;/th&gt;
&lt;th&gt;tok/s&lt;/th&gt;
&lt;th&gt;Output&lt;/th&gt;
&lt;th&gt;Topics correctly captured&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;qwen2.5:3b&lt;/td&gt;
&lt;td&gt;2048&lt;/td&gt;
&lt;td&gt;15.2s&lt;/td&gt;
&lt;td&gt;47.9&lt;/td&gt;
&lt;td&gt;59&lt;/td&gt;
&lt;td&gt;Only segment 5 (Q&amp;amp;A: RTO policy, interns, pricing)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gemma4:e2b&lt;/td&gt;
&lt;td&gt;2048&lt;/td&gt;
&lt;td&gt;106.9s¹&lt;/td&gt;
&lt;td&gt;41.7&lt;/td&gt;
&lt;td&gt;267&lt;/td&gt;
&lt;td&gt;Hedged; &lt;strong&gt;flagged transcript as incomplete&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;qwen2.5:3b&lt;/td&gt;
&lt;td&gt;32768&lt;/td&gt;
&lt;td&gt;25.7s&lt;/td&gt;
&lt;td&gt;39.3&lt;/td&gt;
&lt;td&gt;222&lt;/td&gt;
&lt;td&gt;ARR, Marcus joining, pricing; missed Lighthouse + logos&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;gemma4:e2b&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;32768&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;49.2s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;27.1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;752&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;ARR, three logos by name, Lighthouse + date, Series B, all action items&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;¹ Gemma 4's first invocation includes ~80s cold model load; subsequent runs are roughly half this wall clock.&lt;/p&gt;

&lt;p&gt;The qualitative story is what matters more than the raw numbers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;At &lt;code&gt;num_ctx=2048&lt;/code&gt; (Ollama's default that I was silently using), &lt;strong&gt;Qwen 2.5 confidently produced a wrong summary&lt;/strong&gt; — listing the RTO policy Q&amp;amp;A as one of three "key points discussed" in a meeting where the actual headlines were $4.2M ARR, Project Lighthouse, and a Series B prep announcement. Gemma 4 detected the problem and pushed back.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;At &lt;code&gt;num_ctx=32768&lt;/code&gt; (still well within both models' capabilities), &lt;strong&gt;Gemma 4 produced the most useful summary&lt;/strong&gt; — mentioning Boeing, Amazon, and FedEx by name, Project Lighthouse with its July 15 launch date, and the Series B prep that was the most strategic item in the meeting. Qwen 2.5 at the same context missed those.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Full qualitative analysis with each model's actual summary output is in &lt;a href="https://github.com/thehwang/Scripta/blob/main/benchmarks/findings.md" rel="noopener noreferrer"&gt;&lt;code&gt;benchmarks/findings.md&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reproduce in 5 minutes
&lt;/h3&gt;

&lt;p&gt;You don't have to take my word for any of this. The benchmark harness is checked in — clone the repo and run it on your own hardware:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/thehwang/Scripta &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;Scripta
ollama pull gemma4:e2b

&lt;span class="c"&gt;# Stock Ollama default — reproduces the broken case.&lt;/span&gt;
&lt;span class="nv"&gt;MODELS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"gemma4:e2b"&lt;/span&gt; &lt;span class="nv"&gt;NUM_CTX&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2048 bash scripts/benchmark_models.sh &lt;span class="se"&gt;\&lt;/span&gt;
    benchmarks/synthetic-transcript.md

&lt;span class="c"&gt;# Same model, full context — reproduces the fixed case.&lt;/span&gt;
&lt;span class="nv"&gt;MODELS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"gemma4:e2b"&lt;/span&gt; &lt;span class="nv"&gt;NUM_CTX&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;32768 bash scripts/benchmark_models.sh &lt;span class="se"&gt;\&lt;/span&gt;
    benchmarks/synthetic-transcript.md

&lt;span class="c"&gt;# Compare the two summaries side by side.&lt;/span&gt;
diff &lt;span class="nt"&gt;-y&lt;/span&gt; benchmarks/&lt;span class="k"&gt;*&lt;/span&gt;&lt;span class="nt"&gt;-ctx2048&lt;/span&gt;/gemma4:e2b.txt &lt;span class="se"&gt;\&lt;/span&gt;
        benchmarks/&lt;span class="k"&gt;*&lt;/span&gt;&lt;span class="nt"&gt;-ctx32768&lt;/span&gt;/gemma4:e2b.txt | less
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first run produces a hedged summary that flags the transcript as truncated. The second produces the actual 60-minute meeting summary — &lt;code&gt;$4.2M Q2 ARR&lt;/code&gt;, &lt;code&gt;Marcus Reyes&lt;/code&gt;, &lt;code&gt;Boeing/Amazon/FedEx&lt;/code&gt;, &lt;code&gt;Project Lighthouse launching July 15&lt;/code&gt;. On a 16 GB M-series Mac the whole thing takes about 3 minutes including the cold Gemma 4 load.&lt;/p&gt;

&lt;p&gt;If you want to compare every model on your machine, drop the &lt;code&gt;MODELS=&lt;/code&gt; filter and the script runs &lt;code&gt;qwen2.5:3b&lt;/code&gt;, &lt;code&gt;qwen2.5:1.5b&lt;/code&gt;, &lt;code&gt;llama3.2:3b&lt;/code&gt;, &lt;code&gt;llama3.2:1b&lt;/code&gt;, &lt;code&gt;gemma4:e2b&lt;/code&gt;, and &lt;code&gt;gemma4:e4b&lt;/code&gt; against the same transcript.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bonus — testing Gemma 4's vision at E2B size: a calibration finding
&lt;/h3&gt;

&lt;p&gt;Gemma 4 is multimodal at every size. Scripta's text path is what ships in v3.2 today, but a meeting tool whose user is also looking at slides during the call has an obvious multimodal extension: cross-reference what's on the deck against what was actually said. So I tested it.&lt;/p&gt;

&lt;p&gt;The setup: I generated a fake Q2 all-hands slide for the same Atlas Robotics meeting the benchmark transcript covers, and intentionally seeded it with &lt;strong&gt;two inconsistencies&lt;/strong&gt; vs what was said in the room:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric on slide&lt;/th&gt;
&lt;th&gt;Slide value&lt;/th&gt;
&lt;th&gt;Transcript value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Pricing increase&lt;/td&gt;
&lt;td&gt;20%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;15%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Project Lighthouse launch&lt;/td&gt;
&lt;td&gt;July 22&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;July 15&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk2r9jk3znx5zqpnlv6eu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk2r9jk3znx5zqpnlv6eu.png" alt="Q2 all-hands slide with two intentional inconsistencies vs the transcript" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then I fed both the slide image and the transcript to Gemma 4 E2B via Ollama's &lt;code&gt;/api/generate&lt;/code&gt; with &lt;code&gt;images: [...]&lt;/code&gt;. The full driver script is in &lt;a href="https://github.com/thehwang/Scripta/blob/main/benchmarks/multimodal/run.sh" rel="noopener noreferrer"&gt;&lt;code&gt;benchmarks/multimodal/run.sh&lt;/code&gt;&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bash benchmarks/multimodal/run.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Run 1 — loose prompt&lt;/strong&gt; ("identify any inconsistencies"). Excerpt from the output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Metric:        Pricing Change
Slide:         20%
Transcript:    "Effective September first, we are raising list price
               by fifteen percent across the SKU set."
Likely truth:  The transcript states a 15% price increase, which
               contradicts the 20% figure displayed on the slide.

Metric:        Customer Wins
Slide:         22                          ← fabricated, not on slide
Transcript:    "...closed three of the four new logos."
Likely truth:  Three new logos, contradicting "22" on the slide.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;E2B caught the pricing mismatch correctly — read "20%" from the slide image, retrieved the transcript's "fifteen percent" quote verbatim, and called the contradiction. That's a real, useful capability.&lt;/p&gt;

&lt;p&gt;In the same run it missed the July 22 vs July 15 date discrepancy in the Roadmap column entirely, and fabricated a "Customer Wins: 22" metric that does not appear anywhere on the slide (which just lists "Boeing, Amazon, FedEx" as new logos). The final summary line then read "No inconsistencies found. (Note: While there are numerical discrepancies between the transcript and the slide... )" — the model literally contradicted itself in a parenthetical.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Run 2 — strict grounded prompt&lt;/strong&gt; (&lt;code&gt;STRICT_PROMPT=1 bash benchmarks/multimodal/run.sh&lt;/code&gt;). I tightened the prompt to force the model to first enumerate only values visually present on the slide, then quote the transcript verbatim, then issue a &lt;code&gt;MATCH | MISMATCH | NOT MENTIONED&lt;/code&gt; verdict. Output excerpt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Item:        List Price Increase Percentage
Slide:       fifteen percent              ← wrong; slide actually shows 20%
Transcript:  "...we are raising list price by fifteen percent..."
Verdict:     MATCH

Item:        Lighthouse Launch Date
Slide:       July fifteen                 ← wrong; slide actually shows July 22
Transcript:  "Voice control launches with Lighthouse on July fifteen."
Verdict:     MATCH

Total mismatches: 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The strict prompt overcorrected. With the slide image present but the (much larger) transcript dominating the prompt's attention, the model effectively &lt;em&gt;stopped looking at the slide&lt;/em&gt; — it filled the "Slide:" field with whatever the transcript said and labelled everything MATCH. Both planted inconsistencies surfaced as false negatives. The same run hallucinated 30+ additional rows for items that aren't on the slide at all (Cambridge office details, NPS Q1 baseline, deployment time targets) — confabulated by reading the transcript and pretending those things were rendered.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The honest read.&lt;/strong&gt; At 2B effective parameters, Gemma 4's vision is &lt;strong&gt;useful as a first-pass scanner for obvious numeric mismatches&lt;/strong&gt; (Run 1 caught one real planted inconsistency on the first try with no tuning) but &lt;strong&gt;not yet reliable enough to be the only check&lt;/strong&gt; at this size — it has two failure modes that pull in opposite directions and a sharper prompt cannot fix both at once. Production-quality slide-vs-discussion auditing on local hardware probably needs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A bigger vision tower&lt;/strong&gt; — E4B (9.6 GB) likely shifts the failure floor up; the 31B Dense model further still. Both are out of reach for Scripta's 16 GB target machine while Whisper, the audio pipeline, and a browser are also resident.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Or a hybrid pipeline&lt;/strong&gt; — OCR the slide first, then do the cross-reference as a pure text-vs-text task that the same E2B handles confidently (see the calibration behavior from earlier in this post).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the kind of capability ceiling that's easy to miss in a five-minute demo and obvious once you actually try to use the output for anything, and it's why Scripta v3.2 ships the text path only. Wiring multimodal into the summary loop is a v3.3 question whose prerequisite is solving this grounding fragility, not a coding task — the infrastructure to capture screen-share frames already exists in Scripta (system audio is captured via &lt;code&gt;ScreenCaptureKit&lt;/code&gt;, the same &lt;code&gt;SCStream&lt;/code&gt; can vend video samples), so the bottleneck is the model behavior I just measured, not the plumbing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Honest tradeoffs of choosing E2B
&lt;/h3&gt;

&lt;p&gt;Picking E2B is not a free upgrade over a 3B Qwen:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;~3× larger download.&lt;/strong&gt; 7.2 GB vs 1.9 GB for &lt;code&gt;qwen2.5:3b&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~30% slower throughput.&lt;/strong&gt; 27 tok/s vs 39 tok/s on the same hardware. A 60-second summary becomes an 80-second summary.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Longer cold start.&lt;/strong&gt; First inference includes ~80 seconds of model load on first use. Hot loads are instant.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These tradeoffs are why I left the default at &lt;code&gt;qwen2.5:3b&lt;/code&gt; and made Gemma 4 a one-click opt-in from the picker (with a "NEW" badge and a &lt;code&gt;128K ctx&lt;/code&gt; indicator to surface the differentiation). Users who care most about speed and disk get the default; users who care most about quality and long meetings get Gemma 4. That's the kind of choice judges look for when they say "intentional model selection."&lt;/p&gt;

&lt;h3&gt;
  
  
  What changes for Scripta users
&lt;/h3&gt;

&lt;p&gt;For Scripta specifically, Gemma 4 + the &lt;code&gt;num_ctx&lt;/code&gt; fix turns a previously broken-but-no-one-noticed feature into the headline feature:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A real 60-minute meeting now produces a real 60-minute summary&lt;/strong&gt;, not a summary of the last 5 minutes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long meetings (2+ hours) fit in a single Gemma 4 pass&lt;/strong&gt;, no chunking required, no merging artifacts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chat-with-transcript&lt;/strong&gt; (the existing "ask a question about the meeting" feature) can now actually answer questions about what was discussed in the first half hour.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a tool whose pitch is "100% local meeting transcription with AI summaries," that's the difference between a demo and a product.&lt;/p&gt;




&lt;p&gt;If you want to try it: download the &lt;a href="https://github.com/thehwang/Scripta/releases/latest" rel="noopener noreferrer"&gt;latest release&lt;/a&gt; or run the one-line installer. Pull Gemma 4 from the in-app picker, click Record, and verify the debug log shows &lt;code&gt;Summary: model=gemma4:e2b ctx=131072 ...&lt;/code&gt; — that one log line means your Mac is now actually using all 128,000 of those context tokens.&lt;/p&gt;

&lt;p&gt;Thanks to the Ollama, whisper.cpp, and Gemma 4 teams for shipping the building blocks that made this possible to put together as a side-project, on a laptop, in a weekend.&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>gemmachallenge</category>
      <category>gemma</category>
      <category>macos</category>
    </item>
    <item>
      <title>Building a 100% Local Meeting Transcription App for macOS with whisper.cpp and ScreenCaptureKit</title>
      <dc:creator>thehwang</dc:creator>
      <pubDate>Tue, 12 May 2026 14:17:01 +0000</pubDate>
      <link>https://forem.com/thehwang/building-a-100-local-meeting-transcription-app-for-macos-with-whispercpp-and-screencapturekit-33m7</link>
      <guid>https://forem.com/thehwang/building-a-100-local-meeting-transcription-app-for-macos-with-whispercpp-and-screencapturekit-33m7</guid>
      <description>&lt;p&gt;&lt;em&gt;How I built Scripta — a dual-channel meeting recorder that transcribes your mic and system audio in real-time, generates AI summaries, and never sends a byte to the cloud.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;I spend 2–3 hours a day on Teams and Zoom calls. By the end of the day, I can barely remember who committed to what. I tried cloud transcription services — Otter.ai, Fireflies, Granola — but my company's security policy doesn't allow meeting audio to leave the corporate network.&lt;/p&gt;

&lt;p&gt;So I built &lt;strong&gt;Scripta&lt;/strong&gt;: an open-source macOS app that records both sides of a meeting, transcribes everything in real-time, and generates AI summaries — all running entirely on your Mac. Zero cloud requests. Zero subscriptions. Zero data exfiltration.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/screenshots%2Ffull_mode.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/screenshots%2Ffull_mode.png" alt="Scripta full mode" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/thehwang/Scripta" rel="noopener noreferrer"&gt;github.com/thehwang/Scripta&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Dual-Channel Problem
&lt;/h2&gt;

&lt;p&gt;Most transcription apps work with a single audio stream. That's fine for podcasts, but in a meeting you have two distinct audio sources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Your microphone&lt;/strong&gt; — your voice, physically entering the mic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;System audio&lt;/strong&gt; — the remote participants, coming out of Teams/Zoom/Meet through the OS audio mixer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you mix them into one stream, you lose the ability to label who said what. And if you try to run two speech recognition tasks on separate streams using Apple's &lt;code&gt;SFSpeechRecognizer&lt;/code&gt;, you get a fun surprise: &lt;code&gt;kAFAssistantErrorDomain Code=1101&lt;/code&gt; — Apple's speech framework silently refuses to run two recognition tasks concurrently.&lt;/p&gt;

&lt;p&gt;The solution I landed on uses two completely different ASR engines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────┐     ┌──────────────────┐
│   Microphone     │     │  System Audio     │
│  (AVAudioEngine) │     │ (ScreenCaptureKit)│
└────────┬────────┘     └────────┬─────────┘
         │                       │
    whisper.cpp             SFSpeechRecognizer
    (Metal GPU)             (Apple on-device)
         │                       │
         └───── Transcript ──────┘
                    │
              Local Ollama LLM
                    │
              AI Summary + Chat
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Mic → whisper.cpp&lt;/strong&gt;: The Whisper model runs locally with Metal acceleration. The &lt;code&gt;base&lt;/code&gt; model (142 MB) achieves &amp;gt;15x real-time on Apple Silicon — 5 seconds of audio transcribed in ~0.3 seconds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;System audio → SFSpeechRecognizer&lt;/strong&gt;: Apple's on-device speech recognition handles the remote audio. It works well with compressed VoIP audio and doesn't compete for GPU resources with Whisper.&lt;/p&gt;

&lt;p&gt;This hybrid approach avoids the &lt;code&gt;SFSpeechRecognizer&lt;/code&gt; concurrency crash while keeping everything on-device.&lt;/p&gt;




&lt;h2&gt;
  
  
  Capturing System Audio with ScreenCaptureKit
&lt;/h2&gt;

&lt;p&gt;Before macOS 13, capturing system audio from a specific app required hacks: virtual audio devices like BlackHole, aggregate devices, or kernel extensions. ScreenCaptureKit changed this entirely.&lt;/p&gt;

&lt;p&gt;The key insight: ScreenCaptureKit can capture &lt;strong&gt;audio only&lt;/strong&gt; — you don't need to record the screen at all. Set the video dimensions to 2×2 pixels and enable audio:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;SCStreamConfiguration&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;capturesAudio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;excludesCurrentProcessAudio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;  &lt;span class="c1"&gt;// prevent feedback loops&lt;/span&gt;
&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sampleRate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;16_000&lt;/span&gt;
&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;channelCount&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;width&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;   &lt;span class="c1"&gt;// minimal video — we only want audio&lt;/span&gt;
&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;height&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;excludesCurrentProcessAudio = true&lt;/code&gt; is critical — without it, any sounds your app plays would get captured and create an echo loop.&lt;/p&gt;

&lt;p&gt;The catch: ScreenCaptureKit requires &lt;strong&gt;Screen Recording&lt;/strong&gt; permission, even though we're not recording the screen. On macOS 15, self-signed apps frequently fail to acquire this permission through the normal TCC prompt. Users often need to manually add the app in System Settings → Privacy &amp;amp; Security → Screen Recording. This is the single biggest friction point in the user experience, and there's no programmatic workaround.&lt;/p&gt;




&lt;h2&gt;
  
  
  Integrating whisper.cpp into a Swift App
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/ggerganov/whisper.cpp" rel="noopener noreferrer"&gt;whisper.cpp&lt;/a&gt; provides a clean C API that's straightforward to bridge into Swift — no Objective-C++ needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Building the Static Library
&lt;/h3&gt;

&lt;p&gt;The Makefile clones whisper.cpp, builds it with CMake (Metal enabled), and merges all the resulting &lt;code&gt;.a&lt;/code&gt; files into a single static library:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="nt"&gt;-S&lt;/span&gt; vendor/whisper.cpp &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-DCMAKE_OSX_ARCHITECTURES&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"arm64"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-DBUILD_SHARED_LIBS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;OFF &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-DGGML_METAL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-DWHISPER_BUILD_TESTS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;OFF

cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;--config&lt;/span&gt; Release

libtool &lt;span class="nt"&gt;-static&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; libwhisper.a &lt;span class="se"&gt;\&lt;/span&gt;
    build/src/libwhisper.a &lt;span class="se"&gt;\&lt;/span&gt;
    build/ggml/src/libggml.a &lt;span class="se"&gt;\&lt;/span&gt;
    build/ggml/src/libggml-base.a &lt;span class="se"&gt;\&lt;/span&gt;
    build/ggml/src/libggml-cpu.a &lt;span class="se"&gt;\&lt;/span&gt;
    build/ggml/src/ggml-metal/libggml-metal.a
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Swift Bridging via module.modulemap
&lt;/h3&gt;

&lt;p&gt;Instead of a bridging header, I used a Swift Package Manager &lt;code&gt;systemLibrary&lt;/code&gt; target with a &lt;code&gt;module.modulemap&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="n"&gt;module&lt;/span&gt; &lt;span class="n"&gt;CWhisper&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;header&lt;/span&gt; &lt;span class="s"&gt;"whisper.h"&lt;/span&gt;
    &lt;span class="n"&gt;link&lt;/span&gt; &lt;span class="s"&gt;"whisper"&lt;/span&gt;
    &lt;span class="n"&gt;export&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This lets Swift code &lt;code&gt;import CWhisper&lt;/code&gt; directly and call &lt;code&gt;whisper_init_from_file_with_params&lt;/code&gt;, &lt;code&gt;whisper_full&lt;/code&gt;, etc. as regular C functions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sliding Window Transcription
&lt;/h3&gt;

&lt;p&gt;Real-time transcription with Whisper requires chunking the audio stream. I use a &lt;strong&gt;5-second sliding window with 1-second overlap&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;chunkDuration&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;TimeInterval&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;5.0&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;overlapDuration&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;TimeInterval&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;

&lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;processNextChunk&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;chunk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;Array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sampleBuffer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;prefix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunkSamples&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;sampleBuffer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;removeFirst&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunkSamples&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;overlapSamples&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;transcribeChunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The overlap prevents words at chunk boundaries from being cut off. Each chunk is processed on a background &lt;code&gt;DispatchQueue&lt;/code&gt; — while one chunk is being transcribed, the next is accumulating.&lt;/p&gt;

&lt;p&gt;Noise filtering is important: Whisper tends to hallucinate on silence, producing segments like &lt;code&gt;[MUSIC]&lt;/code&gt;, &lt;code&gt;(silence)&lt;/code&gt;, or &lt;code&gt;Thank you.&lt;/code&gt; when there's no actual speech. A simple pattern-matching filter catches these:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;isNoiseSegment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="nv"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="kt"&gt;Bool&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;trimmed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trimmingCharacters&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;in&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;whitespacesAndNewlines&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;trimmed&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hasPrefix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"["&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;trimmed&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hasSuffix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"]"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;trimmed&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hasPrefix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"("&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;trimmed&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hasSuffix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;")"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;noisePatterns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"music"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"silence"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"blank"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"no speech"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"thank you"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;noisePatterns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;contains&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;trimmed&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lowercased&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;$0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Voice Processing IO Saga
&lt;/h2&gt;

&lt;p&gt;When you're on a meeting with speakers (not headphones), the system audio plays through the speakers and gets picked up by the microphone. The mic transcription ends up containing the remote participant's words — defeating the whole purpose of dual-channel separation.&lt;/p&gt;

&lt;p&gt;The fix: &lt;strong&gt;Voice Processing IO&lt;/strong&gt; — macOS's hardware-level acoustic echo cancellation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="n"&gt;inputNode&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setVoiceProcessingEnabled&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One line of code. Three days of debugging.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pitfall 1: The 9-Channel Format
&lt;/h3&gt;

&lt;p&gt;Enabling Voice Processing IO silently changes the microphone's output format from the expected mono/stereo to &lt;strong&gt;9 channels&lt;/strong&gt;. No documentation mentions this. My &lt;code&gt;AVAudioConverter&lt;/code&gt; — which was converting the mic audio from its native format to mono 16kHz for Whisper — started crashing with &lt;code&gt;EXC_BAD_ACCESS&lt;/code&gt; on the real-time audio thread.&lt;/p&gt;

&lt;p&gt;The fix: bypass &lt;code&gt;AVAudioConverter&lt;/code&gt; entirely. Extract channel 0 manually and resample with linear interpolation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="k"&gt;guard&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;ch0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;floatChannelData&lt;/span&gt;&lt;span class="p"&gt;?[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;ratio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;targetRate&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sampleRate&lt;/span&gt;
&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;resampled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;Float&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="nv"&gt;repeating&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;count&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;Double&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;frameCount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;ratio&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;..&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;resampled&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;srcIdx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;Double&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;ratio&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;idx0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;srcIdx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;frac&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;Float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;srcIdx&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="kt"&gt;Double&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;idx0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;resampled&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ch0&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;frac&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ch0&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;idx0&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;frameCount&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;ch0&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not the most elegant DSP, but it doesn't crash on the audio thread, which is more than &lt;code&gt;AVAudioConverter&lt;/code&gt; can claim.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pitfall 2: System Audio Ducking
&lt;/h3&gt;

&lt;p&gt;After enabling Voice Processing IO, users reported that system volume suddenly dropped during recording. Voice Processing IO automatically &lt;strong&gt;ducks&lt;/strong&gt; (reduces volume of) other audio sources to help with echo cancellation. This also affected ScreenCaptureKit's capture — the system audio recordings were nearly silent at -51 dB.&lt;/p&gt;

&lt;p&gt;The fix (macOS 14+):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="n"&gt;inputNode&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;voiceProcessingOtherAudioDuckingConfiguration&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;init&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;enableAdvancedDucking&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;duckingLevel&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;min&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Pitfall 3: Silent Audio Files
&lt;/h3&gt;

&lt;p&gt;The same 9-channel issue that crashed &lt;code&gt;AVAudioConverter&lt;/code&gt; for Whisper also broke audio file recording. The &lt;code&gt;writeMicAudio&lt;/code&gt; function was using a converter to downsample the mic buffer to 1-channel AAC — but converting 9-channel real-time audio to mono AAC was silently producing empty frames. The resulting &lt;code&gt;.m4a&lt;/code&gt; files were the right duration but contained silence (-91 dB).&lt;/p&gt;

&lt;p&gt;The fix was the same manual channel extraction used for Whisper: extract channel 0, resample, write directly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lessons Learned
&lt;/h3&gt;

&lt;p&gt;Apple's Voice Processing IO documentation is essentially nonexistent. The 9-channel behavior, the ducking side effect, the interaction with &lt;code&gt;AVAudioConverter&lt;/code&gt; — none of this is documented. I found most of it through crash logs and &lt;code&gt;mplog()&lt;/code&gt; statements. If you're building anything with Voice Processing IO, budget extra time for audio format debugging.&lt;/p&gt;




&lt;h2&gt;
  
  
  Local AI with Ollama
&lt;/h2&gt;

&lt;p&gt;For AI summaries and chat, Scripta connects to a local &lt;a href="https://ollama.com" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt; instance. The integration is deliberately simple — a POST request to &lt;code&gt;localhost:11434&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Streaming summary generation&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;request&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;OllamaRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nv"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;modelName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Summarize this meeting transcript...&lt;/span&gt;&lt;span class="se"&gt;\n\n\(&lt;/span&gt;&lt;span class="n"&gt;transcript&lt;/span&gt;&lt;span class="se"&gt;)&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The response streams token-by-token, displayed in real-time in the UI. After the summary completes, users can ask follow-up questions through the Ask AI chat panel — multi-turn conversations with the transcript as system context.&lt;/p&gt;

&lt;p&gt;The default model is &lt;code&gt;qwen2.5:3b&lt;/code&gt; — small enough to run on any Apple Silicon Mac, multilingual, and produces surprisingly good meeting summaries. The install script handles Ollama installation, service startup, and model download automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  UX: Two Display Modes
&lt;/h2&gt;

&lt;p&gt;Scripta offers two modes for different workflows:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Full mode&lt;/strong&gt; is the main interface — transcript panel, AI summary, chat sidebar, recording controls, translation settings. This is where you review meetings after they end.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcy6rl505cte97ur1gwuy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcy6rl505cte97ur1gwuy.png" alt=" " width="800" height="540"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Minimal mode&lt;/strong&gt; is a floating caption bar that stays on top of other windows. During a meeting, you switch to minimal mode and keep working while live captions scroll through:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6whmc59upgc8xintucnc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6whmc59upgc8xintucnc.png" alt=" " width="800" height="320"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The mic mute button works like Teams/Zoom — instant toggle, no pipeline teardown. The audio engine keeps running; the mute flag simply tells the tap callback to skip forwarding samples to Whisper and the audio writer.&lt;/p&gt;




&lt;h2&gt;
  
  
  Distribution Without the App Store
&lt;/h2&gt;

&lt;p&gt;Scripta uses ScreenCaptureKit, communicates with Ollama on localhost, and links against a custom whisper.cpp static library — none of which are allowed under App Store sandboxing rules.&lt;/p&gt;

&lt;p&gt;Instead, I distribute through GitHub Releases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub Actions CI&lt;/strong&gt; builds for macOS 14 and macOS 15, signs with ad-hoc (&lt;code&gt;codesign --sign "-"&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;curl | bash&lt;/code&gt; installer&lt;/strong&gt; downloads the latest release, runs &lt;code&gt;xattr -cr&lt;/code&gt; to clear the Gatekeeper quarantine flag, installs Ollama, pulls the AI model, and downloads the Whisper model&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One command&lt;/strong&gt;: &lt;code&gt;curl -fsSL https://raw.githubusercontent.com/thehwang/Scripta/main/scripts/install.sh | bash&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;xattr -cr&lt;/code&gt; step is what makes ad-hoc signed apps work without a paid Apple Developer ID. It clears the &lt;code&gt;com.apple.quarantine&lt;/code&gt; extended attribute that macOS adds to downloaded files. Combined with the ad-hoc signature (which satisfies code integrity checks), this lets the app run without the "unidentified developer" warning.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;A few things I want to build:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Speaker diarization&lt;/strong&gt; — cluster voice embeddings to distinguish Speaker 1, 2, 3 instead of just "Remote"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;In-app auto-update&lt;/strong&gt; — check GitHub Releases API on launch, download and replace via install script&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Whisper model selection&lt;/strong&gt; — let users choose between &lt;code&gt;tiny&lt;/code&gt; (fast, less accurate) and &lt;code&gt;small&lt;/code&gt;/&lt;code&gt;medium&lt;/code&gt; (slower, better)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Export formats&lt;/strong&gt; — SRT subtitles, JSON with timestamps, integration with note-taking apps&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;Scripta is open-source under the MIT license.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Install&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://raw.githubusercontent.com/thehwang/Scripta/main/scripts/install.sh | bash
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/thehwang/Scripta" rel="noopener noreferrer"&gt;github.com/thehwang/Scripta&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you find it useful, a star on GitHub would mean a lot. Issues, PRs, and feedback are all welcome.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built on macOS with Swift, whisper.cpp, ScreenCaptureKit, SFSpeechRecognizer, and Ollama. No cloud required.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>swift</category>
      <category>opensource</category>
      <category>ai</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
