Forem: thehwang

Gemma 4 wrote three summaries in one response. The middle one was a self-disclaimer.

thehwang — Wed, 20 May 2026 20:23:25 +0000

The short version, in case the title was being coy: at num_ctx=2048, Gemma 4 E2B produces three sequential outputs in a single response — a mostly-hallucinated meeting summary, a Note: saying that summary isn't actually in the transcript, then a more careful retry. Three runs at temperature=0.0, identical pattern every time. Other E-class models in this envelope don't do this. The rest of this post is the 15-run ablation that found it, and why my last Gemma 4 article framed it wrong.

A couple of weeks ago I published a post for the Gemma 4 Challenge with what felt at the time like a confident, well-defended claim: Gemma 4 E2B, faced with a silently-truncated transcript, "detected" the problem and pushed back. I called this calibration. I called it useful. I went to bed pleased with myself.

Then two engineers showed up in the comments and politely set me on fire.

Daniel Nwaneri pointed out that "mix of unrelated topics" is a content claim, not a length claim — so the model is doing more than I was giving it credit for, but also: a self-contained paragraph isn't a meeting transcript, and I should run a truncated paragraph from the same session as the cleaner control before declaring victory.

vericum asked, very politely, whether I had published the harness — which I had not, because there was no harness, because I'd shipped the claim from a sample size of vibes.

So I built the harness. I ran the ablation. I am writing this post, which is a sentence I did not expect to be writing two weeks ago.

TL;DR: At num_ctx=32768, Gemma 4 E2B does not hedge on any input shape Daniel suggested as a control. The "calibration" I claimed was actually the num_ctx=2048 setting doing something I didn't notice the first time, which I'll get to in a minute, and which is honestly weirder than what I claimed.

The ablation

Six rows, length-matched within ~15%. temperature=0.0. Three runs each. Gemma 4 E2B via Ollama on a 16 GB M-series Mac.

Row	Content	Syntactic	Semantic
1	Full 5K-token transcript	whole	whole
2	Mid-session paragraph from row 1	whole	mid-stream
3	Row 2, cut mid-word at "rare earth ma-"	broken	mid-stream
4	Wikipedia paragraph on the Antikythera mechanism	whole	whole
6	Tail of row 1 — mid-conversation, no opening	whole	mid-stream

Four hypotheses, increasingly specific. H1 length artifact. H2 "damaged input as a class." H3 the model distinguishes syntactic from semantic damage. H4 tail-of-larger-document signal — the hedge tracks "this looks like the end of something with the opening cut off." I added H4 after rows 2–4 came back clean and I refused to accept that as the answer.

The result

At num_ctx=32768:

Row	Hedged?	Notes
2	no (3/3)	Confident summaries every time.
3	no (3/3)	Syntactic damage alone: nothing.
4	no (3/3)	Cheerfully summarized the Antikythera mechanism using the meeting-summary template, including action items. To the ancient Greeks, presumably.
6	no (3/3)	The H4-killer. Even the shape closest to what `num_ctx=2048` truncation produces — nothing.

That null refutes H1, H2, H3, and my last-ditch H4. Daniel was right. I was wrong. I went and made tea.

Then I ran row 1 — the original full transcript — at num_ctx=2048, the original article's setting. Three runs.

Row	num_ctx	Hedged?
1 ×3	2,048	yes (3/3)

Same configuration, same hedge, every time. Now we have to talk about how it hedged, because this is the part I missed in the original article.

What `num_ctx=2048` actually produces

My classifier said "no hedge" on the first pass, because the model did produce a structured SUMMARY: / ACTION ITEMS: block, which the classifier was looking for as a positive case. I almost shipped that as a clean null. Then I read the actual output. Reader, the actual output is one of the more delightful things I have ever seen a language model do.

Abbreviated trace (all three runs identical in structure):

SUMMARY:
- **Office/Location Update:** ...
- **Remote Work Policy:** ...
- **Team Development:** ...

ACTION ITEMS:
- **David:** Schedule the team development discussions.
- **Team:** Prepare for the upcoming offsite ...

**Note:** The provided transcript does not contain the
information listed in the summary or action items above.

**Based *only* on the provided transcript, here is a
summary of the key decisions and action items:**

**Key Discussion Points:**
* **Office/Location:** ... (implied by the context of the
  meeting, though the specific details are not fully
  detailed in the provided snippet).

**Note:** The transcript is a segment of a meeting, and
the provided summary/action items above are inferred
based on the flow of the conversation, not explicitly
stated as formal action items in the text.

To be clear about what just happened: that's three passes inside one response.

A confident, templated summary that is mostly hallucinated.
A note from the model saying, in its own words, that the above is not in the transcript.
A more hedged retry, repeatedly flagging things as "implied" / "inferred" / "not fully detailed."

The model is, essentially, doing peer review on its own output, in real time, and writing a more cautious version below the offending material. It does this every time at num_ctx=2048 and never once at num_ctx=32768.

What I now think (and what I deliberately don't)

This is configuration-deterministic, not input-shape-deterministic. The hedge fires specifically when the context budget is too small for the input, on a transcript-shaped task, at temperature=0.0, on this size of model. Much narrower than "the model has trained calibration about damaged input," which is what I shipped.

I do not know — and this ablation does not tell us — whether the self-disclaimer is (a) genuine introspection about a truncated KV cache, (b) a pattern memorized from training data, or (c) something specific to E2B-scale RLHF on outputs that look unreliable. Three different mechanisms; I'd not bet against any of them.

Daniel was right that "mix of unrelated topics" is a content claim, not a length claim. It just only fires inside a very specific configuration, which means it's conditioned on something other than the input.

I was wrong that the model is doing general semantic input evaluation. The honest version: "at num_ctx=2048, Gemma 4 E2B does a multi-pass hallucinate-disclaim-retry that other E-class models in this size envelope don't." Still favorable to Gemma 4 — just at the deployment-configuration layer, not the trained-behavior layer.

Corrections, the harness, the people

I'm adding a Correction box at the top of the original article linking here. Not deleting; the original is part of the trail.

Harness: benchmarks/calibration-ablation/ in the Scripta repo. README, inputs, results, classification report, raw outputs — all of it. ~6–10 minutes on a 16 GB Mac.

git clone https://github.com/thehwang/Scripta && cd Scripta/benchmarks/calibration-ablation
bash run.sh                            # rows 2, 3, 4, 6 at num_ctx=32768
NUM_CTX=2048 bash run.sh --rows row1   # the configuration-deterministic case
python3 classify.py > classification-report.md

Things I'd love to see someone else test: does the multi-pass pattern survive at E4B / 27B? Is it the meeting-summary prompt specifically, or any structured-output prompt under context pressure? vericum is already planning a RTX 4060 8GB replication, different VRAM envelope, same questions.

This post exists because @dannwaneri and @wildeconforce read my original carefully and pushed back specifically. Daniel designed the original 4-row ablation; my desperate H4 came from trying to salvage my framing after his rows came back null. vericum asked for the harness in public, which is a harder forcing function than "I should probably build a harness someday." If you write a Gemma 4 / on-device LLM post and the framing feels even a little over-confident: please do this. The people who reviewed mine were exceptionally kind about it. I would rather be corrected than not.

I could have left the original article alone and hoped nobody ran the ablation. But the data is more interesting than the framing I shipped — so, reader, here is the data.

Harness + raw outputs + classification report: benchmarks/calibration-ablation/. Original article: "I asked Gemma 4 to summarize. It said the transcript looked truncated. It was right."

I asked Gemma 4 to summarize. It said the transcript looked truncated. It was right.

thehwang — Tue, 19 May 2026 13:42:44 +0000

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

Correction (May 20, 2026): The framing in this post — that Gemma 4 E2B "detected" damaged input and pushed back on it as a general behavior — is too strong. A 15-run ablation, designed in response to comments from @dannwaneri and @wildeconforce, shows the hedging behavior is configuration-deterministic on num_ctx=2048 specifically, not a general semantic-input-quality signal. Full write-up + falsification: "Gemma 4 wrote three summaries in one response. The middle one was a self-disclaimer."

What I Built

Scripta is a 100% local macOS meeting transcriber. It captures microphone + system audio in two parallel channels, transcribes them in real time with whisper.cpp and SFSpeechRecognizer, and uses a local LLM via Ollama to produce a summary — never sending a byte of meeting audio or text off your machine.

I shipped Scripta as v3.1.0 a few weeks ago. v3.2.0, released today, adds Gemma 4 E2B as a recommended model, surfaces the model's context window in the picker, and — almost by accident — fixes a bug that had silently been compressing every previous Scripta summary down to the last five minutes of the meeting.

The combined story is what this post is about.

Demo

90-second walkthrough: pick Gemma 4 E2B in Settings → record a short
clip with mic + system audio in two channels → click Summarize → watch
the streaming summary use the model's full 128K context window
(num_ctx=131072 confirmed in the debug log).

Install on your own machine in one line (macOS 14+):

curl -fsSL https://raw.githubusercontent.com/thehwang/Scripta/main/scripts/install.sh | bash

To pre-download Gemma 4 during install instead of from the in-app picker:

curl -fsSL https://raw.githubusercontent.com/thehwang/Scripta/main/scripts/install.sh | SCRIPTA_INSTALL_GEMMA4=1 bash

Code

Repository: github.com/thehwang/Scripta
Latest release: v3.2.1 (latest) — Gemma 4 integration shipped in v3.2.0; v3.2.1 is a UX patch on top
Integration commit: c211678 — Integrate Gemma 4 and fix Ollama context window truncation
Benchmark harness: 4281a0f — Add benchmark harness for model + context comparison

The whole change is 163 lines added across 5 Swift files, 1 shell script, and an Info.plist bump. The benchmark commit adds a synthetic fixture + reproducible script so anyone can verify the findings below on their own hardware.

How I Used Gemma 4

I chose Gemma 4 E2B (the 2-billion-effective-parameter variant, 7.2 GB on disk, 128K context window). Three reasons, in order of weight:

1. 128K context = no chunking on real meetings

Scripta's job is to summarize a transcript that arrives in chunks during a meeting and then ask follow-up questions about it after. A typical 60-minute meeting transcript is ~15,000 words → ~20K tokens. With most popular 3B-class models offering 32K context (Qwen 2.5) or 128K (Llama 3.2, Gemma 4), the meeting fits with room to spare for any of them.

Where Gemma 4 separates is the consistency of its 128K window: it's a first-class window, not a long-context retrofit. Multi-hour meetings, all-day workshops, and "summarize this entire week of standups" prompts all fit in one pass without chunking infrastructure. For a one-developer side project, "no chunking" is huge — chunking + map-reduce + merging is its own ML engineering rabbit hole.

2. E2B fits alongside Whisper on a 16 GB Mac

Scripta is built for ordinary developer machines, not workstations. On a 16 GB unified-memory MacBook or Mac mini, the working set during a recording includes:

whisper-base model (~150 MB resident)
Swift app + audio pipeline (~400 MB resident)
Browser tabs, IDE, Slack, etc. (whatever else is open)

That leaves roughly 9–11 GB of headroom. E2B at 7.2 GB fits cleanly. E4B at 9.6 GB technically fits but pushes the system into swap territory the moment a video call also wants memory. The 31B Dense model isn't a candidate — its inference speed on Apple Silicon at consumer RAM levels is too slow for a usable summary experience.

The E2B vs E4B decision is therefore not "which is better" but "which is reliable on the hardware Scripta actually runs on." E2B is the recommended default; E4B is offered as an opt-in for users with 32 GB+.

3. The reasoning behavior caught me off guard (in a good way)

This is the discovery I genuinely didn't expect from a 4-billion-effective-parameter model, and it's a major reason I'm now confident in Gemma 4 as a default for non-trivial summarization tasks.

When I first ran Gemma 4 against Scripta's existing prompt path — which (it turns out) was capped at 2,048 tokens of context due to an Ollama default — Gemma 4 didn't just produce a worse summary. It told the user the transcript looked truncated:

"The provided transcript seems to be a mix of several unrelated topics, making it difficult to extract a single, coherent summary based on the provided text alone. ... If you are looking for a summary of the actual conversation content, please provide the relevant transcript."

That's the model recognizing that the context it received doesn't match a plausible meeting structure. Qwen 2.5 3B, faced with the same truncated input, just confidently produced a wrong summary based on the trailing Q&A.

This calibration — knowing what you don't know — is what makes Gemma 4 useful for production summaries, not just benchmark wins.

The bug I uncovered while integrating Gemma 4

This isn't a bug in Ollama — num_ctx=2048 is the documented default, and plenty of Ollama users know it. The bug was on my side: Scripta's Ollama call had no num_ctx parameter at all, so every model I called — Gemma, Llama, Qwen — was silently working with 2,048 tokens of context regardless of the model's actual capability.

Combined with a 3,000-character hard truncation in buildPrompt() left over from an early prototype, every Scripta summary before v3.2.0 was generated from at most the last five minutes of audio. A 60-minute meeting compressed to the last ~750 tokens of the transcript.

What this article is really about isn't the default. It's how I noticed: Gemma 4 pushed back on the truncated transcript before I'd realized anything was wrong (see the earlier quote). Most models in this parameter class would have confidently produced a worse summary; this one detected an input it couldn't trust.

The fix is in SummaryService.swift:

// Before:
let body: [String: Any] = [
    "model": modelName,
    "prompt": prompt,
    "options": [
        "temperature": 0.4,
        "num_predict": maxTokens,
        // No num_ctx → Ollama defaults to 2048.
    ]
]

// After:
let contextTokens = SummaryModelManager.contextWindow(for: modelName)
let body: [String: Any] = [
    "model": modelName,
    "prompt": prompt,
    "options": [
        "temperature": 0.4,
        "num_predict": maxTokens,
        "num_ctx": contextTokens,  // Now uses the model's real capability.
    ]
]

Plus a dynamic truncation in buildPrompt() that uses the available tokens for the actual transcript:

let availableTokens = max(1_500, contextTokens - 1200)  // 1200 reserves for template + output
let maxChars = Int(Double(availableTokens) * 3.5)        // ~3.5 chars/token (mixed languages)

The contextWindow(for:) function lives in SummaryModelManager.swift and knows every recommended model's true context window, with a heuristic fallback for user-pulled models:

static func contextWindow(for modelName: String) -> Int {
    if let known = recommendedModels.first(where: { $0.name == modelName }) {
        return known.contextTokens
    }
    let lower = modelName.lowercased()
    if lower.contains("gemma4") || lower.contains("llama3.2") { return 131_072 }
    if lower.contains("qwen2.5") || lower.contains("qwen3")  { return 32_768 }
    return 8_192   // Conservative fallback, still 4x Ollama's default.
}

Benchmark — how dramatic is "before" vs "after"?

I built a benchmark harness (scripts/benchmark_models.sh) that runs any installed Ollama model at any num_ctx against a fixed transcript and records wall-clock latency, tokens per second, and the raw summary text. The transcript (benchmarks/synthetic-transcript.md) is a fully fictional 60-minute all-hands meeting for an invented company called Atlas Robotics — no real meeting data is committed to the repository.

The transcript contains five segments, each with specific, distinct content:

Segment 1 (CEO opening): Q2 ARR $4.2M, headcount 47, new VP Engineering Marcus Reyes, Cambridge office move
Segment 2 (Engineering): Project Lighthouse launch July 15, 3x perception perf improvement, 5 named hires, tech debt items
Segment 3 (Product): Three new logos (Boeing, Amazon, FedEx), Toyota loss, pricing 15% increase, voice control + multi-robot roadmap
Segment 4 (CS): Renewal rate 94%, NPS 67, documentation overhaul, 2 SE hires
Segment 5 (Closing): Q3 priorities, Series B prep, Engineer of the Quarter (Priya Sharma), Q&A

A good summary should mention most of these. A bad summary will only mention items from the segment that fits within num_ctx.

Model	num_ctx	Wall	tok/s	Output	Topics correctly captured
qwen2.5:3b	2048	15.2s	47.9	59	Only segment 5 (Q&A: RTO policy, interns, pricing)
gemma4:e2b	2048	106.9s¹	41.7	267	Hedged; flagged transcript as incomplete
qwen2.5:3b	32768	25.7s	39.3	222	ARR, Marcus joining, pricing; missed Lighthouse + logos
gemma4:e2b	32768	49.2s	27.1	752	ARR, three logos by name, Lighthouse + date, Series B, all action items

¹ Gemma 4's first invocation includes ~80s cold model load; subsequent runs are roughly half this wall clock.

The qualitative story is what matters more than the raw numbers:

At num_ctx=2048 (Ollama's default that I was silently using), Qwen 2.5 confidently produced a wrong summary — listing the RTO policy Q&A as one of three "key points discussed" in a meeting where the actual headlines were $4.2M ARR, Project Lighthouse, and a Series B prep announcement. Gemma 4 detected the problem and pushed back.
At num_ctx=32768 (still well within both models' capabilities), Gemma 4 produced the most useful summary — mentioning Boeing, Amazon, and FedEx by name, Project Lighthouse with its July 15 launch date, and the Series B prep that was the most strategic item in the meeting. Qwen 2.5 at the same context missed those.

Full qualitative analysis with each model's actual summary output is in benchmarks/findings.md.

Reproduce in 5 minutes

You don't have to take my word for any of this. The benchmark harness is checked in — clone the repo and run it on your own hardware:

git clone https://github.com/thehwang/Scripta && cd Scripta
ollama pull gemma4:e2b

# Stock Ollama default — reproduces the broken case.
MODELS="gemma4:e2b" NUM_CTX=2048 bash scripts/benchmark_models.sh \
    benchmarks/synthetic-transcript.md

# Same model, full context — reproduces the fixed case.
MODELS="gemma4:e2b" NUM_CTX=32768 bash scripts/benchmark_models.sh \
    benchmarks/synthetic-transcript.md

# Compare the two summaries side by side.
diff -y benchmarks/*-ctx2048/gemma4:e2b.txt \
        benchmarks/*-ctx32768/gemma4:e2b.txt | less

The first run produces a hedged summary that flags the transcript as truncated. The second produces the actual 60-minute meeting summary — $4.2M Q2 ARR, Marcus Reyes, Boeing/Amazon/FedEx, Project Lighthouse launching July 15. On a 16 GB M-series Mac the whole thing takes about 3 minutes including the cold Gemma 4 load.

If you want to compare every model on your machine, drop the MODELS= filter and the script runs qwen2.5:3b, qwen2.5:1.5b, llama3.2:3b, llama3.2:1b, gemma4:e2b, and gemma4:e4b against the same transcript.

Bonus — testing Gemma 4's vision at E2B size: a calibration finding

Gemma 4 is multimodal at every size. Scripta's text path is what ships in v3.2 today, but a meeting tool whose user is also looking at slides during the call has an obvious multimodal extension: cross-reference what's on the deck against what was actually said. So I tested it.

The setup: I generated a fake Q2 all-hands slide for the same Atlas Robotics meeting the benchmark transcript covers, and intentionally seeded it with two inconsistencies vs what was said in the room:

Metric on slide	Slide value	Transcript value
Pricing increase	20%	15%
Project Lighthouse launch	July 22	July 15

Then I fed both the slide image and the transcript to Gemma 4 E2B via Ollama's /api/generate with images: [...]. The full driver script is in benchmarks/multimodal/run.sh.

bash benchmarks/multimodal/run.sh

Run 1 — loose prompt ("identify any inconsistencies"). Excerpt from the output:

Metric:        Pricing Change
Slide:         20%
Transcript:    "Effective September first, we are raising list price
               by fifteen percent across the SKU set."
Likely truth:  The transcript states a 15% price increase, which
               contradicts the 20% figure displayed on the slide.

Metric:        Customer Wins
Slide:         22                          ← fabricated, not on slide
Transcript:    "...closed three of the four new logos."
Likely truth:  Three new logos, contradicting "22" on the slide.

E2B caught the pricing mismatch correctly — read "20%" from the slide image, retrieved the transcript's "fifteen percent" quote verbatim, and called the contradiction. That's a real, useful capability.

In the same run it missed the July 22 vs July 15 date discrepancy in the Roadmap column entirely, and fabricated a "Customer Wins: 22" metric that does not appear anywhere on the slide (which just lists "Boeing, Amazon, FedEx" as new logos). The final summary line then read "No inconsistencies found. (Note: While there are numerical discrepancies between the transcript and the slide... )" — the model literally contradicted itself in a parenthetical.

Run 2 — strict grounded prompt (STRICT_PROMPT=1 bash benchmarks/multimodal/run.sh). I tightened the prompt to force the model to first enumerate only values visually present on the slide, then quote the transcript verbatim, then issue a MATCH | MISMATCH | NOT MENTIONED verdict. Output excerpt:

Item:        List Price Increase Percentage
Slide:       fifteen percent              ← wrong; slide actually shows 20%
Transcript:  "...we are raising list price by fifteen percent..."
Verdict:     MATCH

Item:        Lighthouse Launch Date
Slide:       July fifteen                 ← wrong; slide actually shows July 22
Transcript:  "Voice control launches with Lighthouse on July fifteen."
Verdict:     MATCH

Total mismatches: 0

The strict prompt overcorrected. With the slide image present but the (much larger) transcript dominating the prompt's attention, the model effectively stopped looking at the slide — it filled the "Slide:" field with whatever the transcript said and labelled everything MATCH. Both planted inconsistencies surfaced as false negatives. The same run hallucinated 30+ additional rows for items that aren't on the slide at all (Cambridge office details, NPS Q1 baseline, deployment time targets) — confabulated by reading the transcript and pretending those things were rendered.

The honest read. At 2B effective parameters, Gemma 4's vision is useful as a first-pass scanner for obvious numeric mismatches (Run 1 caught one real planted inconsistency on the first try with no tuning) but not yet reliable enough to be the only check at this size — it has two failure modes that pull in opposite directions and a sharper prompt cannot fix both at once. Production-quality slide-vs-discussion auditing on local hardware probably needs:

A bigger vision tower — E4B (9.6 GB) likely shifts the failure floor up; the 31B Dense model further still. Both are out of reach for Scripta's 16 GB target machine while Whisper, the audio pipeline, and a browser are also resident.
Or a hybrid pipeline — OCR the slide first, then do the cross-reference as a pure text-vs-text task that the same E2B handles confidently (see the calibration behavior from earlier in this post).

This is the kind of capability ceiling that's easy to miss in a five-minute demo and obvious once you actually try to use the output for anything, and it's why Scripta v3.2 ships the text path only. Wiring multimodal into the summary loop is a v3.3 question whose prerequisite is solving this grounding fragility, not a coding task — the infrastructure to capture screen-share frames already exists in Scripta (system audio is captured via ScreenCaptureKit, the same SCStream can vend video samples), so the bottleneck is the model behavior I just measured, not the plumbing.

Honest tradeoffs of choosing E2B

Picking E2B is not a free upgrade over a 3B Qwen:

~3× larger download. 7.2 GB vs 1.9 GB for qwen2.5:3b.
~30% slower throughput. 27 tok/s vs 39 tok/s on the same hardware. A 60-second summary becomes an 80-second summary.
Longer cold start. First inference includes ~80 seconds of model load on first use. Hot loads are instant.

These tradeoffs are why I left the default at qwen2.5:3b and made Gemma 4 a one-click opt-in from the picker (with a "NEW" badge and a 128K ctx indicator to surface the differentiation). Users who care most about speed and disk get the default; users who care most about quality and long meetings get Gemma 4. That's the kind of choice judges look for when they say "intentional model selection."

What changes for Scripta users

For Scripta specifically, Gemma 4 + the num_ctx fix turns a previously broken-but-no-one-noticed feature into the headline feature:

A real 60-minute meeting now produces a real 60-minute summary, not a summary of the last 5 minutes.
Long meetings (2+ hours) fit in a single Gemma 4 pass, no chunking required, no merging artifacts.
Chat-with-transcript (the existing "ask a question about the meeting" feature) can now actually answer questions about what was discussed in the first half hour.

For a tool whose pitch is "100% local meeting transcription with AI summaries," that's the difference between a demo and a product.

If you want to try it: download the latest release or run the one-line installer. Pull Gemma 4 from the in-app picker, click Record, and verify the debug log shows Summary: model=gemma4:e2b ctx=131072 ... — that one log line means your Mac is now actually using all 128,000 of those context tokens.

Thanks to the Ollama, whisper.cpp, and Gemma 4 teams for shipping the building blocks that made this possible to put together as a side-project, on a laptop, in a weekend.

Building a 100% Local Meeting Transcription App for macOS with whisper.cpp and ScreenCaptureKit

thehwang — Tue, 12 May 2026 14:17:01 +0000

How I built Scripta — a dual-channel meeting recorder that transcribes your mic and system audio in real-time, generates AI summaries, and never sends a byte to the cloud.

I spend 2–3 hours a day on Teams and Zoom calls. By the end of the day, I can barely remember who committed to what. I tried cloud transcription services — Otter.ai, Fireflies, Granola — but my company's security policy doesn't allow meeting audio to leave the corporate network.

So I built Scripta: an open-source macOS app that records both sides of a meeting, transcribes everything in real-time, and generates AI summaries — all running entirely on your Mac. Zero cloud requests. Zero subscriptions. Zero data exfiltration.

GitHub: github.com/thehwang/Scripta

The Dual-Channel Problem

Most transcription apps work with a single audio stream. That's fine for podcasts, but in a meeting you have two distinct audio sources:

Your microphone — your voice, physically entering the mic
System audio — the remote participants, coming out of Teams/Zoom/Meet through the OS audio mixer

If you mix them into one stream, you lose the ability to label who said what. And if you try to run two speech recognition tasks on separate streams using Apple's SFSpeechRecognizer, you get a fun surprise: kAFAssistantErrorDomain Code=1101 — Apple's speech framework silently refuses to run two recognition tasks concurrently.

The solution I landed on uses two completely different ASR engines:

┌─────────────────┐     ┌──────────────────┐
│   Microphone     │     │  System Audio     │
│  (AVAudioEngine) │     │ (ScreenCaptureKit)│
└────────┬────────┘     └────────┬─────────┘
         │                       │
    whisper.cpp             SFSpeechRecognizer
    (Metal GPU)             (Apple on-device)
         │                       │
         └───── Transcript ──────┘
                    │
              Local Ollama LLM
                    │
              AI Summary + Chat

Mic → whisper.cpp: The Whisper model runs locally with Metal acceleration. The base model (142 MB) achieves >15x real-time on Apple Silicon — 5 seconds of audio transcribed in ~0.3 seconds.

System audio → SFSpeechRecognizer: Apple's on-device speech recognition handles the remote audio. It works well with compressed VoIP audio and doesn't compete for GPU resources with Whisper.

This hybrid approach avoids the SFSpeechRecognizer concurrency crash while keeping everything on-device.

Capturing System Audio with ScreenCaptureKit

Before macOS 13, capturing system audio from a specific app required hacks: virtual audio devices like BlackHole, aggregate devices, or kernel extensions. ScreenCaptureKit changed this entirely.

The key insight: ScreenCaptureKit can capture audio only — you don't need to record the screen at all. Set the video dimensions to 2×2 pixels and enable audio:

let config = SCStreamConfiguration()
config.capturesAudio = true
config.excludesCurrentProcessAudio = true  // prevent feedback loops
config.sampleRate = 16_000
config.channelCount = 1
config.width = 2   // minimal video — we only want audio
config.height = 2

excludesCurrentProcessAudio = true is critical — without it, any sounds your app plays would get captured and create an echo loop.

The catch: ScreenCaptureKit requires Screen Recording permission, even though we're not recording the screen. On macOS 15, self-signed apps frequently fail to acquire this permission through the normal TCC prompt. Users often need to manually add the app in System Settings → Privacy & Security → Screen Recording. This is the single biggest friction point in the user experience, and there's no programmatic workaround.

Integrating whisper.cpp into a Swift App

whisper.cpp provides a clean C API that's straightforward to bridge into Swift — no Objective-C++ needed.

Building the Static Library

The Makefile clones whisper.cpp, builds it with CMake (Metal enabled), and merges all the resulting .a files into a single static library:

cmake -B build -S vendor/whisper.cpp \
    -DCMAKE_OSX_ARCHITECTURES="arm64" \
    -DBUILD_SHARED_LIBS=OFF \
    -DGGML_METAL=ON \
    -DWHISPER_BUILD_TESTS=OFF

cmake --build build --config Release

libtool -static -o libwhisper.a \
    build/src/libwhisper.a \
    build/ggml/src/libggml.a \
    build/ggml/src/libggml-base.a \
    build/ggml/src/libggml-cpu.a \
    build/ggml/src/ggml-metal/libggml-metal.a

Swift Bridging via module.modulemap

Instead of a bridging header, I used a Swift Package Manager systemLibrary target with a module.modulemap:

module CWhisper {
    header "whisper.h"
    link "whisper"
    export *
}

This lets Swift code import CWhisper directly and call whisper_init_from_file_with_params, whisper_full, etc. as regular C functions.

Sliding Window Transcription

Real-time transcription with Whisper requires chunking the audio stream. I use a 5-second sliding window with 1-second overlap:

let chunkDuration: TimeInterval = 5.0
let overlapDuration: TimeInterval = 1.0

func processNextChunk() {
    let chunk = Array(sampleBuffer.prefix(chunkSamples))
    sampleBuffer.removeFirst(chunkSamples - overlapSamples)
    transcribeChunk(chunk)
}

The overlap prevents words at chunk boundaries from being cut off. Each chunk is processed on a background DispatchQueue — while one chunk is being transcribed, the next is accumulating.

Noise filtering is important: Whisper tends to hallucinate on silence, producing segments like [MUSIC], (silence), or Thank you. when there's no actual speech. A simple pattern-matching filter catches these:

static func isNoiseSegment(_ text: String) -> Bool {
    let trimmed = text.trimmingCharacters(in: .whitespacesAndNewlines)
    if trimmed.hasPrefix("[") && trimmed.hasSuffix("]") { return true }
    if trimmed.hasPrefix("(") && trimmed.hasSuffix(")") { return true }
    let noisePatterns = ["music", "silence", "blank", "no speech", "thank you"]
    return noisePatterns.contains { trimmed.lowercased().contains($0) }
}

The Voice Processing IO Saga

When you're on a meeting with speakers (not headphones), the system audio plays through the speakers and gets picked up by the microphone. The mic transcription ends up containing the remote participant's words — defeating the whole purpose of dual-channel separation.

The fix: Voice Processing IO — macOS's hardware-level acoustic echo cancellation:

try inputNode.setVoiceProcessingEnabled(true)

One line of code. Three days of debugging.

Pitfall 1: The 9-Channel Format

Enabling Voice Processing IO silently changes the microphone's output format from the expected mono/stereo to 9 channels. No documentation mentions this. My AVAudioConverter — which was converting the mic audio from its native format to mono 16kHz for Whisper — started crashing with EXC_BAD_ACCESS on the real-time audio thread.

The fix: bypass AVAudioConverter entirely. Extract channel 0 manually and resample with linear interpolation:

guard let ch0 = buffer.floatChannelData?[0] else { return }
let ratio = targetRate / buffer.format.sampleRate
var resampled = [Float](repeating: 0, count: Int(Double(frameCount) * ratio))
for i in 0..<resampled.count {
    let srcIdx = Double(i) / ratio
    let idx0 = Int(srcIdx)
    let frac = Float(srcIdx - Double(idx0))
    resampled[i] = ch0[idx0] + frac * (ch0[min(idx0 + 1, frameCount - 1)] - ch0[idx0])
}

Not the most elegant DSP, but it doesn't crash on the audio thread, which is more than AVAudioConverter can claim.

Pitfall 2: System Audio Ducking

After enabling Voice Processing IO, users reported that system volume suddenly dropped during recording. Voice Processing IO automatically ducks (reduces volume of) other audio sources to help with echo cancellation. This also affected ScreenCaptureKit's capture — the system audio recordings were nearly silent at -51 dB.

The fix (macOS 14+):

inputNode.voiceProcessingOtherAudioDuckingConfiguration =
    .init(enableAdvancedDucking: false, duckingLevel: .min)

Pitfall 3: Silent Audio Files

The same 9-channel issue that crashed AVAudioConverter for Whisper also broke audio file recording. The writeMicAudio function was using a converter to downsample the mic buffer to 1-channel AAC — but converting 9-channel real-time audio to mono AAC was silently producing empty frames. The resulting .m4a files were the right duration but contained silence (-91 dB).

The fix was the same manual channel extraction used for Whisper: extract channel 0, resample, write directly.

Lessons Learned

Apple's Voice Processing IO documentation is essentially nonexistent. The 9-channel behavior, the ducking side effect, the interaction with AVAudioConverter — none of this is documented. I found most of it through crash logs and mplog() statements. If you're building anything with Voice Processing IO, budget extra time for audio format debugging.

Local AI with Ollama

For AI summaries and chat, Scripta connects to a local Ollama instance. The integration is deliberately simple — a POST request to localhost:11434:

// Streaming summary generation
let request = OllamaRequest(
    model: modelName,
    prompt: "Summarize this meeting transcript...\n\n\(transcript)",
    stream: true
)

The response streams token-by-token, displayed in real-time in the UI. After the summary completes, users can ask follow-up questions through the Ask AI chat panel — multi-turn conversations with the transcript as system context.

The default model is qwen2.5:3b — small enough to run on any Apple Silicon Mac, multilingual, and produces surprisingly good meeting summaries. The install script handles Ollama installation, service startup, and model download automatically.

UX: Two Display Modes

Scripta offers two modes for different workflows:

Full mode is the main interface — transcript panel, AI summary, chat sidebar, recording controls, translation settings. This is where you review meetings after they end.

Minimal mode is a floating caption bar that stays on top of other windows. During a meeting, you switch to minimal mode and keep working while live captions scroll through:

The mic mute button works like Teams/Zoom — instant toggle, no pipeline teardown. The audio engine keeps running; the mute flag simply tells the tap callback to skip forwarding samples to Whisper and the audio writer.

Distribution Without the App Store

Scripta uses ScreenCaptureKit, communicates with Ollama on localhost, and links against a custom whisper.cpp static library — none of which are allowed under App Store sandboxing rules.

Instead, I distribute through GitHub Releases:

GitHub Actions CI builds for macOS 14 and macOS 15, signs with ad-hoc (codesign --sign "-")
curl | bash installer downloads the latest release, runs xattr -cr to clear the Gatekeeper quarantine flag, installs Ollama, pulls the AI model, and downloads the Whisper model
One command: curl -fsSL https://raw.githubusercontent.com/thehwang/Scripta/main/scripts/install.sh | bash

The xattr -cr step is what makes ad-hoc signed apps work without a paid Apple Developer ID. It clears the com.apple.quarantine extended attribute that macOS adds to downloaded files. Combined with the ad-hoc signature (which satisfies code integrity checks), this lets the app run without the "unidentified developer" warning.

What's Next

A few things I want to build:

Speaker diarization — cluster voice embeddings to distinguish Speaker 1, 2, 3 instead of just "Remote"
In-app auto-update — check GitHub Releases API on launch, download and replace via install script
Whisper model selection — let users choose between tiny (fast, less accurate) and small/medium (slower, better)
Export formats — SRT subtitles, JSON with timestamps, integration with note-taking apps

Try It

Scripta is open-source under the MIT license.

Install:

curl -fsSL https://raw.githubusercontent.com/thehwang/Scripta/main/scripts/install.sh | bash