<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Gabriele Mastrapasqua</title>
    <description>The latest articles on Forem by Gabriele Mastrapasqua (@gabrielemastrapasqua).</description>
    <link>https://forem.com/gabrielemastrapasqua</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F71092%2F2baf511b-f51e-4826-bdd5-16b26f7deb31.jpeg</url>
      <title>Forem: Gabriele Mastrapasqua</title>
      <link>https://forem.com/gabrielemastrapasqua</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/gabrielemastrapasqua"/>
    <language>en</language>
    <item>
      <title>Extending Qwen3-TTS: clone voices once, reuse everywhere (pure C)</title>
      <dc:creator>Gabriele Mastrapasqua</dc:creator>
      <pubDate>Sun, 12 Apr 2026 13:36:03 +0000</pubDate>
      <link>https://forem.com/gabrielemastrapasqua/extending-qwen3-tts-clone-voices-once-reuse-everywhere-pure-c-271o</link>
      <guid>https://forem.com/gabrielemastrapasqua/extending-qwen3-tts-clone-voices-once-reuse-everywhere-pure-c-271o</guid>
      <description>&lt;p&gt;&lt;em&gt;Part of &lt;a href="https://github.com/gabriele-mastrapasqua/qwen3-tts" rel="noopener noreferrer"&gt;qwen3-tts&lt;/a&gt; — a pure C inference engine for Qwen3-TTS.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR — turn any 30-second clip into a first-class Qwen3-TTS voice
&lt;/h2&gt;

&lt;p&gt;Qwen3-TTS ships with &lt;strong&gt;9 preset speakers&lt;/strong&gt;. That's it. You can't add your own, you can't use the 1.7B instruct feature on a cloned voice, and every new clone has to re-run the 200 ms ECAPA-TDNN encoder from scratch.&lt;/p&gt;

&lt;p&gt;This post is about tearing that ceiling down.&lt;/p&gt;

&lt;p&gt;With the pure-C engine at &lt;a href="https://github.com/gabriele-mastrapasqua/qwen3-tts" rel="noopener noreferrer"&gt;qwen3-tts&lt;/a&gt; you can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🎙️ Clone &lt;strong&gt;any voice&lt;/strong&gt; from 30 seconds of audio (ECAPA-TDNN speaker encoder implemented from scratch)&lt;/li&gt;
&lt;li&gt;💾 Save it as a portable &lt;strong&gt;&lt;code&gt;.qvoice&lt;/code&gt; file&lt;/strong&gt; and load it anywhere — CLI, HTTP server, streaming pipeline, one-shot generation&lt;/li&gt;
&lt;li&gt;🎛️ Combine a cloned voice with &lt;strong&gt;&lt;code&gt;--instruct&lt;/code&gt; style prompts&lt;/strong&gt; on 1.7B (sad / happy / angry / solemn) — something the Base model alone can't do&lt;/li&gt;
&lt;li&gt;🎯 Get &lt;strong&gt;bit-identical output&lt;/strong&gt; across runs, processes, and machines via the &lt;code&gt;WDELTA&lt;/code&gt; weight-delta format&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;.qvoice&lt;/code&gt; file is a new way to &lt;em&gt;extend&lt;/em&gt; Qwen3-TTS's voice set: drop the file next to the binary, point &lt;code&gt;--load-voice&lt;/code&gt; at it, and the model speaks with your voice like it was one of the originals.&lt;/p&gt;

&lt;p&gt;That portability comes in three flavors. All produce the same voice identity; they differ in how much of the Base model's weight signature they carry along:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;th&gt;Mel correlation vs Base&lt;/th&gt;
&lt;th&gt;Fidelity&lt;/th&gt;
&lt;th&gt;Works with instruct?&lt;/th&gt;
&lt;th&gt;Use when…&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;🥇 &lt;code&gt;.qvoice&lt;/code&gt; &lt;strong&gt;WDELTA&lt;/strong&gt; (LZ4 full delta)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;785 MB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;1.000&lt;/strong&gt; (bit-identical)&lt;/td&gt;
&lt;td&gt;Perfect, PCM-identical&lt;/td&gt;
&lt;td&gt;✅ yes (1.7B)&lt;/td&gt;
&lt;td&gt;You're building a reusable voice asset — server, streaming, product&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🥈 &lt;code&gt;.qvoice&lt;/code&gt; &lt;strong&gt;standard&lt;/strong&gt; (TPAD + WOVR)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;16 MB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.71&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Good; small prosody drift&lt;/td&gt;
&lt;td&gt;⚠️ Base only&lt;/td&gt;
&lt;td&gt;Default for sharing — fits in chat, sounds right&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🥉 &lt;code&gt;.bin&lt;/code&gt; &lt;strong&gt;embedding only&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4 KB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;em&gt;not measured&lt;/em&gt; (~60–70 % subj.)&lt;/td&gt;
&lt;td&gt;Voice drifts, timbre loose&lt;/td&gt;
&lt;td&gt;❌ no&lt;/td&gt;
&lt;td&gt;You have 4 kilobytes to spend&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The headline: &lt;strong&gt;WDELTA makes a cloned voice a first-class citizen of the CustomVoice model&lt;/strong&gt;. You clone once on Base, save a &lt;code&gt;.qvoice&lt;/code&gt;, and the CV model loads it and treats it exactly like one of the nine built-in speakers — same latency, same server behavior, same streaming support, now with instruct-style control on top.&lt;/p&gt;

&lt;p&gt;All audio below is hosted in the same repo — click ▶️ to play.&lt;/p&gt;




&lt;h2&gt;
  
  
  Samples — listen for yourself
&lt;/h2&gt;

&lt;h3&gt;
  
  
  🇮🇹 Italian — &lt;em&gt;Galatea&lt;/em&gt; / Riccardo Fasol · &lt;a href="https://archive.org/details/galatea_0908_librivox" rel="noopener noreferrer"&gt;LibriVox, PD&lt;/a&gt;
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Buongiorno a tutti, oggi vi racconto una breve storia, con la voce clonata da una registrazione libera."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;📥 &lt;strong&gt;Input reference&lt;/strong&gt; — 30 s from LibriVox · &lt;a href="https://raw.githubusercontent.com/gabriele-mastrapasqua/qwen3-tts/main/samples/voice_clone_refs/it_galatea_fasol.wav" rel="noopener noreferrer"&gt;wav&lt;/a&gt;&lt;/p&gt;




&lt;h4&gt;
  
  
  🎤 Voice clone output — 3 storage formats
&lt;/h4&gt;

&lt;p&gt;🥇 &lt;strong&gt;Top — WDELTA, 785 MB&lt;/strong&gt; (mel 1.000, bit-identical) · &lt;a href="https://raw.githubusercontent.com/gabriele-mastrapasqua/qwen3-tts/main/samples/voice_clone_refs/outputs/out_it_wdelta_785mb.wav" rel="noopener noreferrer"&gt;wav&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🥈 &lt;strong&gt;Mid — standard &lt;code&gt;.qvoice&lt;/code&gt;, 16 MB&lt;/strong&gt; (mel 0.71) · &lt;a href="https://raw.githubusercontent.com/gabriele-mastrapasqua/qwen3-tts/main/samples/voice_clone_refs/outputs/out_it_standard_16mb.wav" rel="noopener noreferrer"&gt;wav&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🥉 &lt;strong&gt;Light — &lt;code&gt;.bin&lt;/code&gt;, 4 KB&lt;/strong&gt; (embedding only) · &lt;a href="https://raw.githubusercontent.com/gabriele-mastrapasqua/qwen3-tts/main/samples/voice_clone_refs/outputs/out_it_bin_4kb.wav" rel="noopener noreferrer"&gt;wav&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  🇬🇧 English — &lt;em&gt;The Gifts of the Magi&lt;/em&gt; / Phil Chenevert · &lt;a href="https://archive.org/details/5belovedstories_ohenry_pc_librivox" rel="noopener noreferrer"&gt;LibriVox, CC0&lt;/a&gt;
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Hello everyone, today I am speaking with a voice cloned from a freely licensed recording."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;📥 &lt;strong&gt;Input reference&lt;/strong&gt; · &lt;a href="https://raw.githubusercontent.com/gabriele-mastrapasqua/qwen3-tts/main/samples/voice_clone_refs/en_ohenry_chenevert.wav" rel="noopener noreferrer"&gt;wav&lt;/a&gt;&lt;/p&gt;




&lt;h4&gt;
  
  
  🎤 Voice clone output — 3 storage formats
&lt;/h4&gt;

&lt;p&gt;🥇 &lt;strong&gt;Top — WDELTA, 785 MB&lt;/strong&gt; (mel 1.000) · &lt;a href="https://raw.githubusercontent.com/gabriele-mastrapasqua/qwen3-tts/main/samples/voice_clone_refs/outputs/out_en_wdelta_785mb.wav" rel="noopener noreferrer"&gt;wav&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🥈 &lt;strong&gt;Mid — standard &lt;code&gt;.qvoice&lt;/code&gt;, 16 MB&lt;/strong&gt; (mel 0.71) · &lt;a href="https://raw.githubusercontent.com/gabriele-mastrapasqua/qwen3-tts/main/samples/voice_clone_refs/outputs/out_en_standard_16mb.wav" rel="noopener noreferrer"&gt;wav&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🥉 &lt;strong&gt;Light — &lt;code&gt;.bin&lt;/code&gt;, 4 KB&lt;/strong&gt; · &lt;a href="https://raw.githubusercontent.com/gabriele-mastrapasqua/qwen3-tts/main/samples/voice_clone_refs/outputs/out_en_bin_4kb.wav" rel="noopener noreferrer"&gt;wav&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  🇪🇸 Spanish — &lt;em&gt;Don Quijote&lt;/em&gt; / Lu · &lt;a href="https://archive.org/details/donquijote_2507_librivox" rel="noopener noreferrer"&gt;LibriVox, PD&lt;/a&gt;
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Hola a todos, hoy les hablo con una voz clonada a partir de una grabación de dominio público."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;📥 &lt;strong&gt;Input reference&lt;/strong&gt; · &lt;a href="https://raw.githubusercontent.com/gabriele-mastrapasqua/qwen3-tts/main/samples/voice_clone_refs/es_quijote_lu.wav" rel="noopener noreferrer"&gt;wav&lt;/a&gt;&lt;/p&gt;




&lt;h4&gt;
  
  
  🎤 Voice clone output — 3 storage formats
&lt;/h4&gt;

&lt;p&gt;🥇 &lt;strong&gt;Top — WDELTA, 785 MB&lt;/strong&gt; (mel 1.000) · &lt;a href="https://raw.githubusercontent.com/gabriele-mastrapasqua/qwen3-tts/main/samples/voice_clone_refs/outputs/out_es_wdelta_785mb.wav" rel="noopener noreferrer"&gt;wav&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🥈 &lt;strong&gt;Mid — standard &lt;code&gt;.qvoice&lt;/code&gt;, 16 MB&lt;/strong&gt; (mel 0.71) · &lt;a href="https://raw.githubusercontent.com/gabriele-mastrapasqua/qwen3-tts/main/samples/voice_clone_refs/outputs/out_es_standard_16mb.wav" rel="noopener noreferrer"&gt;wav&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🥉 &lt;strong&gt;Light — &lt;code&gt;.bin&lt;/code&gt;, 4 KB&lt;/strong&gt; · &lt;a href="https://raw.githubusercontent.com/gabriele-mastrapasqua/qwen3-tts/main/samples/voice_clone_refs/outputs/out_es_bin_4kb.wav" rel="noopener noreferrer"&gt;wav&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  🇫🇷 French — &lt;em&gt;Le dernier jour d'un condamné&lt;/em&gt; / Bidou · &lt;a href="https://archive.org/details/dernierjour_2203_librivox" rel="noopener noreferrer"&gt;LibriVox, PD&lt;/a&gt;
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Bonjour à tous, aujourd'hui je vous parle avec une voix clonée à partir d'un enregistrement libre."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;📥 &lt;strong&gt;Input reference&lt;/strong&gt; · &lt;a href="https://raw.githubusercontent.com/gabriele-mastrapasqua/qwen3-tts/main/samples/voice_clone_refs/fr_hugo_bidou.wav" rel="noopener noreferrer"&gt;wav&lt;/a&gt;&lt;/p&gt;




&lt;h4&gt;
  
  
  🎤 Voice clone output — 3 storage formats
&lt;/h4&gt;

&lt;p&gt;🥇 &lt;strong&gt;Top — WDELTA, 785 MB&lt;/strong&gt; (mel 1.000) · &lt;a href="https://raw.githubusercontent.com/gabriele-mastrapasqua/qwen3-tts/main/samples/voice_clone_refs/outputs/out_fr_wdelta_785mb.wav" rel="noopener noreferrer"&gt;wav&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🥈 &lt;strong&gt;Mid — standard &lt;code&gt;.qvoice&lt;/code&gt;, 16 MB&lt;/strong&gt; (mel 0.71) · &lt;a href="https://raw.githubusercontent.com/gabriele-mastrapasqua/qwen3-tts/main/samples/voice_clone_refs/outputs/out_fr_standard_16mb.wav" rel="noopener noreferrer"&gt;wav&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🥉 &lt;strong&gt;Light — &lt;code&gt;.bin&lt;/code&gt;, 4 KB&lt;/strong&gt; · &lt;a href="https://raw.githubusercontent.com/gabriele-mastrapasqua/qwen3-tts/main/samples/voice_clone_refs/outputs/out_fr_bin_4kb.wav" rel="noopener noreferrer"&gt;wav&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost table — how much you pay for each tier
&lt;/h3&gt;

&lt;p&gt;All numbers on Apple M1 8-core, 16 GB RAM, 4 threads, 0.6B model, cold start.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;File size&lt;/th&gt;
&lt;th&gt;Create &lt;code&gt;.qvoice&lt;/code&gt;
&lt;/th&gt;
&lt;th&gt;Generate (wall)&lt;/th&gt;
&lt;th&gt;What's inside&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;.bin&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;4 KB&lt;/td&gt;
&lt;td&gt;~20.8 s&lt;/td&gt;
&lt;td&gt;~11 s&lt;/td&gt;
&lt;td&gt;1024 ECAPA-TDNN floats&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;standard &lt;code&gt;.qvoice&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;16 MB&lt;/td&gt;
&lt;td&gt;~19.6 s&lt;/td&gt;
&lt;td&gt;~9.6 s&lt;/td&gt;
&lt;td&gt;Embedding + &lt;code&gt;text_projection&lt;/code&gt; + &lt;code&gt;codec_embedding&lt;/code&gt; + pad embeds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WDELTA &lt;code&gt;.qvoice&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;785 MB&lt;/td&gt;
&lt;td&gt;~28.5 s&lt;/td&gt;
&lt;td&gt;~13.8 s&lt;/td&gt;
&lt;td&gt;LZ4 int16 deltas for all 402 talker + CP tensors&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Verdict: &lt;strong&gt;standard 16 MB is the sensible default.&lt;/strong&gt; Go to WDELTA only when you need bit-identical output or to combine a cloned voice with &lt;code&gt;--instruct&lt;/code&gt; style control (1.7B). Go to &lt;code&gt;.bin&lt;/code&gt; only if 4 KB is the whole budget.&lt;/p&gt;

&lt;p&gt;All input clips are 30-second excerpts from the 30 s mark (skipping the LibriVox preamble), 24 kHz mono PCM. Full attribution in &lt;a href="https://github.com/gabriele-mastrapasqua/qwen3-tts/blob/main/samples/voice_clone_refs/ATTRIBUTION.md" rel="noopener noreferrer"&gt;&lt;code&gt;samples/voice_clone_refs/ATTRIBUTION.md&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;Now the technical part — how we got there.&lt;/p&gt;

&lt;h2&gt;
  
  
  From audio to identity: the ECAPA-TDNN pipeline
&lt;/h2&gt;

&lt;p&gt;The speaker encoder is an ECAPA-TDNN (&lt;em&gt;Emphasized Channel Attention, Propagation and Aggregation in Time Delay Neural Network&lt;/em&gt;), designed for speaker verification. Its job: take variable-length audio and produce a fixed-size vector that captures &lt;em&gt;who&lt;/em&gt; is speaking, independent of &lt;em&gt;what&lt;/em&gt; they're saying.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1 — Mel spectrogram
&lt;/h3&gt;

&lt;p&gt;Raw 24 kHz audio → 1024-point FFT, hop 256, 128 mel bins → &lt;code&gt;[T, 128]&lt;/code&gt; where T ≈ 94 frames/sec. For 30 s of audio, roughly &lt;code&gt;[2810, 128]&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2 — TDNN + SE-Res2Net blocks
&lt;/h3&gt;

&lt;p&gt;Four convolutional blocks process the mel spectrogram. The first is a plain 1D conv (128→512, k=5). The next three are Squeeze-and-Excitation Res2Net blocks: each splits 512 channels into 8 groups of 64, runs cascaded dilated convolutions (dilations 2, 3, 4) over them, and reweights channels with a small attention block. The effective receptive field grows to several hundred milliseconds by the last block.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3 — Multi-layer feature aggregation
&lt;/h3&gt;

&lt;p&gt;The three SE-Res2Net outputs are concatenated channel-wise (→ &lt;code&gt;[1536, T]&lt;/code&gt;) and passed through one more TDNN. The network now has access to features at every level of abstraction simultaneously.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4 — Attentive Statistics Pooling
&lt;/h3&gt;

&lt;p&gt;The most important step — the one that collapses variable-length time into a fixed-size vector:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[1536, T] → mean, std across T → concat [hidden, mean, std] → [4608, T]
         → TDNN(4608→128) → tanh → Conv1d(128→1536) → softmax over T
         → weighted mean + weighted std → [3072]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Attention learns &lt;em&gt;which frames matter most&lt;/em&gt;. Sustained vowels reveal more about vocal tract shape than fricatives or silence — the network weights them higher. This is also why &lt;strong&gt;varied reference audio beats long monotone audio&lt;/strong&gt;: more variation, richer pooling. 30 s is our default sweet spot.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5 — Final projection
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Conv1d(3072 → enc_dim, kernel=1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;enc_dim&lt;/th&gt;
&lt;th&gt;hidden&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0.6B-Base&lt;/td&gt;
&lt;td&gt;1024&lt;/td&gt;
&lt;td&gt;1024&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1.7B-Base&lt;/td&gt;
&lt;td&gt;2048&lt;/td&gt;
&lt;td&gt;2048&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The output — 1024 or 2048 floats — replaces the discrete speaker token in the transformer prompt.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bug hidden by a coincidence
&lt;/h2&gt;

&lt;p&gt;Voice cloning worked on 0.6B. On 1.7B it sounded completely wrong. The cause was a single line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="n"&gt;enc&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;enc_dim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;// hardcoded — wrong for 1.7B&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On 0.6B, &lt;code&gt;enc_dim == hidden == 1024&lt;/code&gt; by coincidence. On 1.7B, &lt;code&gt;enc_dim == 2048&lt;/code&gt;, so we were writing 1024 valid floats into a 2048-dim slot — the rest was uninitialized memory. The first half of the hidden state got a real speaker; the second half got garbage.&lt;/p&gt;

&lt;p&gt;The fix was reading &lt;code&gt;enc_dim&lt;/code&gt; from &lt;code&gt;config.json&lt;/code&gt;. &lt;strong&gt;Lesson:&lt;/strong&gt; when two model sizes "work" but one sounds wrong, check whether shared code accidentally matches by coincidence rather than by design.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why 1.7B clones better
&lt;/h2&gt;

&lt;p&gt;After the fix, 1.7B consistently produced more faithful clones. Two reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;2048-dim embedding&lt;/strong&gt; vs 1024-dim — twice the capacity to capture breathiness, nasality, micro-timing in phoneme transitions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;4× transformer parameters&lt;/strong&gt; — the model can actually &lt;em&gt;use&lt;/em&gt; the richer embedding.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A detailed speaker embedding is only useful if the model has the capacity to condition on those details.&lt;/p&gt;

&lt;h2&gt;
  
  
  The &lt;code&gt;.qvoice&lt;/code&gt; v3 format
&lt;/h2&gt;

&lt;p&gt;Cloning isn't free (~200 ms of ECAPA-TDNN per 30 s of audio), and more importantly a raw embedding alone loses prosody. The &lt;code&gt;.qvoice&lt;/code&gt; format stores everything needed to reproduce the clone:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;QVCE magic + version 3
├── Speaker embedding        (1024 or 2048 floats)
├── Reference text + ICL codec tokens (optional)
├── META                     (language, voice name, source model size, flags)
├── TPAD                     (source model's tts_pad/bos/eos embeddings, 12 KB)
├── WOVR                     (text_projection + codec_embedding, 16 MB)
└── WDELTA                   (LZ4 int16 deltas for all talker+CP weights, ~785 MB)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each section is optional. You pick the trade-off.&lt;/p&gt;

&lt;h2&gt;
  
  
  How we got to bit-identical
&lt;/h2&gt;

&lt;p&gt;Base and CustomVoice share &lt;strong&gt;99.98 %&lt;/strong&gt; of transformer weights (cosine ≈ 0.9999 per layer). But BF16 values differ at 87 % of positions, and those micro-differences accumulate autoregressively. Closing the gap was a three-step elimination:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;TPAD (+12 KB)&lt;/strong&gt; — override the source model's &lt;code&gt;tts_pad_embed&lt;/code&gt;. Mel correlation 0.756.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WOVR (+16 MB)&lt;/strong&gt; — override &lt;code&gt;text_projection&lt;/code&gt; and &lt;code&gt;codec_embedding&lt;/code&gt; entirely. Mel correlation 0.711, RTF 1.60.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WDELTA (+785 MB, LZ4)&lt;/strong&gt; — int16 deltas for every remaining layer. Mel correlation &lt;strong&gt;1.000&lt;/strong&gt;, PCM bit-identical.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Two things bit us along the way:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Partial layer replacement is worse than none.&lt;/strong&gt; Replacing the 5 most-divergent layers out of 28 dropped quality below the no-replacement baseline. The transformer is a chain; mismatched interfaces at layer boundaries cost more than uniform small differences everywhere.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Code Predictor has its own weights.&lt;/strong&gt; Even after replacing all 28 talker layers, codebooks 5–15 still diverged until we also deltaed the CP's 86 tensors and rebuilt its &lt;code&gt;gate_up_fused&lt;/code&gt; buffer.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  LZ4 vs zlib
&lt;/h2&gt;

&lt;p&gt;We started with zlib. It produced smaller files but decompression dominated load time.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Compression&lt;/th&gt;
&lt;th&gt;File (0.6B)&lt;/th&gt;
&lt;th&gt;Decompress&lt;/th&gt;
&lt;th&gt;Total wall&lt;/th&gt;
&lt;th&gt;vs preset&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;zlib&lt;/td&gt;
&lt;td&gt;510 MB&lt;/td&gt;
&lt;td&gt;~4 s&lt;/td&gt;
&lt;td&gt;15.9 s&lt;/td&gt;
&lt;td&gt;+32 %&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LZ4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;785 MB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~1 s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;12.8 s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+7 %&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For a one-shot load at startup, decompression speed matters more than file size.&lt;/p&gt;

&lt;h2&gt;
  
  
  Style control + cloned voice
&lt;/h2&gt;

&lt;p&gt;On 1.7B + WDELTA you can finally combine &lt;code&gt;--instruct&lt;/code&gt; with a cloned voice — something the Base model alone can't do, because it was never trained with both signals together:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./qwen_tts &lt;span class="nt"&gt;-d&lt;/span&gt; qwen3-tts-1.7b &lt;span class="nt"&gt;--load-voice&lt;/span&gt; silvio_17b.qvoice &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--text&lt;/span&gt; &lt;span class="s2"&gt;"Una notizia importante."&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-I&lt;/span&gt; &lt;span class="s2"&gt;"Parla con voce triste e malinconica"&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; sad.wav

./qwen_tts &lt;span class="nt"&gt;-d&lt;/span&gt; qwen3-tts-1.7b &lt;span class="nt"&gt;--load-voice&lt;/span&gt; silvio_17b.qvoice &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--text&lt;/span&gt; &lt;span class="s2"&gt;"Una notizia importante."&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-I&lt;/span&gt; &lt;span class="s2"&gt;"Parla con voce allegra e entusiasta"&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; happy.wav
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Voice identity stays constant; instruct modulates rhythm, pacing, and emphasis.&lt;/p&gt;

&lt;h2&gt;
  
  
  Commands, for the curious
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Clone once (needs Base + CV of same size)&lt;/span&gt;
./qwen_tts &lt;span class="nt"&gt;-d&lt;/span&gt; qwen3-tts-0.6b-base &lt;span class="nt"&gt;--ref-audio&lt;/span&gt; speaker.wav &lt;span class="nt"&gt;-l&lt;/span&gt; Italian &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--voice-name&lt;/span&gt; &lt;span class="s2"&gt;"Mario"&lt;/span&gt; &lt;span class="nt"&gt;--target-cv&lt;/span&gt; qwen3-tts-0.6b &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--save-voice&lt;/span&gt; voices/mario_06b.qvoice

&lt;span class="c"&gt;# Use anywhere (only needs CV + .qvoice)&lt;/span&gt;
./qwen_tts &lt;span class="nt"&gt;-d&lt;/span&gt; qwen3-tts-0.6b &lt;span class="nt"&gt;--load-voice&lt;/span&gt; voices/mario_06b.qvoice &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--text&lt;/span&gt; &lt;span class="s2"&gt;"Ciao, come va?"&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; output.wav
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without &lt;code&gt;--target-cv&lt;/code&gt;, you get the 16 MB standard format. With it, the 785 MB WDELTA.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Test every model size.&lt;/strong&gt; Dimension bugs hide behind coincidences.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Longer audio helps, but not linearly.&lt;/strong&gt; Diversity beats duration.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embedding dimension &lt;em&gt;is&lt;/em&gt; quality.&lt;/strong&gt; 1024 → 2048 is a clear audible jump.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Version your file formats.&lt;/strong&gt; v1 &lt;code&gt;.qvoice&lt;/code&gt; silently corrupted on size mismatch; v2+ errors loudly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A/B test by listening.&lt;/strong&gt; Unit tests pass on garbage outputs. Ears don't.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The encoder captures everything, not just the voice&lt;/strong&gt; — background music, room noise, a second speaker. Clean your input or run demucs first.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Style control and voice cloning live in separate worlds — until you bridge them with weight deltas.&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;em&gt;Source, deeper dives, and benchmarks: &lt;a href="https://github.com/gabriele-mastrapasqua/qwen3-tts" rel="noopener noreferrer"&gt;gabriele-mastrapasqua/qwen3-tts&lt;/a&gt;. For the full weight-analysis story behind WDELTA, see &lt;a href="https://github.com/gabriele-mastrapasqua/qwen3-tts/blob/main/blog/cross-model-voice-analysis.md" rel="noopener noreferrer"&gt;cross-model-voice-analysis.md&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>c</category>
      <category>ai</category>
      <category>tts</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Optimizing a Qwen3-TTS Engine in Pure C: Lessons from 1990s Game Programming</title>
      <dc:creator>Gabriele Mastrapasqua</dc:creator>
      <pubDate>Mon, 16 Mar 2026 21:34:23 +0000</pubDate>
      <link>https://forem.com/gabrielemastrapasqua/optimizing-a-qwen3-tts-engine-lessons-from-1990s-game-programming-440n</link>
      <guid>https://forem.com/gabrielemastrapasqua/optimizing-a-qwen3-tts-engine-lessons-from-1990s-game-programming-440n</guid>
      <description>&lt;p&gt;&lt;em&gt;How cache alignment, SIMD intrinsics (NEON/AVX), pipeline threading, and lessons from 1990s game programming nearly tripled the inference speed of Qwen3-TTS, reducing RTF from 3.5 to 1.26.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Starting Point
&lt;/h2&gt;

&lt;p&gt;We have a &lt;a href="https://github.com/gabriele-mastrapasqua/qwen3-tts" rel="noopener noreferrer"&gt;pure C inference engine&lt;/a&gt; for qwen3-tts, a text-to-speech model with&lt;br&gt;
a 28-layer transformer (Talker), a 5-layer code predictor, and a convolutional&lt;br&gt;
speech decoder. No Python, no PyTorch, no GPU — just C, Apple Accelerate BLAS,&lt;br&gt;
and SIMD intrinsics (NEON on ARM, AVX on x86) on an Apple M1 with 16 GB RAM.&lt;/p&gt;

&lt;p&gt;After getting the pipeline correct and implementing the first round of SIMD&lt;br&gt;
kernels (NEON/AVX: fused 2-row bf16 matvec, unified QKV dispatch, fused gate+up SwiGLU),&lt;br&gt;
we were at &lt;strong&gt;RTF ~3.5&lt;/strong&gt; on short text and &lt;strong&gt;RTF ~2.5&lt;/strong&gt; on longer text (the fixed&lt;br&gt;
costs of prefill and speech decoding amortize over longer audio).&lt;/p&gt;

&lt;p&gt;This post covers the optimizations that brought us to &lt;strong&gt;RTF ~1.26&lt;/strong&gt; (server warm,&lt;br&gt;
long text) — up to a 2.7x total speedup with zero algorithmic changes and zero&lt;br&gt;
new dependencies.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;RTF&lt;/strong&gt; = Real-Time Factor = processing_time / audio_duration. Lower is better.&lt;br&gt;
RTF &amp;lt; 1.0 means faster than real-time.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  The Abrash Instinct: Cache Alignment Still Matters
&lt;/h2&gt;

&lt;p&gt;If you grew up reading Michael Abrash's &lt;em&gt;Graphics Programming Black Book&lt;/em&gt;&lt;br&gt;
(1997), you remember the chapters on data alignment. Abrash hammered on a&lt;br&gt;
simple point: on the 386 and 486, unaligned memory accesses caused extra&lt;br&gt;
wait-states that destroyed performance. Word-aligned on 386, dword-aligned&lt;br&gt;
on 486 — he had the diagrams, the tables, the rules.&lt;/p&gt;

&lt;p&gt;John Carmack talked about this too in his &lt;code&gt;.plan&lt;/code&gt; files and QuakeCon talks,&lt;br&gt;
in his typical informal way — "align your structs, pack your data, think about&lt;br&gt;
cache lines." But the systematic treatment, the benchmarks, the rules of thumb?&lt;br&gt;
That was Abrash. Chapter after chapter of the Black Book devoted to data&lt;br&gt;
alignment, struct layout, and how the CPU bus punishes you for sloppy memory&lt;br&gt;
access patterns.&lt;/p&gt;

&lt;p&gt;Here's the thing: &lt;strong&gt;those lessons still apply.&lt;/strong&gt; The penalty isn't wait-states&lt;br&gt;
anymore — it's SIMD throughput. Modern CPUs have SIMD units (128-bit NEON on&lt;br&gt;
ARM, 256-bit AVX on x86) that operate on aligned data natively. When you feed&lt;br&gt;
misaligned buffers to BLAS routines like &lt;code&gt;cblas_sgemm&lt;/code&gt;, the library can't&lt;br&gt;
use its fastest SIMD paths. Apple Accelerate checks alignment at runtime and&lt;br&gt;
falls back to slower code when buffers aren't aligned.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Fix: 3 Lines of Code
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kr"&gt;inline&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="nf"&gt;aligned_malloc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ptr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;posix_memalign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;We replaced every &lt;code&gt;malloc()&lt;/code&gt; and &lt;code&gt;calloc()&lt;/code&gt; in the hot path with&lt;br&gt;
&lt;code&gt;posix_memalign(64, ...)&lt;/code&gt;. Not just the BLAS buffers — the KV caches, the&lt;br&gt;
decode buffers, the prefill temporaries. Everything that touches a SIMD&lt;br&gt;
instruction or a BLAS call got 64-byte alignment (one cache line on M1).&lt;/p&gt;
&lt;h3&gt;
  
  
  The Result: 24% Total Speedup
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;th&gt;Improvement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Prefill (BLAS sgemm)&lt;/td&gt;
&lt;td&gt;475ms&lt;/td&gt;
&lt;td&gt;260ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;84%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Speech Decoder (BLAS sgemm)&lt;/td&gt;
&lt;td&gt;2,580ms&lt;/td&gt;
&lt;td&gt;1,648ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;36%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code Predictor (SIMD matvec)&lt;/td&gt;
&lt;td&gt;66.4 ms/f&lt;/td&gt;
&lt;td&gt;60.8 ms/f&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;9%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total pipeline&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;10.4s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;7.9s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;24%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The prefill stage — which does batch matrix multiplication via &lt;code&gt;cblas_sgemm&lt;/code&gt; —&lt;br&gt;
nearly doubled in speed. The speech decoder, which also relies heavily on&lt;br&gt;
sgemm for its convolutions, improved by 36%. Even the Code Predictor, which&lt;br&gt;
uses our hand-written SIMD bf16 matvec kernel (NEON/AVX), gained 9% from aligned KV&lt;br&gt;
cache and decode buffers.&lt;/p&gt;

&lt;p&gt;And the output is &lt;strong&gt;bit-identical&lt;/strong&gt;. Same seed, same text, same bytes in the&lt;br&gt;
WAV file. This is pure implementation overhead we were leaving on the table.&lt;/p&gt;
&lt;h3&gt;
  
  
  Why 64 Bytes?
&lt;/h3&gt;

&lt;p&gt;The M1's L1 cache line is 128 bytes on the P-cores, but the common denominator&lt;br&gt;
across ARM and x86 is 64 bytes. The BLAS library needs at least 16-byte&lt;br&gt;
alignment for NEON (32-byte for AVX), but 64 bytes guarantees that no buffer&lt;br&gt;
straddles a cache line boundary unnecessarily. It's the sweet spot for&lt;br&gt;
cross-platform code.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;posix_memalign&lt;/code&gt; is POSIX standard — it works on Linux and macOS without any&lt;br&gt;
platform-specific code. On Windows/WSL2, it's available too. Three lines of&lt;br&gt;
code, zero dependencies, cross-platform.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Speech Decoder: Scalar Code Hiding in Plain Sight
&lt;/h2&gt;

&lt;p&gt;After the alignment win, we profiled again. The speech decoder still took&lt;br&gt;
~1,650ms for 62 frames. Digging into the code, we found something&lt;br&gt;
embarrassing: &lt;strong&gt;six scalar RMSNorm loops&lt;/strong&gt; that were never converted to SIMD.&lt;/p&gt;

&lt;p&gt;The Talker and Code Predictor used our SIMD-optimized &lt;code&gt;qwen_rms_norm()&lt;/code&gt;&lt;br&gt;
(NEON on ARM, AVX on x86)&lt;br&gt;
function. But the speech decoder had its own hand-written scalar version:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Before: scalar, called 480 times per generation (60 frames x 8 layers)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;n_frames&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;sum_sq&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;sum_sq&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;xs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;xs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;inv_rms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;sqrtf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sum_sq&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;512&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;eps&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;xn&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;xs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;inv_rms&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// After: one line&lt;/span&gt;
&lt;span class="n"&gt;qwen_rms_norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x_norm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hidden&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;attn_norm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_frames&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dec_hidden&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;eps&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The SIMD version (NEON on ARM, AVX on x86) processes 4-8 floats per iteration&lt;br&gt;
with fused multiply-accumulate, versus one float at a time in the scalar version.&lt;/p&gt;

&lt;p&gt;Same story with RoPE (rotary position embeddings) — the speech decoder had a&lt;br&gt;
scalar loop doing paired rotations at 32 elements per head. We replaced it&lt;br&gt;
with SIMD intrinsics that process 4 pairs at once, fusing Q and K rotation&lt;br&gt;
in the same pass (shown here with NEON; AVX variant in &lt;code&gt;qwen_tts_kernels_avx.c&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="c1"&gt;// NEON: 4-wide fused Q+K rotation&lt;/span&gt;
&lt;span class="n"&gt;float32x4_t&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vld1q_f32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cos_ptr&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;float32x4_t&lt;/span&gt; &lt;span class="n"&gt;si&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vld1q_f32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sin_ptr&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;float32x4_t&lt;/span&gt; &lt;span class="n"&gt;q1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vld1q_f32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;qh&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;q2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vld1q_f32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;qh&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;half&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;float32x4_t&lt;/span&gt; &lt;span class="n"&gt;k1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vld1q_f32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kh&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;k2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vld1q_f32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kh&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;half&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;vst1q_f32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;qh&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="n"&gt;vmlsq_f32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vmulq_f32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;q2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;si&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="n"&gt;vst1q_f32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;qh&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;half&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vmlaq_f32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vmulq_f32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;q1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;si&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="n"&gt;vst1q_f32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kh&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="n"&gt;vmlsq_f32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vmulq_f32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;k2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;si&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="n"&gt;vst1q_f32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kh&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;half&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vmlaq_f32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vmulq_f32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;k1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;si&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We also replaced the scalar attention dot-product loop with our SIMD-optimized&lt;br&gt;
windowed causal attention kernel (NEON/AVX) — online softmax with wide dot&lt;br&gt;
products and fused V accumulation.&lt;/p&gt;

&lt;p&gt;And the VQ dequantization step, which did per-frame scalar matrix-vector&lt;br&gt;
products for codebook projection, was batched into a single &lt;code&gt;cblas_sgemm&lt;/code&gt;&lt;br&gt;
call across all frames.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Combined result: speech decoder 11% faster&lt;/strong&gt; (1,446ms to 1,288ms).&lt;/p&gt;
&lt;h2&gt;
  
  
  Eliminating Per-Token Malloc
&lt;/h2&gt;

&lt;p&gt;The generation loop was doing malloc/free for every token:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;topk_filter()&lt;/code&gt;: &lt;code&gt;malloc(vocab_size * sizeof(float))&lt;/code&gt; + &lt;code&gt;free()&lt;/code&gt; per sample&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;topp_filter()&lt;/code&gt;: &lt;code&gt;malloc(vocab_size * sizeof(int))&lt;/code&gt; + &lt;code&gt;free()&lt;/code&gt; per sample&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;embed_one_text_token()&lt;/code&gt;: two &lt;code&gt;malloc(text_hidden * sizeof(float))&lt;/code&gt; + &lt;code&gt;free()&lt;/code&gt; per text token&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;qwen_talker_prefill()&lt;/code&gt;: 14 large buffers allocated and freed per generation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a typical generation of 60 frames, that's ~120 malloc/free pairs just for&lt;br&gt;
sampling, plus ~14 large buffer allocations for prefill.&lt;/p&gt;

&lt;p&gt;We pre-allocated everything:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sampling buffers persist as module-level statics (allocated once on first call)&lt;/li&gt;
&lt;li&gt;Text embedding temps stored in the context struct&lt;/li&gt;
&lt;li&gt;Prefill buffers (including ~50MB of f32 weight conversion temps) persist across
generations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The single-run impact is negligible (&amp;lt;1%), but in &lt;strong&gt;server mode&lt;/strong&gt;, where the&lt;br&gt;
model handles many sequential requests, the second request runs &lt;strong&gt;38% faster&lt;/strong&gt;&lt;br&gt;
because all buffers are warm in cache and no allocation overhead.&lt;/p&gt;

&lt;p&gt;The generation loop now has &lt;strong&gt;zero per-token malloc calls&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Text Embedding Cache: Avoiding Redundant Work
&lt;/h2&gt;

&lt;p&gt;Each text token goes through a two-layer MLP projection (bf16 lookup → fc1 2048×2048&lt;br&gt;
SiLU → fc2 1024×2048) — about 12 million FLOPs per token. For a 57-token long prompt,&lt;br&gt;
that's ~29ms of pure compute. On a server handling the same or similar requests, this&lt;br&gt;
is entirely redundant.&lt;/p&gt;

&lt;p&gt;We added two levels of caching:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Special token cache&lt;/strong&gt; (computed once at model load): &lt;code&gt;tts_pad&lt;/code&gt;, &lt;code&gt;tts_bos&lt;/code&gt;, and&lt;br&gt;
&lt;code&gt;tts_eos&lt;/code&gt; are used in &lt;em&gt;every&lt;/em&gt; request. Pre-computing them at load time eliminates&lt;br&gt;
3 matvec pairs per generation — trivial change, zero runtime cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LRU hash map&lt;/strong&gt; for all text tokens: An open-addressing hash table maps &lt;code&gt;token_id →&lt;br&gt;
float[hidden]&lt;/code&gt; with 2048 slots. On a cache hit, a single 4KB memcpy replaces two bf16&lt;br&gt;
matrix-vector multiplications. The table uses Knuth multiplicative hashing with linear&lt;br&gt;
probing and LRU eviction when full.&lt;/p&gt;

&lt;p&gt;Memory cost: 2048 × 1024 × 4 bytes = &lt;strong&gt;8MB&lt;/strong&gt; — negligible compared to the ~1.2GB&lt;br&gt;
model weights. Always active (both CLI and server) since the overhead is near-zero.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result: 14% faster on long-text server cold call&lt;/strong&gt; (RTF 1.55 → 1.33). On warm calls&lt;br&gt;
the improvement is smaller (~2%) because subsequent requests already benefit from OS&lt;br&gt;
page cache and buffer reuse.&lt;/p&gt;
&lt;h2&gt;
  
  
  Decoder Thread: Pipeline Parallelism
&lt;/h2&gt;

&lt;p&gt;The TTS pipeline has three stages: Talker generates a codec token, the Code Predictor&lt;br&gt;
fills in 15 more codebook entries, then the speech decoder converts those codes to&lt;br&gt;
audio. The original code ran these strictly sequentially — the speech decoder waited&lt;br&gt;
until ALL frames were generated, then processed everything in one batch.&lt;/p&gt;

&lt;p&gt;But the speech decoder is completely independent of the Talker and Code Predictor.&lt;br&gt;
It reads completed codec frames and writes audio. No shared weights, no shared KV&lt;br&gt;
cache. And it's already designed for incremental operation: the pre-transformer uses&lt;br&gt;
sliding-window causal attention (window=72), and the ConvNet is fully causal.&lt;/p&gt;

&lt;p&gt;The fix: a producer-consumer pipeline with two threads:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Main thread:    [Talker → CP → push frame] → [Talker → CP → push frame] → ...
Decoder thread: [wait] → [decode chunk] → [wait] → [decode chunk] → ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The main thread pushes completed frames to a mutex-guarded queue. The decoder thread&lt;br&gt;
wakes on a condition variable, pulls available frames, and decodes them incrementally&lt;br&gt;
using the existing streaming decoder path. At the end, the main thread joins the&lt;br&gt;
decoder thread and collects the accumulated audio.&lt;/p&gt;

&lt;p&gt;~150 lines of pthreads code: mutex + condvar queue, producer push, consumer loop,&lt;br&gt;
join + audio collection.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Result
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;th&gt;Improvement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CLI short (~5s audio)&lt;/td&gt;
&lt;td&gt;RTF 2.01&lt;/td&gt;
&lt;td&gt;RTF 1.74&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;14%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Server short cold&lt;/td&gt;
&lt;td&gt;RTF 1.85&lt;/td&gt;
&lt;td&gt;RTF 1.50&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;19%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Server long warm&lt;/td&gt;
&lt;td&gt;RTF 1.31&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;RTF 1.26&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The gain is largest on short text where the speech decoder is a bigger fraction of&lt;br&gt;
total time. On long text, Talker+CP dominate and the decoder overlap has less to&lt;br&gt;
hide. The "drain" at the end (waiting for the decoder to finish its last chunk) is&lt;br&gt;
only ~500ms on short text.&lt;/p&gt;

&lt;p&gt;One trade-off: the decoder thread competes with the main thread for CPU cores and&lt;br&gt;
memory bandwidth. Talker+CP ms/frame increases slightly (~10%) due to contention,&lt;br&gt;
but the net wall-time improvement from overlapping far exceeds this cost.&lt;/p&gt;
&lt;h2&gt;
  
  
  Quickselect: When the Algorithm Is the Bug
&lt;/h2&gt;

&lt;p&gt;After all the SIMD and threading work, we noticed the "Codec head+sampling"&lt;br&gt;
line in the timing report: &lt;strong&gt;93ms&lt;/strong&gt; for 101 frames. That's almost 1ms per frame&lt;br&gt;
spent on... sampling? Something was off.&lt;/p&gt;

&lt;p&gt;The top-k filter used &lt;strong&gt;selection sort&lt;/strong&gt; to find the k-th largest logit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="c1"&gt;// O(k × n) — selection sort to find top-k threshold&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;max_idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tmp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;tmp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;max_idx&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="n"&gt;max_idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tmp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt; &lt;span class="n"&gt;tmp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tmp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;max_idx&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt; &lt;span class="n"&gt;tmp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;max_idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With &lt;code&gt;k=50&lt;/code&gt; and &lt;code&gt;n=3072&lt;/code&gt; (codec vocabulary), that's &lt;strong&gt;153,600 comparisons per&lt;br&gt;
frame&lt;/strong&gt; × 101 frames = 15.5M comparisons. It's technically O(kn), but the&lt;br&gt;
constant is awful.&lt;/p&gt;

&lt;p&gt;The fix: &lt;strong&gt;quickselect&lt;/strong&gt; (Hoare's algorithm). It finds the k-th element in&lt;br&gt;
O(n) average time using 3-way partitioning:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="nf"&gt;quickselect_kth_largest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;arr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;lo&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hi&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lo&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;hi&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;pivot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;arr&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;lo&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hi&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;lo&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
        &lt;span class="c1"&gt;// 3-way partition: [&amp;gt;pivot] [==pivot] [&amp;lt;pivot]&lt;/span&gt;
        &lt;span class="c1"&gt;// ...&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;arr&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;lo&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result: 93ms → 21ms (4.4× faster).&lt;/strong&gt; Output bit-identical — same threshold,&lt;br&gt;
same filtering, same samples. The only thing that changed was how fast we find&lt;br&gt;
the threshold value.&lt;/p&gt;

&lt;p&gt;We also checked softmax (3 scalar passes over vocab) and top-p (O(n²) full&lt;br&gt;
sort). Softmax turned out to be ~1.5ms total — with &lt;code&gt;-ffast-math&lt;/code&gt; on macOS,&lt;br&gt;
&lt;code&gt;expf&lt;/code&gt; is already vectorized by the compiler via Accelerate. And top-p is&lt;br&gt;
skipped entirely at the default &lt;code&gt;top_p=1.0&lt;/code&gt;. So quickselect was the only&lt;br&gt;
sampling fix that mattered.&lt;/p&gt;
&lt;h2&gt;
  
  
  Streaming Pipeline: Closing the Last Gap
&lt;/h2&gt;

&lt;p&gt;With streaming mode (&lt;code&gt;--stream&lt;/code&gt;), the user hears audio as it generates — chunks&lt;br&gt;
of ~0.8s arrive progressively. But streaming was &lt;strong&gt;30% slower&lt;/strong&gt; than normal mode&lt;br&gt;
(RTF 2.0 vs 1.4). Why?&lt;/p&gt;

&lt;p&gt;Normal mode uses a &lt;strong&gt;decoder thread&lt;/strong&gt;: the speech decoder runs in the background&lt;br&gt;
while Talker+CP generate the next frame. The two stages overlap in time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Main thread:    [Gen F1] [Gen F2] [Gen F3] ...
Decoder thread:          [Dec F1] [Dec F2] ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But streaming mode ran the decoder &lt;strong&gt;synchronously in the main thread&lt;/strong&gt;. Every&lt;br&gt;
10 frames, the main thread stopped generating to decode audio and call the&lt;br&gt;
callback. The main thread was blocked during decode:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Main thread:    [Gen F1-10] [DECODE+CALLBACK] [Gen F11-20] [DECODE+CALLBACK] ...
                             ^^^^ BLOCKED ^^^^               ^^^^ BLOCKED ^^^^
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The fix: use the decoder thread for streaming too. Instead of accumulating audio&lt;br&gt;
in a buffer, the decoder thread calls the audio callback directly. The main&lt;br&gt;
thread never blocks on decode:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;audio_cb&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;ret&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;audio_cb&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_audio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk_samples&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;audio_cb_userdata&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ret&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;cb_aborted&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;dt_append_audio&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk_audio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk_samples&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The callback (&lt;code&gt;fwrite&lt;/code&gt; + &lt;code&gt;fflush&lt;/code&gt; to a WAV file, or &lt;code&gt;send()&lt;/code&gt; to an HTTP socket)&lt;br&gt;
is called from the decoder thread. Both are thread-safe by default.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result: Streaming RTF 2.04 → 1.38&lt;/strong&gt; — identical to normal mode. The change&lt;br&gt;
was &lt;code&gt;-80&lt;/code&gt; lines, &lt;code&gt;+53&lt;/code&gt; lines (net simpler!), because we deleted the entire&lt;br&gt;
synchronous streaming code path and unified everything through the decoder&lt;br&gt;
thread.&lt;/p&gt;

&lt;p&gt;The output is &lt;strong&gt;bit-identical&lt;/strong&gt; across all four modes: CLI normal, CLI streaming,&lt;br&gt;
HTTP server normal, HTTP server streaming. Same seed, same speaker, same language&lt;br&gt;
→ same bytes in the WAV file.&lt;/p&gt;
&lt;h2&gt;
  
  
  Batch vvexpf: Transcendentals Are Expensive One at a Time
&lt;/h2&gt;

&lt;p&gt;After the algorithmic wins, we went hunting for smaller gains. The SwiGLU&lt;br&gt;
activation function in every transformer layer computes &lt;code&gt;x * sigmoid(x)&lt;/code&gt;, and&lt;br&gt;
sigmoid needs &lt;code&gt;expf()&lt;/code&gt;. In a 28-layer Talker and a 5-layer Code Predictor&lt;br&gt;
running 15 passes per frame, that's ~163,000 individual &lt;code&gt;expf()&lt;/code&gt; calls per&lt;br&gt;
audio frame.&lt;/p&gt;

&lt;p&gt;Each &lt;code&gt;expf()&lt;/code&gt; is a transcendental function — high latency, hard to pipeline.&lt;br&gt;
But calling them one by one wastes the CPU's SIMD units. The fix: batch them.&lt;/p&gt;

&lt;p&gt;On macOS, Apple's Accelerate framework provides &lt;code&gt;vvexpf()&lt;/code&gt; — a vectorized&lt;br&gt;
exponential that processes an entire array at once using optimized SIMD paths&lt;br&gt;
internally. We wrote a &lt;code&gt;qwen_swiglu_inplace()&lt;/code&gt; kernel that computes&lt;br&gt;
&lt;code&gt;gate = vvexpf(-gate); gate = x / (1 + gate); gate *= up&lt;/code&gt; over the full&lt;br&gt;
intermediate dimension in one call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;qwen_swiglu_inplace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;gate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;up&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="cp"&gt;#if defined(__APPLE__) &amp;amp;&amp;amp; defined(USE_BLAS)
&lt;/span&gt;    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;ni&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="c1"&gt;// gate = -gate&lt;/span&gt;
    &lt;span class="n"&gt;vDSP_vneg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ni&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="c1"&gt;// gate = exp(-gate)  (batch)&lt;/span&gt;
    &lt;span class="n"&gt;vvexpf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;ni&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="c1"&gt;// gate = 1 + exp(-gate)&lt;/span&gt;
    &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;one&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;vDSP_vsadd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;one&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ni&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="c1"&gt;// gate = x / (1 + exp(-gate))  →  sigmoid(x) * x via up vector&lt;/span&gt;
    &lt;span class="n"&gt;vDSP_vdiv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;up&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ni&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="cp"&gt;#else
&lt;/span&gt;    &lt;span class="c1"&gt;// scalar fallback — compiler auto-vectorizes with -ffast-math&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;gate&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;up&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;gate&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;expf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;gate&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]));&lt;/span&gt;
&lt;span class="cp"&gt;#endif
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result: Code Predictor 8% faster&lt;/strong&gt; (76 ms/f → 70 ms/f). Those ~163K scalar&lt;br&gt;
&lt;code&gt;expf&lt;/code&gt; calls per frame collapsed into ~206 batched &lt;code&gt;vvexpf&lt;/code&gt; calls. Not a&lt;br&gt;
headline number, but it's free — the output is bit-identical and the code is&lt;br&gt;
actually cleaner than the inline scalar loop it replaced.&lt;/p&gt;

&lt;p&gt;The Abrash lesson applies here too: just as he taught us that unaligned memory&lt;br&gt;
access wastes bus cycles, calling transcendentals one at a time wastes SIMD&lt;br&gt;
lanes. The hardware &lt;em&gt;wants&lt;/em&gt; to process 4-8 values at once — you just have to&lt;br&gt;
feed it that way.&lt;/p&gt;
&lt;h2&gt;
  
  
  SIMD BF16 Accumulation: One More Scalar Loop
&lt;/h2&gt;

&lt;p&gt;The codec embedding lookup accumulates 15 codebook vectors per audio frame —&lt;br&gt;
each a BF16-to-F32 conversion followed by a vector add. The original code did&lt;br&gt;
this scalar:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;uint32_t&lt;/span&gt; &lt;span class="n"&gt;bits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;src_bf16&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;memcpy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;bits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
    &lt;span class="n"&gt;dst&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We wrote &lt;code&gt;qwen_bf16_accum_f32()&lt;/code&gt; with NEON and AVX2 paths. The NEON version&lt;br&gt;
processes 8 BF16 values per iteration — load, shift-widen to F32, add:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="c1"&gt;// NEON: 8-wide BF16→F32 accumulate&lt;/span&gt;
&lt;span class="n"&gt;uint16x8_t&lt;/span&gt; &lt;span class="n"&gt;bf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vld1q_u16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;src_bf16&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;float32x4_t&lt;/span&gt; &lt;span class="n"&gt;f0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vreinterpretq_f32_u32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vshll_n_u16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vget_low_u16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bf&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="n"&gt;float32x4_t&lt;/span&gt; &lt;span class="n"&gt;f1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vreinterpretq_f32_u32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vshll_n_u16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vget_high_u16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bf&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="n"&gt;vst1q_f32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dst&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="n"&gt;vaddq_f32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vld1q_f32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dst&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;f0&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="n"&gt;vst1q_f32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dst&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vaddq_f32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vld1q_f32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dst&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;f1&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The AVX2 version does the same with 256-bit registers — &lt;code&gt;cvtepu16_epi32&lt;/code&gt; to&lt;br&gt;
zero-extend, &lt;code&gt;slli_epi32&lt;/code&gt; to shift into F32 position, &lt;code&gt;add_ps&lt;/code&gt; to accumulate.&lt;/p&gt;

&lt;p&gt;The per-frame impact is small (~0.5-1ms), but it adds up over hundreds of&lt;br&gt;
frames and eliminates yet another scalar loop hiding in a SIMD codebase —&lt;br&gt;
exactly the kind of thing Abrash warned about: the fast path is only fast if&lt;br&gt;
&lt;em&gt;all&lt;/em&gt; the code on it is optimized.&lt;/p&gt;
&lt;h2&gt;
  
  
  Delta Prefill: Reusing the KV Cache Across Requests
&lt;/h2&gt;

&lt;p&gt;The Talker's prompt has a fixed structure: ChatML header, speaker token,&lt;br&gt;
language token, codec control tokens, then the actual text. For a server&lt;br&gt;
handling multiple requests with the same speaker and language, the prefix&lt;br&gt;
is identical every time — but we were re-prefilling it from scratch on every&lt;br&gt;
call.&lt;/p&gt;

&lt;p&gt;Causal attention gives us a nice property: prefix tokens produce identical&lt;br&gt;
KV cache entries regardless of what comes after. If the first 8 tokens of&lt;br&gt;
the prompt match the previous request, their KV entries are already in the&lt;br&gt;
cache. We just need to prefill the &lt;em&gt;new&lt;/em&gt; tokens.&lt;/p&gt;

&lt;p&gt;The implementation compares the current input embeddings against the previous&lt;br&gt;
call's cached embeddings (stored in &lt;code&gt;prev_input_embeds&lt;/code&gt;). If the first N&lt;br&gt;
embeddings match, we skip to position N and only process the delta:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Request 1: [header][speaker][lang][codec][text_A]  →  full prefill (18 tokens)
Request 2: [header][speaker][lang][codec][text_B]  →  delta prefill (skip 8, process 10)
Request 3: [header][speaker][lang][codec][text_C]  →  delta prefill (skip 8, process 7)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the speaker or language changes, the prefix differs and we fall back to&lt;br&gt;
full prefill automatically — no special-casing needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result: ~50% prefill time savings on repeated speaker&lt;/strong&gt; in server mode. For&lt;br&gt;
a chatbot or voice assistant scenario where you're generating many responses&lt;br&gt;
in the same voice, this eliminates the biggest fixed cost in the pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quantization: What the 1.7B Model Taught Us
&lt;/h2&gt;

&lt;p&gt;We'd already tried INT4 and INT8 quantization on the 0.6B model and found&lt;br&gt;
them slower or neutral — the matrices are too small (hidden=1024) to be&lt;br&gt;
bandwidth-bound, so dequantization overhead dominates. But the 1.7B model&lt;br&gt;
has &lt;code&gt;hidden=2048&lt;/code&gt; and &lt;code&gt;intermediate=6144&lt;/code&gt; — 4× larger matrices. Time to&lt;br&gt;
revisit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;INT8 (&lt;code&gt;--int8&lt;/code&gt;): 20% Talker speedup on 1.7B.&lt;/strong&gt; Per-row absmax quantization&lt;br&gt;
at load time (scale = max(|row|) / 127), NEON int8 matvec for decode. The&lt;br&gt;
Talker went from 79.3 ms/f to 67.4 ms/f. Audio quality is good — no&lt;br&gt;
perceptible degradation in A/B tests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;INT4 Q4_0 (&lt;code&gt;--int4&lt;/code&gt;): no speedup, actually 4% slower.&lt;/strong&gt; We used the same&lt;br&gt;
nibble-packed format as llama.cpp (32 weights per block, 16 bytes + 1 fp32&lt;br&gt;
scale). The NEON unpack path needs AND, SHR, subtract-8, widen, convert —&lt;br&gt;
about 8 ops per 32 weights versus 1 op for BF16 (&lt;code&gt;vshll&lt;/code&gt;). Even at 2048-wide,&lt;br&gt;
the compute overhead exceeds the bandwidth savings.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Config&lt;/th&gt;
&lt;th&gt;Talker ms/f&lt;/th&gt;
&lt;th&gt;CP ms/f&lt;/th&gt;
&lt;th&gt;RTF&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1.7B BF16&lt;/td&gt;
&lt;td&gt;79.3&lt;/td&gt;
&lt;td&gt;87.0&lt;/td&gt;
&lt;td&gt;4.32&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1.7B INT8&lt;/td&gt;
&lt;td&gt;67.4&lt;/td&gt;
&lt;td&gt;78.7&lt;/td&gt;
&lt;td&gt;3.59&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1.7B INT4&lt;/td&gt;
&lt;td&gt;82.6&lt;/td&gt;
&lt;td&gt;81.7&lt;/td&gt;
&lt;td&gt;4.51&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0.6B BF16&lt;/td&gt;
&lt;td&gt;22.5&lt;/td&gt;
&lt;td&gt;82.0&lt;/td&gt;
&lt;td&gt;2.15&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The takeaway: quantization is not a universal win. It depends on whether&lt;br&gt;
you're compute-bound or bandwidth-bound at your specific matrix dimensions.&lt;br&gt;
INT8 hits the sweet spot for 1.7B — enough bandwidth reduction to matter,&lt;br&gt;
low enough unpack overhead (3 NEON ops vs BF16's 1) to not eat the gains.&lt;br&gt;
INT4's nibble unpacking (8 ops) crosses the break-even point. And on 0.6B,&lt;br&gt;
nothing helps because you're compute-bound anyway.&lt;/p&gt;

&lt;p&gt;This echoes what Abrash wrote about optimization traps: "the fastest code&lt;br&gt;
is the code you don't execute." INT4 adds &lt;em&gt;more&lt;/em&gt; code per weight (unpack,&lt;br&gt;
shift, subtract, widen, convert, scale, accumulate) than BF16 (shift,&lt;br&gt;
accumulate). The memory savings are real, but speed is what matters for&lt;br&gt;
realtime TTS.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Analyzed and Skipped
&lt;/h2&gt;

&lt;p&gt;Not every optimization idea pans out. Here's what we investigated and rejected:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Struct field reordering&lt;/strong&gt; (est. 3-7%, actual: 0%). The &lt;code&gt;qwen_tts_ctx_t&lt;/code&gt;&lt;br&gt;
struct is 7.6 KB spanning 119 cache lines. We built a layout analyzer and&lt;br&gt;
found that the hot decode fields (KV cache pointers, decode buffers) already&lt;br&gt;
sit on adjacent cache lines 112-118. More importantly, the struct is accessed&lt;br&gt;
via pointer indirection — the CPU loads the pointer once and the struct stays&lt;br&gt;
in L1. The bottleneck is the data these pointers &lt;em&gt;reference&lt;/em&gt; (multi-MB weight&lt;br&gt;
matrices), not the 8-byte pointer loads themselves.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;L1 cache blocking for matvec&lt;/strong&gt; (est. 3-5%, actual: not worth the complexity).&lt;br&gt;
Our bf16 matvec kernel already processes 2 rows at a time with 8 SIMD&lt;br&gt;
accumulators (NEON/AVX), doing 32 elements per inner loop iteration. The input vector&lt;br&gt;
(4 KB for hidden=1024) fits entirely in L1. The weight matrix access is&lt;br&gt;
sequential, which the hardware prefetcher handles well. The bottleneck is&lt;br&gt;
main memory bandwidth (~10 GB/s effective out of 68 GB/s peak), not cache&lt;br&gt;
misses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prefetch hints in CP loop&lt;/strong&gt; (est. 0.5-1%, actual: not possible). Each Code&lt;br&gt;
Predictor layer has ~26 MB of weights. The M1's shared L2 is 12 MB. You&lt;br&gt;
can't prefetch what doesn't fit. The hardware prefetcher handles sequential&lt;br&gt;
access within each matvec just fine — it's the layer &lt;em&gt;transitions&lt;/em&gt; that cause&lt;br&gt;
cold misses, and those are unavoidable without smaller weights.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;INT4/INT8 quantization on 0.6B&lt;/strong&gt; (tested, slower or neutral). See the&lt;br&gt;
Quantization section above — the 0.6B model's hidden=1024 matrices are&lt;br&gt;
compute-bound, not bandwidth-bound. Quantization only helped on 1.7B (INT8:&lt;br&gt;
20% Talker speedup), while INT4 was slower even there.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Softmax SIMD vectorization&lt;/strong&gt; (est. 2-4×, actual: not worth it). After&lt;br&gt;
quickselect reduced total sampling from 93ms to 21ms, softmax is only ~1.5ms&lt;br&gt;
of the remaining 21ms. With &lt;code&gt;-ffast-math&lt;/code&gt;, the compiler already vectorizes&lt;br&gt;
&lt;code&gt;expf&lt;/code&gt; via platform libraries (Accelerate on macOS, libm on Linux). No&lt;br&gt;
headroom for custom NEON/AVX exp.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Speech decoder depthwise conv / LayerNorm SIMD&lt;/strong&gt; (est. 1.5-3×, actual: not&lt;br&gt;
worth it). The speech decoder runs in a background thread overlapped with&lt;br&gt;
generation. It finishes &lt;em&gt;before&lt;/em&gt; Talker+CP complete — it's not the bottleneck.&lt;br&gt;
ConvNeXt depthwise conv does 1.4M FLOPs vs 838M FLOPs for the BLAS-accelerated&lt;br&gt;
pointwise convolutions. Optimizing 0.2% of the decoder's compute is pointless.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Separating INT8 fields from CP layer struct&lt;/strong&gt; (est. 2-3% cache, actual: not&lt;br&gt;
worth it). Only 5 layers × 264 bytes = 1.3KB total. The bottleneck is the&lt;br&gt;
weight data (26MB per layer), not the 8-byte pointer loads in the struct.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers
&lt;/h2&gt;

&lt;h3&gt;
  
  
  0.6B Model (Primary Target)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Baseline&lt;/th&gt;
&lt;th&gt;After all optimizations&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Talker&lt;/td&gt;
&lt;td&gt;46.9 ms/f&lt;/td&gt;
&lt;td&gt;~22 ms/f&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code Predictor&lt;/td&gt;
&lt;td&gt;104.7 ms/f&lt;/td&gt;
&lt;td&gt;~60 ms/f (batch vvexpf: 70→60)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Speech Decoder&lt;/td&gt;
&lt;td&gt;~2,600ms (blocking)&lt;/td&gt;
&lt;td&gt;overlapped (background thread)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prefill&lt;/td&gt;
&lt;td&gt;~1,800ms&lt;/td&gt;
&lt;td&gt;~1,000–1,600ms (delta: ~500ms repeat)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Codec head+sampling&lt;/td&gt;
&lt;td&gt;93ms&lt;/td&gt;
&lt;td&gt;21ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-token malloc calls&lt;/td&gt;
&lt;td&gt;~120+&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RTF (CLI, short ~5s audio)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~3.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~1.4–1.7&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RTF (CLI, long ~17s audio)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~2.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~1.3&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RTF (CLI &lt;code&gt;--stream&lt;/code&gt;)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~3.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;~1.4–1.7&lt;/strong&gt; (same as normal)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RTF (server warm, short)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.39&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RTF (server warm, long)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.26&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  1.7B Model (with INT8)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;BF16&lt;/th&gt;
&lt;th&gt;INT8 (&lt;code&gt;--int8&lt;/code&gt;)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Talker&lt;/td&gt;
&lt;td&gt;79.3 ms/f&lt;/td&gt;
&lt;td&gt;67.4 ms/f (&lt;strong&gt;20% faster&lt;/strong&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code Predictor&lt;/td&gt;
&lt;td&gt;87.0 ms/f&lt;/td&gt;
&lt;td&gt;78.7 ms/f&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RTF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4.32&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3.59&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All on an Apple M1 8-core, 16 GB RAM, 4 threads. RTF improves with longer&lt;br&gt;
audio because prefill is a fixed cost that amortizes over more frames. The&lt;br&gt;
speech decoder runs in a background thread, overlapping most of its work with&lt;br&gt;
generation — including streaming mode, where the decoder thread calls the audio&lt;br&gt;
callback directly. Server mode with embedding cache, warm buffers, delta prefill,&lt;br&gt;
and decoder thread overlap delivers the best RTF at &lt;strong&gt;1.26&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Alignment matters more than you think.&lt;/strong&gt; A 24% speedup from&lt;br&gt;
&lt;code&gt;posix_memalign&lt;/code&gt; is absurd in 2026, but BLAS libraries really do check&lt;br&gt;
alignment and choose different code paths. Abrash was right in 1997 and&lt;br&gt;
he's right now.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Profile before you optimize.&lt;/strong&gt; We nearly implemented L1 cache blocking&lt;br&gt;
for the matvec kernel — a complex change — before realizing the kernel was&lt;br&gt;
already bandwidth-bound and the complexity would gain nothing.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Look for scalar code in SIMD codebases.&lt;/strong&gt; When different components are&lt;br&gt;
written at different times, it's easy for one file to miss the optimization&lt;br&gt;
that all others have. Six scalar RMSNorm loops hiding in the speech decoder.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Zero-malloc decode loops matter for servers.&lt;/strong&gt; The single-run difference&lt;br&gt;
is negligible, but for a long-running server handling request after request,&lt;br&gt;
eliminating allocation churn in the hot loop adds up.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cache computed results, not just data.&lt;/strong&gt; The LRU text embedding cache&lt;br&gt;
avoids recomputing token projections (12M FLOPs each) across requests. At&lt;br&gt;
8MB for 2048 tokens, it's practically free. The lesson: when you spot a&lt;br&gt;
pure function called repeatedly with the same inputs, memoize it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Pipeline independent stages.&lt;/strong&gt; The speech decoder doesn't share any state&lt;br&gt;
with the Talker or Code Predictor. Once we recognized that, overlapping them&lt;br&gt;
with a simple producer-consumer thread was ~150 lines for a 14-19% speedup.&lt;br&gt;
Look for stages in your pipeline that only consume the output of previous&lt;br&gt;
stages — those are free parallelism.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Check your algorithms, not just your SIMD.&lt;/strong&gt; A 4× sampling speedup from&lt;br&gt;
replacing selection sort with quickselect — no intrinsics, no threading,&lt;br&gt;
just a better algorithm. Profile first, but when you find O(kn) in a hot&lt;br&gt;
loop, fix the algorithm before reaching for SIMD.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Unify code paths.&lt;/strong&gt; Streaming was 30% slower because it had its own&lt;br&gt;
synchronous decode path. When we unified it with the decoder thread (the&lt;br&gt;
same path normal mode uses), the gap disappeared. Two code paths that do&lt;br&gt;
the same thing will always diverge in performance.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Batch your transcendentals.&lt;/strong&gt; Calling &lt;code&gt;expf()&lt;/code&gt; 163,000 times per frame&lt;br&gt;
is slower than calling &lt;code&gt;vvexpf()&lt;/code&gt; 206 times — same math, same result,&lt;br&gt;
8% faster. SIMD units want batches. This is the Abrash data alignment&lt;br&gt;
lesson in a different guise: don't waste hardware lanes by feeding values&lt;br&gt;
one at a time.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Exploit causal structure for caching.&lt;/strong&gt; Causal attention means prefix&lt;br&gt;
tokens produce identical KV entries regardless of suffix. Delta prefill&lt;br&gt;
cuts server prefill time in half for repeated speakers — zero accuracy&lt;br&gt;
cost, because the math guarantees identical outputs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Quantization is not free compression.&lt;/strong&gt; INT8 works on 1.7B (20% win)&lt;br&gt;
because the matrices are large enough to be bandwidth-bound. INT4 loses&lt;br&gt;
on every model size we tested — the nibble unpack overhead exceeds the&lt;br&gt;
bandwidth savings. Always measure before assuming "smaller weights = faster."&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Read the old books.&lt;/strong&gt; Abrash's &lt;em&gt;Graphics Programming Black Book&lt;/em&gt; and&lt;br&gt;
Carmack's &lt;code&gt;.plan&lt;/code&gt; files are from another era, but the principles — cache&lt;br&gt;
friendliness, data alignment, knowing your memory hierarchy — are timeless.&lt;br&gt;
The specific rules change (64-byte cache lines instead of dword alignment),&lt;br&gt;
but the instinct to think about how data flows through the CPU is exactly&lt;br&gt;
the same. Every optimization in this post — alignment, SIMD batching,&lt;br&gt;
pipeline parallelism, algorithmic complexity — traces back to ideas those&lt;br&gt;
two articulated thirty years ago. The hardware evolved; the thinking didn't.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;em&gt;This is part of the &lt;a href="https://github.com/gabriele-mastrapasqua/qwen3-tts" rel="noopener noreferrer"&gt;qwen3-tts&lt;/a&gt;&lt;br&gt;
project — a pure C inference engine for Qwen3-TTS text-to-speech models.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>tts</category>
      <category>audio</category>
      <category>c</category>
    </item>
    <item>
      <title>Building a Text-to-Speech Engine in Pure C</title>
      <dc:creator>Gabriele Mastrapasqua</dc:creator>
      <pubDate>Mon, 09 Mar 2026 14:49:04 +0000</pubDate>
      <link>https://forem.com/gabrielemastrapasqua/building-a-text-to-speech-engine-in-pure-c-59h4</link>
      <guid>https://forem.com/gabrielemastrapasqua/building-a-text-to-speech-engine-in-pure-c-59h4</guid>
      <description>&lt;p&gt;I built a &lt;strong&gt;pure C inference engine&lt;/strong&gt; for &lt;a href="https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice" rel="noopener noreferrer"&gt;Qwen3-TTS&lt;/a&gt;, Alibaba's open-source text-to-speech model. The goal: run high-quality multilingual TTS on CPU, with zero Python dependencies, inspired by &lt;a href="https://github.com/antirez" rel="noopener noreferrer"&gt;antirez's&lt;/a&gt; approach to minimal C inference engines (specifically his &lt;a href="https://github.com/antirez/qwen-asr" rel="noopener noreferrer"&gt;qwen-asr&lt;/a&gt; project). The code is on GitHub: &lt;a href="https://github.com/gabriele-mastrapasqua/qwen3-tts" rel="noopener noreferrer"&gt;gabriele-mastrapasqua/qwen3-tts&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;What started as a "let's just get the basic pipeline working" turned into a full-featured TTS engine with streaming output, an HTTP server, voice cloning, and custom voice design — all in a single C binary.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why pure C?
&lt;/h2&gt;

&lt;p&gt;The official Qwen3-TTS runs on PyTorch with the usual stack of transformers, tokenizers, and CUDA. That's fine for a GPU server, but I wanted something that runs anywhere — a single binary, no runtime dependencies, just mmap the model weights and go.&lt;/p&gt;

&lt;p&gt;The result: &lt;code&gt;make blas&lt;/code&gt;, point it at a model directory, and you get a ~200KB binary that does everything.&lt;/p&gt;

&lt;h2&gt;
  
  
  The architecture
&lt;/h2&gt;

&lt;p&gt;Qwen3-TTS is a three-stage pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Talker&lt;/strong&gt; — a 28-layer causal Qwen3 LLM (0.6B or 1.7B params) with GQA, RoPE, and SwiGLU that generates discrete audio frame tokens from text&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code Predictor&lt;/strong&gt; — a 5-layer transformer that runs 15 sequential passes per frame, filling in the remaining codebook entries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speech Decoder&lt;/strong&gt; — a causal ConvNet with Snake activations, ResBlocks, and 480x upsampling that converts discrete codes to 24kHz audio&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each stage was reimplemented from scratch in C. The model supports 9 preset voices, 10 languages, and both 0.6B and 1.7B model sizes (auto-detected from weights).&lt;/p&gt;

&lt;h2&gt;
  
  
  BF16 weights, float32 compute
&lt;/h2&gt;

&lt;p&gt;The model weights are stored in bfloat16 and memory-mapped directly from standard HuggingFace safetensors files. On Apple Silicon, bf16-to-f32 conversion is essentially free (it's just a left shift), and this approach gives &lt;strong&gt;bit-identical results&lt;/strong&gt; to the Python reference with greedy decoding.&lt;/p&gt;

&lt;p&gt;I did experiment with INT4 quantization, but for the 0.6B model the matrices are too small to be bandwidth-bound — the Q4 unpack overhead actually made it &lt;strong&gt;20% slower&lt;/strong&gt;. BF16 turned out to be the sweet spot.&lt;/p&gt;

&lt;h2&gt;
  
  
  Streaming output
&lt;/h2&gt;

&lt;p&gt;The speech decoder is fully causal (no lookahead), which made streaming architecturally possible. The engine generates N frames, decodes a chunk through the speech decoder, and writes audio immediately — no need to wait for the full sequence.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Pipe raw PCM to an audio player for real-time playback&lt;/span&gt;
./qwen_tts &lt;span class="nt"&gt;-d&lt;/span&gt; qwen3-tts-0.6b &lt;span class="nt"&gt;--text&lt;/span&gt; &lt;span class="s2"&gt;"Hello world"&lt;/span&gt; &lt;span class="nt"&gt;--stdout&lt;/span&gt; | &lt;span class="se"&gt;\&lt;/span&gt;
    play &lt;span class="nt"&gt;-t&lt;/span&gt; raw &lt;span class="nt"&gt;-r&lt;/span&gt; 24000 &lt;span class="nt"&gt;-e&lt;/span&gt; signed &lt;span class="nt"&gt;-b&lt;/span&gt; 16 &lt;span class="nt"&gt;-c&lt;/span&gt; 1 -
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;First audio arrives within ~1 second. The speech decoder uses incremental decoding with KV caching, so each streaming chunk is O(chunk_size) rather than re-processing the full sequence.&lt;/p&gt;

&lt;h2&gt;
  
  
  HTTP server
&lt;/h2&gt;

&lt;p&gt;The engine includes an embedded HTTP server — no nginx, no FastAPI, just start it and send requests:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Start server (model loaded once, shared across requests)&lt;/span&gt;
./qwen_tts &lt;span class="nt"&gt;-d&lt;/span&gt; qwen3-tts-0.6b &lt;span class="nt"&gt;--serve&lt;/span&gt; 8080

&lt;span class="c"&gt;# Generate speech&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8080/v1/tts &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"text":"Hello world","speaker":"ryan","language":"English"}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-o&lt;/span&gt; output.wav
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It also has an &lt;strong&gt;OpenAI-compatible endpoint&lt;/strong&gt; (&lt;code&gt;/v1/audio/speech&lt;/code&gt;) so you can use it as a drop-in replacement for OpenAI's TTS API in existing apps, plus a streaming endpoint that sends chunked PCM as it generates.&lt;/p&gt;

&lt;h2&gt;
  
  
  Voice cloning
&lt;/h2&gt;

&lt;p&gt;Using the Base model variant, you can clone any voice from a few seconds of reference audio:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./qwen_tts &lt;span class="nt"&gt;-d&lt;/span&gt; qwen3-tts-0.6b-base &lt;span class="nt"&gt;--text&lt;/span&gt; &lt;span class="s2"&gt;"Hello, this is my cloned voice."&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--ref-audio&lt;/span&gt; reference.wav &lt;span class="nt"&gt;-o&lt;/span&gt; cloned.wav
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Under the hood, this runs a full ECAPA-TDNN speaker encoder to extract a 1024-dim speaker embedding from the reference audio's mel spectrogram. You can save and reload embeddings to avoid re-extracting:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Extract and save&lt;/span&gt;
./qwen_tts &lt;span class="nt"&gt;-d&lt;/span&gt; qwen3-tts-0.6b-base &lt;span class="nt"&gt;--text&lt;/span&gt; &lt;span class="s2"&gt;"Hello"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--ref-audio&lt;/span&gt; ref.wav &lt;span class="nt"&gt;--save-voice&lt;/span&gt; my_voice.bin &lt;span class="nt"&gt;-o&lt;/span&gt; out.wav

&lt;span class="c"&gt;# Reuse later (instant)&lt;/span&gt;
./qwen_tts &lt;span class="nt"&gt;-d&lt;/span&gt; qwen3-tts-0.6b-base &lt;span class="nt"&gt;--text&lt;/span&gt; &lt;span class="s2"&gt;"Another sentence"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--load-voice&lt;/span&gt; my_voice.bin &lt;span class="nt"&gt;-o&lt;/span&gt; out2.wav
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The speech tokenizer encoder (Mimi-based, with 4-stage strided convolutions, 8-layer transformer, and split RVQ quantization) was also implemented for the full ICL mode.&lt;/p&gt;

&lt;h2&gt;
  
  
  VoiceDesign
&lt;/h2&gt;

&lt;p&gt;The 1.7B VoiceDesign model can create entirely new voices from natural language descriptions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./qwen_tts &lt;span class="nt"&gt;-d&lt;/span&gt; qwen3-tts-voice-design &lt;span class="nt"&gt;-l&lt;/span&gt; English &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--instruct&lt;/span&gt; &lt;span class="s2"&gt;"A deep male voice with a British accent, speaking slowly and calmly"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--text&lt;/span&gt; &lt;span class="s2"&gt;"Hello, this is a test."&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; british.wav
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No reference audio needed — just describe what you want.&lt;/p&gt;

&lt;h2&gt;
  
  
  Style and emotion control
&lt;/h2&gt;

&lt;p&gt;The 1.7B CustomVoice model supports an &lt;code&gt;--instruct&lt;/code&gt; flag to control speaking style:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./qwen_tts &lt;span class="nt"&gt;-d&lt;/span&gt; qwen3-tts-1.7b &lt;span class="nt"&gt;--text&lt;/span&gt; &lt;span class="s2"&gt;"I cannot believe you did that."&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--instruct&lt;/span&gt; &lt;span class="s2"&gt;"Speak in a very angry and aggressive tone"&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; angry.wav

./qwen_tts &lt;span class="nt"&gt;-d&lt;/span&gt; qwen3-tts-1.7b &lt;span class="nt"&gt;--text&lt;/span&gt; &lt;span class="s2"&gt;"I cannot believe you did that."&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--instruct&lt;/span&gt; &lt;span class="s2"&gt;"Speak very slowly and softly, in a sad whisper"&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; whisper.wav
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same text, completely different delivery.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance
&lt;/h2&gt;

&lt;p&gt;On Apple Silicon (M-series, 4 threads):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Speed&lt;/th&gt;
&lt;th&gt;Per-frame&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0.6B&lt;/td&gt;
&lt;td&gt;~0.7-0.86x realtime&lt;/td&gt;
&lt;td&gt;Talker 24ms + CP 70ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1.7B&lt;/td&gt;
&lt;td&gt;~0.48x realtime&lt;/td&gt;
&lt;td&gt;Talker 92ms + CP 75ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The bottleneck is the Code Predictor — 15 sequential autoregressive passes per frame, no way around it.&lt;/p&gt;

&lt;p&gt;Key optimizations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;NEON-optimized bf16 matvec&lt;/strong&gt; with multi-row fusion (2-row fused dispatch)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fused gate+up projections&lt;/strong&gt; for SwiGLU in both Talker and Code Predictor&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unified QKV dispatch&lt;/strong&gt; to reduce threading overhead&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NEON kernels&lt;/strong&gt; for RMSNorm, attention (dot+V accum), RoPE, Snake activation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fused argmax+matvec&lt;/strong&gt; in the Code Predictor hot loop&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;im2col + BLAS sgemm&lt;/strong&gt; for the ConvNet decoder, with tiling for large sequences&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incremental speech decoder&lt;/strong&gt; with KV cache for streaming&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;4-thread dispatch_apply&lt;/strong&gt; (sweet spot — 8 threads hit the memory bandwidth ceiling)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Starting from ~0.4x realtime pre-optimization, these brought the 0.6B model to ~0.86x realtime.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Metal GPU detour
&lt;/h2&gt;

&lt;p&gt;I implemented a full Metal GPU backend — compute shaders, GPU-side transformer, the works. The result? &lt;strong&gt;~1.3x slower&lt;/strong&gt; than the optimized NEON CPU path. On Apple Silicon, CPU and GPU share the same memory bus, so there's no bandwidth advantage. The NEON path was already near-optimal for these model sizes. Deleted the whole thing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The debugging journey
&lt;/h2&gt;

&lt;p&gt;Getting bit-identical output required tracking down some non-obvious issues:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The model config says &lt;code&gt;"interleaved": true&lt;/code&gt; for RoPE, but the Python code actually uses NeoX split-half rotation (the opposite!)&lt;/li&gt;
&lt;li&gt;The Code Predictor's first codebook uses the &lt;em&gt;Talker's&lt;/em&gt; codec embedding, not its own&lt;/li&gt;
&lt;li&gt;Snake activations store alpha and beta in &lt;strong&gt;log space&lt;/strong&gt; — &lt;code&gt;sin²(exp(alpha) * x)&lt;/code&gt;, not &lt;code&gt;sin²(alpha * x)&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;All convolutions in the speech decoder are causal (left-only padding), including transposed convolutions&lt;/li&gt;
&lt;li&gt;ResBlock dilations are [1, 3, 9], not [1, 1, 1] as you might assume&lt;/li&gt;
&lt;li&gt;The 1.7B model needs a projection layer (2048→1024) between the Talker and Code Predictor that isn't in the 0.6B&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of these was a "why doesn't my output match?" rabbit hole. The final validation: correlation 0.999996 with the Python reference across the full pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it looks like
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Build&lt;/span&gt;
make blas

&lt;span class="c"&gt;# Basic usage&lt;/span&gt;
./qwen_tts &lt;span class="nt"&gt;-d&lt;/span&gt; qwen3-tts-0.6b &lt;span class="nt"&gt;--text&lt;/span&gt; &lt;span class="s2"&gt;"Hello, how are you?"&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; hello.wav

&lt;span class="c"&gt;# Stream to speaker&lt;/span&gt;
./qwen_tts &lt;span class="nt"&gt;-d&lt;/span&gt; qwen3-tts-0.6b &lt;span class="nt"&gt;--text&lt;/span&gt; &lt;span class="s2"&gt;"Hello world"&lt;/span&gt; &lt;span class="nt"&gt;--stdout&lt;/span&gt; | &lt;span class="se"&gt;\&lt;/span&gt;
    play &lt;span class="nt"&gt;-t&lt;/span&gt; raw &lt;span class="nt"&gt;-r&lt;/span&gt; 24000 &lt;span class="nt"&gt;-e&lt;/span&gt; signed &lt;span class="nt"&gt;-b&lt;/span&gt; 16 &lt;span class="nt"&gt;-c&lt;/span&gt; 1 -

&lt;span class="c"&gt;# Start HTTP server&lt;/span&gt;
./qwen_tts &lt;span class="nt"&gt;-d&lt;/span&gt; qwen3-tts-0.6b &lt;span class="nt"&gt;--serve&lt;/span&gt; 8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The project supports macOS (ARM/x86), Linux (ARM/x86), and Windows via WSL2. NEON and AVX SIMD paths are included. The 0.6B model needs ~3 GB of memory, the 1.7B needs ~8 GB.&lt;/p&gt;

&lt;p&gt;The code is on GitHub: &lt;a href="https://github.com/gabriele-mastrapasqua/qwen3-tts" rel="noopener noreferrer"&gt;gabriele-mastrapasqua/qwen3-tts&lt;/a&gt;&lt;/p&gt;

</description>
      <category>c</category>
      <category>ai</category>
      <category>tts</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
