Forem: Gabriele Mastrapasqua

Extending Qwen3-TTS: clone voices once, reuse everywhere (pure C)

Gabriele Mastrapasqua — Sun, 12 Apr 2026 13:36:03 +0000

Part of qwen3-tts — a pure C inference engine for Qwen3-TTS.

TL;DR — turn any 30-second clip into a first-class Qwen3-TTS voice

Qwen3-TTS ships with 9 preset speakers. That's it. You can't add your own, you can't use the 1.7B instruct feature on a cloned voice, and every new clone has to re-run the 200 ms ECAPA-TDNN encoder from scratch.

This post is about tearing that ceiling down.

With the pure-C engine at qwen3-tts you can:

🎙️ Clone any voice from 30 seconds of audio (ECAPA-TDNN speaker encoder implemented from scratch)
💾 Save it as a portable .qvoice file and load it anywhere — CLI, HTTP server, streaming pipeline, one-shot generation
🎛️ Combine a cloned voice with --instruct style prompts on 1.7B (sad / happy / angry / solemn) — something the Base model alone can't do
🎯 Get bit-identical output across runs, processes, and machines via the WDELTA weight-delta format

The .qvoice file is a new way to extend Qwen3-TTS's voice set: drop the file next to the binary, point --load-voice at it, and the model speaks with your voice like it was one of the originals.

That portability comes in three flavors. All produce the same voice identity; they differ in how much of the Base model's weight signature they carry along:

Format	Size	Mel correlation vs Base	Fidelity	Works with instruct?	Use when…
🥇 `.qvoice` WDELTA (LZ4 full delta)	785 MB	1.000 (bit-identical)	Perfect, PCM-identical	✅ yes (1.7B)	You're building a reusable voice asset — server, streaming, product
🥈 `.qvoice` standard (TPAD + WOVR)	16 MB	0.71	Good; small prosody drift	⚠️ Base only	Default for sharing — fits in chat, sounds right
🥉 `.bin` embedding only	4 KB	not measured (~60–70 % subj.)	Voice drifts, timbre loose	❌ no	You have 4 kilobytes to spend

The headline: WDELTA makes a cloned voice a first-class citizen of the CustomVoice model. You clone once on Base, save a .qvoice, and the CV model loads it and treats it exactly like one of the nine built-in speakers — same latency, same server behavior, same streaming support, now with instruct-style control on top.

All audio below is hosted in the same repo — click ▶️ to play.

Samples — listen for yourself

🇮🇹 Italian — Galatea / Riccardo Fasol · LibriVox, PD

"Buongiorno a tutti, oggi vi racconto una breve storia, con la voce clonata da una registrazione libera."

📥 Input reference — 30 s from LibriVox · wav

🎤 Voice clone output — 3 storage formats

🥇 Top — WDELTA, 785 MB (mel 1.000, bit-identical) · wav

🥈 Mid — standard .qvoice, 16 MB (mel 0.71) · wav

🥉 Light — .bin, 4 KB (embedding only) · wav

🇬🇧 English — The Gifts of the Magi / Phil Chenevert · LibriVox, CC0

"Hello everyone, today I am speaking with a voice cloned from a freely licensed recording."

📥 Input reference · wav

🎤 Voice clone output — 3 storage formats

🥇 Top — WDELTA, 785 MB (mel 1.000) · wav

🥈 Mid — standard .qvoice, 16 MB (mel 0.71) · wav

🥉 Light — .bin, 4 KB · wav

🇪🇸 Spanish — Don Quijote / Lu · LibriVox, PD

"Hola a todos, hoy les hablo con una voz clonada a partir de una grabación de dominio público."

📥 Input reference · wav

🎤 Voice clone output — 3 storage formats

🥇 Top — WDELTA, 785 MB (mel 1.000) · wav

🥈 Mid — standard .qvoice, 16 MB (mel 0.71) · wav

🥉 Light — .bin, 4 KB · wav

🇫🇷 French — Le dernier jour d'un condamné / Bidou · LibriVox, PD

"Bonjour à tous, aujourd'hui je vous parle avec une voix clonée à partir d'un enregistrement libre."

📥 Input reference · wav

🎤 Voice clone output — 3 storage formats

🥇 Top — WDELTA, 785 MB (mel 1.000) · wav

🥈 Mid — standard .qvoice, 16 MB (mel 0.71) · wav

🥉 Light — .bin, 4 KB · wav

Cost table — how much you pay for each tier

All numbers on Apple M1 8-core, 16 GB RAM, 4 threads, 0.6B model, cold start.

Format	File size	Create `.qvoice`	Generate (wall)	What's inside
`.bin`	4 KB	~20.8 s	~11 s	1024 ECAPA-TDNN floats
standard `.qvoice`	16 MB	~19.6 s	~9.6 s	Embedding + `text_projection` + `codec_embedding` + pad embeds
WDELTA `.qvoice`	785 MB	~28.5 s	~13.8 s	LZ4 int16 deltas for all 402 talker + CP tensors

Verdict: standard 16 MB is the sensible default. Go to WDELTA only when you need bit-identical output or to combine a cloned voice with --instruct style control (1.7B). Go to .bin only if 4 KB is the whole budget.

All input clips are 30-second excerpts from the 30 s mark (skipping the LibriVox preamble), 24 kHz mono PCM. Full attribution in samples/voice_clone_refs/ATTRIBUTION.md.

Now the technical part — how we got there.

From audio to identity: the ECAPA-TDNN pipeline

The speaker encoder is an ECAPA-TDNN (Emphasized Channel Attention, Propagation and Aggregation in Time Delay Neural Network), designed for speaker verification. Its job: take variable-length audio and produce a fixed-size vector that captures who is speaking, independent of what they're saying.

Step 1 — Mel spectrogram

Raw 24 kHz audio → 1024-point FFT, hop 256, 128 mel bins → [T, 128] where T ≈ 94 frames/sec. For 30 s of audio, roughly [2810, 128].

Step 2 — TDNN + SE-Res2Net blocks

Four convolutional blocks process the mel spectrogram. The first is a plain 1D conv (128→512, k=5). The next three are Squeeze-and-Excitation Res2Net blocks: each splits 512 channels into 8 groups of 64, runs cascaded dilated convolutions (dilations 2, 3, 4) over them, and reweights channels with a small attention block. The effective receptive field grows to several hundred milliseconds by the last block.

Step 3 — Multi-layer feature aggregation

The three SE-Res2Net outputs are concatenated channel-wise (→ [1536, T]) and passed through one more TDNN. The network now has access to features at every level of abstraction simultaneously.

Step 4 — Attentive Statistics Pooling

The most important step — the one that collapses variable-length time into a fixed-size vector:

[1536, T] → mean, std across T → concat [hidden, mean, std] → [4608, T]
         → TDNN(4608→128) → tanh → Conv1d(128→1536) → softmax over T
         → weighted mean + weighted std → [3072]

Attention learns which frames matter most. Sustained vowels reveal more about vocal tract shape than fricatives or silence — the network weights them higher. This is also why varied reference audio beats long monotone audio: more variation, richer pooling. 30 s is our default sweet spot.

Step 5 — Final projection

Conv1d(3072 → enc_dim, kernel=1)

Model	enc_dim	hidden
0.6B-Base	1024	1024
1.7B-Base	2048	2048

The output — 1024 or 2048 floats — replaces the discrete speaker token in the transformer prompt.

The bug hidden by a coincidence

Voice cloning worked on 0.6B. On 1.7B it sounded completely wrong. The cause was a single line:

enc->enc_dim = 1024;  // hardcoded — wrong for 1.7B

On 0.6B, enc_dim == hidden == 1024 by coincidence. On 1.7B, enc_dim == 2048, so we were writing 1024 valid floats into a 2048-dim slot — the rest was uninitialized memory. The first half of the hidden state got a real speaker; the second half got garbage.

The fix was reading enc_dim from config.json. Lesson: when two model sizes "work" but one sounds wrong, check whether shared code accidentally matches by coincidence rather than by design.

Why 1.7B clones better

After the fix, 1.7B consistently produced more faithful clones. Two reasons:

2048-dim embedding vs 1024-dim — twice the capacity to capture breathiness, nasality, micro-timing in phoneme transitions.
4× transformer parameters — the model can actually use the richer embedding.

A detailed speaker embedding is only useful if the model has the capacity to condition on those details.

The `.qvoice` v3 format

Cloning isn't free (~200 ms of ECAPA-TDNN per 30 s of audio), and more importantly a raw embedding alone loses prosody. The .qvoice format stores everything needed to reproduce the clone:

QVCE magic + version 3
├── Speaker embedding        (1024 or 2048 floats)
├── Reference text + ICL codec tokens (optional)
├── META                     (language, voice name, source model size, flags)
├── TPAD                     (source model's tts_pad/bos/eos embeddings, 12 KB)
├── WOVR                     (text_projection + codec_embedding, 16 MB)
└── WDELTA                   (LZ4 int16 deltas for all talker+CP weights, ~785 MB)

Each section is optional. You pick the trade-off.

How we got to bit-identical

Base and CustomVoice share 99.98 % of transformer weights (cosine ≈ 0.9999 per layer). But BF16 values differ at 87 % of positions, and those micro-differences accumulate autoregressively. Closing the gap was a three-step elimination:

TPAD (+12 KB) — override the source model's tts_pad_embed. Mel correlation 0.756.
WOVR (+16 MB) — override text_projection and codec_embedding entirely. Mel correlation 0.711, RTF 1.60.
WDELTA (+785 MB, LZ4) — int16 deltas for every remaining layer. Mel correlation 1.000, PCM bit-identical.

Two things bit us along the way:

Partial layer replacement is worse than none. Replacing the 5 most-divergent layers out of 28 dropped quality below the no-replacement baseline. The transformer is a chain; mismatched interfaces at layer boundaries cost more than uniform small differences everywhere.
The Code Predictor has its own weights. Even after replacing all 28 talker layers, codebooks 5–15 still diverged until we also deltaed the CP's 86 tensors and rebuilt its gate_up_fused buffer.

LZ4 vs zlib

We started with zlib. It produced smaller files but decompression dominated load time.

Compression	File (0.6B)	Decompress	Total wall	vs preset
zlib	510 MB	~4 s	15.9 s	+32 %
LZ4	785 MB	~1 s	12.8 s	+7 %

For a one-shot load at startup, decompression speed matters more than file size.

Style control + cloned voice

On 1.7B + WDELTA you can finally combine --instruct with a cloned voice — something the Base model alone can't do, because it was never trained with both signals together:

./qwen_tts -d qwen3-tts-1.7b --load-voice silvio_17b.qvoice \
    --text "Una notizia importante." \
    -I "Parla con voce triste e malinconica" -o sad.wav

./qwen_tts -d qwen3-tts-1.7b --load-voice silvio_17b.qvoice \
    --text "Una notizia importante." \
    -I "Parla con voce allegra e entusiasta" -o happy.wav

Voice identity stays constant; instruct modulates rhythm, pacing, and emphasis.

Commands, for the curious

# Clone once (needs Base + CV of same size)
./qwen_tts -d qwen3-tts-0.6b-base --ref-audio speaker.wav -l Italian \
    --voice-name "Mario" --target-cv qwen3-tts-0.6b \
    --save-voice voices/mario_06b.qvoice

# Use anywhere (only needs CV + .qvoice)
./qwen_tts -d qwen3-tts-0.6b --load-voice voices/mario_06b.qvoice \
    --text "Ciao, come va?" -o output.wav

Without --target-cv, you get the 16 MB standard format. With it, the 785 MB WDELTA.

What I learned

Test every model size. Dimension bugs hide behind coincidences.
Longer audio helps, but not linearly. Diversity beats duration.
Embedding dimension is quality. 1024 → 2048 is a clear audible jump.
Version your file formats. v1 .qvoice silently corrupted on size mismatch; v2+ errors loudly.
A/B test by listening. Unit tests pass on garbage outputs. Ears don't.
The encoder captures everything, not just the voice — background music, room noise, a second speaker. Clean your input or run demucs first.
Style control and voice cloning live in separate worlds — until you bridge them with weight deltas.

Source, deeper dives, and benchmarks: gabriele-mastrapasqua/qwen3-tts. For the full weight-analysis story behind WDELTA, see cross-model-voice-analysis.md.

Optimizing a Qwen3-TTS Engine in Pure C: Lessons from 1990s Game Programming

Gabriele Mastrapasqua — Mon, 16 Mar 2026 21:34:23 +0000

How cache alignment, SIMD intrinsics (NEON/AVX), pipeline threading, and lessons from 1990s game programming nearly tripled the inference speed of Qwen3-TTS, reducing RTF from 3.5 to 1.26.

The Starting Point

We have a pure C inference engine for qwen3-tts, a text-to-speech model with
a 28-layer transformer (Talker), a 5-layer code predictor, and a convolutional
speech decoder. No Python, no PyTorch, no GPU — just C, Apple Accelerate BLAS,
and SIMD intrinsics (NEON on ARM, AVX on x86) on an Apple M1 with 16 GB RAM.

After getting the pipeline correct and implementing the first round of SIMD
kernels (NEON/AVX: fused 2-row bf16 matvec, unified QKV dispatch, fused gate+up SwiGLU),
we were at RTF ~3.5 on short text and RTF ~2.5 on longer text (the fixed
costs of prefill and speech decoding amortize over longer audio).

This post covers the optimizations that brought us to RTF ~1.26 (server warm,
long text) — up to a 2.7x total speedup with zero algorithmic changes and zero
new dependencies.

RTF = Real-Time Factor = processing_time / audio_duration. Lower is better.
RTF < 1.0 means faster than real-time.

The Abrash Instinct: Cache Alignment Still Matters

If you grew up reading Michael Abrash's Graphics Programming Black Book
(1997), you remember the chapters on data alignment. Abrash hammered on a
simple point: on the 386 and 486, unaligned memory accesses caused extra
wait-states that destroyed performance. Word-aligned on 386, dword-aligned
on 486 — he had the diagrams, the tables, the rules.

John Carmack talked about this too in his .plan files and QuakeCon talks,
in his typical informal way — "align your structs, pack your data, think about
cache lines." But the systematic treatment, the benchmarks, the rules of thumb?
That was Abrash. Chapter after chapter of the Black Book devoted to data
alignment, struct layout, and how the CPU bus punishes you for sloppy memory
access patterns.

Here's the thing: those lessons still apply. The penalty isn't wait-states
anymore — it's SIMD throughput. Modern CPUs have SIMD units (128-bit NEON on
ARM, 256-bit AVX on x86) that operate on aligned data natively. When you feed
misaligned buffers to BLAS routines like cblas_sgemm, the library can't
use its fastest SIMD paths. Apple Accelerate checks alignment at runtime and
falls back to slower code when buffers aren't aligned.

The Fix: 3 Lines of Code

static inline void *aligned_malloc(size_t size) {
    void *ptr = NULL;
    if (posix_memalign(&ptr, 64, size) != 0) return NULL;
    return ptr;
}

We replaced every malloc() and calloc() in the hot path with
posix_memalign(64, ...). Not just the BLAS buffers — the KV caches, the
decode buffers, the prefill temporaries. Everything that touches a SIMD
instruction or a BLAS call got 64-byte alignment (one cache line on M1).

The Result: 24% Total Speedup

Component	Before	After	Improvement
Prefill (BLAS sgemm)	475ms	260ms	84%
Speech Decoder (BLAS sgemm)	2,580ms	1,648ms	36%
Code Predictor (SIMD matvec)	66.4 ms/f	60.8 ms/f	9%
Total pipeline	10.4s	7.9s	24%

The prefill stage — which does batch matrix multiplication via cblas_sgemm —
nearly doubled in speed. The speech decoder, which also relies heavily on
sgemm for its convolutions, improved by 36%. Even the Code Predictor, which
uses our hand-written SIMD bf16 matvec kernel (NEON/AVX), gained 9% from aligned KV
cache and decode buffers.

And the output is bit-identical. Same seed, same text, same bytes in the
WAV file. This is pure implementation overhead we were leaving on the table.

Why 64 Bytes?

The M1's L1 cache line is 128 bytes on the P-cores, but the common denominator
across ARM and x86 is 64 bytes. The BLAS library needs at least 16-byte
alignment for NEON (32-byte for AVX), but 64 bytes guarantees that no buffer
straddles a cache line boundary unnecessarily. It's the sweet spot for
cross-platform code.

posix_memalign is POSIX standard — it works on Linux and macOS without any
platform-specific code. On Windows/WSL2, it's available too. Three lines of
code, zero dependencies, cross-platform.

The Speech Decoder: Scalar Code Hiding in Plain Sight

After the alignment win, we profiled again. The speech decoder still took
~1,650ms for 62 frames. Digging into the code, we found something
embarrassing: six scalar RMSNorm loops that were never converted to SIMD.

The Talker and Code Predictor used our SIMD-optimized qwen_rms_norm()
(NEON on ARM, AVX on x86)
function. But the speech decoder had its own hand-written scalar version:

// Before: scalar, called 480 times per generation (60 frames x 8 layers)
for (int s = 0; s < n_frames; s++) {
    float sum_sq = 0;
    for (int i = 0; i < 512; i++) sum_sq += xs[i] * xs[i];
    float inv_rms = 1.0f / sqrtf(sum_sq / 512 + eps);
    for (int i = 0; i < 512; i++) xn[i] = xs[i] * inv_rms * weight[i];
}

// After: one line
qwen_rms_norm(x_norm, hidden, l->attn_norm, n_frames, dec_hidden, eps);

The SIMD version (NEON on ARM, AVX on x86) processes 4-8 floats per iteration
with fused multiply-accumulate, versus one float at a time in the scalar version.

Same story with RoPE (rotary position embeddings) — the speech decoder had a
scalar loop doing paired rotations at 32 elements per head. We replaced it
with SIMD intrinsics that process 4 pairs at once, fusing Q and K rotation
in the same pass (shown here with NEON; AVX variant in qwen_tts_kernels_avx.c):

// NEON: 4-wide fused Q+K rotation
float32x4_t c = vld1q_f32(cos_ptr + i);
float32x4_t si = vld1q_f32(sin_ptr + i);
float32x4_t q1 = vld1q_f32(qh + i), q2 = vld1q_f32(qh + i + half);
float32x4_t k1 = vld1q_f32(kh + i), k2 = vld1q_f32(kh + i + half);
vst1q_f32(qh + i,        vmlsq_f32(vmulq_f32(q1, c), q2, si));
vst1q_f32(qh + i + half, vmlaq_f32(vmulq_f32(q2, c), q1, si));
vst1q_f32(kh + i,        vmlsq_f32(vmulq_f32(k1, c), k2, si));
vst1q_f32(kh + i + half, vmlaq_f32(vmulq_f32(k2, c), k1, si));

We also replaced the scalar attention dot-product loop with our SIMD-optimized
windowed causal attention kernel (NEON/AVX) — online softmax with wide dot
products and fused V accumulation.

And the VQ dequantization step, which did per-frame scalar matrix-vector
products for codebook projection, was batched into a single cblas_sgemm
call across all frames.

Combined result: speech decoder 11% faster (1,446ms to 1,288ms).

Eliminating Per-Token Malloc

The generation loop was doing malloc/free for every token:

topk_filter(): malloc(vocab_size * sizeof(float)) + free() per sample
topp_filter(): malloc(vocab_size * sizeof(int)) + free() per sample
embed_one_text_token(): two malloc(text_hidden * sizeof(float)) + free() per text token
qwen_talker_prefill(): 14 large buffers allocated and freed per generation

For a typical generation of 60 frames, that's ~120 malloc/free pairs just for
sampling, plus ~14 large buffer allocations for prefill.

We pre-allocated everything:

Sampling buffers persist as module-level statics (allocated once on first call)
Text embedding temps stored in the context struct
Prefill buffers (including ~50MB of f32 weight conversion temps) persist across generations

The single-run impact is negligible (<1%), but in server mode, where the
model handles many sequential requests, the second request runs 38% faster
because all buffers are warm in cache and no allocation overhead.

The generation loop now has zero per-token malloc calls.

Text Embedding Cache: Avoiding Redundant Work

Each text token goes through a two-layer MLP projection (bf16 lookup → fc1 2048×2048
SiLU → fc2 1024×2048) — about 12 million FLOPs per token. For a 57-token long prompt,
that's ~29ms of pure compute. On a server handling the same or similar requests, this
is entirely redundant.

We added two levels of caching:

Special token cache (computed once at model load): tts_pad, tts_bos, and
tts_eos are used in every request. Pre-computing them at load time eliminates
3 matvec pairs per generation — trivial change, zero runtime cost.

LRU hash map for all text tokens: An open-addressing hash table maps token_id → float[hidden] with 2048 slots. On a cache hit, a single 4KB memcpy replaces two bf16
matrix-vector multiplications. The table uses Knuth multiplicative hashing with linear
probing and LRU eviction when full.

Memory cost: 2048 × 1024 × 4 bytes = 8MB — negligible compared to the ~1.2GB
model weights. Always active (both CLI and server) since the overhead is near-zero.

Result: 14% faster on long-text server cold call (RTF 1.55 → 1.33). On warm calls
the improvement is smaller (~2%) because subsequent requests already benefit from OS
page cache and buffer reuse.

Decoder Thread: Pipeline Parallelism

The TTS pipeline has three stages: Talker generates a codec token, the Code Predictor
fills in 15 more codebook entries, then the speech decoder converts those codes to
audio. The original code ran these strictly sequentially — the speech decoder waited
until ALL frames were generated, then processed everything in one batch.

But the speech decoder is completely independent of the Talker and Code Predictor.
It reads completed codec frames and writes audio. No shared weights, no shared KV
cache. And it's already designed for incremental operation: the pre-transformer uses
sliding-window causal attention (window=72), and the ConvNet is fully causal.

The fix: a producer-consumer pipeline with two threads:

Main thread:    [Talker → CP → push frame] → [Talker → CP → push frame] → ...
Decoder thread: [wait] → [decode chunk] → [wait] → [decode chunk] → ...

The main thread pushes completed frames to a mutex-guarded queue. The decoder thread
wakes on a condition variable, pulls available frames, and decodes them incrementally
using the existing streaming decoder path. At the end, the main thread joins the
decoder thread and collects the accumulated audio.

~150 lines of pthreads code: mutex + condvar queue, producer push, consumer loop,
join + audio collection.

The Result

Mode	Before	After	Improvement
CLI short (~5s audio)	RTF 2.01	RTF 1.74	14%
Server short cold	RTF 1.85	RTF 1.50	19%
Server long warm	RTF 1.31	RTF 1.26	4%

The gain is largest on short text where the speech decoder is a bigger fraction of
total time. On long text, Talker+CP dominate and the decoder overlap has less to
hide. The "drain" at the end (waiting for the decoder to finish its last chunk) is
only ~500ms on short text.

One trade-off: the decoder thread competes with the main thread for CPU cores and
memory bandwidth. Talker+CP ms/frame increases slightly (~10%) due to contention,
but the net wall-time improvement from overlapping far exceeds this cost.

Quickselect: When the Algorithm Is the Bug

After all the SIMD and threading work, we noticed the "Codec head+sampling"
line in the timing report: 93ms for 101 frames. That's almost 1ms per frame
spent on... sampling? Something was off.

The top-k filter used selection sort to find the k-th largest logit:

// O(k × n) — selection sort to find top-k threshold
for (int i = 0; i < k; i++) {
    int max_idx = i;
    for (int j = i + 1; j < n; j++)
        if (tmp[j] > tmp[max_idx]) max_idx = j;
    float t = tmp[i]; tmp[i] = tmp[max_idx]; tmp[max_idx] = t;
}

With k=50 and n=3072 (codec vocabulary), that's 153,600 comparisons per
frame × 101 frames = 15.5M comparisons. It's technically O(kn), but the
constant is awful.

The fix: quickselect (Hoare's algorithm). It finds the k-th element in
O(n) average time using 3-way partitioning:

static float quickselect_kth_largest(float *arr, int n, int k) {
    int lo = 0, hi = n - 1;
    while (lo < hi) {
        float pivot = arr[lo + (hi - lo) / 2];
        // 3-way partition: [>pivot] [==pivot] [<pivot]
        // ...
    }
    return arr[lo];
}

Result: 93ms → 21ms (4.4× faster). Output bit-identical — same threshold,
same filtering, same samples. The only thing that changed was how fast we find
the threshold value.

We also checked softmax (3 scalar passes over vocab) and top-p (O(n²) full
sort). Softmax turned out to be ~1.5ms total — with -ffast-math on macOS,
expf is already vectorized by the compiler via Accelerate. And top-p is
skipped entirely at the default top_p=1.0. So quickselect was the only
sampling fix that mattered.

Streaming Pipeline: Closing the Last Gap

With streaming mode (--stream), the user hears audio as it generates — chunks
of ~0.8s arrive progressively. But streaming was 30% slower than normal mode
(RTF 2.0 vs 1.4). Why?

Normal mode uses a decoder thread: the speech decoder runs in the background
while Talker+CP generate the next frame. The two stages overlap in time:

Main thread:    [Gen F1] [Gen F2] [Gen F3] ...
Decoder thread:          [Dec F1] [Dec F2] ...

But streaming mode ran the decoder synchronously in the main thread. Every
10 frames, the main thread stopped generating to decode audio and call the
callback. The main thread was blocked during decode:

Main thread:    [Gen F1-10] [DECODE+CALLBACK] [Gen F11-20] [DECODE+CALLBACK] ...
                             ^^^^ BLOCKED ^^^^               ^^^^ BLOCKED ^^^^

The fix: use the decoder thread for streaming too. Instead of accumulating audio
in a buffer, the decoder thread calls the audio callback directly. The main
thread never blocks on decode:

if (dt->audio_cb) {
    int ret = dt->audio_cb(chunk_audio, chunk_samples, dt->audio_cb_userdata);
    if (ret != 0) dt->cb_aborted = 1;
} else {
    dt_append_audio(dt, chunk_audio, chunk_samples);
}

The callback (fwrite + fflush to a WAV file, or send() to an HTTP socket)
is called from the decoder thread. Both are thread-safe by default.

Result: Streaming RTF 2.04 → 1.38 — identical to normal mode. The change
was -80 lines, +53 lines (net simpler!), because we deleted the entire
synchronous streaming code path and unified everything through the decoder
thread.

The output is bit-identical across all four modes: CLI normal, CLI streaming,
HTTP server normal, HTTP server streaming. Same seed, same speaker, same language
→ same bytes in the WAV file.

Batch vvexpf: Transcendentals Are Expensive One at a Time

After the algorithmic wins, we went hunting for smaller gains. The SwiGLU
activation function in every transformer layer computes x * sigmoid(x), and
sigmoid needs expf(). In a 28-layer Talker and a 5-layer Code Predictor
running 15 passes per frame, that's ~163,000 individual expf() calls per
audio frame.

Each expf() is a transcendental function — high latency, hard to pipeline.
But calling them one by one wastes the CPU's SIMD units. The fix: batch them.

On macOS, Apple's Accelerate framework provides vvexpf() — a vectorized
exponential that processes an entire array at once using optimized SIMD paths
internally. We wrote a qwen_swiglu_inplace() kernel that computes
gate = vvexpf(-gate); gate = x / (1 + gate); gate *= up over the full
intermediate dimension in one call:

void qwen_swiglu_inplace(float *gate, const float *up, int n) {
#if defined(__APPLE__) && defined(USE_BLAS)
    int ni = n;
    // gate = -gate
    vDSP_vneg(gate, 1, gate, 1, ni);
    // gate = exp(-gate)  (batch)
    vvexpf(gate, gate, &ni);
    // gate = 1 + exp(-gate)
    float one = 1.0f;
    vDSP_vsadd(gate, 1, &one, gate, 1, ni);
    // gate = x / (1 + exp(-gate))  →  sigmoid(x) * x via up vector
    vDSP_vdiv(gate, 1, up, 1, gate, 1, ni);
#else
    // scalar fallback — compiler auto-vectorizes with -ffast-math
    for (int i = 0; i < n; i++)
        gate[i] = up[i] * gate[i] / (1.0f + expf(-gate[i]));
#endif
}

Result: Code Predictor 8% faster (76 ms/f → 70 ms/f). Those ~163K scalar
expf calls per frame collapsed into ~206 batched vvexpf calls. Not a
headline number, but it's free — the output is bit-identical and the code is
actually cleaner than the inline scalar loop it replaced.

The Abrash lesson applies here too: just as he taught us that unaligned memory
access wastes bus cycles, calling transcendentals one at a time wastes SIMD
lanes. The hardware wants to process 4-8 values at once — you just have to
feed it that way.

SIMD BF16 Accumulation: One More Scalar Loop

The codec embedding lookup accumulates 15 codebook vectors per audio frame —
each a BF16-to-F32 conversion followed by a vector add. The original code did
this scalar:

for (int i = 0; i < dim; i++) {
    uint32_t bits = (uint32_t)src_bf16[i] << 16;
    float val; memcpy(&val, &bits, sizeof(float));
    dst[i] += val;
}

We wrote qwen_bf16_accum_f32() with NEON and AVX2 paths. The NEON version
processes 8 BF16 values per iteration — load, shift-widen to F32, add:

// NEON: 8-wide BF16→F32 accumulate
uint16x8_t bf = vld1q_u16(src_bf16 + i);
float32x4_t f0 = vreinterpretq_f32_u32(vshll_n_u16(vget_low_u16(bf), 16));
float32x4_t f1 = vreinterpretq_f32_u32(vshll_n_u16(vget_high_u16(bf), 16));
vst1q_f32(dst + i,     vaddq_f32(vld1q_f32(dst + i), f0));
vst1q_f32(dst + i + 4, vaddq_f32(vld1q_f32(dst + i + 4), f1));

The AVX2 version does the same with 256-bit registers — cvtepu16_epi32 to
zero-extend, slli_epi32 to shift into F32 position, add_ps to accumulate.

The per-frame impact is small (~0.5-1ms), but it adds up over hundreds of
frames and eliminates yet another scalar loop hiding in a SIMD codebase —
exactly the kind of thing Abrash warned about: the fast path is only fast if
all the code on it is optimized.

Delta Prefill: Reusing the KV Cache Across Requests

The Talker's prompt has a fixed structure: ChatML header, speaker token,
language token, codec control tokens, then the actual text. For a server
handling multiple requests with the same speaker and language, the prefix
is identical every time — but we were re-prefilling it from scratch on every
call.

Causal attention gives us a nice property: prefix tokens produce identical
KV cache entries regardless of what comes after. If the first 8 tokens of
the prompt match the previous request, their KV entries are already in the
cache. We just need to prefill the new tokens.

The implementation compares the current input embeddings against the previous
call's cached embeddings (stored in prev_input_embeds). If the first N
embeddings match, we skip to position N and only process the delta:

Request 1: [header][speaker][lang][codec][text_A]  →  full prefill (18 tokens)
Request 2: [header][speaker][lang][codec][text_B]  →  delta prefill (skip 8, process 10)
Request 3: [header][speaker][lang][codec][text_C]  →  delta prefill (skip 8, process 7)

When the speaker or language changes, the prefix differs and we fall back to
full prefill automatically — no special-casing needed.

Result: ~50% prefill time savings on repeated speaker in server mode. For
a chatbot or voice assistant scenario where you're generating many responses
in the same voice, this eliminates the biggest fixed cost in the pipeline.

Quantization: What the 1.7B Model Taught Us

We'd already tried INT4 and INT8 quantization on the 0.6B model and found
them slower or neutral — the matrices are too small (hidden=1024) to be
bandwidth-bound, so dequantization overhead dominates. But the 1.7B model
has hidden=2048 and intermediate=6144 — 4× larger matrices. Time to
revisit.

INT8 (--int8): 20% Talker speedup on 1.7B. Per-row absmax quantization
at load time (scale = max(|row|) / 127), NEON int8 matvec for decode. The
Talker went from 79.3 ms/f to 67.4 ms/f. Audio quality is good — no
perceptible degradation in A/B tests.

INT4 Q4_0 (--int4): no speedup, actually 4% slower. We used the same
nibble-packed format as llama.cpp (32 weights per block, 16 bytes + 1 fp32
scale). The NEON unpack path needs AND, SHR, subtract-8, widen, convert —
about 8 ops per 32 weights versus 1 op for BF16 (vshll). Even at 2048-wide,
the compute overhead exceeds the bandwidth savings.

Config	Talker ms/f	CP ms/f	RTF
1.7B BF16	79.3	87.0	4.32
1.7B INT8	67.4	78.7	3.59
1.7B INT4	82.6	81.7	4.51
0.6B BF16	22.5	82.0	2.15

The takeaway: quantization is not a universal win. It depends on whether
you're compute-bound or bandwidth-bound at your specific matrix dimensions.
INT8 hits the sweet spot for 1.7B — enough bandwidth reduction to matter,
low enough unpack overhead (3 NEON ops vs BF16's 1) to not eat the gains.
INT4's nibble unpacking (8 ops) crosses the break-even point. And on 0.6B,
nothing helps because you're compute-bound anyway.

This echoes what Abrash wrote about optimization traps: "the fastest code
is the code you don't execute." INT4 adds more code per weight (unpack,
shift, subtract, widen, convert, scale, accumulate) than BF16 (shift,
accumulate). The memory savings are real, but speed is what matters for
realtime TTS.

What We Analyzed and Skipped

Not every optimization idea pans out. Here's what we investigated and rejected:

Struct field reordering (est. 3-7%, actual: 0%). The qwen_tts_ctx_t
struct is 7.6 KB spanning 119 cache lines. We built a layout analyzer and
found that the hot decode fields (KV cache pointers, decode buffers) already
sit on adjacent cache lines 112-118. More importantly, the struct is accessed
via pointer indirection — the CPU loads the pointer once and the struct stays
in L1. The bottleneck is the data these pointers reference (multi-MB weight
matrices), not the 8-byte pointer loads themselves.

L1 cache blocking for matvec (est. 3-5%, actual: not worth the complexity).
Our bf16 matvec kernel already processes 2 rows at a time with 8 SIMD
accumulators (NEON/AVX), doing 32 elements per inner loop iteration. The input vector
(4 KB for hidden=1024) fits entirely in L1. The weight matrix access is
sequential, which the hardware prefetcher handles well. The bottleneck is
main memory bandwidth (~10 GB/s effective out of 68 GB/s peak), not cache
misses.

Prefetch hints in CP loop (est. 0.5-1%, actual: not possible). Each Code
Predictor layer has ~26 MB of weights. The M1's shared L2 is 12 MB. You
can't prefetch what doesn't fit. The hardware prefetcher handles sequential
access within each matvec just fine — it's the layer transitions that cause
cold misses, and those are unavoidable without smaller weights.

INT4/INT8 quantization on 0.6B (tested, slower or neutral). See the
Quantization section above — the 0.6B model's hidden=1024 matrices are
compute-bound, not bandwidth-bound. Quantization only helped on 1.7B (INT8:
20% Talker speedup), while INT4 was slower even there.

Softmax SIMD vectorization (est. 2-4×, actual: not worth it). After
quickselect reduced total sampling from 93ms to 21ms, softmax is only ~1.5ms
of the remaining 21ms. With -ffast-math, the compiler already vectorizes
expf via platform libraries (Accelerate on macOS, libm on Linux). No
headroom for custom NEON/AVX exp.

Speech decoder depthwise conv / LayerNorm SIMD (est. 1.5-3×, actual: not
worth it). The speech decoder runs in a background thread overlapped with
generation. It finishes before Talker+CP complete — it's not the bottleneck.
ConvNeXt depthwise conv does 1.4M FLOPs vs 838M FLOPs for the BLAS-accelerated
pointwise convolutions. Optimizing 0.2% of the decoder's compute is pointless.

Separating INT8 fields from CP layer struct (est. 2-3% cache, actual: not
worth it). Only 5 layers × 264 bytes = 1.3KB total. The bottleneck is the
weight data (26MB per layer), not the 8-byte pointer loads in the struct.

The Numbers

0.6B Model (Primary Target)

Metric	Baseline	After all optimizations
Talker	46.9 ms/f	~22 ms/f
Code Predictor	104.7 ms/f	~60 ms/f (batch vvexpf: 70→60)
Speech Decoder	~2,600ms (blocking)	overlapped (background thread)
Prefill	~1,800ms	~1,000–1,600ms (delta: ~500ms repeat)
Codec head+sampling	93ms	21ms
Per-token malloc calls	~120+	0
RTF (CLI, short ~5s audio)	~3.5	~1.4–1.7
RTF (CLI, long ~17s audio)	~2.5	~1.3
RTF (CLI `--stream`)	~3.5	~1.4–1.7 (same as normal)
RTF (server warm, short)	—	1.39
RTF (server warm, long)	—	1.26

1.7B Model (with INT8)

Metric	BF16	INT8 (`--int8`)
Talker	79.3 ms/f	67.4 ms/f (20% faster)
Code Predictor	87.0 ms/f	78.7 ms/f
RTF	4.32	3.59

All on an Apple M1 8-core, 16 GB RAM, 4 threads. RTF improves with longer
audio because prefill is a fixed cost that amortizes over more frames. The
speech decoder runs in a background thread, overlapping most of its work with
generation — including streaming mode, where the decoder thread calls the audio
callback directly. Server mode with embedding cache, warm buffers, delta prefill,
and decoder thread overlap delivers the best RTF at 1.26.

Lessons

Alignment matters more than you think. A 24% speedup from
posix_memalign is absurd in 2026, but BLAS libraries really do check
alignment and choose different code paths. Abrash was right in 1997 and
he's right now.
Profile before you optimize. We nearly implemented L1 cache blocking
for the matvec kernel — a complex change — before realizing the kernel was
already bandwidth-bound and the complexity would gain nothing.
Look for scalar code in SIMD codebases. When different components are
written at different times, it's easy for one file to miss the optimization
that all others have. Six scalar RMSNorm loops hiding in the speech decoder.
Zero-malloc decode loops matter for servers. The single-run difference
is negligible, but for a long-running server handling request after request,
eliminating allocation churn in the hot loop adds up.
Cache computed results, not just data. The LRU text embedding cache
avoids recomputing token projections (12M FLOPs each) across requests. At
8MB for 2048 tokens, it's practically free. The lesson: when you spot a
pure function called repeatedly with the same inputs, memoize it.
Pipeline independent stages. The speech decoder doesn't share any state
with the Talker or Code Predictor. Once we recognized that, overlapping them
with a simple producer-consumer thread was ~150 lines for a 14-19% speedup.
Look for stages in your pipeline that only consume the output of previous
stages — those are free parallelism.
Check your algorithms, not just your SIMD. A 4× sampling speedup from
replacing selection sort with quickselect — no intrinsics, no threading,
just a better algorithm. Profile first, but when you find O(kn) in a hot
loop, fix the algorithm before reaching for SIMD.
Unify code paths. Streaming was 30% slower because it had its own
synchronous decode path. When we unified it with the decoder thread (the
same path normal mode uses), the gap disappeared. Two code paths that do
the same thing will always diverge in performance.
Batch your transcendentals. Calling expf() 163,000 times per frame
is slower than calling vvexpf() 206 times — same math, same result,
8% faster. SIMD units want batches. This is the Abrash data alignment
lesson in a different guise: don't waste hardware lanes by feeding values
one at a time.
Exploit causal structure for caching. Causal attention means prefix
tokens produce identical KV entries regardless of suffix. Delta prefill
cuts server prefill time in half for repeated speakers — zero accuracy
cost, because the math guarantees identical outputs.
Quantization is not free compression. INT8 works on 1.7B (20% win)
because the matrices are large enough to be bandwidth-bound. INT4 loses
on every model size we tested — the nibble unpack overhead exceeds the
bandwidth savings. Always measure before assuming "smaller weights = faster."
Read the old books. Abrash's Graphics Programming Black Book and
Carmack's .plan files are from another era, but the principles — cache
friendliness, data alignment, knowing your memory hierarchy — are timeless.
The specific rules change (64-byte cache lines instead of dword alignment),
but the instinct to think about how data flows through the CPU is exactly
the same. Every optimization in this post — alignment, SIMD batching,
pipeline parallelism, algorithmic complexity — traces back to ideas those
two articulated thirty years ago. The hardware evolved; the thinking didn't.

This is part of the qwen3-tts
project — a pure C inference engine for Qwen3-TTS text-to-speech models.

Building a Text-to-Speech Engine in Pure C

Gabriele Mastrapasqua — Mon, 09 Mar 2026 14:49:04 +0000

I built a pure C inference engine for Qwen3-TTS, Alibaba's open-source text-to-speech model. The goal: run high-quality multilingual TTS on CPU, with zero Python dependencies, inspired by antirez's approach to minimal C inference engines (specifically his qwen-asr project). The code is on GitHub: gabriele-mastrapasqua/qwen3-tts.

What started as a "let's just get the basic pipeline working" turned into a full-featured TTS engine with streaming output, an HTTP server, voice cloning, and custom voice design — all in a single C binary.

Why pure C?

The official Qwen3-TTS runs on PyTorch with the usual stack of transformers, tokenizers, and CUDA. That's fine for a GPU server, but I wanted something that runs anywhere — a single binary, no runtime dependencies, just mmap the model weights and go.

The result: make blas, point it at a model directory, and you get a ~200KB binary that does everything.

The architecture

Qwen3-TTS is a three-stage pipeline:

Talker — a 28-layer causal Qwen3 LLM (0.6B or 1.7B params) with GQA, RoPE, and SwiGLU that generates discrete audio frame tokens from text
Code Predictor — a 5-layer transformer that runs 15 sequential passes per frame, filling in the remaining codebook entries
Speech Decoder — a causal ConvNet with Snake activations, ResBlocks, and 480x upsampling that converts discrete codes to 24kHz audio

Each stage was reimplemented from scratch in C. The model supports 9 preset voices, 10 languages, and both 0.6B and 1.7B model sizes (auto-detected from weights).

BF16 weights, float32 compute

The model weights are stored in bfloat16 and memory-mapped directly from standard HuggingFace safetensors files. On Apple Silicon, bf16-to-f32 conversion is essentially free (it's just a left shift), and this approach gives bit-identical results to the Python reference with greedy decoding.

I did experiment with INT4 quantization, but for the 0.6B model the matrices are too small to be bandwidth-bound — the Q4 unpack overhead actually made it 20% slower. BF16 turned out to be the sweet spot.

Streaming output

The speech decoder is fully causal (no lookahead), which made streaming architecturally possible. The engine generates N frames, decodes a chunk through the speech decoder, and writes audio immediately — no need to wait for the full sequence.

# Pipe raw PCM to an audio player for real-time playback
./qwen_tts -d qwen3-tts-0.6b --text "Hello world" --stdout | \
    play -t raw -r 24000 -e signed -b 16 -c 1 -

First audio arrives within ~1 second. The speech decoder uses incremental decoding with KV caching, so each streaming chunk is O(chunk_size) rather than re-processing the full sequence.

HTTP server

The engine includes an embedded HTTP server — no nginx, no FastAPI, just start it and send requests:

# Start server (model loaded once, shared across requests)
./qwen_tts -d qwen3-tts-0.6b --serve 8080

# Generate speech
curl -X POST http://localhost:8080/v1/tts \
  -H "Content-Type: application/json" \
  -d '{"text":"Hello world","speaker":"ryan","language":"English"}' \
  -o output.wav

It also has an OpenAI-compatible endpoint (/v1/audio/speech) so you can use it as a drop-in replacement for OpenAI's TTS API in existing apps, plus a streaming endpoint that sends chunked PCM as it generates.

Voice cloning

Using the Base model variant, you can clone any voice from a few seconds of reference audio:

./qwen_tts -d qwen3-tts-0.6b-base --text "Hello, this is my cloned voice." \
    --ref-audio reference.wav -o cloned.wav

Under the hood, this runs a full ECAPA-TDNN speaker encoder to extract a 1024-dim speaker embedding from the reference audio's mel spectrogram. You can save and reload embeddings to avoid re-extracting:

# Extract and save
./qwen_tts -d qwen3-tts-0.6b-base --text "Hello" \
    --ref-audio ref.wav --save-voice my_voice.bin -o out.wav

# Reuse later (instant)
./qwen_tts -d qwen3-tts-0.6b-base --text "Another sentence" \
    --load-voice my_voice.bin -o out2.wav

The speech tokenizer encoder (Mimi-based, with 4-stage strided convolutions, 8-layer transformer, and split RVQ quantization) was also implemented for the full ICL mode.

VoiceDesign

The 1.7B VoiceDesign model can create entirely new voices from natural language descriptions:

./qwen_tts -d qwen3-tts-voice-design -l English \
    --instruct "A deep male voice with a British accent, speaking slowly and calmly" \
    --text "Hello, this is a test." -o british.wav

No reference audio needed — just describe what you want.

Style and emotion control

The 1.7B CustomVoice model supports an --instruct flag to control speaking style:

./qwen_tts -d qwen3-tts-1.7b --text "I cannot believe you did that." \
    --instruct "Speak in a very angry and aggressive tone" -o angry.wav

./qwen_tts -d qwen3-tts-1.7b --text "I cannot believe you did that." \
    --instruct "Speak very slowly and softly, in a sad whisper" -o whisper.wav

Same text, completely different delivery.

Performance

On Apple Silicon (M-series, 4 threads):

Model	Speed	Per-frame
0.6B	~0.7-0.86x realtime	Talker 24ms + CP 70ms
1.7B	~0.48x realtime	Talker 92ms + CP 75ms

The bottleneck is the Code Predictor — 15 sequential autoregressive passes per frame, no way around it.

Key optimizations:

NEON-optimized bf16 matvec with multi-row fusion (2-row fused dispatch)
Fused gate+up projections for SwiGLU in both Talker and Code Predictor
Unified QKV dispatch to reduce threading overhead
NEON kernels for RMSNorm, attention (dot+V accum), RoPE, Snake activation
Fused argmax+matvec in the Code Predictor hot loop
im2col + BLAS sgemm for the ConvNet decoder, with tiling for large sequences
Incremental speech decoder with KV cache for streaming
4-thread dispatch_apply (sweet spot — 8 threads hit the memory bandwidth ceiling)

Starting from ~0.4x realtime pre-optimization, these brought the 0.6B model to ~0.86x realtime.

The Metal GPU detour

I implemented a full Metal GPU backend — compute shaders, GPU-side transformer, the works. The result? ~1.3x slower than the optimized NEON CPU path. On Apple Silicon, CPU and GPU share the same memory bus, so there's no bandwidth advantage. The NEON path was already near-optimal for these model sizes. Deleted the whole thing.

The debugging journey

Getting bit-identical output required tracking down some non-obvious issues:

The model config says "interleaved": true for RoPE, but the Python code actually uses NeoX split-half rotation (the opposite!)
The Code Predictor's first codebook uses the Talker's codec embedding, not its own
Snake activations store alpha and beta in log space — sin²(exp(alpha) * x), not sin²(alpha * x)
All convolutions in the speech decoder are causal (left-only padding), including transposed convolutions
ResBlock dilations are [1, 3, 9], not [1, 1, 1] as you might assume
The 1.7B model needs a projection layer (2048→1024) between the Talker and Code Predictor that isn't in the 0.6B

Each of these was a "why doesn't my output match?" rabbit hole. The final validation: correlation 0.999996 with the Python reference across the full pipeline.

What it looks like

# Build
make blas

# Basic usage
./qwen_tts -d qwen3-tts-0.6b --text "Hello, how are you?" -o hello.wav

# Stream to speaker
./qwen_tts -d qwen3-tts-0.6b --text "Hello world" --stdout | \
    play -t raw -r 24000 -e signed -b 16 -c 1 -

# Start HTTP server
./qwen_tts -d qwen3-tts-0.6b --serve 8080

The project supports macOS (ARM/x86), Linux (ARM/x86), and Windows via WSL2. NEON and AVX SIMD paths are included. The 0.6B model needs ~3 GB of memory, the 1.7B needs ~8 GB.

The code is on GitHub: gabriele-mastrapasqua/qwen3-tts

Forem: Gabriele Mastrapasqua

Extending Qwen3-TTS: clone voices once, reuse everywhere (pure C)

TL;DR — turn any 30-second clip into a first-class Qwen3-TTS voice

Samples — listen for yourself

🇮🇹 Italian — Galatea / Riccardo Fasol · LibriVox, PD

🎤 Voice clone output — 3 storage formats

🇬🇧 English — The Gifts of the Magi / Phil Chenevert · LibriVox, CC0

🎤 Voice clone output — 3 storage formats

🇪🇸 Spanish — Don Quijote / Lu · LibriVox, PD

🎤 Voice clone output — 3 storage formats

🇫🇷 French — Le dernier jour d'un condamné / Bidou · LibriVox, PD

🎤 Voice clone output — 3 storage formats

Cost table — how much you pay for each tier

From audio to identity: the ECAPA-TDNN pipeline

Step 1 — Mel spectrogram

Step 2 — TDNN + SE-Res2Net blocks

Step 3 — Multi-layer feature aggregation

Step 4 — Attentive Statistics Pooling

Step 5 — Final projection

The bug hidden by a coincidence

Why 1.7B clones better

The .qvoice v3 format

How we got to bit-identical

LZ4 vs zlib

Style control + cloned voice

Commands, for the curious

What I learned

Optimizing a Qwen3-TTS Engine in Pure C: Lessons from 1990s Game Programming

The Starting Point

The Abrash Instinct: Cache Alignment Still Matters

The Fix: 3 Lines of Code

The Result: 24% Total Speedup

Why 64 Bytes?

The Speech Decoder: Scalar Code Hiding in Plain Sight

Eliminating Per-Token Malloc

Text Embedding Cache: Avoiding Redundant Work

Decoder Thread: Pipeline Parallelism

The Result

Quickselect: When the Algorithm Is the Bug

Streaming Pipeline: Closing the Last Gap

Batch vvexpf: Transcendentals Are Expensive One at a Time

SIMD BF16 Accumulation: One More Scalar Loop

Delta Prefill: Reusing the KV Cache Across Requests

Quantization: What the 1.7B Model Taught Us

What We Analyzed and Skipped

The Numbers

0.6B Model (Primary Target)

1.7B Model (with INT8)

Lessons

Building a Text-to-Speech Engine in Pure C

Why pure C?

The architecture

BF16 weights, float32 compute

Streaming output

HTTP server

Voice cloning

VoiceDesign

Style and emotion control

Performance

The Metal GPU detour

The debugging journey

What it looks like

The `.qvoice` v3 format