How We Built a Free Voice Cloning Tool That Supports 646 Languages

SupaCtx — Sun, 12 Apr 2026 10:37:48 +0000

If you've ever tried to add multilingual text-to-speech to your app, you know the pain: ElevenLabs caps at 32 languages, PlayHT at 132, and the pricing scales fast. We built OmniVoice — a free, open-source voice generator that covers 646 languages with zero-shot voice cloning. Here's what we learned.

The Problem

Most TTS APIs force you to choose between quality and coverage. Want natural-sounding English? Easy. Want the same quality in Yoruba, Kazakh, or Cantonese? Good luck. And if you need voice cloning across languages — where a speaker's voice stays consistent regardless of the language — you're basically out of options.

The Architecture

OmniVoice uses a non-autoregressive diffusion language model — a single-stage architecture that skips the typical two-step "text → tokens → audio" pipeline. Key design decisions:

Qwen3-0.6B as text encoder — LLM initialization dramatically improves intelligibility across languages
Full-codebook random masking — the diffusion process operates on all codebook levels simultaneously, avoiding the quality degradation of cascaded approaches
581k hours of open-source training data — no proprietary datasets

The result: 2.85% WER (vs. ElevenLabs' 10.95%) and 0.830 speaker similarity (vs. 0.655) on standardized benchmarks.

Voice Cloning in 3 Lines

from omnivoice import OmniVoice

engine = OmniVoice()
engine.tts(
    text="Hello from OmniVoice",
    reference_audio="speaker.wav",  # 3-30 seconds of audio
    output="output.wav"
)

That's it. No fine-tuning, no training, no API keys. The model clones the voice from 3-30 seconds of reference audio and works cross-lingually — record in English, generate in Japanese.

Voice Design (No Audio Needed)

This is the feature that surprised us most during development. You can create entirely new voices from text descriptions:

engine.tts(
    text="Welcome to the future of speech",
    voice_design="A young female speaker with a British accent, medium pitch, calm and professional tone",
    output="designed_voice.wav"
)

Combine gender, age, pitch, accents (10 English variants, 12 Chinese dialects), and speaking styles freely.

Performance

On a single GPU, OmniVoice runs at RTF 0.025 (~40x real-time). A 10-second clip generates in ~250ms. For production deployments, the OpenAI-compatible REST API wrapper (OmniVoice-local) makes integration straightforward:

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Hello world",
    "voice": "reference_speaker",
    "model": "omnivoice"
  }'

Try It

Browser demo (no signup): omnivoice.pro
HuggingFace Space: k2-fsa/OmniVoice
GitHub: k2-fsa/OmniVoice (Apache 2.0)
Paper: arXiv:2604.00688

One Caveat

The Higgs-audio tokenizer (from Boson AI) requires an extended license if you exceed 100k monthly active users. Below that threshold, it's fully free under Apache 2.0.

We'd love feedback from anyone working on multilingual apps, accessibility tools, or content localization. What languages or features would matter most for your use case?

Why AI Video Feels Unreliable — and What Reference-to-Video Fixes

SupaCtx — Thu, 18 Dec 2025 08:02:24 +0000

AI video generation looks great in demos.
Clips are sharp, motion is smooth, and results can feel cinematic.

But once you try to reuse the same character or build a real workflow, things fall apart.

The problem isn’t realism.
It’s control.

Why text and images aren’t enough

Most AI video tools rely on text prompts or single images.

Text explains ideas.
Images lock appearance.

But neither describes how something moves.

Motion, timing, posture, and physical behavior are what make a character feel consistent.
That information doesn’t live in text or images — it lives in video.

Reference video as a control layer

A short reference video carries exactly what’s missing:

how a character moves
how actions flow over time
how behavior stays consistent

Instead of asking the model to guess, reference-to-video lets it reuse motion and identity.

Generation becomes directed, not random.

Why this changes AI video workflows

With reference-to-video:

characters stay stable
motion becomes reusable
scenes feel intentional

You stop regenerating until something “looks right” and start planning outcomes.

That’s the difference between demos and real tools.

A practical example: Wan 2.6

Models like wan 2.6 treat reference video as a core input, not a bonus feature.

With just a few seconds of reference, it can preserve identity and motion while placing characters into new scenes or narratives.

This makes AI video far more predictable — and far more usable.

The missing piece

AI video didn’t struggle because models lacked power.

It struggled because creators lacked leverage.

Reference-to-video provides that missing control layer.
And once it’s in place, AI video starts to behave like a system you can actually build with.

Forem: SupaCtx