<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: SupaCtx</title>
    <description>The latest articles on Forem by SupaCtx (@supactx_8c8c0c94591ec0e1f).</description>
    <link>https://forem.com/supactx_8c8c0c94591ec0e1f</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3667163%2F44a0438a-0659-4066-a51d-79c4bfcd88bc.png</url>
      <title>Forem: SupaCtx</title>
      <link>https://forem.com/supactx_8c8c0c94591ec0e1f</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/supactx_8c8c0c94591ec0e1f"/>
    <language>en</language>
    <item>
      <title>How We Built a Free Voice Cloning Tool That Supports 646 Languages</title>
      <dc:creator>SupaCtx</dc:creator>
      <pubDate>Sun, 12 Apr 2026 10:37:48 +0000</pubDate>
      <link>https://forem.com/supactx_8c8c0c94591ec0e1f/how-we-built-a-free-voice-cloning-tool-that-supports-646-languages-3h6h</link>
      <guid>https://forem.com/supactx_8c8c0c94591ec0e1f/how-we-built-a-free-voice-cloning-tool-that-supports-646-languages-3h6h</guid>
      <description>&lt;p&gt;If you've ever tried to add multilingual text-to-speech to your app, you know the pain: ElevenLabs caps at 32 languages, PlayHT at 132, and the pricing scales fast. We built &lt;a href="https://omnivoice.pro" rel="noopener noreferrer"&gt;OmniVoice&lt;/a&gt; — a free, open-source voice generator that covers 646 languages with zero-shot voice cloning. Here's what we learned.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Most TTS APIs force you to choose between quality and coverage. Want natural-sounding English? Easy. Want the same quality in Yoruba, Kazakh, or Cantonese? Good luck. And if you need voice cloning across languages — where a speaker's voice stays consistent regardless of the language — you're basically out of options.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;OmniVoice uses a &lt;strong&gt;non-autoregressive diffusion language model&lt;/strong&gt; — a single-stage architecture that skips the typical two-step "text → tokens → audio" pipeline. Key design decisions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Qwen3-0.6B as text encoder&lt;/strong&gt; — LLM initialization dramatically improves intelligibility across languages&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full-codebook random masking&lt;/strong&gt; — the diffusion process operates on all codebook levels simultaneously, avoiding the quality degradation of cascaded approaches&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;581k hours of open-source training data&lt;/strong&gt; — no proprietary datasets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result: &lt;strong&gt;2.85% WER&lt;/strong&gt; (vs. ElevenLabs' 10.95%) and &lt;strong&gt;0.830 speaker similarity&lt;/strong&gt; (vs. 0.655) on standardized benchmarks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Voice Cloning in 3 Lines
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;omnivoice&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OmniVoice&lt;/span&gt;

&lt;span class="n"&gt;engine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OmniVoice&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hello from OmniVoice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;reference_audio&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;speaker.wav&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# 3-30 seconds of audio
&lt;/span&gt;    &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output.wav&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. No fine-tuning, no training, no API keys. The model clones the voice from 3-30 seconds of reference audio and works cross-lingually — record in English, generate in Japanese.&lt;/p&gt;

&lt;h2&gt;
  
  
  Voice Design (No Audio Needed)
&lt;/h2&gt;

&lt;p&gt;This is the feature that surprised us most during development. You can create entirely new voices from text descriptions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Welcome to the future of speech&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;voice_design&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A young female speaker with a British accent, medium pitch, calm and professional tone&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;designed_voice.wav&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Combine gender, age, pitch, accents (10 English variants, 12 Chinese dialects), and speaking styles freely.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance
&lt;/h2&gt;

&lt;p&gt;On a single GPU, OmniVoice runs at &lt;strong&gt;RTF 0.025&lt;/strong&gt; (~40x real-time). A 10-second clip generates in ~250ms. For production deployments, the OpenAI-compatible REST API wrapper (&lt;a href="https://github.com/pasadei/OmniVoice-local" rel="noopener noreferrer"&gt;OmniVoice-local&lt;/a&gt;) makes integration straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8000/v1/audio/speech &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "input": "Hello world",
    "voice": "reference_speaker",
    "model": "omnivoice"
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Browser demo&lt;/strong&gt; (no signup): &lt;a href="https://omnivoice.pro" rel="noopener noreferrer"&gt;omnivoice.pro&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HuggingFace Space&lt;/strong&gt;: &lt;a href="https://huggingface.co/spaces/k2-fsa/OmniVoice" rel="noopener noreferrer"&gt;k2-fsa/OmniVoice&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/k2-fsa/OmniVoice" rel="noopener noreferrer"&gt;k2-fsa/OmniVoice&lt;/a&gt; (Apache 2.0)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Paper&lt;/strong&gt;: &lt;a href="https://arxiv.org/abs/2604.00688" rel="noopener noreferrer"&gt;arXiv:2604.00688&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  One Caveat
&lt;/h2&gt;

&lt;p&gt;The Higgs-audio tokenizer (from Boson AI) requires an extended license if you exceed 100k monthly active users. Below that threshold, it's fully free under Apache 2.0.&lt;/p&gt;




&lt;p&gt;We'd love feedback from anyone working on multilingual apps, accessibility tools, or content localization. What languages or features would matter most for your use case?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>nlp</category>
      <category>opensource</category>
      <category>showdev</category>
    </item>
    <item>
      <title>Why AI Video Feels Unreliable — and What Reference-to-Video Fixes</title>
      <dc:creator>SupaCtx</dc:creator>
      <pubDate>Thu, 18 Dec 2025 08:02:24 +0000</pubDate>
      <link>https://forem.com/supactx_8c8c0c94591ec0e1f/why-ai-video-feels-unreliable-and-what-reference-to-video-fixes-k1p</link>
      <guid>https://forem.com/supactx_8c8c0c94591ec0e1f/why-ai-video-feels-unreliable-and-what-reference-to-video-fixes-k1p</guid>
      <description>&lt;p&gt;AI video generation looks great in demos.&lt;br&gt;
Clips are sharp, motion is smooth, and results can feel cinematic.&lt;/p&gt;

&lt;p&gt;But once you try to reuse the same character or build a real workflow, things fall apart.&lt;/p&gt;

&lt;p&gt;The problem isn’t realism.&lt;br&gt;
It’s control.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why text and images aren’t enough
&lt;/h2&gt;

&lt;p&gt;Most AI video tools rely on text prompts or single images.&lt;/p&gt;

&lt;p&gt;Text explains ideas.&lt;br&gt;
Images lock appearance.&lt;/p&gt;

&lt;p&gt;But neither describes how something &lt;strong&gt;moves&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Motion, timing, posture, and physical behavior are what make a character feel consistent.&lt;br&gt;
That information doesn’t live in text or images — it lives in video.&lt;/p&gt;




&lt;h2&gt;
  
  
  Reference video as a control layer
&lt;/h2&gt;

&lt;p&gt;A short reference video carries exactly what’s missing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;how a character moves&lt;/li&gt;
&lt;li&gt;how actions flow over time&lt;/li&gt;
&lt;li&gt;how behavior stays consistent&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead of asking the model to guess, reference-to-video lets it reuse motion and identity.&lt;/p&gt;

&lt;p&gt;Generation becomes directed, not random.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why this changes AI video workflows
&lt;/h2&gt;

&lt;p&gt;With reference-to-video:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;characters stay stable&lt;/li&gt;
&lt;li&gt;motion becomes reusable&lt;/li&gt;
&lt;li&gt;scenes feel intentional&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You stop regenerating until something “looks right” and start planning outcomes.&lt;/p&gt;

&lt;p&gt;That’s the difference between demos and real tools.&lt;/p&gt;




&lt;h2&gt;
  
  
  A practical example: Wan 2.6
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://vidthis.ai/features/wan2-6" rel="noopener noreferrer"&gt;Models like wan 2.6&lt;/a&gt; treat reference video as a core input, not a bonus feature.&lt;/p&gt;

&lt;p&gt;With just a few seconds of reference, it can preserve identity and motion while placing characters into new scenes or narratives.&lt;/p&gt;

&lt;p&gt;This makes AI video far more predictable — and far more usable.&lt;/p&gt;




&lt;h2&gt;
  
  
  The missing piece
&lt;/h2&gt;

&lt;p&gt;AI video didn’t struggle because models lacked power.&lt;/p&gt;

&lt;p&gt;It struggled because creators lacked leverage.&lt;/p&gt;

&lt;p&gt;Reference-to-video provides that missing control layer.&lt;br&gt;
And once it’s in place, AI video starts to behave like a system you can actually build with.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>rag</category>
    </item>
  </channel>
</rss>
