Alibaba Cloud’s Qwen crew has open-sourced Qwen3-TTS, a household of multilingual text-to-speech fashions that focus on three core duties in a single stack, voice clone, voice design, and prime quality speech technology.


Mannequin household and capabilities
Qwen3-TTS makes use of a 12Hz speech tokenizer and a couple of language mannequin sizes, 0.6B and 1.7B, packaged into 3 essential duties. The open launch exposes 5 fashions, Qwen3-TTS-12Hz-0.6B-Base and Qwen3-TTS-12Hz-1.7B-Base for voice cloning and generic TTS, Qwen3-TTS-12Hz-0.6B-CustomVoice and Qwen3-TTS-12Hz-1.7B-CustomVoice for promptable preset audio system, and Qwen3-TTS-12Hz-1.7B-VoiceDesign free of charge type voice creation from pure language descriptions, together with the Qwen3-TTS-Tokenizer-12Hz codec.
All fashions assist 10 languages, Chinese language, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian. CustomVoice variants ship with 9 curated timbres, equivalent to Vivian, a vibrant younger Chinese language feminine voice, Ryan, a dynamic English male voice, and Ono_Anna, a playful Japanese feminine voice, every with a brief description that encodes timbre and talking fashion.
The VoiceDesign mannequin maps textual content directions on to new voices, for instance ‘communicate in a nervous teenage male voice with rising intonation’ and may then be mixed with the Base mannequin by first producing a brief reference clip and reusing it by way of create_voice_clone_prompt.


Structure, tokenizer, and streaming path
Qwen3-TTS is a twin monitor language mannequin, one monitor predicts discrete acoustic tokens from textual content, the opposite handles alignment and management indicators. The system is educated on greater than 5 million hours of multilingual speech in 3 pre coaching levels that transfer from normal mapping, to prime quality information, to lengthy context assist as much as 32,768 tokens.
A key element is the Qwen3-TTS-Tokenizer-12Hz codec. It operates at 12.5 frames per second, about 80 ms per token, and makes use of 16 quantizers with a 2048 entry codebook. On LibriSpeech take a look at clear it reaches PESQ wideband 3.21, STOI 0.96, and UTMOS 4.16, outperforming SpeechTokenizer, XCodec, Mimi, FireredTTS 2 and different current semantic tokenizers, whereas utilizing the same or decrease body charge.
The tokenizer is applied as a pure left context streaming decoder, so it may emit waveforms as quickly as sufficient tokens can be found. With 4 tokens per packet, every streaming packet carries 320 ms of audio. The non-DiT decoder and BigVGAN free design reduces decode price and simplifies batching.
On the language mannequin facet, the analysis crew experiences finish to finish streaming measurements on a single vLLM backend with torch.compile and CUDA Graph optimizations. For Qwen3-TTS-12Hz-0.6B-Base and Qwen3-TTS-12Hz-1.7B-Base at concurrency 1, the primary packet latency is round 97 ms and 101 ms, with actual time elements of 0.288 and 0.313 respectively. Even at concurrency 6, first packet latency stays round 299 ms and 333 ms.


Alignment and management
Submit coaching makes use of a staged alignment pipeline. First, Direct Choice Optimization aligns generated speech with human preferences on multilingual information. Then GSPO with rule primarily based rewards improves stability and prosody. A closing speaker tremendous tuning stage on the Base mannequin yields goal speaker variants whereas preserving the core capabilities of the overall mannequin.
Instruction following is applied in a ChatML fashion format, the place textual content directions about fashion, emotion or tempo are prepended to the enter. This identical interface powers VoiceDesign, CustomVoice fashion prompts, and tremendous grained edits for cloned audio system.
Benchmarks, zero shot cloning, and multilingual speech
On the Seed-TTS take a look at set, Qwen3-TTS is evaluated as a zero-shot voice cloning system. The Qwen3-TTS-12Hz-1.7B-Base mannequin reaches a Phrase Error Price of 0.77 on test-zh and 1.24 on test-en. The analysis crew highlights the 1.24 WER on test-en as cutting-edge among the many in contrast techniques, whereas the Chinese language WER is near, however not decrease than, the most effective CosyVoice 3 rating.


On a multilingual TTS take a look at set masking 10 languages, Qwen3-TTS achieves the bottom WER in 6 languages, Chinese language, English, Italian, French, Korean, and Russian, and aggressive efficiency on the remaining 4 languages, whereas additionally acquiring the very best speaker similarity in all 10 languages in comparison with MiniMax-Speech and ElevenLabs Multilingual v2.
Cross-lingual evaluations present that Qwen3-TTS-12Hz-1.7B-Base reduces blended error charge for a number of language pairs, equivalent to zh-to-ko, the place the error drops from 14.4 for CosyVoice3 to 4.82, a couple of 66 p.c relative discount.
On InstructTTSEval, the Qwen3TTS-12Hz-1.7B-VD VoiceDesign mannequin units new cutting-edge scores amongst open supply fashions on Description-Speech Consistency and Response Precision in each Chinese language and English, and is aggressive with business techniques like Hume and Gemini on a number of metrics.
Key Takeaways
- Full open supply multilingual TTS stack: Qwen3-TTS is an Apache 2.0 licensed suite that covers 3 duties in a single stack, prime quality TTS, 3 second voice cloning, and instruction primarily based voice design throughout 10 languages utilizing the 12Hz tokenizer household.
- Environment friendly discrete codec and actual time streaming: The Qwen3-TTS-Tokenizer-12Hz makes use of 16 codebooks at 12.5 frames per second, reaches robust PESQ, STOI and UTMOS scores, and helps packetized streaming with about 320 ms of audio per packet and sub 120 ms first packet latency for the 0.6B and 1.7B fashions within the reported setup.
- Job particular mannequin variants: The discharge affords Base fashions for cloning and generic TTS, CustomVoice fashions with 9 predefined audio system and elegance prompts, and a VoiceDesign mannequin that generates new voices straight from pure language descriptions which might then be reused by the Base mannequin.
- Sturdy alignment and multilingual high quality: A multi stage alignment pipeline with DPO, GSPO and speaker tremendous tuning provides Qwen3-TTS low phrase error charges and excessive speaker similarity, with lowest WER in 6 of 10 languages and the most effective speaker similarity in all 10 languages among the many evaluated techniques, and cutting-edge zero shot English cloning on Seed TTS.
Try the Mannequin Weights, Repo and Playground. Additionally, be at liberty to observe us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you may be part of us on telegram as effectively.
