Spark-TTS: Text-2-Speech Model Single-Stream Decoupled Tokens [pdf]

🧀

View Website View PDF HTML Format Source Code

Spark-TTS introduces an efficient text-to-speech model leveraging a novel BiCodec for a single-stream speech codec. The model enables customization over speaker attributes and linguistic content, utilizing the Qwen2.5 LLM and chain-of-thought generation. The research includes the VoxBox dataset and provides extensive experimentation showing state-of-the-art zero-shot voice cloning.

Submitted on 3 Mar 2025.
Authors listed include Xinsheng Wang and others.
Addresses zero-shot text-to-speech synthesis.
Utilizes BiCodec for speech token decoupling.
Submitted to ACL 2025.

View Website View PDF HTML Format Source Code

Social

Spark-TTS: Text-2-Speech Model Single-Stream Decoupled Tokens [pdf]