🧀 BigCheese.ai

Social

Spark-TTS: Text-2-Speech Model Single-Stream Decoupled Tokens [pdf]

🧀

Spark-TTS introduces an efficient text-to-speech model leveraging a novel BiCodec for a single-stream speech codec. The model enables customization over speaker attributes and linguistic content, utilizing the Qwen2.5 LLM and chain-of-thought generation. The research includes the VoxBox dataset and provides extensive experimentation showing state-of-the-art zero-shot voice cloning.

  • Submitted on 3 Mar 2025.
  • Authors listed include Xinsheng Wang and others.
  • Addresses zero-shot text-to-speech synthesis.
  • Utilizes BiCodec for speech token decoupling.
  • Submitted to ACL 2025.