Spark-TTS introduces an efficient text-to-speech model leveraging a novel BiCodec for a single-stream speech codec. The model enables customization over speaker attributes and linguistic content, utilizing the Qwen2.5 LLM and chain-of-thought generation. The research includes the VoxBox dataset and provides extensive experimentation showing state-of-the-art zero-shot voice cloning.