Moshi: A speech-text foundation model for real time dialogue

🧀

View Website GitHub Repo Moshi Paper Moshi Demo Hugging Face Collection

Moshi is a speech-text foundation model for real-time dialogue by Kyutai Labs. This innovative model, which lies on the GitHub repository, includes a streaming neural audio codec named Mimi, achieving low latency and high-quality audio processing. The repo provides different versions for PyTorch, MLX for M series Macs, and Rust used in production. It emphasizes development focused on real-time applications and dialogue systems.

Moshi uses Mimi, a state-of-the-art streaming neural audio codec.
It operates with 24 kHz audio, down to a 12.5 Hz representation.
Moshi includes a Temporal Transformer with 7 billion parameters.
Moshi achieves a theoretical latency of 160ms.
The repo includes versions for PyTorch, MLX for macOS, and Rust.

View Website GitHub Repo Moshi Paper Moshi Demo Hugging Face Collection

Social

Moshi: A speech-text foundation model for real time dialogue