🧀 BigCheese.ai

Social

Moshi: A speech-text foundation model for real time dialogue

🧀

Moshi is a speech-text foundation model for real-time dialogue by Kyutai Labs. This innovative model, which lies on the GitHub repository, includes a streaming neural audio codec named Mimi, achieving low latency and high-quality audio processing. The repo provides different versions for PyTorch, MLX for M series Macs, and Rust used in production. It emphasizes development focused on real-time applications and dialogue systems.

  • Moshi uses Mimi, a state-of-the-art streaming neural audio codec.
  • It operates with 24 kHz audio, down to a 12.5 Hz representation.
  • Moshi includes a Temporal Transformer with 7 billion parameters.
  • Moshi achieves a theoretical latency of 160ms.
  • The repo includes versions for PyTorch, MLX for macOS, and Rust.