Kyutai Introduces 2B-Parameter Streaming TTS with 220 ms Latency Trained on 2.5 M Hours of Audio
–
Kyutai, an open AI research lab, unveiled a streaming Text-to-Speech model built on about 2 billion parameters. It delivers clear audio in as little as 220 milliseconds and preserves high fidelity. The system learned from 2.5 million hours of speech and is available under CC-BY-4.0. This release opens large-scale speech synthesis for on-device and autonomous AI use with full visibility.
Its streaming design lies at the core of this release. On a single NVIDIA L40 GPU, it can support thirty-two simultaneous streams under 350 ms of end-to-end delay. In a solo setup, it generates speech in about 220 ms, enabling chatbots, virtual assistants, and live captioning. These speeds are made possible by Delayed Streams Modeling, which processes text in segments and speaks before full input arrives.
- Model size: about 2 billion parameters
- Training data: 2.5 million hours of speech
- Latency: 220 ms for one user; under 350 ms for 32 users on an L40 GPU
- Languages: English and French
- License: CC-BY-4.0
This performance rests on the Delayed Streams Modeling technique. It begins generating audio before the entire sentence is processed. That design trades minimal context for faster reply times as it retains voice naturalness and timing consistency. Instead of waiting for full text, the model operates on each chunk as it arrives. The result keeps speech smooth and coherent even at speed.
All code and the full training recipe are in Kyutai’s GitHub repository to allow complete reproducibility and wider contributions.
Model weights and inference scripts appear on Hugging Face to make setup straightforward for developers, researchers, and product teams. The CC-BY-4.0 license lets any group adapt or integrate the model into commercial or research projects, provided they credit the source.
The release supports both batch and streaming modes, offering a foundation for voice cloning, interactive agents, accessibility tools, and other speech applications. With ready models in English and French, it meets the needs of global pipelines.
By cutting the gap to roughly 200 milliseconds, the system slashes end-user lag between text input and spoken output. That speed suits a wide variety of speech tasks:
- Conversational AI: realistic voice interfaces that reply nearly in real time
- Assistive tech: screen readers and voice feedback systems that speak as soon as text arrives
- Media production: rapid voiceover drafts and edits in a few seconds
- On-device deployment: efficient inference on limited hardware without major slowdowns
A single L40 GPU serving thirty-two streams at once lets operations scale voice services more economically in cloud settings, reducing infrastructure costs for large deployments.
Kyutai’s streaming TTS rollout delivers a blend of high-fidelity speech and near-live performance under an open license. Its reproducible code, multilingual support, and elastic scaling give it an edge over closed-source alternatives. Developers and researchers can now build conversational systems, accessibility solutions, and media pipelines with the same model, backed by transparent design and clear reuse terms.