Nvidia today introduced Granary, a massive open-source collection of speech data covering 25 European tongues, and rolled out two multilingual speech AI models, Canary-1b-v2 and Parakeet-tdt-0.6b-v3. These resources address gaps in automatic speech recognition (ASR) and speech translation (AST) for languages that have received limited annotated material.
Granary emerged from a joint effort with Carnegie Mellon University and Fondazione Bruno Kessler. It includes roughly one million hours of audio, split into 650,000 hours for ASR and 350,000 for AST. The compilation spans nearly all official European Union languages, plus Russian and Ukrainian. Special attention went to those with minimal prior resources, such as Croatian, Estonian, and Maltese.
Key points of the Granary dataset:
• Dataset size: One million hours of speech
• Language coverage: 25 European languages
• Pseudo-labeling pipeline: Uses Nvidia NeMo’s Speech Data Processor to structure unlabeled public audio and improve quality, cutting the need for extensive human annotation
• Task support: Prepares data for both transcription and translation
• Open access: Freely available for developers to use in production-scale training
By feeding models with this curated collection, Granary speeds up convergence. Tests show teams require about half the data volume to reach target accuracy compared to similar sets, making it ideal for rapid prototyping and languages with scarce annotated examples.
Canary-1b-v2 packs one billion parameters into an encoder–decoder framework trained on Granary. It handles bidirectional transcription and translation across 25 European languages, up from four in its previous version.
Highlights of Canary-1b-v2 include:
• Performance: Matches accuracy of models three times its size and runs up to ten times faster
• Multitask ability: Excels at both ASR and AST in a single system
• Output features: Punctuation, capitalization, timestamps at word and segment levels, and even timestamped translations
• Architecture: FastConformer encoder, Transformer decoder, and a unified SentencePiece vocabulary across languages
• Robustness: Maintains accuracy under noisy conditions and limits false outputs
• ASR benchmarks: 7.15% WER on AMI, 10.82% on LibriSpeech Clean
• AST benchmarks: COMET scores of 79.3 for other-to-English and 84.56 for English-to-other
• License and deployment: Offered under CC BY 4.0 and optimized for Nvidia GPUs for efficient training and inference
Parakeet-tdt-0.6b-v3 carries 600 million parameters and extends the Parakeet line to full European language support. It aims at high-volume transcription with low latency and offers built-in language detection, handling up to 24 minutes of audio per inference cycle.
Parakeet-tdt-0.6b-v3 key features:
• Automatic language recognition: No need for external prompts
• Real-time transcription: Processes long audio segments in a single pass
• Scalability: Designed for batch workloads with word-level timestamps, punctuation, and capitalization
• Audio resilience: Delivers stable results on complex inputs like numbers or songs and in difficult recording environments
Developers, researchers, and businesses can employ these tools to create:
• Multilingual chatbots
• Customer service voice agents
• Near-real-time translation services
Open access to these resources supports a broader set of spoken languages and the creation of diverse voice-driven applications across Europe.

