NVIDIA Brings Speech AI to 25 European Languages With New Open-Source Toolkit

Artificial intelligence might seem omnipresent, yet it covers only a small slice of the world’s roughly 7,000 languages, excluding vast segments of people. NVIDIA wants to address this gap in language AI, focusing on Europe’s diverse linguistic map. Its mission is to bring speech recognition and translation to underrepresented tongues.

NVIDIA has shared a new open-source software package that empowers developers to produce high-quality speech AI for 25 European tongues. It extends coverage beyond major languages, supporting often neglected varieties like Croatian, Estonian and Maltese. This effort represents a lifeline for communities that are usually out of reach for big tech players.

This toolkit aims to support features users often assume are default, such as multilingual chatbots that grasp local expressions, customer support assistants that follow every detail, and translation engines that return accurate results almost instantly.

At the heart of this plan sits Granary, a huge repository of recorded human speech. It holds roughly one million hours of audio, carefully organized to train AI systems in transcription and translation tasks.

Alongside this repository, NVIDIA is releasing two AI models designed for language tasks:

Canary-1b-v2, a large-scale model aimed at top transcription and translation accuracy.
Parakeet-tdt-0.6b-v3, a rapid model for scenarios that demand real-time processing.

Researchers will present a full paper on Granary at this month’s Interspeech conference in the Netherlands, offering in-depth details on methodology and results. Meanwhile, developers interested in hands-on work can download the dataset and both models immediately from Hugging Face and begin experiments.

This breakthrough hinges on the advanced automated pipeline that generated the dataset from previously unstructured recordings. Training AI normally demands huge amounts of labeled audio, and manual annotation often involves many months of painstaking work, high labor costs, and monotonous effort.

NVIDIA’s speech AI team, with Carnegie Mellon University and Fondazione Bruno Kessler, created an advanced automated workflow using its NeMo toolkit. The pipeline filters out background noise, segments long recordings into discrete utterances, infers transcripts via alignment models, and applies quality checks to drop low-confidence samples. The output is polished, structured data ready for AI training.

This project marks more than a technical milestone; it signals a significant advance in digital inclusion for underrepresented languages across Europe. Developers in cities like Riga or Zagreb can now create voice-based tools that understand local speech patterns, accents, and dialects. They can work faster and require less computing time. Tests show Granary data cuts the volume needed to reach a given accuracy in half, compared to other widely used datasets.

This pair of models vividly demonstrates how the Granary resource powers fully functional, real-world language applications across varied contexts. Canary-1b-v2 delivers translation and transcription accuracy on par with systems three times larger, and it operates at up to ten times the throughput. Parakeet-tdt-0.6b-v3 processes a 24-minute meeting transcription in a single run, automatically detecting the spoken language and handing off results without delay. Both models handle punctuation, capitalization, and word-level timestamps—capabilities essential for creating robust, production-quality voice applications that meet industry requirements.

By sharing this open-source collection of tools and underlying methods with developers around the world, NVIDIA goes beyond a typical product release. The company aims to spark a surge of creativity and build a future where AI truly understands every language and diverse dialect, regardless of a user’s origin.