Wednesday, November 12, 2025
HomeArtificial IntelligenceMeta AI Releases Omnilingual ASR: A Suite of Open-Supply Multilingual Speech Recognition...

Meta AI Releases Omnilingual ASR: A Suite of Open-Supply Multilingual Speech Recognition Fashions for 1600+ Languages

How do you construct a single speech recognition system that may perceive 1,000’s of languages together with many who by no means had working ASR (automated speech recognition) fashions earlier than? Meta AI has launched Omnilingual ASR, an open supply speech recognition suite that scales to greater than 1,600 languages and may be prolonged to unseen languages with just a few speech textual content examples, with out retraining the mannequin.

Knowledge and language protection

The supervised coaching knowledge comes from a mixed corpus known as AllASR. AllASR comprises 120,710 hours of labeled speech paired with transcripts throughout 1,690 languages. This corpus merges a number of sources, together with open supply datasets, inside and licensed corpora, associate created knowledge, and a commissioned assortment known as the Omnilingual ASR Corpus.

The Omnilingual ASR Corpus contributes 3,350 hours of speech for 348 languages, with knowledge collected via discipline work with native organizations and audio system in areas similar to Africa and South Asia. Prompts are open ended, so audio system produce pure monologues in their very own language as an alternative of studying mounted sentences, which provides extra sensible acoustic and lexical variation.

https://ai.meta.com/analysis/publications/omnilingual-asr-open-source-multilingual-speech-recognition-for-1600-languages/

For self supervised pre coaching, the wav2vec 2.0 encoders are skilled on a big unlabeled speech corpus. The pre coaching dataset comprises 3.84M hours of speech with language identification throughout 1,239 languages, plus one other 460K hours with out language identification. The full unlabeled audio used for pre coaching is due to this fact about 4.3M hours. That is nonetheless considerably smaller than the 12M hours utilized by USM, which makes the reported outcomes extra attention-grabbing from a knowledge effectivity perspective.

https://ai.meta.com/analysis/publications/omnilingual-asr-open-source-multilingual-speech-recognition-for-1600-languages/

Mannequin household

Omnilingual ASR exposes 3 predominant mannequin households that every one share the identical wav2vec 2.0 speech encoder spine:

  1. SSL encoders (OmniASR W2V)
    Self supervised wav2vec 2.0 encoders with the next parameter counts
    omniASR_W2V_300M with 317,390,592 parameters
    omniASR_W2V_1B with 965,514,752 parameters
    omniASR_W2V_3B with 3,064,124,672 parameters
    omniASR_W2V_7B with 6,488,487,168 parameters. These fashions are skilled with the usual wav2vec 2.0 contrastive goal. After coaching, the quantizer is discarded and the encoder is used as a speech illustration spine.
  2. CTC (connectionist temporal classification) ASR fashions
    CTC fashions add a easy linear layer on prime of the encoder and prepare finish to finish with a personality degree CTC loss. The launched CTC fashions vary from 325,494,996 parameters to six,504,786,132 parameters and attain actual time elements as little as 0.001 for the 300M mannequin on A100 for 30 second audio with batch dimension 1.
  3. LLM ASR fashions
    LLM ASR stacks a Transformer decoder on prime of the wav2vec 2.0 encoder. The decoder is a language mannequin like Transformer that operates on character degree tokens plus particular tokens similar to and . Coaching makes use of commonplace subsequent token prediction on sequences of the shape gs(x), gt(), gt(y), gt() the place gs is the speech encoder and gt is the textual content embedding matrix. The LLM ASR household ranges from about 1.63B parameters for omniASR_LLM_300M to 7,801,041,536 parameters for omniASR_LLM_7B. A separate omniASR_LLM_7B_ZS checkpoint with 7,810,900,608 parameters is used for zero shot ASR.

All LLM ASR fashions assist elective language conditioning. Languages are represented as {language_code}_{script} similar to eng_Latn for English in Latin script or cmn_Hans for Mandarin Chinese language in Simplified Chinese language script. A discovered embedding for the language script identifier is injected into the decoder enter. In coaching, the language ID token is usually dropped, so the mannequin can even function with out express language tags at inference.

Zero shot ASR with context examples and SONAR

The supervised fashions cowl greater than 1,600 languages. Nevertheless, many languages nonetheless haven’t any transcribed ASR knowledge. To deal with these instances, Omnilingual ASR extends the LLM ASR mannequin with a zero shot mode skilled with context examples.

Throughout coaching for the zero shot variant, the decoder consumes N + 1 speech textual content pairs from the identical language. The primary N pairs act as context and the ultimate pair is the goal. All pairs are embedded with the speech encoder and textual content embedding matrix, then concatenated right into a single decoder enter sequence. The loss continues to be subsequent token prediction on the goal transcription. This teaches the decoder to deduce the mapping from speech to textual content in a given language from a small immediate of in language examples.

At inference, the omniASR_LLM_7B_ZS mannequin can obtain a couple of speech textual content examples from any language, together with languages not current in coaching, after which transcribe new utterances in that language with out updating weights. That is in context studying for ASR.

The system consists of an instance retrieval mechanism based mostly on SONAR, a multilingual multimodal encoder that initiatives audio and textual content right into a shared embedding area. The goal audio is embedded as soon as, then nearest neighbor search over a database of speech textual content pairs selects probably the most related examples to incorporate within the context window. This SONAR based mostly choice improves zero shot efficiency in contrast with random instance choice or easy textual content similarity.

https://ai.meta.com/analysis/publications/omnilingual-asr-open-source-multilingual-speech-recognition-for-1600-languages/

High quality and benchmarks

The omniASR_LLM_7B mannequin achieves character error price beneath 10 p.c for 78 p.c of the greater than 1,600 supported languages.

The analysis staff stories that on multilingual benchmarks similar to FLEURS 102, the 7B LLM ASR mannequin outperforms the 7B CTC fashions and in addition surpasses Google USM variants in common character error price, regardless of utilizing about 4.3M unlabeled hours as an alternative of 12M and an easier pre coaching pipeline. This implies that scaling the wav2vec 2.0 encoder and including an LLM type decoder is an efficient path for prime protection multilingual ASR.

Key Takeaways

  1. Omnilingual ASR gives open supply ASR protection for greater than 1,600 languages and might generalize to greater than 5,400 languages utilizing zero shot in context studying.
  2. The fashions are constructed on massive scale wav2vec 2.0 encoders skilled on about 4.3M hours of unlabeled audio from 1,239 labeled languages plus further unlabeled speech.
  3. The suite consists of wav2vec 2.0 encoders, CTC ASR, LLM ASR, and a devoted zero shot LLM ASR mannequin, with encoder sizes from 300M to 7B parameters and LLM ASR as much as about 7.8B parameters.
  4. The 7B LLM ASR mannequin achieves character error price beneath 10 p.c on 78 p.c of the greater than 1,600 supported languages, which is aggressive with or higher than prior multilingual methods in low useful resource settings.

Omnilingual ASR is a major methods degree contribution as a result of it treats multilingual ASR as an extensible framework, not a hard and fast language checklist, combining a 7B wav2vec 2.0 encoder, CTC and LLM ASR decoders, and a zero shot LLM ASR mannequin that may adapt to new languages with a couple of in context examples, whereas reaching character error price beneath 10 p.c on 78 p.c of greater than 1,600 supported languages and releasing all the pieces beneath Apache 2.0 and CC BY 4.0. Total, this launch establishes Omnilingual ASR as probably the most extensible open supply speech recognition mannequin at the moment accessible.


Try the Paper, Repo and Technical particulars. Be at liberty to take a look at our GitHub Web page for Tutorials, Codes and Notebooks. Additionally, be happy to observe us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you possibly can be part of us on telegram as effectively.


Michal Sutter is a knowledge science skilled with a Grasp of Science in Knowledge Science from the College of Padova. With a strong basis in statistical evaluation, machine studying, and knowledge engineering, Michal excels at remodeling complicated datasets into actionable insights.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments