18485271. Universal Monolingual Output Layer for Multilingual Speech Recognition simplified abstract (GOOGLE LLC)
Contents
Universal Monolingual Output Layer for Multilingual Speech Recognition
Organization Name
Inventor(s)
Chao Zhang of Mountain View CA (US)
Tara N. Sainath of Jersey City NJ (US)
Trevor Strohman of Mountain View CA (US)
Shuo-yiin Chang of Sunnyvale CA (US)
Universal Monolingual Output Layer for Multilingual Speech Recognition - A simplified explanation of the abstract
This abstract first appeared for US patent application 18485271 titled 'Universal Monolingual Output Layer for Multilingual Speech Recognition
Simplified Explanation
The method described in the abstract involves using a multilingual automated speech recognition (ASR) model to recognize speech in multiple languages by processing a sequence of acoustic frames. The model includes an audio encoder, a language identification (LID) predictor, and a decoder with a monolingual output layer.
- The method receives a sequence of acoustic frames and generates a higher order feature representation for each frame using an audio encoder.
- A language prediction representation is generated for each higher order feature representation by a language identification (LID) predictor.
- The decoder generates a probability distribution over possible speech recognition results based on the higher order feature representation and language prediction representation.
---
- Potential Applications
- Multilingual speech recognition systems - Language identification in speech processing applications
- Problems Solved
- Improving accuracy in recognizing speech in multiple languages - Enhancing language identification capabilities in automated systems
- Benefits
- Increased efficiency in processing multilingual speech data - Enhanced user experience in speech recognition applications
- Potential Commercial Applications
- Optimizing Multilingual Speech Recognition Systems
- Potential Commercial Applications
---
- Possible Prior Art
One potential prior art in this field is the use of language models in speech recognition systems to improve accuracy and performance. Researchers have explored various techniques to enhance multilingual speech recognition, including language identification and feature extraction methods.
- Unanswered Questions
- How does the multilingual ASR model handle accents and dialects in speech recognition?
- Unanswered Questions
The abstract does not provide details on how the model addresses variations in accents and dialects, which can impact the accuracy of speech recognition.
- What is the computational complexity of the proposed method compared to existing multilingual ASR models?
The abstract does not mention the computational resources required for implementing the described method, which could be a crucial factor in practical applications.
Original Abstract Submitted
A method includes receiving a sequence of acoustic frames as input to a multilingual automated speech recognition (ASR) model configured to recognize speech in a plurality of different supported languages and generating, by an audio encoder of the multilingual ASR, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method also includes generating, by a language identification (LID) predictor of the multilingual ASR, a language prediction representation for a corresponding higher order feature representation. The method also includes generating, by a decoder of the multilingual ASR, a probability distribution over possible speech recognition results based on the corresponding higher order feature representation, a sequence of non-blank symbols, and a corresponding language prediction representation. The decoder includes monolingual output layer having a plurality of output nodes each sharing a plurality of language-specific wordpiece models.