18485271. Universal Monolingual Output Layer for Multilingual Speech Recognition simplified abstract (GOOGLE LLC)

From WikiPatents
Revision as of 06:28, 8 May 2024 by Wikipatents (talk | contribs) (Creating a new page)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Universal Monolingual Output Layer for Multilingual Speech Recognition

Organization Name

GOOGLE LLC

Inventor(s)

Chao Zhang of Mountain View CA (US)

Bo Li of Santa Clara CA (US)

Tara N. Sainath of Jersey City NJ (US)

Trevor Strohman of Mountain View CA (US)

Shuo-yiin Chang of Sunnyvale CA (US)

Universal Monolingual Output Layer for Multilingual Speech Recognition - A simplified explanation of the abstract

This abstract first appeared for US patent application 18485271 titled 'Universal Monolingual Output Layer for Multilingual Speech Recognition

Simplified Explanation

The method described in the abstract involves using a multilingual automated speech recognition (ASR) model to recognize speech in multiple languages by processing a sequence of acoustic frames. The model includes an audio encoder, a language identification (LID) predictor, and a decoder with a monolingual output layer.

  • The method receives a sequence of acoustic frames and generates a higher order feature representation for each frame using an audio encoder.
  • A language prediction representation is generated for each higher order feature representation by a language identification (LID) predictor.
  • The decoder generates a probability distribution over possible speech recognition results based on the higher order feature representation and language prediction representation.

---

      1. Potential Applications

- Multilingual speech recognition systems - Language identification in speech processing applications

      1. Problems Solved

- Improving accuracy in recognizing speech in multiple languages - Enhancing language identification capabilities in automated systems

      1. Benefits

- Increased efficiency in processing multilingual speech data - Enhanced user experience in speech recognition applications

      1. Potential Commercial Applications
        1. Optimizing Multilingual Speech Recognition Systems

---

      1. Possible Prior Art

One potential prior art in this field is the use of language models in speech recognition systems to improve accuracy and performance. Researchers have explored various techniques to enhance multilingual speech recognition, including language identification and feature extraction methods.

      1. Unanswered Questions
        1. How does the multilingual ASR model handle accents and dialects in speech recognition?

The abstract does not provide details on how the model addresses variations in accents and dialects, which can impact the accuracy of speech recognition.

        1. What is the computational complexity of the proposed method compared to existing multilingual ASR models?

The abstract does not mention the computational resources required for implementing the described method, which could be a crucial factor in practical applications.


Original Abstract Submitted

A method includes receiving a sequence of acoustic frames as input to a multilingual automated speech recognition (ASR) model configured to recognize speech in a plurality of different supported languages and generating, by an audio encoder of the multilingual ASR, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method also includes generating, by a language identification (LID) predictor of the multilingual ASR, a language prediction representation for a corresponding higher order feature representation. The method also includes generating, by a decoder of the multilingual ASR, a probability distribution over possible speech recognition results based on the corresponding higher order feature representation, a sequence of non-blank symbols, and a corresponding language prediction representation. The decoder includes monolingual output layer having a plurality of output nodes each sharing a plurality of language-specific wordpiece models.