Google llc (20240290321). CHUNK-WISE ATTENTION FOR LONGFORM ASR simplified abstract

From WikiPatents
Jump to navigation Jump to search

CHUNK-WISE ATTENTION FOR LONGFORM ASR

Organization Name

google llc

Inventor(s)

Yongqiang Wang of Kirkland WA (US)

Yu Zhang of Mountain View CA (US)

Wei Han of Mountain View CA (US)

Parisa Haghani of Mountain View CA (US)

Pedro J. Moreno Mengibar of Jersey City NJ (US)

CHUNK-WISE ATTENTION FOR LONGFORM ASR - A simplified explanation of the abstract

This abstract first appeared for US patent application 20240290321 titled 'CHUNK-WISE ATTENTION FOR LONGFORM ASR

The method described in the abstract involves processing training data containing multilingual unspoken textual utterances, un-transcribed non-synthetic speech utterances, and transcribed non-synthetic speech utterances.

  • Generating target quantized vector tokens and token indexes for un-transcribed non-synthetic speech utterances.
  • Creating contrastive context vectors from masked audio features and deriving a contrastive loss term.
  • Generating alignment outputs and probability distributions over speech recognition hypotheses.
  • Pre-training an audio encoder based on various loss terms to improve performance.
    • Key Features and Innovation:**
  • Utilizes multilingual data for training, improving language coverage.
  • Incorporates contrastive learning to enhance speech recognition accuracy.
  • Integrates alignment outputs to refine speech recognition hypotheses.
  • Pre-trains audio encoder for better performance on non-synthetic speech utterances.
    • Potential Applications:**
  • Enhanced multilingual speech recognition systems.
  • Improved accuracy in transcribing non-synthetic speech.
  • Language learning tools for unspoken textual utterances.
    • Problems Solved:**
  • Addressing the challenge of accurately transcribing unspoken multilingual text.
  • Improving speech recognition performance on non-synthetic speech data.
    • Benefits:**
  • Increased accuracy in transcribing multilingual speech data.
  • Enhanced performance in recognizing non-synthetic speech utterances.
    • Commercial Applications:**

Potential commercial applications include:

  • Multilingual transcription services.
  • Speech recognition software for various industries.
  • Language learning platforms with improved speech recognition capabilities.
    • Questions about the Technology:**

1. How does the method handle multilingual data during training? 2. What are the specific loss terms used for pre-training the audio encoder?

    • Frequently Updated Research:**

Stay updated on the latest advancements in multilingual speech recognition and contrastive learning techniques to enhance the performance of this technology.


Original Abstract Submitted

a method includes receiving training data including a corpus of multilingual unspoken textual utterances, a corpus of multilingual un-transcribed non-synthetic speech utterances, and a corpus of multilingual transcribed non-synthetic speech utterances. for each un-transcribed non-synthetic speech utterance, the method includes generating a target quantized vector token and a target token index, generating contrastive context vectors from corresponding masked audio features, and deriving a contrastive loss term. the method also includes generating an alignment output, generating a first probability distribution over possible speech recognition hypotheses for the alignment output, and determining an alignment output loss term. the method also includes generating a second probability distribution over possible speech recognition hypotheses and determining a non-synthetic speech loss term. the method also includes pre-training an audio encoder based on the contrastive loss term, the alignment output loss term, and the non-synthetic speech loss term.