18592590. Text Injection For Training Auxiliary Tasks In Speech Recognition Models simplified abstract (GOOGLE LLC)

From WikiPatents
Jump to navigation Jump to search

Text Injection For Training Auxiliary Tasks In Speech Recognition Models

Organization Name

GOOGLE LLC

Inventor(s)

Shaan Jagdeep Patrick Bijwadia of San Francisco CA (US)

Shuo-yiin Chang of Sunnyvale CA (US)

Tara N. Sainath of Jersey City NJ (US)

Weiran Wang of San Jose CA (US)

Zhong Meng of Mountain View CA (US)

Text Injection For Training Auxiliary Tasks In Speech Recognition Models - A simplified explanation of the abstract

This abstract first appeared for US patent application 18592590 titled 'Text Injection For Training Auxiliary Tasks In Speech Recognition Models

The abstract describes a joint auxiliary task and automatic speech recognition (ASR) model that includes an encoder to process acoustic frames and generate higher-order feature representations, a multi-output HAT decoder to generate speech recognition hypotheses, and an indication of auxiliary tokens for a specific task. The model is trained using paired audio data with transcriptions annotated with auxiliary tokens and unpaired textual utterances also annotated with auxiliary tokens.

  • Encoder processes acoustic frames to generate higher-order feature representations
  • Multi-output HAT decoder generates speech recognition hypotheses
  • Model indicates auxiliary tokens for a specific task
  • Trained using paired audio data with annotated auxiliary tokens and unpaired textual utterances with annotated auxiliary tokens

Potential Applications: - Speech recognition systems - Language translation tools - Voice-controlled devices

Problems Solved: - Improving accuracy of speech recognition - Enhancing performance of auxiliary tasks in ASR models

Benefits: - Enhanced speech recognition capabilities - Improved accuracy in recognizing auxiliary task-related tokens

Commercial Applications: - Integration into smart speakers - Implementation in customer service chatbots - Inclusion in language learning applications

Questions about the technology: 1. How does the model differentiate between auxiliary tokens and regular speech recognition hypotheses? 2. What are the potential challenges in training the model with paired and unpaired data?

Frequently Updated Research: - Ongoing studies on optimizing the training process for joint auxiliary tasks and ASR models.


Original Abstract Submitted

A joint auxiliary task and ASR model includes an encoder to receive a sequence of acoustic frames and generate, at each of a plurality of output steps, a higher-order feature representation for a corresponding acoustic frame. The model also includes a multi-output HAT decoder to generate at each of the plurality of output steps a probability distribution over possible speech recognition hypotheses, and an indication of whether the output step corresponds to an auxiliary token associated with a particular auxiliary task. The model is trained by a JEIT training process based on: a paired training data set including paired audio data and transcriptions, the transcriptions annotated with ground-truth auxiliary tokens associated with the particular auxiliary task; and an unpaired training data set including textual utterances not paired with any corresponding audio data, the textual utterances annotated with the ground-truth auxiliary tokens associated with the particular auxiliary task.