20240029719. Unified End-To-End Speech Recognition And Endpointing Using A Switch Connection simplified abstract (GOOGLE LLC)

From WikiPatents
Jump to navigation Jump to search

Unified End-To-End Speech Recognition And Endpointing Using A Switch Connection

Organization Name

GOOGLE LLC

Inventor(s)

Shaan Jagdeep Patrick Bijwadia of San Francisco CA (US)

Shuo-yiin Chang of Sunnyvale CA (US)

Bo Li of Fremont CA (US)

Yanzhang He of Palo Alto CA (US)

Tara N. Sainath of Jersey City NJ (US)

Chao Zhang of Mountain View CA (US)

Unified End-To-End Speech Recognition And Endpointing Using A Switch Connection - A simplified explanation of the abstract

This abstract first appeared for US patent application 20240029719 titled 'Unified End-To-End Speech Recognition And Endpointing Using A Switch Connection

Simplified Explanation

The patent application describes a single end-to-end multitask model that combines a speech recognition model and an endpointer model. The speech recognition model includes an audio encoder that converts audio frames into higher-order feature representations, and a decoder that generates probability distributions for speech recognition hypotheses based on these representations. The endpointer model operates in two modes: VAD mode and EOQ detection mode. In VAD mode, it determines whether each input audio frame contains speech. In EOQ detection mode, it determines whether the latent representations of the audio frames include final silence.

  • The patent application proposes a multitask model that integrates speech recognition and endpointer functionalities.
  • The speech recognition model uses an audio encoder and decoder to process audio frames and generate speech recognition hypotheses.
  • The endpointer model operates in VAD mode to detect speech in input audio frames and in EOQ detection mode to identify final silence in latent representations.
  • The model aims to provide a comprehensive solution for speech recognition and endpointing tasks.

Potential Applications

  • Speech recognition systems for various applications such as transcription services, voice assistants, and automated customer service.
  • Endpointing systems for audio processing applications like call center analytics, voice activity detection, and audio segmentation.

Problems Solved

  • Improved accuracy and efficiency in speech recognition by combining speech recognition and endpointing models into a single multitask model.
  • Enhanced endpointing performance by utilizing higher-order feature representations and latent representations of audio frames.

Benefits

  • Simplified architecture by integrating speech recognition and endpointing models, reducing the need for separate systems.
  • Improved accuracy in speech recognition by leveraging higher-order feature representations.
  • Efficient endpointing by utilizing latent representations and reducing the need for additional processing steps.


Original Abstract Submitted

a single e2e multitask model includes a speech recognition model and an endpointer model. the speech recognition model includes an audio encoder configured to encode a sequence of audio frames into corresponding higher-order feature representations, and a decoder configured to generate probability distributions over possible speech recognition hypotheses for the sequence of audio frames based on the higher-order feature representations. the endpointer model is configured to operate between a vad mode and an eoq detection mode. during the vad mode, the endpointer model receives input audio frames, and determines, for each input audio frame, whether the input audio frame includes speech. during the eoq detection mode, the endpointer model receives latent representations for the sequence of audio frames output from the audio encoder, and determines, for each of the latent representation, whether the latent representation includes final silence.