Google llc (20240290320). Semantic Segmentation With Language Models For Long-Form Automatic Speech Recognition simplified abstract

From WikiPatents
Jump to navigation Jump to search

Semantic Segmentation With Language Models For Long-Form Automatic Speech Recognition

Organization Name

google llc

Inventor(s)

Wenqian Huang of Mountain View CA (US)

Hao Zhang of Jericho NY (US)

Shankar Kumar of New York NY (US)

Shuo-yiin Chang of Sunnyvale CA (US)

Tara N. Sainath of Jersey City NJ (US)

Semantic Segmentation With Language Models For Long-Form Automatic Speech Recognition - A simplified explanation of the abstract

This abstract first appeared for US patent application 20240290320 titled 'Semantic Segmentation With Language Models For Long-Form Automatic Speech Recognition

The abstract describes a joint segmenting and ASR model that includes an encoder to process acoustic frames and generate higher order feature representations, a decoder to generate speech recognition hypotheses and end of segment indications, and training on long-form speech data with ground-truth end of segment labels.

  • Encoder processes acoustic frames to generate higher order feature representations
  • Decoder generates speech recognition hypotheses and end of segment indications
  • Trained on long-form speech data with ground-truth end of segment labels
  • Utilizes a language model teacher to inject end of segment labels into transcriptions
  • Aimed at improving speech recognition accuracy and segmentation in long-form speech

Potential Applications: - Enhanced speech recognition systems - Improved segmentation of long-form speech data - Language transcription services

Problems Solved: - Addressing challenges in accurately segmenting long-form speech data - Enhancing the performance of automatic speech recognition systems

Benefits: - Higher accuracy in speech recognition - Improved transcription quality for long-form speech - Enhanced user experience in speech-to-text applications

Commercial Applications: Title: Advanced Speech Recognition and Segmentation Technology for Transcription Services This technology can be utilized in transcription services, call center analytics, voice assistants, and any application requiring accurate speech recognition and segmentation.

Prior Art: Researchers can explore existing patents and publications related to joint segmenting and ASR models, encoder-decoder architectures in speech recognition, and language model distillation techniques.

Frequently Updated Research: Stay updated on advancements in speech recognition models, encoder-decoder architectures, and language model training techniques to enhance the performance of the joint segmenting and ASR model.

Questions about the Technology: 1. How does the joint segmenting and ASR model improve the accuracy of speech recognition? 2. What are the key differences between traditional speech recognition systems and this innovative model?


Original Abstract Submitted

a joint segmenting and asr model includes an encoder to receive a sequence of acoustic frames and generate, at each of a plurality of output steps, a higher order feature representation for a corresponding acoustic frame. the model also includes a decoder to generate based on the higher order feature representation at each of the plurality of output steps a probability distribution over possible speech recognition hypotheses, and an indication of whether the corresponding output step corresponds to an end of segment (eos). the model is trained on a set of training samples, each training sample including audio data characterizing multiple segments of long-form speech; and a corresponding transcription of the long-form speech, the corresponding transcription annotated with ground-truth eos labels obtained via distillation from a language model teacher that receives the corresponding transcription as input and injects the ground-truth eos labels into the corresponding transcription between semantically complete segments.