Google llc (20240290320). Semantic Segmentation With Language Models For Long-Form Automatic Speech Recognition simplified abstract
Contents
Semantic Segmentation With Language Models For Long-Form Automatic Speech Recognition
Organization Name
Inventor(s)
Wenqian Huang of Mountain View CA (US)
Shankar Kumar of New York NY (US)
Shuo-yiin Chang of Sunnyvale CA (US)
Tara N. Sainath of Jersey City NJ (US)
Semantic Segmentation With Language Models For Long-Form Automatic Speech Recognition - A simplified explanation of the abstract
This abstract first appeared for US patent application 20240290320 titled 'Semantic Segmentation With Language Models For Long-Form Automatic Speech Recognition
The abstract describes a joint segmenting and ASR model that includes an encoder to process acoustic frames and generate higher order feature representations, a decoder to generate speech recognition hypotheses and end of segment indications, and training on long-form speech data with ground-truth end of segment labels.
- Encoder processes acoustic frames to generate higher order feature representations
- Decoder generates speech recognition hypotheses and end of segment indications
- Trained on long-form speech data with ground-truth end of segment labels
- Utilizes a language model teacher to inject end of segment labels into transcriptions
- Aimed at improving speech recognition accuracy and segmentation in long-form speech
Potential Applications: - Enhanced speech recognition systems - Improved segmentation of long-form speech data - Language transcription services
Problems Solved: - Addressing challenges in accurately segmenting long-form speech data - Enhancing the performance of automatic speech recognition systems
Benefits: - Higher accuracy in speech recognition - Improved transcription quality for long-form speech - Enhanced user experience in speech-to-text applications
Commercial Applications: Title: Advanced Speech Recognition and Segmentation Technology for Transcription Services This technology can be utilized in transcription services, call center analytics, voice assistants, and any application requiring accurate speech recognition and segmentation.
Prior Art: Researchers can explore existing patents and publications related to joint segmenting and ASR models, encoder-decoder architectures in speech recognition, and language model distillation techniques.
Frequently Updated Research: Stay updated on advancements in speech recognition models, encoder-decoder architectures, and language model training techniques to enhance the performance of the joint segmenting and ASR model.
Questions about the Technology: 1. How does the joint segmenting and ASR model improve the accuracy of speech recognition? 2. What are the key differences between traditional speech recognition systems and this innovative model?
Original Abstract Submitted
a joint segmenting and asr model includes an encoder to receive a sequence of acoustic frames and generate, at each of a plurality of output steps, a higher order feature representation for a corresponding acoustic frame. the model also includes a decoder to generate based on the higher order feature representation at each of the plurality of output steps a probability distribution over possible speech recognition hypotheses, and an indication of whether the corresponding output step corresponds to an end of segment (eos). the model is trained on a set of training samples, each training sample including audio data characterizing multiple segments of long-form speech; and a corresponding transcription of the long-form speech, the corresponding transcription annotated with ground-truth eos labels obtained via distillation from a language model teacher that receives the corresponding transcription as input and injects the ground-truth eos labels into the corresponding transcription between semantically complete segments.