US Patent Application 18304064. Joint Segmenting and Automatic Speech Recognition simplified abstract

From WikiPatents
Jump to navigation Jump to search

Joint Segmenting and Automatic Speech Recognition

Organization Name

Google LLC


Inventor(s)

Ronny Huang of Mountain View CA (US)


Shuo-yiin Chang of Sunnyvale CA (US)


David Rybach of Munich (DE)


Rohit Prakash Prabhavalkar of Palo Alto CA (US)


Tara N. Sainath of Jersey City NJ (US)


Cyril Allauzen of Mountain View CA (US)


Charles Caleb Peyser of New York NY (US)


Zhiyun Lu of Brooklyn NY (US)


Joint Segmenting and Automatic Speech Recognition - A simplified explanation of the abstract

  • This abstract for appeared for US patent application number 18304064 Titled 'Joint Segmenting and Automatic Speech Recognition'

Simplified Explanation

The abstract describes a model that combines speech segmentation and automatic speech recognition (ASR). The model consists of an encoder and a decoder. The encoder takes in a sequence of acoustic frames that represent spoken utterances and generates a higher-level feature representation for each frame. The decoder uses this feature representation to generate a probability distribution over possible speech recognition hypotheses and determine if an output step corresponds to the end of a speech segment.

The joint segmenting and ASR model is trained on a set of training samples, each containing audio data of spoken utterances and their corresponding transcriptions. The transcriptions are modified by inserting an "end of speech segment" token based on a set of heuristic rules and exceptions applied to the training sample.


Original Abstract Submitted

A joint segmenting and ASR model includes an encoder and decoder. The encoder configured to: receive a sequence of acoustic frames characterizing one or more utterances; and generate, at each output step, a higher order feature representation for a corresponding acoustic frame. The decoder configured to: receive the higher order feature representation and generate, at each output step: a probability distribution over possible speech recognition hypotheses, and an indication of whether the corresponding output step corresponds to an end of speech segment. The j oint segmenting and ASR model trained on a set of training samples, each training sample including: audio data characterizing a spoken utterance; and a corresponding transcription of the spoken utterance, the corresponding transcription having an end of speech segment ground truth token inserted into the corresponding transcription automatically based on a set of heuristic-based rules and exceptions applied to the training sample.