Emitting Word Timings with End-to-End Models

Organization Name

Google LLC

Inventor(s)

Tara N. Sainath of Jersey City NJ (US)

Basilio Garcia Castillo of Mountain View CA (US)

David Rybach of Munich (DE)

Trevor Strohman of Mountain View CA (US)

Ruoming Pang of New York NY (US)

Emitting Word Timings with End-to-End Models - A simplified explanation of the abstract

This abstract first appeared for US patent application 20240321263 titled 'Emitting Word Timings with End-to-End Models

The method described in the abstract involves processing training examples containing audio data of spoken utterances and their corresponding transcriptions. The method inserts placeholder symbols before each word to identify the beginning and end of the word, then determines word pieces and generates constrained alignments for each word piece. These alignments are used to constrain the attention head of a decoder in a second pass.

Training examples contain audio data and transcriptions
Placeholder symbols are inserted before each word
Word pieces are determined and constrained alignments are generated
Alignments are used to constrain the attention head of a decoder
Second pass decoder is constrained by the alignments

Potential Applications: - Speech recognition technology - Language translation systems - Voice-controlled devices

Problems Solved: - Improving accuracy in speech recognition - Enhancing the performance of language processing systems

Benefits: - Higher accuracy in transcribing spoken language - Improved efficiency in language translation - Enhanced user experience with voice-controlled devices

Commercial Applications: Title: Enhanced Speech Recognition Technology for Improved Language Processing This technology can be utilized in various industries such as: - Customer service for automated call centers - Language learning applications - Voice-activated virtual assistants

Questions about the technology: 1. How does this method improve the accuracy of speech recognition systems? 2. What are the potential limitations of using constrained alignments in language processing systems?

Frequently Updated Research: Stay updated on advancements in speech recognition technology and language processing systems to leverage the latest innovations in the field.

Original Abstract Submitted

a method includes receiving a training example that includes audio data representing a spoken utterance and a ground truth transcription. for each word in the spoken utterance, the method also includes inserting a placeholder symbol before the respective word identifying a respective ground truth alignment for a beginning and an end of the respective word, determining a beginning word piece and an ending word piece, and generating a first constrained alignment for the beginning word piece and a second constrained alignment for the ending word piece. the first constrained alignment is aligned with the ground truth alignment for the beginning of the respective word and the second constrained alignment is aligned with the ground truth alignment for the ending of the respective word. the method also includes constraining an attention head of a second pass decoder by applying the first and second constrained alignments.