Emitting Word Timings with End-to-End Models

Organization Name

Google LLC

Inventor(s)

Tara N. Sainath of Jersey City NJ (US)

Basilio Garcia Castillo of Mountain View CA (US)

David Rybach of Munich (DE)

Trevor Strohman of Mountain View CA (US)

Ruoming Pang of New York NY (US)

Emitting Word Timings with End-to-End Models - A simplified explanation of the abstract

This abstract first appeared for US patent application 18680797 titled 'Emitting Word Timings with End-to-End Models

The method described in the abstract involves aligning spoken words with their respective ground truth transcriptions using constrained alignments and attention heads in a decoder.

Receiving a training example with audio data and ground truth transcription
Inserting placeholder symbols for word alignments
Determining beginning and ending word pieces
Generating constrained alignments for word pieces
Aligning constrained alignments with ground truth alignments
Constraining attention heads in a decoder with the constrained alignments

Potential Applications: - Speech recognition technology - Language translation systems - Voice-controlled devices

Problems Solved: - Improving accuracy in aligning spoken words with transcriptions - Enhancing the performance of speech recognition systems

Benefits: - Increased efficiency in processing spoken language - Enhanced accuracy in transcribing audio data

Commercial Applications: Title: Advanced Speech Recognition Technology for Improved Transcription Accuracy This technology can be utilized in various industries such as: - Customer service for automated call transcription - Legal and medical transcription services - Language learning applications

Questions about Advanced Speech Recognition Technology: 1. How does this method improve the accuracy of speech recognition systems?

  - The method uses constrained alignments to align spoken words with ground truth transcriptions, enhancing accuracy.

2. What are the potential applications of this technology beyond speech recognition?

  - This technology can also be applied in language translation systems and voice-controlled devices.

Original Abstract Submitted

A method includes receiving a training example that includes audio data representing a spoken utterance and a ground truth transcription. For each word in the spoken utterance, the method also includes inserting a placeholder symbol before the respective word identifying a respective ground truth alignment for a beginning and an end of the respective word, determining a beginning word piece and an ending word piece, and generating a first constrained alignment for the beginning word piece and a second constrained alignment for the ending word piece. The first constrained alignment is aligned with the ground truth alignment for the beginning of the respective word and the second constrained alignment is aligned with the ground truth alignment for the ending of the respective word. The method also includes constraining an attention head of a second pass decoder by applying the first and second constrained alignments.