18680797. Emitting Word Timings with End-to-End Models simplified abstract (Google LLC)
Emitting Word Timings with End-to-End Models
Organization Name
Inventor(s)
Tara N. Sainath of Jersey City NJ (US)
Basilio Garcia Castillo of Mountain View CA (US)
Trevor Strohman of Mountain View CA (US)
Ruoming Pang of New York NY (US)
Emitting Word Timings with End-to-End Models - A simplified explanation of the abstract
This abstract first appeared for US patent application 18680797 titled 'Emitting Word Timings with End-to-End Models
The method described in the abstract involves aligning spoken words with their respective ground truth transcriptions using constrained alignments and attention heads in a decoder.
- Receiving a training example with audio data and ground truth transcription
- Inserting placeholder symbols for word alignments
- Determining beginning and ending word pieces
- Generating constrained alignments for word pieces
- Aligning constrained alignments with ground truth alignments
- Constraining attention heads in a decoder with the constrained alignments
Potential Applications: - Speech recognition technology - Language translation systems - Voice-controlled devices
Problems Solved: - Improving accuracy in aligning spoken words with transcriptions - Enhancing the performance of speech recognition systems
Benefits: - Increased efficiency in processing spoken language - Enhanced accuracy in transcribing audio data
Commercial Applications: Title: Advanced Speech Recognition Technology for Improved Transcription Accuracy This technology can be utilized in various industries such as: - Customer service for automated call transcription - Legal and medical transcription services - Language learning applications
Questions about Advanced Speech Recognition Technology: 1. How does this method improve the accuracy of speech recognition systems?
- The method uses constrained alignments to align spoken words with ground truth transcriptions, enhancing accuracy.
2. What are the potential applications of this technology beyond speech recognition?
- This technology can also be applied in language translation systems and voice-controlled devices.
Original Abstract Submitted
A method includes receiving a training example that includes audio data representing a spoken utterance and a ground truth transcription. For each word in the spoken utterance, the method also includes inserting a placeholder symbol before the respective word identifying a respective ground truth alignment for a beginning and an end of the respective word, determining a beginning word piece and an ending word piece, and generating a first constrained alignment for the beginning word piece and a second constrained alignment for the ending word piece. The first constrained alignment is aligned with the ground truth alignment for the beginning of the respective word and the second constrained alignment is aligned with the ground truth alignment for the ending of the respective word. The method also includes constraining an attention head of a second pass decoder by applying the first and second constrained alignments.