SEPARATING ACOUSTIC AND LINGUISTIC INFORMATION IN NEURAL TRANSDUCER MODELS FOR END-TO-END SPEECH RECOGNITION

Organization Name

International Business Machines Corporation

Inventor(s)

SEPARATING ACOUSTIC AND LINGUISTIC INFORMATION IN NEURAL TRANSDUCER MODELS FOR END-TO-END SPEECH RECOGNITION - A simplified explanation of the abstract

This abstract first appeared for US patent application 17478064 titled 'SEPARATING ACOUSTIC AND LINGUISTIC INFORMATION IN NEURAL TRANSDUCER MODELS FOR END-TO-END SPEECH RECOGNITION

Simplified Explanation

The patent application describes a computer-implemented method for training a Recurrent Neural Network Transducer (RNN-T) for speech recognition. The method involves training two separate RNN-T models, one for forward prediction and one for backward prediction, using a set of audio data. The trained forward prediction model is then used for inference.

The method involves training two separate RNN-T models, one for forward prediction and one for backward prediction.
Each RNN-T model includes a common encoder, a prediction network, and a joint network.
The common encoder processes the input audio data and extracts relevant features.
The forward prediction network predicts label sequences in a forward direction, while the backward prediction network predicts label sequences in a backward direction.
The joint network combines the outputs of the common encoder and the prediction network.
The trained forward prediction model is used for inference, which involves predicting label sequences for new audio data.

Potential applications of this technology:

Speech recognition systems: The trained RNN-T models can be used in speech recognition systems to convert spoken language into written text.
Voice assistants: The technology can be used in voice assistants to accurately transcribe user commands and queries.
Transcription services: The RNN-T models can be utilized in transcription services to automatically transcribe audio recordings into text.

Problems solved by this technology:

Improved accuracy: Training separate RNN-T models for forward and backward prediction helps improve the accuracy of speech recognition by considering both past and future context.
Handling long sequences: The RNN-T models can effectively handle long audio sequences by predicting label sequences in both forward and backward directions.
Adaptability: The trained models can adapt to different speakers and speech patterns, making them suitable for various applications.

Benefits of this technology:

Enhanced speech recognition accuracy: By utilizing both forward and backward prediction, the RNN-T models can capture a broader context and improve the accuracy of speech recognition.
Efficient training: The common encoder and joint network allow for shared learning, reducing the computational resources required for training the models.
Real-time inference: The trained forward prediction model can be used for real-time speech recognition, enabling fast and accurate transcription of spoken language.

Original Abstract Submitted

A computer-implemented method is provided for training a Recurrent Neural Network Transducer (RNN-T). The method includes training, by inputting a set of audio data, a first RNN-T which includes a common encoder, a forward prediction network, and a first joint network combining outputs of both the common encoder and the forward prediction network. The forward prediction network predicts label sequences forward. The method further includes training, by inputting the set of audio data, a second RNN-T which includes the common encoder, a backward prediction network, and a second joint network combining outputs of both the common encoder and the backward prediction network. The backward prediction network predicts label sequences backward. The trained first RNN-T is used for inference.

17478064. SEPARATING ACOUSTIC AND LINGUISTIC INFORMATION IN NEURAL TRANSDUCER MODELS FOR END-TO-END SPEECH RECOGNITION simplified abstract (International Business Machines Corporation)

Contents

SEPARATING ACOUSTIC AND LINGUISTIC INFORMATION IN NEURAL TRANSDUCER MODELS FOR END-TO-END SPEECH RECOGNITION

Organization Name

Inventor(s)