US Patent Application 18078460. METHOD OF TRAINING SPEECH RECOGNITION MODEL, ELECTRONIC DEVICE AND STORAGE MEDIUM simplified abstract

From WikiPatents
Jump to navigation Jump to search

METHOD OF TRAINING SPEECH RECOGNITION MODEL, ELECTRONIC DEVICE AND STORAGE MEDIUM

Organization Name

BEIJING XIAOMI MOBILE SOFTWARE CO., LTD.

Inventor(s)

Zengwei Yao of Beijing (CN)

Liyong Guo of Beijing (CN)

POVEY Daniel of Beijing (CN)

Long Lin of Beijing (CN)

Fangjun Kuang of Beijing (CN)

Wei Kang of Beijing (CN)

Mingshuang Luo of Beijing (CN)

Quandong Wang of Beijing (CN)

Yuxiang Kong of Beijing (CN)

METHOD OF TRAINING SPEECH RECOGNITION MODEL, ELECTRONIC DEVICE AND STORAGE MEDIUM - A simplified explanation of the abstract

This abstract first appeared for US patent application 18078460 titled 'METHOD OF TRAINING SPEECH RECOGNITION MODEL, ELECTRONIC DEVICE AND STORAGE MEDIUM

Simplified Explanation

The patent application describes a method for training a speech recognition model. Here are the key points:

  • The method involves inputting speech data from multiple training samples into a teacher model and a to-be-trained speech recognition model separately.
  • The teacher model and the to-be-trained speech recognition model generate an embedding and encoded data, respectively.
  • The embedding is subjected to multi-codebook quantization to obtain quantized codebook data.
  • A loss is calculated based on the encoded data, quantized codebook data, and text data in the training sample.
  • Training of the to-be-trained speech recognition model is stopped when the loss is below a preset threshold and/or the model has been trained for a preset number of times.
  • The result is a trained speech recognition model that can accurately recognize speech.


Original Abstract Submitted

A method of training a speech recognition model is provided. The method includes that: speech data of each of a plurality of training samples is inputted into a teacher model and a to-be-trained speech recognition model separately. Additionally, an embedding outputted by the teacher model and encoded data outputted by the to-be-trained speech recognition model are obtained. Furthermore, quantized codebook data is obtained by performing a multi-codebook quantization on the embedding. A loss is calculated based on the encoded data, the quantized codebook data, and text data in the training sample. Moreover, a trained speech recognition model is obtained by stopping training the to-be-trained speech recognition model when the loss is less than or equal to a preset loss threshold and/or trained times is greater than preset trained times.