20240029711. USING CORRECTIONS, OF PREDICTED TEXTUAL SEGMENTS OF SPOKEN UTTERANCES, FOR TRAINING OF ON-DEVICE SPEECH RECOGNITION MODEL simplified abstract (GOOGLE LLC)

From WikiPatents
Jump to navigation Jump to search

USING CORRECTIONS, OF PREDICTED TEXTUAL SEGMENTS OF SPOKEN UTTERANCES, FOR TRAINING OF ON-DEVICE SPEECH RECOGNITION MODEL

Organization Name

GOOGLE LLC

Inventor(s)

Françoise Beaufays of Mountain View CA (US)

Johan Schalkwyk of Scarsdale NY (US)

Giovanni Motta of San Jose CA (US)

USING CORRECTIONS, OF PREDICTED TEXTUAL SEGMENTS OF SPOKEN UTTERANCES, FOR TRAINING OF ON-DEVICE SPEECH RECOGNITION MODEL - A simplified explanation of the abstract

This abstract first appeared for US patent application 20240029711 titled 'USING CORRECTIONS, OF PREDICTED TEXTUAL SEGMENTS OF SPOKEN UTTERANCES, FOR TRAINING OF ON-DEVICE SPEECH RECOGNITION MODEL

Simplified Explanation

The abstract of this patent application describes a system and method for improving speech recognition models on client devices. Here is a simplified explanation of the abstract:

  • The client device receives audio data capturing a user's spoken utterance.
  • The audio data is processed using an on-device speech recognition model to generate a predicted textual segment that represents the spoken utterance.
  • The predicted textual segment is rendered visually and/or audibly to the user.
  • If the user provides a correction to the predicted textual segment, it is received as further user interface input.
  • A gradient is generated by comparing the predicted output to the ground truth output corresponding to the user's correction.
  • The gradient is used to update the weights of the on-device speech recognition model and/or transmitted to a remote system for updating global weights of a global speech recognition model.

Potential applications of this technology:

  • Improving speech recognition accuracy on client devices.
  • Enhancing user experience in voice-controlled applications.
  • Enabling real-time speech recognition and correction.

Problems solved by this technology:

  • Inaccurate speech recognition on client devices.
  • Limited ability to correct and improve speech recognition models on the device.
  • Dependence on remote systems for updating speech recognition models.

Benefits of this technology:

  • Improved accuracy and reliability of speech recognition on client devices.
  • Enhanced user satisfaction and productivity in voice-controlled applications.
  • Reduced reliance on remote systems for speech recognition model updates.


Original Abstract Submitted

processor(s) of a client device can: receive audio data that captures a spoken utterance of a user of the client device; process, using an on-device speech recognition model, the audio data to generate a predicted textual segment that is a prediction of the spoken utterance; cause at least part of the predicted textual segment to be rendered (e.g., visually and/or audibly); receive further user interface input that is a correction of the predicted textual segment to an alternate textual segment; and generate a gradient based on comparing at least part of the predicted output to ground truth output that corresponds to the alternate textual segment. the gradient is used, by processor(s) of the client device, to update weights of the on-device speech recognition model and/or is transmitted to a remote system for use in remote updating of global weights of a global speech recognition model.