18157303. SEQUENCE-TO-SEQUENCE SPEECH RECOGNITION WITH LATENCY THRESHOLD simplified abstract (MICROSOFT TECHNOLOGY LICENSING, LLC)

From WikiPatents
Jump to navigation Jump to search

SEQUENCE-TO-SEQUENCE SPEECH RECOGNITION WITH LATENCY THRESHOLD

Organization Name

MICROSOFT TECHNOLOGY LICENSING, LLC

Inventor(s)

Yashesh Gaur of Bellevue WA (US)

Jinyu Li of Redmond WA (US)

Liang Lu of Redmond WA (US)

Hirofumi Inaguma of Kyoto (JP)

Yifan Gong of Sammamish WA (US)

SEQUENCE-TO-SEQUENCE SPEECH RECOGNITION WITH LATENCY THRESHOLD - A simplified explanation of the abstract

This abstract first appeared for US patent application 18157303 titled 'SEQUENCE-TO-SEQUENCE SPEECH RECOGNITION WITH LATENCY THRESHOLD

Simplified Explanation

The abstract describes a computing system that can convert audio input into text transcription using a sequence-to-sequence speech recognition model. The system generates hidden states and output text tokens based on the audio input, with each output text token having an alignment within the audio input. The system ensures that the latency between the output alignment and the external-model alignment is below a predetermined threshold. The text transcription is then outputted.

  • The computing system receives audio input and converts it into text transcription.
  • It uses a sequence-to-sequence speech recognition model to generate a text transcription.
  • The model assigns external-model text tokens to frames in the audio input.
  • Each external-model text token has an alignment within the audio input.
  • The system generates hidden states and output text tokens based on the audio input.
  • Each output text token has a corresponding output alignment within the audio input.
  • The system ensures that the latency between the output alignment and the external-model alignment is below a predetermined threshold.
  • The text transcription is outputted by the system.

Potential Applications

  • Speech-to-text transcription services
  • Voice assistants and virtual agents
  • Transcribing audio recordings for documentation purposes
  • Real-time captioning for live events or broadcasts

Problems Solved

  • Efficient and accurate conversion of audio input into text transcription
  • Reducing latency between output alignment and external-model alignment
  • Improving the performance of sequence-to-sequence speech recognition models

Benefits

  • Faster and more accurate transcription of audio input
  • Improved synchronization between the text transcription and the audio input
  • Enhanced usability of speech recognition systems
  • Increased accessibility for individuals with hearing impairments


Original Abstract Submitted

A computing system including one or more processors configured to receive an audio input. The one or more processors may generate a text transcription of the audio input at a sequence-to-sequence speech recognition model, which may assign a respective plurality of external-model text tokens to a plurality of frames included in the audio input. Each external-model text token may have an external-model alignment within the audio input. Based on the audio input, the one or more processors may generate a plurality of hidden states. Based on the plurality of hidden states, the one or more processors may generate a plurality of output text tokens. Each output text token may have a corresponding output alignment within the audio input. For each output text token, a latency between the output alignment and the external-model alignment may be below a predetermined latency threshold. The one or more processors may output the text transcription.