SEQUENCE-TO-SEQUENCE SPEECH RECOGNITION WITH LATENCY THRESHOLD

Organization Name

MICROSOFT TECHNOLOGY LICENSING, LLC

Inventor(s)

Yashesh Gaur of Bellevue WA (US)

Jinyu Li of Redmond WA (US)

Liang Lu of Redmond WA (US)

Hirofumi Inaguma of Kyoto (JP)

Yifan Gong of Sammamish WA (US)

SEQUENCE-TO-SEQUENCE SPEECH RECOGNITION WITH LATENCY THRESHOLD - A simplified explanation of the abstract

This abstract first appeared for US patent application 18157303 titled 'SEQUENCE-TO-SEQUENCE SPEECH RECOGNITION WITH LATENCY THRESHOLD

Simplified Explanation

The abstract describes a computing system that can convert audio input into text transcription using a sequence-to-sequence speech recognition model. The system generates hidden states and output text tokens based on the audio input, with each output text token having an alignment within the audio input. The system ensures that the latency between the output alignment and the external-model alignment is below a predetermined threshold. The text transcription is then outputted.

The computing system receives audio input and converts it into text transcription.
It uses a sequence-to-sequence speech recognition model to generate a text transcription.
The model assigns external-model text tokens to frames in the audio input.
Each external-model text token has an alignment within the audio input.
The system generates hidden states and output text tokens based on the audio input.
Each output text token has a corresponding output alignment within the audio input.
The system ensures that the latency between the output alignment and the external-model alignment is below a predetermined threshold.
The text transcription is outputted by the system.

Potential Applications

Speech-to-text transcription services
Voice assistants and virtual agents
Transcribing audio recordings for documentation purposes
Real-time captioning for live events or broadcasts

Problems Solved

Efficient and accurate conversion of audio input into text transcription
Reducing latency between output alignment and external-model alignment
Improving the performance of sequence-to-sequence speech recognition models

Benefits

Faster and more accurate transcription of audio input
Improved synchronization between the text transcription and the audio input
Enhanced usability of speech recognition systems
Increased accessibility for individuals with hearing impairments

Original Abstract Submitted

A computing system including one or more processors configured to receive an audio input. The one or more processors may generate a text transcription of the audio input at a sequence-to-sequence speech recognition model, which may assign a respective plurality of external-model text tokens to a plurality of frames included in the audio input. Each external-model text token may have an external-model alignment within the audio input. Based on the audio input, the one or more processors may generate a plurality of hidden states. Based on the plurality of hidden states, the one or more processors may generate a plurality of output text tokens. Each output text token may have a corresponding output alignment within the audio input. For each output text token, a latency between the output alignment and the external-model alignment may be below a predetermined latency threshold. The one or more processors may output the text transcription.

18157303. SEQUENCE-TO-SEQUENCE SPEECH RECOGNITION WITH LATENCY THRESHOLD simplified abstract (MICROSOFT TECHNOLOGY LICENSING, LLC)

Contents

SEQUENCE-TO-SEQUENCE SPEECH RECOGNITION WITH LATENCY THRESHOLD

Organization Name

Inventor(s)

SEQUENCE-TO-SEQUENCE SPEECH RECOGNITION WITH LATENCY THRESHOLD - A simplified explanation of the abstract

Simplified Explanation

Potential Applications

Problems Solved

Benefits

Original Abstract Submitted

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools