MICROSOFT TECHNOLOGY LICENSING, LLC (20240257815). TRAINING AND USING A TRANSCRIPT GENERATION MODEL ON A MULTI-SPEAKER AUDIO STREAM simplified abstract

From WikiPatents
Jump to navigation Jump to search

TRAINING AND USING A TRANSCRIPT GENERATION MODEL ON A MULTI-SPEAKER AUDIO STREAM

Organization Name

MICROSOFT TECHNOLOGY LICENSING, LLC

Inventor(s)

Naoyuki Kanda of Bellevue WA (US)

Takuya Yoshioka of Bellevue WA (US)

Zhuo Chen of Bellevue WA (US)

Jinyu Li of Sammamish WA (US)

Yashesh Gaur of Redmond WA (US)

Zhong Meng of Mercer Island WA (US)

Xiaofei Wang of Bellevue WA (US)

Xiong Xiao of Bothell WA (US)

TRAINING AND USING A TRANSCRIPT GENERATION MODEL ON A MULTI-SPEAKER AUDIO STREAM - A simplified explanation of the abstract

This abstract first appeared for US patent application 20240257815 titled 'TRAINING AND USING A TRANSCRIPT GENERATION MODEL ON A MULTI-SPEAKER AUDIO STREAM

The abstract describes a method for generating a transcript from a multi-speaker audio stream by using a transcript generation model. The model processes audio data with overlapping speech from multiple speakers to create a transcript that includes channel change symbols to indicate speaker changes during overlapping speech.

  • Audio data with multiple speakers and overlapping speech is processed to generate frame embeddings.
  • A transcript generation model then creates words and channel change symbols from the frame embeddings.
  • Channel change symbols are inserted between words spoken by different speakers simultaneously.
  • The words and channel change symbols are organized into transcript lines, creating a multi-speaker transcript.
  • The inclusion of channel change symbols improves the accuracy and efficiency of multi-speaker transcription.

Potential Applications: - Automated transcription services for meetings, conferences, and interviews. - Enhancing accessibility for individuals with hearing impairments. - Improving the efficiency of audio data analysis in research and business settings.

Problems Solved: - Efficiently transcribing audio data with multiple speakers and overlapping speech. - Accurately identifying speaker changes during overlapping speech.

Benefits: - Saves time and resources compared to manual transcription. - Increases the accuracy of transcriptions in complex audio environments. - Enhances accessibility for individuals who rely on transcriptions.

Commercial Applications: Title: Multi-Speaker Audio Transcription Technology for Enhanced Accessibility and Efficiency This technology can be used in transcription services, audio analysis software, and accessibility tools for various industries such as media, healthcare, and education.

Questions about Multi-Speaker Audio Transcription Technology: 1. How does this technology improve the accuracy of transcriptions in complex audio environments?

  - The inclusion of channel change symbols helps identify speaker changes during overlapping speech, leading to more accurate transcriptions.

2. What are the potential applications of this technology beyond transcription services?

  - This technology can also be used in audio analysis software, accessibility tools, and research applications.


Original Abstract Submitted

the disclosure herein describes using a transcript generation model for generating a transcript from a multi-speaker audio stream. audio data including overlapping speech of a plurality of speakers is obtained and a set of frame embeddings are generated from audio data frames of obtained audio data using an audio data encoder. a set of words and channel change (cc) symbols are generated from the set of frame embeddings using a transcript generation model. the cc symbols are included between pairs of adjacent words that are spoken by different people at the same time. the set of words and cc symbols are transformed into a plurality of transcript lines, wherein words of the set of words are sorted into transcript lines based on cc symbols, and a multi-speaker transcript is generated based on the plurality of transcript lines. the inclusion of cc symbols by the model enables efficient, accurate multi-speaker transcription.