18157070. HYPOTHESIS STITCHER FOR SPEECH RECOGNITION OF LONG-FORM AUDIO simplified abstract (MICROSOFT TECHNOLOGY LICENSING, LLC)

From WikiPatents
Jump to navigation Jump to search

HYPOTHESIS STITCHER FOR SPEECH RECOGNITION OF LONG-FORM AUDIO

Organization Name

MICROSOFT TECHNOLOGY LICENSING, LLC

Inventor(s)

Naoyuki Kanda of Bellevue WA (US)

Xuankai Chang of Baltimore MD (US)

Yashesh Gaur of Redmond WA (US)

Xiaofei Wang of Bellevue WA (US)

Zhong Meng of Mercer Island WA (US)

Takuya Yoshioka of Bellevue WA (US)

HYPOTHESIS STITCHER FOR SPEECH RECOGNITION OF LONG-FORM AUDIO - A simplified explanation of the abstract

This abstract first appeared for US patent application 18157070 titled 'HYPOTHESIS STITCHER FOR SPEECH RECOGNITION OF LONG-FORM AUDIO

Simplified Explanation

Abstract

A hypothesis stitcher for speech recognition of long-form audio provides superior performance, such as higher accuracy and reduced computational cost. The system segments the audio stream into smaller segments, identifies speakers within each segment, performs automatic speech recognition (ASR) on each segment to generate short-segment hypotheses, merges these hypotheses into a first merged hypothesis set, inserts stitching symbols into the merged set, and consolidates the set into a first consolidated hypothesis using a network-based hypothesis stitcher. Multiple variations are disclosed, including alignment-based stitchers and serialized stitchers, which may operate as speaker-specific or multi-speaker stitchers, and may support multiple options for differing hypothesis configurations.

Bullet Points

  • Hypothesis stitcher for speech recognition of long-form audio
  • Segments audio stream into smaller segments
  • Identifies speakers within each segment
  • Performs ASR on each segment to generate short-segment hypotheses
  • Merges hypotheses into a first merged hypothesis set
  • Inserts stitching symbols, including a window change (WC) symbol, into the merged set
  • Consolidates the merged set into a first consolidated hypothesis using a network-based stitcher
  • Multiple variations, including alignment-based and serialized stitchers
  • Can operate as speaker-specific or multi-speaker stitchers
  • Supports multiple options for differing hypothesis configurations

Potential Applications

  • Transcription services for long-form audio recordings
  • Voice assistants for processing extended conversations
  • Call center analytics for analyzing customer interactions
  • Language learning platforms for transcribing and analyzing spoken language

Problems Solved

  • Improved accuracy in speech recognition of long-form audio
  • Reduced computational cost for processing large audio streams
  • Efficiently handling multiple speakers within the audio stream
  • Seamless stitching of short-segment hypotheses for a coherent transcription

Benefits

  • Higher accuracy in transcribing long-form audio
  • Faster and more efficient processing of large audio streams
  • Improved understanding of multi-speaker conversations
  • Enhanced user experience with voice assistants and transcription services


Original Abstract Submitted

A hypothesis stitcher for speech recognition of long-form audio provides superior performance, such as higher accuracy and reduced computational cost. An example disclosed operation includes: segmenting the audio stream into a plurality of audio segments; identifying a plurality of speakers within each of the plurality of audio segments; performing automatic speech recognition (ASR) on each of the plurality of audio segments to generate a plurality of short-segment hypotheses; merging at least a portion of the short-segment hypotheses into a first merged hypothesis set; inserting stitching symbols into the first merged hypothesis set, the stitching symbols including a window change (WC) symbol; and consolidating, with a network-based hypothesis stitcher, the first merged hypothesis set into a first consolidated hypothesis. Multiple variations are disclosed, including alignment-based stitchers and serialized stitchers, which may operate as speaker-specific stitchers or multi-speaker stitchers, and may further support multiple options for differing hypothesis configurations.