18046041. ONLINE SPEAKER DIARIZATION USING LOCAL AND GLOBAL CLUSTERING simplified abstract (SAMSUNG ELECTRONICS CO., LTD.)

From WikiPatents
Jump to navigation Jump to search

ONLINE SPEAKER DIARIZATION USING LOCAL AND GLOBAL CLUSTERING

Organization Name

SAMSUNG ELECTRONICS CO., LTD.

Inventor(s)

Myungjong Kim of Milpitas CA (US)

Taeyeon Ki of Milpitas CA (US)

Vijendra Raj Apsingekar of San Jose CA (US)

Sungjae Park of Seoul (KR)

SeungBeom Ryu of Suwon (KR)

Hyuk Oh of Seoul (KR)

ONLINE SPEAKER DIARIZATION USING LOCAL AND GLOBAL CLUSTERING - A simplified explanation of the abstract

This abstract first appeared for US patent application 18046041 titled 'ONLINE SPEAKER DIARIZATION USING LOCAL AND GLOBAL CLUSTERING

Simplified Explanation

The patent application describes a method for speaker identification using audio streams containing speech activity. The method involves generating embedding vectors for each segment of the audio stream and clustering them into one or more clusters for speaker identification.

  • Obtaining an audio stream with speech activity
  • Generating embedding vectors for each segment of the audio stream
  • Clustering the embedding vectors into clusters for speaker identification
  • Presenting sequences of speaker identities based on the speaker identification performed for local and global windows

Potential applications of this technology:

  • Speaker identification in call centers or customer service applications
  • Voice recognition and authentication systems
  • Forensic analysis of audio recordings
  • Automatic transcription and captioning services

Problems solved by this technology:

  • Efficient and accurate speaker identification in audio streams with multiple speakers
  • Reducing the need for manual speaker identification and annotation
  • Handling variations in speech patterns and accents

Benefits of this technology:

  • Improved accuracy and reliability in speaker identification
  • Time-saving and cost-effective compared to manual identification methods
  • Scalable for large audio datasets
  • Can be integrated into existing speech processing systems


Original Abstract Submitted

A method includes obtaining at least a portion of an audio stream containing speech activity. At least the portion of the audio stream includes multiple segments. The method also includes, for each of the multiple segments, generating an embedding vector that represents the segment. The method further includes, within each of multiple local windows, clustering the embedding vectors into one or more clusters to perform speaker identification. Different clusters correspond to different speakers. The method also includes presenting at least one first sequence of speaker identities based on the speaker identification performed for the local windows. The method further includes, within each of multiple global windows, clustering the embedding vectors into one or more clusters to perform speaker identification. Each global window includes two or more local windows. In addition, the method includes presenting at least one second sequence of speaker identities based on the speaker identification performed for the global windows.