Google llc (20240135934). EVALUATION-BASED SPEAKER CHANGE DETECTION EVALUATION METRICS simplified abstract

From WikiPatents
Jump to navigation Jump to search

EVALUATION-BASED SPEAKER CHANGE DETECTION EVALUATION METRICS

Organization Name

google llc

Inventor(s)

Guanlong Zhao of Long Island City NY (US)

Quan Wang of Hoboken NJ (US)

Han Lu of Redmond WA (US)

Yiling Huang of Edgewater NJ (US)

Jason Pelecanos of Mountain View CA (US)

EVALUATION-BASED SPEAKER CHANGE DETECTION EVALUATION METRICS - A simplified explanation of the abstract

This abstract first appeared for US patent application 20240135934 titled 'EVALUATION-BASED SPEAKER CHANGE DETECTION EVALUATION METRICS

Simplified Explanation

The method described in the abstract involves training a model to predict speaker changes in audio data with multiple speakers. Here is a simplified explanation of the abstract:

  • Obtain a training sample with audio data from multiple speakers and ground-truth speaker change intervals.
  • Process the audio data to predict speaker changes using a sequence transduction model.
  • Label predicted speaker change tokens as correct if they overlap with ground-truth intervals.
  • Determine precision metric based on correct predictions.

Potential Applications

This technology could be applied in various fields such as speech recognition, speaker diarization, and audio transcription.

Problems Solved

This technology helps in accurately detecting speaker changes in audio data with multiple speakers, improving the performance of speech processing systems.

Benefits

- Enhanced accuracy in identifying speaker changes - Improved performance of speech recognition systems - Efficient processing of multi-utterance audio data

Potential Commercial Applications

"Speaker Change Detection Technology for Enhanced Speech Processing Systems"

Possible Prior Art

One possible prior art could be traditional speaker diarization methods that may not be as accurate or efficient in handling multi-utterance audio data with multiple speakers.

Unanswered Questions

How does the model handle overlapping speech between speakers?

The abstract does not specify how the model deals with overlapping speech segments where multiple speakers are talking simultaneously.

What is the computational complexity of the sequence transduction model?

The abstract does not provide information on the computational resources required to train and deploy the model for predicting speaker changes in audio data.


Original Abstract Submitted

a method includes obtaining a multi-utterance training sample that includes audio data characterizing utterances spoken by two or more different speakers and obtaining ground-truth speaker change intervals indicating time intervals in the audio data where speaker changes among the two or more different speakers occur. the method also includes processing the audio data to generate a sequence of predicted speaker change tokens using a sequence transduction model. for each corresponding predicted speaker change token, the method includes labeling the corresponding predicted speaker change token as correct when the predicted speaker change token overlaps with one of the ground-truth speaker change intervals. the method also includes determining a precision metric of the sequence transduction model based on a number of the predicted speaker change tokens labeled as correct and a total number of the predicted speaker change tokens in the sequence of predicted speaker change tokens.