20240013772. Multi-Channel Voice Activity Detection simplified abstract (Google LLC)

From WikiPatents
Jump to navigation Jump to search

Multi-Channel Voice Activity Detection

Organization Name

Google LLC

Inventor(s)

Nolan Andrew Miller of Seattle WA (US)

Ramin Mehran of Mountain View CA (US)

Multi-Channel Voice Activity Detection - A simplified explanation of the abstract

This abstract first appeared for US patent application 20240013772 titled 'Multi-Channel Voice Activity Detection

Simplified Explanation

The patent application describes a method for multi-channel voice activity detection using a sequence of input frames of streaming multi-channel audio captured by an array of microphones. The method involves determining the location of the audio source relative to the user device using a location fingerprint model based on the audio features of each channel. An output is generated from an application-specific classifier, indicating the likelihood that the audio corresponds to a particular audio type that the application is configured to process. The method then determines whether to accept or reject the audio for processing based on the generated score.

  • The method involves analyzing streaming multi-channel audio captured by an array of microphones.
  • It uses a location fingerprint model to determine the location of the audio source relative to the user device.
  • An application-specific classifier generates a score indicating the likelihood that the audio corresponds to a specific audio type.
  • The method decides whether to accept or reject the audio for processing based on the generated score.

Potential Applications:

  • Voice recognition systems: The method can be used in voice recognition systems to accurately detect and process voice commands or speech in multi-channel audio.
  • Audio surveillance: It can be applied in audio surveillance systems to identify and analyze specific audio types or events in multi-channel audio recordings.
  • Teleconferencing: The method can enhance the audio quality and intelligibility in teleconferencing systems by selectively processing audio based on the generated score.

Problems Solved:

  • Accurate voice activity detection: The method solves the problem of accurately detecting voice activity in multi-channel audio by considering the location of the audio source and using an application-specific classifier.
  • Efficient audio processing: By determining whether to accept or reject the audio for processing based on the generated score, the method optimizes the use of computational resources by only processing relevant audio.

Benefits:

  • Improved accuracy: The method improves the accuracy of voice activity detection by incorporating location information and using an application-specific classifier.
  • Enhanced audio quality: By selectively processing audio based on the generated score, the method can enhance the audio quality and intelligibility in various applications.
  • Efficient resource utilization: The method optimizes the use of computational resources by only processing audio that is likely to be relevant, resulting in improved efficiency.


Original Abstract Submitted

a method for multi-channel voice activity detection includes receiving a sequence of input frames characterizing streaming multi-channel audio captured by an array of microphones. each channel of the streaming multi-channel audio includes respective audio features captured by a separate dedicated microphone. the method also includes determining, using a location fingerprint model, a location fingerprint indicating a location of a source of the multi-channel audio relative to the user device based on the respective audio features of each channel of the multi-channel audio. the method also includes generating an output from an application-specific classifier. the first score indicates a likelihood that the multi-channel audio corresponds to a particular audio type that the particular application is configured to process. the method also includes determining whether to accept or reject the multi-channel audio for processing by the particular application based on the first score generated as output from the application-specific classifier.