17808653. Conditioned Separation of Arbitrary Sounds based on Machine Learning Models simplified abstract (GOOGLE LLC)

From WikiPatents
Jump to navigation Jump to search

Conditioned Separation of Arbitrary Sounds based on Machine Learning Models

Organization Name

GOOGLE LLC

Inventor(s)

Beat Gfeller of Duebendorf (CH)

Kevin Ian Kilgour of Zurich (CH)

Marco Tagliasacchi of Kilchberg (CH)

Aren Jansen of Mountain View CA (US)

Scott Thomas Wisdom of Boston MA (US)

Qingqing Huang of Palo Alto CA (US)

Conditioned Separation of Arbitrary Sounds based on Machine Learning Models - A simplified explanation of the abstract

This abstract first appeared for US patent application 17808653 titled 'Conditioned Separation of Arbitrary Sounds based on Machine Learning Models

Simplified Explanation

The patent application describes methods for training a neural network to separate audio sources from a given audio waveform using both audio clips and textual descriptions of the audio as training data. The methods involve generating a shared representation of the audio and text, where the audio embedding and text embedding of a given audio clip are close to each other. This shared representation is then used to train a neural network to separate the target audio source from the input audio waveform.

  • The patent application proposes a method for training a neural network to separate audio sources from an audio waveform.
  • The method uses both audio clips and textual descriptions of the audio as training data.
  • A shared representation is generated by embedding the audio and text, ensuring that the embeddings of a given audio clip and its textual description are close to each other.
  • The shared representation is then used to train a neural network to separate the target audio source from the input audio waveform.

Potential Applications

  • Speech enhancement: The technology can be used to separate speech from background noise in audio recordings, improving speech intelligibility.
  • Music source separation: It can be applied to separate individual instruments or vocals from a music recording, allowing for remixing or isolating specific elements.
  • Audio transcription: The technology can assist in transcribing audio recordings by separating different speakers or audio sources, making it easier to transcribe each source separately.

Problems Solved

  • Difficulty in separating specific audio sources from a mixture of sounds in an audio waveform.
  • Lack of training data that combines both audio clips and textual descriptions, making it challenging to train neural networks for audio source separation.

Benefits

  • Improved accuracy: By incorporating textual descriptions, the neural network can better understand the target audio source, leading to more accurate separation.
  • Versatility: The technology can be applied to various audio source separation tasks, such as speech enhancement, music source separation, and audio transcription.
  • Enhanced user experience: Separating audio sources can improve the quality and intelligibility of audio recordings, benefiting users in different domains such as media production, transcription services, and communication systems.


Original Abstract Submitted

Example methods include receiving training data comprising a plurality of audio clips and a plurality of textual descriptions of audio. The methods include generating a shared representation comprising a joint embedding. An audio embedding of a given audio clip is within a threshold distance of a text embedding of a textual description of the given audio clip. The methods include generating, based on the joint embedding, a conditioning vector and training, based on the conditioning vector, a neural network to: receive (i) an input audio waveform, and (ii) an input comprising one or more of an input textual description of a target audio source in the input audio waveform, or an audio sample of the target audio source, separate audio corresponding to the target audio source from the input audio waveform, and output the separated audio corresponding to the target audio source in response to the receiving of the input.