US Patent Application 18226545. Audio-Visual Separation of On-Screen Sounds based on Machine Learning Models simplified abstract

From WikiPatents
Jump to navigation Jump to search

Audio-Visual Separation of On-Screen Sounds based on Machine Learning Models

Organization Name

GOOGLE LLC

Inventor(s)

Efthymios Tzinis of Urbana IL (US)

Scott Wisdom of Boston MA (US)

Aren Jansen of Mountain View CA (US)

John R. Hershey of Winchester MA (US)

Audio-Visual Separation of On-Screen Sounds based on Machine Learning Models - A simplified explanation of the abstract

This abstract first appeared for US patent application 18226545 titled 'Audio-Visual Separation of On-Screen Sounds based on Machine Learning Models

Simplified Explanation

This patent application describes an apparatus and methods for separating audio sources in videos using a neural network.

  • The method involves receiving an audio waveform associated with multiple video frames.
  • A neural network is used to estimate the audio sources associated with the video frames.
  • The neural network also generates audio embeddings that correspond to the estimated audio sources.
  • By comparing the audio embeddings and a video embedding, the method determines if the estimated audio sources correspond to objects in the video frames.
  • Based on this analysis, the neural network predicts a modified version of the audio waveform that includes audio sources corresponding to objects in the video frames.


Original Abstract Submitted

Apparatus and methods related to separation of audio sources are provided. The method includes receiving an audio waveform associated with a plurality of video frames. The method includes estimating, by a neural network, one or more audio sources associated with the plurality of video frames. The method includes generating, by the neural network, one or more audio embeddings corresponding to the one or more estimated audio sources. The method includes determining, based on the audio embeddings and a video embedding, whether one or more audio sources of the one or more estimated audio sources correspond to objects in the plurality of video frames. The method includes predicting, by the neural network and based on the one or more audio embeddings and the video embedding, a version of the audio waveform comprising audio sources that correspond to objects in the plurality of video frames.