DETERMINING AUDIO AND VIDEO REPRESENTATIONS USING SELF-SUPERVISED LEARNING

Organization Name

Inventor(s)

DETERMINING AUDIO AND VIDEO REPRESENTATIONS USING SELF-SUPERVISED LEARNING

This abstract first appeared for US patent application 20240257496 titled 'DETERMINING AUDIO AND VIDEO REPRESENTATIONS USING SELF-SUPERVISED LEARNING

Original Abstract Submitted

embodiments are disclosed for training a system to generate audio and video representations using self-supervised learning. the method may include receiving a video signal including an audio component and a video component. a first machine learning model is trained to determine a representation of the audio component using a contrastive learning task and a temporal learning task. a second machine learning model to determine a representation of the video component using the contrastive learning task and the temporal learning task. by training the machine learning models using both contrastive learning tasks and temporal learning tasks, the machine learning models learn short term features, long term features, and semantic features of input data.