IMPROVEMENT OF AUDIO-VISUAL QUESTION ANSWERING

Organization Name

Inventor(s)

IMPROVEMENT OF AUDIO-VISUAL QUESTION ANSWERING

This abstract first appeared for US patent application 20250104701 titled 'IMPROVEMENT OF AUDIO-VISUAL QUESTION ANSWERING

Original Abstract Submitted

the present disclosure describes techniques for improving audio-visual question answering. a machine learning model is configured for audio-visual question answering (avqa). the machine learning model comprises a first sub-model configured to capture semantic audio information and output an audio spatial feature map x. the machine learning model comprises a second sub-model configured to extract visual features xand audio features xand further configured to obtain a question vector x. the machine learning model comprises a third sub-model configured to capture audio-visual correspondence at a granular level. a balanced avqa dataset is created. the balanced avqa dataset comprises balanced answer distribution in each question category. the machine learning model is trained to answer questions about visual objects, sounds, and their associations in videos using at least a subset of the balanced avaq dataset.