20240046628. HIERARCHICAL AUDIO-VISUAL FEATURE FUSING METHOD FOR AUDIO-VISUAL QUESTION ANSWERING AND PRODUCT simplified abstract (TSINGHUA UNIVERSITY)

From WikiPatents
Jump to navigation Jump to search

HIERARCHICAL AUDIO-VISUAL FEATURE FUSING METHOD FOR AUDIO-VISUAL QUESTION ANSWERING AND PRODUCT

Organization Name

TSINGHUA UNIVERSITY

Inventor(s)

Wenwu Zhu of Beijing (CN)

Xin Wang of Beijing (CN)

Pinci Yang of Beijing (CN)

HIERARCHICAL AUDIO-VISUAL FEATURE FUSING METHOD FOR AUDIO-VISUAL QUESTION ANSWERING AND PRODUCT - A simplified explanation of the abstract

This abstract first appeared for US patent application 20240046628 titled 'HIERARCHICAL AUDIO-VISUAL FEATURE FUSING METHOD FOR AUDIO-VISUAL QUESTION ANSWERING AND PRODUCT

Simplified Explanation

The abstract describes a method for audio-visual question answering by fusing audio and video embeddings with a baseline model at different stages in a hierarchical feature fusing process. This process generates answer probability distributions which are then added and averaged for hierarchical integration to produce a final answer.

  • The method involves fusing audio embedding in a video clip with a baseline model, as well as video embedding and question embedding at different stages in a hierarchical feature fusing process.
  • The hierarchical feature fusing process includes an early stage, a middle stage, and a late stage.
  • The method generates three answer probability distributions: a first answer probability distribution, a second answer probability distribution, and a third answer probability distribution.
  • The answer probability distributions are added based on preset weights and then averaged for hierarchical integration.
  • The final answer is generated through the hierarchical integration of the answer probability distributions.

Potential applications of this technology:

  • Audio-visual question answering systems
  • Video analysis and understanding
  • Natural language processing and understanding
  • Human-computer interaction

Problems solved by this technology:

  • Improving the accuracy and performance of audio-visual question answering systems
  • Enhancing the integration of audio and visual information in video analysis
  • Addressing the challenges of understanding and processing natural language queries in multimedia applications

Benefits of this technology:

  • Improved accuracy and reliability in answering audio-visual questions
  • Enhanced understanding and analysis of audio and visual information in videos
  • More efficient and effective human-computer interaction in multimedia applications


Original Abstract Submitted

a hierarchical audio-visual feature fusing method for audio-visual question answering and a product relate to the field of audio-visual question answering. by fusing audio embedding in an input video clip with a baseline model as well as video embedding and question embedding respectively at an early stage, a middle stage and a late stage in a hierarchical feature fusing process, a first answer probability distribution, a second answer probability distribution and a third answer probability distribution are obtained, and the answer probability distributions are added based on preset weights, and then averaged for hierarchical integration to generate a final answer.