20240046628. HIERARCHICAL AUDIO-VISUAL FEATURE FUSING METHOD FOR AUDIO-VISUAL QUESTION ANSWERING AND PRODUCT simplified abstract (TSINGHUA UNIVERSITY)
Contents
HIERARCHICAL AUDIO-VISUAL FEATURE FUSING METHOD FOR AUDIO-VISUAL QUESTION ANSWERING AND PRODUCT
Organization Name
Inventor(s)
HIERARCHICAL AUDIO-VISUAL FEATURE FUSING METHOD FOR AUDIO-VISUAL QUESTION ANSWERING AND PRODUCT - A simplified explanation of the abstract
This abstract first appeared for US patent application 20240046628 titled 'HIERARCHICAL AUDIO-VISUAL FEATURE FUSING METHOD FOR AUDIO-VISUAL QUESTION ANSWERING AND PRODUCT
Simplified Explanation
The abstract describes a method for audio-visual question answering by fusing audio and video embeddings with a baseline model at different stages in a hierarchical feature fusing process. This process generates answer probability distributions which are then added and averaged for hierarchical integration to produce a final answer.
- The method involves fusing audio embedding in a video clip with a baseline model, as well as video embedding and question embedding at different stages in a hierarchical feature fusing process.
- The hierarchical feature fusing process includes an early stage, a middle stage, and a late stage.
- The method generates three answer probability distributions: a first answer probability distribution, a second answer probability distribution, and a third answer probability distribution.
- The answer probability distributions are added based on preset weights and then averaged for hierarchical integration.
- The final answer is generated through the hierarchical integration of the answer probability distributions.
Potential applications of this technology:
- Audio-visual question answering systems
- Video analysis and understanding
- Natural language processing and understanding
- Human-computer interaction
Problems solved by this technology:
- Improving the accuracy and performance of audio-visual question answering systems
- Enhancing the integration of audio and visual information in video analysis
- Addressing the challenges of understanding and processing natural language queries in multimedia applications
Benefits of this technology:
- Improved accuracy and reliability in answering audio-visual questions
- Enhanced understanding and analysis of audio and visual information in videos
- More efficient and effective human-computer interaction in multimedia applications
Original Abstract Submitted
a hierarchical audio-visual feature fusing method for audio-visual question answering and a product relate to the field of audio-visual question answering. by fusing audio embedding in an input video clip with a baseline model as well as video embedding and question embedding respectively at an early stage, a middle stage and a late stage in a hierarchical feature fusing process, a first answer probability distribution, a second answer probability distribution and a third answer probability distribution are obtained, and the answer probability distributions are added based on preset weights, and then averaged for hierarchical integration to generate a final answer.