17946400. Multi-Granularity Alignment for Visual Question Answering simplified abstract (Samsung Electronics Co., Ltd.)
Contents
Multi-Granularity Alignment for Visual Question Answering
Organization Name
Inventor(s)
Peixi Xiong of Evanston IL (US)
Yilin Shen of Santa Clara CA (US)
Hongxia Jin of San Jose CA (US)
Multi-Granularity Alignment for Visual Question Answering - A simplified explanation of the abstract
This abstract first appeared for US patent application 17946400 titled 'Multi-Granularity Alignment for Visual Question Answering
Simplified Explanation
The patent application describes a method for answering natural-language questions about images. Here are the key points:
- The method involves accessing an image and a question about the image.
- The image is analyzed to extract two sets of features at different levels of detail.
- The question is also analyzed to extract two sets of features at the same levels of detail as the image features.
- The method then generates two outputs representing the alignment between the image features and the text features.
- Finally, the answer to the question is determined based on the outputs.
Potential applications of this technology:
- Image-based question answering systems
- Visual search engines
- Automated image analysis tools
Problems solved by this technology:
- Difficulty in understanding and answering questions about images using natural language
- Lack of efficient methods for aligning image features with text features
Benefits of this technology:
- Improved accuracy and efficiency in answering questions about images
- Enhanced capabilities for image analysis and understanding
Original Abstract Submitted
In one embodiment, a method includes accessing an image and a natural-language question regarding the image and extracting, from the image, a first set of image features at a first level of granularity and a second set of image features at a second level of granularity. The method further includes extracting, from the question, a first set of text features at the first level of granularity and a second set of text features at the second level of granularity; generating a first output representing an alignment between the first set of image features and the first set of text features; generating a second output representing an alignment between the second set of image features and the second set of text features; and determining an answer to the question based on the first output and the second output.