17946400. Multi-Granularity Alignment for Visual Question Answering simplified abstract (Samsung Electronics Co., Ltd.)

From WikiPatents
Jump to navigation Jump to search

Multi-Granularity Alignment for Visual Question Answering

Organization Name

Samsung Electronics Co., Ltd.

Inventor(s)

Peixi Xiong of Evanston IL (US)

Yilin Shen of Santa Clara CA (US)

Hongxia Jin of San Jose CA (US)

Multi-Granularity Alignment for Visual Question Answering - A simplified explanation of the abstract

This abstract first appeared for US patent application 17946400 titled 'Multi-Granularity Alignment for Visual Question Answering

Simplified Explanation

The patent application describes a method for answering natural-language questions about images. Here are the key points:

  • The method involves accessing an image and a question about the image.
  • The image is analyzed to extract two sets of features at different levels of detail.
  • The question is also analyzed to extract two sets of features at the same levels of detail as the image features.
  • The method then generates two outputs representing the alignment between the image features and the text features.
  • Finally, the answer to the question is determined based on the outputs.

Potential applications of this technology:

  • Image-based question answering systems
  • Visual search engines
  • Automated image analysis tools

Problems solved by this technology:

  • Difficulty in understanding and answering questions about images using natural language
  • Lack of efficient methods for aligning image features with text features

Benefits of this technology:

  • Improved accuracy and efficiency in answering questions about images
  • Enhanced capabilities for image analysis and understanding


Original Abstract Submitted

In one embodiment, a method includes accessing an image and a natural-language question regarding the image and extracting, from the image, a first set of image features at a first level of granularity and a second set of image features at a second level of granularity. The method further includes extracting, from the question, a first set of text features at the first level of granularity and a second set of text features at the second level of granularity; generating a first output representing an alignment between the first set of image features and the first set of text features; generating a second output representing an alignment between the second set of image features and the second set of text features; and determining an answer to the question based on the first output and the second output.