US Patent Application 17804724. REGIONAL-TO-LOCAL ATTENTION FOR VISION TRANSFORMERS simplified abstract

From WikiPatents
Jump to navigation Jump to search

REGIONAL-TO-LOCAL ATTENTION FOR VISION TRANSFORMERS

Organization Name

International Business Machines Corporation

Inventor(s)

Richard Chen of Baldwin Place NY (US)

Rameswar Panda of Medford MA (US)

Quanfu Fan of Lexington MA (US)

REGIONAL-TO-LOCAL ATTENTION FOR VISION TRANSFORMERS - A simplified explanation of the abstract

This abstract first appeared for US patent application 17804724 titled 'REGIONAL-TO-LOCAL ATTENTION FOR VISION TRANSFORMERS

Simplified Explanation

- The patent application describes techniques and apparatus for analyzing visual content using a visual transformer. - The visual content item is divided into regions, and a first set of tokens is generated, with each token representing a regional feature from a different region. - A second set of tokens is generated, with each token representing a local feature from one of the regions. - Using a hierarchical vision transformer, at least one feature map is generated by analyzing the first set of tokens and the second set of tokens separately. - Based on the feature map, at least one vision task is performed. - The innovation allows for efficient analysis of visual content by utilizing regional and local features and a hierarchical vision transformer.


Original Abstract Submitted

Techniques and apparatus for analyzing visual content using a visual transformer are described. An example technique includes generating a first set of tokens based on a visual content item. Each token in the first set of tokens is associated with a regional feature from a different region of a plurality of regions of the visual content item. A second set of tokens is generated based on the visual content item. Each token in the second set of tokens is associated with a local feature from one of the plurality of regions of the visual content item. At least one feature map is generated for the visual content item, based on analyzing the first set of tokens and the second set of tokens separately using a hierarchical vision transformer. At least one vision task is performed based on the at least one feature map.