INTEGRATING SPATIAL LOCALITY INTO IMAGE TRANSFORMERS WITH MASKED ATTENTION

Organization Name

Inventor(s)

Ali Shafiee Ardestani of Santa Clara CA (US)

INTEGRATING SPATIAL LOCALITY INTO IMAGE TRANSFORMERS WITH MASKED ATTENTION - A simplified explanation of the abstract

This abstract first appeared for US patent application 17573630 titled 'INTEGRATING SPATIAL LOCALITY INTO IMAGE TRANSFORMERS WITH MASKED ATTENTION

Simplified Explanation

Abstract

A vision transformer is described in this patent application. It consists of multiple layers and attention heads. Some attention heads have an attention mask added before a Softmax operation, while others do not. The attention masks are used to form element-wise products between Query and Key vectors. The patent also mentions the use of hard masks and soft masks in the attention masks. Additionally, a learnable bias α can be added to diagonal elements of the attention map.

Patent/Innovation Explanation

The vision transformer includes multiple layers and attention heads.
Some attention heads have an attention mask added before a Softmax operation.
The attention masks are used to form element-wise products between Query and Key vectors.
Hard masks are used to select closest neighbors of a patch and ignore patches further away.
Soft masks multiply weights of closest neighbors by a magnification factor and pass weights of further away patches.
A learnable bias α can be added to diagonal elements of the attention map.

Potential Applications

Image recognition and classification
Object detection and tracking
Video analysis and understanding
Natural language processing
Medical imaging analysis

Problems Solved

Improved attention mechanism in vision transformers
Enhanced ability to focus on relevant image patches or features
Better handling of spatial relationships in visual data

Benefits

Improved accuracy and performance in vision tasks
Increased interpretability and explainability of the model's attention mechanism
More efficient processing of visual data
Potential for transfer learning and generalization to various domains

Original Abstract Submitted

A vision transformer includes L layers, and H attention heads in each layer. An h′ of the attention heads include an attention mask added before a Softmax operation, and an h of the attention heads include unmasked attention heads in which H=h′+h. Each attention mask multiplies a Query vector and a Key vector for form element-wise products. At least one attention mask is a hard mask that selects closest neighbors of a patch and ignores patches further away than the closest neighbors of the patch. Alternatively, at least one attention mask includes a soft mask that multiplies weights of closest neighbors of a patch by a magnification factor and passes weights of patches that are further away than the closest neighbors of the patch. A learnable bias α may be added to diagonal elements of the at least one attention map.

17573630. INTEGRATING SPATIAL LOCALITY INTO IMAGE TRANSFORMERS WITH MASKED ATTENTION simplified abstract (Samsung Electronics Co., Ltd.)

Contents

INTEGRATING SPATIAL LOCALITY INTO IMAGE TRANSFORMERS WITH MASKED ATTENTION

Organization Name

Inventor(s)