17573630. INTEGRATING SPATIAL LOCALITY INTO IMAGE TRANSFORMERS WITH MASKED ATTENTION simplified abstract (Samsung Electronics Co., Ltd.)

From WikiPatents
Jump to navigation Jump to search

INTEGRATING SPATIAL LOCALITY INTO IMAGE TRANSFORMERS WITH MASKED ATTENTION

Organization Name

Samsung Electronics Co., Ltd.

Inventor(s)

Ling Li of Sunnyvale CA (US)

Ali Shafiee Ardestani of Santa Clara CA (US)

Joseph H. Hassoun of San Jose CA (US)

INTEGRATING SPATIAL LOCALITY INTO IMAGE TRANSFORMERS WITH MASKED ATTENTION - A simplified explanation of the abstract

This abstract first appeared for US patent application 17573630 titled 'INTEGRATING SPATIAL LOCALITY INTO IMAGE TRANSFORMERS WITH MASKED ATTENTION

Simplified Explanation

Abstract

A vision transformer is described in this patent application. It consists of multiple layers and attention heads. Some attention heads have an attention mask added before a Softmax operation, while others do not. The attention masks are used to form element-wise products between Query and Key vectors. The patent also mentions the use of hard masks and soft masks in the attention masks. Additionally, a learnable bias α can be added to diagonal elements of the attention map.

Patent/Innovation Explanation

  • The vision transformer includes multiple layers and attention heads.
  • Some attention heads have an attention mask added before a Softmax operation.
  • The attention masks are used to form element-wise products between Query and Key vectors.
  • Hard masks are used to select closest neighbors of a patch and ignore patches further away.
  • Soft masks multiply weights of closest neighbors by a magnification factor and pass weights of further away patches.
  • A learnable bias α can be added to diagonal elements of the attention map.

Potential Applications

  • Image recognition and classification
  • Object detection and tracking
  • Video analysis and understanding
  • Natural language processing
  • Medical imaging analysis

Problems Solved

  • Improved attention mechanism in vision transformers
  • Enhanced ability to focus on relevant image patches or features
  • Better handling of spatial relationships in visual data

Benefits

  • Improved accuracy and performance in vision tasks
  • Increased interpretability and explainability of the model's attention mechanism
  • More efficient processing of visual data
  • Potential for transfer learning and generalization to various domains


Original Abstract Submitted

A vision transformer includes L layers, and H attention heads in each layer. An h′ of the attention heads include an attention mask added before a Softmax operation, and an h of the attention heads include unmasked attention heads in which H=h′+h. Each attention mask multiplies a Query vector and a Key vector for form element-wise products. At least one attention mask is a hard mask that selects closest neighbors of a patch and ignores patches further away than the closest neighbors of the patch. Alternatively, at least one attention mask includes a soft mask that multiplies weights of closest neighbors of a patch by a magnification factor and passes weights of patches that are further away than the closest neighbors of the patch. A learnable bias α may be added to diagonal elements of the at least one attention map.