20240020948. EFFICIENTFORMER VISION TRANSFORMER simplified abstract (Unknown Organization)

From WikiPatents
Jump to navigation Jump to search

EFFICIENTFORMER VISION TRANSFORMER

Organization Name

Unknown Organization

Inventor(s)

Jian Ren of Hermosa Beach CA (US)

Yang Wen of San Jose CA (US)

Ju Hu of Los Angeles CA (US)

Georgios Evangelidis of Wien (AT)

Sergey Tulyakov of Santa Monica CA (US)

Yanyu Li of Malden MA (US)

Geng Yuan of Medford MA (US)

EFFICIENTFORMER VISION TRANSFORMER - A simplified explanation of the abstract

This abstract first appeared for US patent application 20240020948 titled 'EFFICIENTFORMER VISION TRANSFORMER

Simplified Explanation

The abstract describes a vision transformer network designed for mobile devices, specifically smart eyewear devices and augmented reality (AR) and virtual reality (VR) devices. The network processes input images using a convolution stem and includes multiple stages of metablocks (MBs) and token mixers. The MB stages have different layer configurations and include a multi-head self-attention (MHSA) processing block.

  • The vision transformer network is optimized for low latency and can be used on mobile devices like smart eyewear and AR/VR devices.
  • The network starts with a convolution stem that embeds the input image.
  • The network consists of two stacks of stages: the first stack includes at least two stages of 4D metablocks (MBs), and the second stack includes at least two stages of 3D MBs.
  • Each MB stage and each MB within the stages have different layer configurations.
  • Each MB stage and each MB within the stages include a token mixer.
  • The MB stages also include a multi-head self-attention (MHSA) processing block.

Potential Applications:

  • Mobile devices: The vision transformer network can be used in smart eyewear devices and AR/VR devices, enabling real-time image processing and analysis on these devices.
  • Augmented Reality: The network can enhance AR experiences by providing efficient and accurate image processing capabilities on mobile devices.
  • Virtual Reality: The network can improve VR experiences by enabling low-latency image processing on mobile devices, enhancing the realism and interactivity of virtual environments.

Problems Solved:

  • Latency: The vision transformer network is designed to have extremely low latency, addressing the challenge of real-time image processing on mobile devices.
  • Mobile Device Compatibility: By optimizing the network for mobile devices, the technology solves the problem of resource-intensive image processing on devices with limited computational power and battery life.

Benefits:

  • Low Latency: The network's design ensures minimal delay in processing input images, enabling real-time applications on mobile devices.
  • Mobile Device Compatibility: The network is specifically tailored for mobile devices, making it usable on smart eyewear devices and AR/VR devices without compromising performance.
  • Enhanced AR/VR Experiences: By providing efficient image processing capabilities, the network improves the quality and interactivity of AR and VR experiences on mobile devices.


Original Abstract Submitted

a vision transformer network having extremely low latency and usable on mobile devices, such as smart eyewear devices and other augmented reality (ar) and virtual reality (vr) devices. the transformer network processes an input image, and the network includes a convolution stem configured to patch embed the image. a first stack of stages including at least two stages of 4-dimension (4d) metablocks (mbs) (mb) follow the convolution stem. a second stack of stages including at least two stages of 3-dimension mbs (mb) follow the mbstages. each of the mbstages and each of the mbstages include different layer configurations, and each of the mbstages and each of the mbstages include a token mixer. the mbstages each additionally include a multi-head self attention (mhsa) processing block.