18520083. Vector-Quantized Image Modeling simplified abstract (GOOGLE LLC)

From WikiPatents
Jump to navigation Jump to search

Vector-Quantized Image Modeling

Organization Name

GOOGLE LLC

Inventor(s)

Jiahui Yu of Jersey City NJ (US)

Xin Li of Santa Clara CA (US)

Han Zhang of Sunnyvale CA (US)

Vijay Vasudevan of Los Altos Hills CA (US)

Alexander Yeong-Shiuh Ku of Brooklyn NY (US)

Jason Michael Baldridge of Austin TX (US)

Yuanzhong Xu of Mountain View CA (US)

Jing Yu Koh of Austin TX (US)

Thang Minh Luong of Santa Clara CA (US)

Gunjan Baid of San Francisco CA (US)

Zirui Wang of San Francisco CA (US)

Yonghui Wu of Palo Alto CA (US)

Vector-Quantized Image Modeling - A simplified explanation of the abstract

This abstract first appeared for US patent application 18520083 titled 'Vector-Quantized Image Modeling

Simplified Explanation

The present disclosure describes a Vector-quantized Image Modeling (VIM) approach using vision transformers and improved codebook handling to enhance image modeling tasks.

  • The approach involves pretraining a machine learning model, such as a Transformer model, to predict rasterized image tokens autoregressively.
  • Discrete image tokens are encoded from a learned Vision-Transformer-based VQGAN (ViT-VQGAN), which improves efficiency and reconstruction fidelity.
  • The improved ViT-VQGAN enhances vector-quantized image modeling tasks, including unconditional image generation, conditioned image generation (e.g., class-conditioned image generation), and unsupervised representation learning.

Potential Applications

  • Image generation
  • Image classification
  • Representation learning

Problems Solved

  • Enhanced efficiency in image modeling
  • Improved reconstruction fidelity
  • Better handling of codebooks

Benefits

  • Higher quality image generation
  • More accurate image classification
  • Improved unsupervised representation learning

Potential Commercial Applications

Enhanced Image Generation and Classification Using Vision Transformers

Possible Prior Art

No prior art information is available at this time.

Unanswered Questions

=== How does the ViT-VQGAN approach compare to other existing image modeling techniques in terms of performance and efficiency? === Are there any limitations or drawbacks to using the ViT-VQGAN approach for image modeling tasks?


Original Abstract Submitted

Systems and methods are provided for vector-quantized image modeling using vision transformers and improved codebook handling. In particular, the present disclosure provides a Vector-quantized Image Modeling (VIM) approach that involves pretraining a machine learning model (e.g., Transformer model) to predict rasterized image tokens autoregressively. The discrete image tokens can be encoded from a learned Vision-Transformer-based VQGAN (example implementations of which can be referred to as ViT-VQGAN). The present disclosure proposes multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity. The improved ViT-VQGAN further improves vector-quantized image modeling tasks, including unconditional image generation, conditioned image generation (e.g., class-conditioned image generation), and unsupervised representation learning.