18454459. PROMPT TUNING FOR ZERO-SHOT COMPOSITIONAL LEARNING IN MACHINE LEARNING SYSTEMS simplified abstract (SAMSUNG ELECTRONICS CO., LTD.)

From WikiPatents
Jump to navigation Jump to search

PROMPT TUNING FOR ZERO-SHOT COMPOSITIONAL LEARNING IN MACHINE LEARNING SYSTEMS

Organization Name

SAMSUNG ELECTRONICS CO., LTD.

Inventor(s)

Lingyu Zhang of Cupertino CA (US)

Ting Hua of Santa Clara CA (US)

Yilin Shen of Santa Clara CA (US)

Hongxia Jin of San Jose CA (US)

PROMPT TUNING FOR ZERO-SHOT COMPOSITIONAL LEARNING IN MACHINE LEARNING SYSTEMS - A simplified explanation of the abstract

This abstract first appeared for US patent application 18454459 titled 'PROMPT TUNING FOR ZERO-SHOT COMPOSITIONAL LEARNING IN MACHINE LEARNING SYSTEMS

The method described in the abstract involves fine-tuning a pre-trained vision-language model to select attribute and object labels that match the content in an image.

  • The model is trained to choose one attribute label and one object label for each image.
  • Prompt tuning involves generating textual features for objects and attributes using different encoders, as well as image features using a vision encoder.
  • Learnable prompt tokens are created from intermediate outputs of the encoders and combined with input data to improve model performance.

Potential Applications: - Image captioning - Visual question answering - Content-based image retrieval

Problems Solved: - Enhancing the accuracy of vision-language models - Improving the understanding of complex visual content

Benefits: - More precise image analysis - Better matching of image content with textual labels

Commercial Applications: Title: Enhanced Image Labeling Technology for Improved Visual Understanding This technology can be used in industries such as e-commerce, social media, and healthcare for automated image tagging, content recommendation, and medical image analysis.

Questions about the technology: 1. How does prompt tuning improve the performance of vision-language models?

  Prompt tuning helps the model focus on specific attributes and objects in an image, leading to more accurate labeling.

2. What are the potential drawbacks of using learnable prompt tokens in the model?

  Learnable prompt tokens may introduce additional complexity to the model and require more computational resources.


Original Abstract Submitted

A method includes obtaining an image, a set of attribute labels, and a set of object labels and performing prompt tuning of a pre-trained vision-language model having first and second textual encoders and a vision encoder. The model is trained during prompt tuning to select one attribute label and one object label that match content in the image. Performing the prompt tuning includes, for each attribute label-object label pair, generating object textual features associated with the object label using the first textual encoder, generating attribute textual features associated with the attribute label using the second textual encoder, and generating image features associated with the image using the vision encoder. Intermediate outputs from initial layers of the textual encoders and the vision encoder are combined to generate layer-specific learnable prompt tokens that are appended to inputs of specified layers in the first and second textual encoders and the vision encoder.