Samsung electronics co., ltd. (20240119077). APPARATUS AND METHOD FOR SHARING AND PRUNING WEIGHTS FOR VISION AND LANGUAGE MODELS simplified abstract

From WikiPatents
Jump to navigation Jump to search

APPARATUS AND METHOD FOR SHARING AND PRUNING WEIGHTS FOR VISION AND LANGUAGE MODELS

Organization Name

samsung electronics co., ltd.

Inventor(s)

Shangqian Gao of Pittsburgh PA (US)

Burak Uzkent of Sunnyvale CA (US)

Yilin Shen of San Jose CA (US)

Hongxia Jin of San Jose CA (US)

APPARATUS AND METHOD FOR SHARING AND PRUNING WEIGHTS FOR VISION AND LANGUAGE MODELS - A simplified explanation of the abstract

This abstract first appeared for US patent application 20240119077 titled 'APPARATUS AND METHOD FOR SHARING AND PRUNING WEIGHTS FOR VISION AND LANGUAGE MODELS

Simplified Explanation

The abstract describes a method of performing multimodal tasks using a multimodal model with a text encoder and a vision encoder. The model obtains text and image features, then outputs a response based on their similarity. The weight vectors of the encoders are pruned and shared according to a hypernetwork, and the model is trained to minimize differences between weight vectors and parameters.

  • Explanation:

- Method for performing multimodal tasks using a model with text and vision encoders - Obtains text and image features, then outputs a response based on similarity - Weight vectors of encoders are pruned and shared based on a hypernetwork - Model is trained to minimize differences between weight vectors and parameters

  • Potential Applications:

- Image and text retrieval systems - Multimodal chatbots - Content recommendation systems

  • Problems Solved:

- Efficiently processing multimodal data - Improving response accuracy in multimodal tasks

  • Benefits:

- Enhanced performance in multimodal tasks - Reduced computational complexity - Improved model training efficiency

  • Potential Commercial Applications:

- E-commerce product recommendation systems - Customer service chatbots - Image search engines

  • Possible Prior Art:

- Previous methods for multimodal feature extraction and similarity calculation - Existing models for text and image processing in isolation

      1. Unanswered Questions:
        1. How does the hypernetwork generate sharing and pruning vectors for the weight vectors of the encoders?

The abstract mentions that a hypernetwork is used to generate sharing and pruning vectors for the weight vectors of the text and vision encoders. However, it does not provide details on the specific mechanism or algorithm used by the hypernetwork to perform this task.

        1. What specific metrics are used to measure the differences between weight vectors and parameters during model training?

The abstract states that the model is trained to minimize differences between weight vectors in the text and vision encoders, as well as differences between weight vectors in different layers of the text encoder. However, it does not specify the exact metrics or loss functions used to quantify these differences and guide the training process.


Original Abstract Submitted

a method of performing a multimodal tasks by using a multimodal model that includes a text encoder and a vision encoder, may include obtaining a text feature from the query via the text encoder; obtaining an image feature from the one or more input images via the vision encoder; and outputting a response to the query based on similarity between the text feature and the image feature, wherein weights vectors of the text encoder and the vision encoder are pruned and shared according to a sharing vector and a pruning vector that are generated by a hypernetwork, and wherein the hypernetwork and the multimodal model are jointly trained to minimize at least one of a difference between the weight vectors in the text encoder and the vision encoder, a difference between the weight vectors in different layers of the text encoder, and a number of parameters in the multimodal model.