20230081171. Cross-Modal Contrastive Learning for Text-to-Image Generation based on Machine Learning Models simplified abstract (Google LLC)

From WikiPatents
Jump to navigation Jump to search

Cross-Modal Contrastive Learning for Text-to-Image Generation based on Machine Learning Models

Organization Name

Google LLC

Inventor(s)

Han Zhang of Sunnyvale CA (US)

Jing Yu Koh of Singapore (SG)

Jason Michael Baldridge of Austin TX (US)

Yinfei Yang of Sunnyvale CA (US)

Honglak Lee of Ann Arbor MI (US)

Cross-Modal Contrastive Learning for Text-to-Image Generation based on Machine Learning Models - A simplified explanation of the abstract

This abstract first appeared for US patent application 20230081171 titled 'Cross-Modal Contrastive Learning for Text-to-Image Generation based on Machine Learning Models

Simplified Explanation

The patent application describes a computer-implemented method for generating image renditions of scenes based on textual descriptions.

  • The method uses a neural network for text-to-image generation.
  • The neural network is trained to make image renditions associated with the same textual description attract each other and image renditions associated with different textual descriptions repel each other.
  • Mutual information between corresponding pairs of image and text is used to train the neural network.
  • The method predicts the output image rendition of a given scene based on the textual description.

Potential Applications

This technology has potential applications in various fields, including:

  • Content creation: It can be used to generate visual content based on textual descriptions, such as in video games, virtual reality, and animation.
  • Design and architecture: It can assist in visualizing architectural designs or interior layouts based on textual descriptions.
  • Advertising and marketing: It can be used to create visual representations of products or concepts described in text for promotional purposes.
  • Accessibility: It can help visually impaired individuals by generating image renditions of scenes described in text.

Problems Solved

This technology addresses the following problems:

  • Bridging the gap between textual descriptions and visual representations.
  • Automating the process of generating image renditions based on textual input.
  • Improving the accuracy and quality of text-to-image generation.

Benefits

The use of this technology offers several benefits:

  • Time and cost savings: It eliminates the need for manual creation of visual content based on textual descriptions.
  • Enhanced creativity: It provides a tool for artists, designers, and content creators to visualize their ideas more easily.
  • Improved accessibility: It enables visually impaired individuals to better understand and interact with visual content.
  • Consistency and scalability: It ensures consistent and scalable generation of image renditions based on textual input.


Original Abstract Submitted

a computer-implemented method includes receiving, by a computing device, a particular textual description of a scene. the method also includes applying a neural network for text-to-image generation to generate an output image rendition of the scene, the neural network having been trained to cause two image renditions associated with a same textual description to attract each other and two image renditions associated with different textual descriptions to repel each other based on mutual information between a plurality of corresponding pairs, wherein the plurality of corresponding pairs comprise an image-to-image pair and a text-to-image pair. the method further includes predicting the output image rendition of the scene.