20240054991. SPOKEN QUERY PROCESSING FOR IMAGE SEARCH simplified abstract (Adobe Inc.)

From WikiPatents
Jump to navigation Jump to search

SPOKEN QUERY PROCESSING FOR IMAGE SEARCH

Organization Name

Adobe Inc.

Inventor(s)

Ajay Jain of San Jose (US)

Sanjeev Tagra of Redmond WA (US)

Sachin Soni of New Delhi (IN)

Ryan Rozich of Austin TX (US)

Nikaash Puri of New Delhi (IN)

Jonathan Roeder of San Jose (US)

SPOKEN QUERY PROCESSING FOR IMAGE SEARCH - A simplified explanation of the abstract

This abstract first appeared for US patent application 20240054991 titled 'SPOKEN QUERY PROCESSING FOR IMAGE SEARCH

Simplified Explanation

The abstract describes a patent application for an image search system that uses a multi-modal model to determine the relevance of images to a spoken query. The multi-modal model includes a spoken language model that extracts features from the spoken query and a language processing model that extracts features from an image. The model calculates a relevance score for the image and spoken query based on the extracted features. The model is trained using a curriculum approach, starting with training the spoken language model using audio data and then jointly training the spoken language model and image processing model using a dataset of spoken queries and associated images.

  • The patent application describes an image search system that uses a multi-modal model.
  • The multi-modal model includes a spoken language model and a language processing model.
  • The spoken language model extracts features from the spoken query.
  • The language processing model extracts features from the image.
  • The multi-modal model calculates a relevance score for the image and spoken query.
  • The model is trained using a curriculum approach, starting with training the spoken language model using audio data.
  • The model is then jointly trained using a dataset of spoken queries and associated images.

Potential Applications:

  • Improved image search systems that can understand spoken queries.
  • Enhanced user experience in searching for images using voice commands.
  • Integration of voice-based image search in various applications, such as virtual assistants or smart home devices.

Problems Solved:

  • Overcoming the limitations of traditional text-based image search systems.
  • Enabling users to search for images using natural language spoken queries.
  • Improving the accuracy and relevance of image search results.

Benefits:

  • More intuitive and convenient image search experience for users.
  • Increased accuracy in retrieving relevant images based on spoken queries.
  • Potential for improved accessibility for individuals with disabilities who may have difficulty typing or using traditional search interfaces.


Original Abstract Submitted

an image search system uses a multi-modal model to determine relevance of images to a spoken query. the multi-modal model includes a spoken language model that extracts features from spoken query and a language processing model that extract features from an image. the multi-model model determines a relevance score for the image and the spoken query based on the extracted features. the multi-modal model is trained using a curriculum approach that includes training the spoken language model using audio data. subsequently, a training dataset comprising a plurality of spoken queries and one or more images associated with each spoken query is used to jointly train the spoken language model and an image processing model to provide a trained multi-modal model.