Google LLC (20240330362). SYSTEM AND METHOD FOR GENERATING VISUAL CAPTIONS simplified abstract

From WikiPatents
Jump to navigation Jump to search

SYSTEM AND METHOD FOR GENERATING VISUAL CAPTIONS

Organization Name

Google LLC

Inventor(s)

Ruofei Du of San Francisco CA (US)

Alex Olwal of Santa Cruz CA (US)

Xingyu Liu of Los Angeles CA (US)

SYSTEM AND METHOD FOR GENERATING VISUAL CAPTIONS - A simplified explanation of the abstract

This abstract first appeared for US patent application 20240330362 titled 'SYSTEM AND METHOD FOR GENERATING VISUAL CAPTIONS

The patent application describes a method and device that can receive audio data through a sensor on a computing device, convert the audio data to text, extract a portion of the text, input it into a neural network-based language model, and obtain visual images, sources of the images, content of the images, or confidence scores for the images.

  • The device can receive audio data through a sensor on a computing device.
  • It converts the audio data to text and extracts a portion of the text.
  • The extracted text is input into a neural network-based language model.
  • The device obtains visual images, sources of the images, content of the images, or confidence scores for the images.
  • Based on the input, the device determines at least one visual image to output on the display of the computing device to supplement the audio data and facilitate communication.

Potential Applications: - Assistive technology for individuals with hearing impairments. - Real-time translation services for multilingual communication. - Enhancing accessibility in educational settings for students with learning disabilities.

Problems Solved: - Improving communication for individuals with hearing impairments. - Streamlining the process of converting audio data to visual information. - Enhancing the user experience in accessing and understanding audio content.

Benefits: - Increased accessibility for individuals with hearing impairments. - Improved efficiency in converting audio data to visual images. - Enhanced communication and understanding in diverse language settings.

Commercial Applications: Title: "Enhanced Audio-to-Visual Conversion Technology for Improved Communication" This technology can be utilized in: - Smart devices for real-time language translation. - Educational tools for students with learning disabilities. - Communication devices for individuals with hearing impairments.

Questions about the technology: 1. How does this technology improve communication for individuals with hearing impairments? 2. What are the potential limitations of using a neural network-based language model for visual image extraction?


Original Abstract Submitted

methods and devices are provided where a device may receive audio data via a sensor of a computing device. the device may convert the audio data to text and extract a portion of the text. the device may input the portion of the text to a neural network-based language model to obtain at least one of a type of visual images, a source of the visual images, a content of the visual images, or a confidence score for the visual images. the device may determine at least one visual image based on at least one of the type of the visual images, the source of the visual images, the content of the visual images, or the confidence score for each of the visual images. the at least one visual image may be output on a display of the computing device to supplement the audio data and facilitate a communication.