20240054683. ENHANCED USER EXPERIENCE THROUGH BI-DIRECTIONAL AUDIO AND VISUAL SIGNAL GENERATION simplified abstract (MICROSOFT TECHNOLOGY LICENSING, LLC)

From WikiPatents
Jump to navigation Jump to search

ENHANCED USER EXPERIENCE THROUGH BI-DIRECTIONAL AUDIO AND VISUAL SIGNAL GENERATION

Organization Name

MICROSOFT TECHNOLOGY LICENSING, LLC

Inventor(s)

Sunando Sengupta of Reading (GB)

Alexandros Neofytou of London (GB)

Eric Chris Wolfgang Sommerlade of Oxford (GB)

Yang Liu of Reading (GB)

ENHANCED USER EXPERIENCE THROUGH BI-DIRECTIONAL AUDIO AND VISUAL SIGNAL GENERATION - A simplified explanation of the abstract

This abstract first appeared for US patent application 20240054683 titled 'ENHANCED USER EXPERIENCE THROUGH BI-DIRECTIONAL AUDIO AND VISUAL SIGNAL GENERATION

Simplified Explanation

The patent application describes a computer-implemented method for training a neural network to generate an output signal of a different modality from an input signal. The first modality can be either a sound signal or a visual image, and the output signal would be the opposite modality.

  • The method involves training a model using a first pair of visual and audio networks to create a set of codebooks using known visual and audio signals.
  • The model is further trained using a second pair of visual and audio networks to refine the set of codebooks using augmented visual and audio signals.
  • The first and second visual networks are equally weighted, as are the first and second audio networks.

Potential applications of this technology:

  • Cross-modal translation: The method can be used to convert sound signals into visual images or vice versa. This can have applications in areas such as speech-to-image translation or generating visual representations of music.
  • Accessibility: The technology can be used to provide alternative modalities for individuals with sensory impairments. For example, converting visual content into sound signals for visually impaired individuals.

Problems solved by this technology:

  • Modality conversion: The method addresses the challenge of converting signals between different modalities accurately and efficiently.
  • Training with augmented data: The use of augmented visual and audio signals helps improve the accuracy and robustness of the model.

Benefits of this technology:

  • Enhanced communication: The ability to convert signals between different modalities can enable more effective communication and understanding across different sensory channels.
  • Accessibility and inclusion: The technology can help make content and information accessible to individuals with sensory impairments, promoting inclusivity.
  • Creative applications: The method opens up possibilities for creative applications, such as generating visual representations of music or creating immersive audiovisual experiences.


Original Abstract Submitted

in various embodiments, a computer-implemented method of training a neural network for creating an output signal of different modality from an input signal is described. in embodiments, the first modality may be a sound signal or a visual image and where the output signal would be a visual image or a sound signal, respectively. in embodiments a model is trained using a first pair of visual and audio networks to train a set of codebooks using known visual signals and the audio signals and using a second pair of visual and audio networks to further train the set of codebooks using the augmented visual signals and the augmented audio signals. further, the first and the second visual networks are equally weighted and where the first and the second audio networks are equally weighted.