GOOGLE LLC (20240265917). Cascaded Audiovisual Automatic Speech Recognition Models simplified abstract

From WikiPatents
Jump to navigation Jump to search

Cascaded Audiovisual Automatic Speech Recognition Models

Organization Name

GOOGLE LLC

Inventor(s)

Oscar Chang of New York NY (US)

Cascaded Audiovisual Automatic Speech Recognition Models - A simplified explanation of the abstract

This abstract first appeared for US patent application 20240265917 titled 'Cascaded Audiovisual Automatic Speech Recognition Models

The method described in the patent application involves processing a sequence of acoustic frames to generate higher-order feature representations using an audio encoder. These representations are then used in conjunction with corresponding video frames to create audiovisual higher-order feature representations through an audiovisual encoder. A joint network then utilizes these representations to generate probability distributions over possible speech recognition hypotheses.

  • Receiving a sequence of acoustic frames
  • Generating acoustic higher-order feature representations at each output step
  • Creating audiovisual higher-order feature representations for paired acoustic and video frames
  • Generating probability distributions for speech recognition hypotheses based on these representations
  • Utilizing a joint network for processing both audio and audiovisual representations

Potential Applications: - Speech recognition systems - Audiovisual content analysis - Multimodal communication technologies

Problems Solved: - Enhancing speech recognition accuracy - Improving audiovisual synchronization - Facilitating multimodal data processing

Benefits: - Increased accuracy in speech recognition - Enhanced audiovisual content analysis - Improved performance in multimodal communication

Commercial Applications: Title: Advanced Speech Recognition and Audiovisual Analysis Technology This technology can be utilized in various industries such as: - Telecommunications - Media and entertainment - Security and surveillance

Questions about the technology: 1. How does this technology improve speech recognition accuracy? 2. What are the potential challenges in implementing this technology in real-world applications?

Frequently Updated Research: Stay updated on the latest advancements in speech recognition technology and audiovisual analysis to enhance the performance and capabilities of this innovation.


Original Abstract Submitted

a method includes receiving a sequence of acoustic frames and generating, by an audio encoder, at each of a plurality of output steps, an acoustic higher-order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. for each acoustic frame in the sequence of acoustic frames paired with a corresponding video frame, the method includes generating, by an audiovisual encoder, an audiovisual higher-order feature representation for the corresponding acoustic higher-order feature frame and the corresponding video frame; and generating, by a joint network, at an output step, a probability distribution over possible speech recognition hypotheses based on the audiovisual higher-order feature representation. the method, for each corresponding acoustic frame in the sequence of acoustic frames not paired with a corresponding video frame, includes generating, by the joint network, at an output step, a probability distribution over possible speech recognition hypotheses based on the acoustic higher-order feature representation.