Google llc (20240265917). Cascaded Audiovisual Automatic Speech Recognition Models simplified abstract

From WikiPatents
Jump to navigation Jump to search

Cascaded Audiovisual Automatic Speech Recognition Models

Organization Name

google llc

Inventor(s)

Oscar Chang of New York NY (US)

Cascaded Audiovisual Automatic Speech Recognition Models - A simplified explanation of the abstract

This abstract first appeared for US patent application 20240265917 titled 'Cascaded Audiovisual Automatic Speech Recognition Models

The method described in the abstract involves processing a sequence of acoustic frames to generate higher-order feature representations using an audio encoder. These representations are then used in conjunction with video frames to create audiovisual higher-order feature representations through an audiovisual encoder. A joint network is utilized to generate probability distributions over speech recognition hypotheses based on these representations.

  • Receiving a sequence of acoustic frames and generating higher-order feature representations at each output step.
  • Pairing acoustic frames with corresponding video frames to create audiovisual higher-order feature representations.
  • Using a joint network to generate probability distributions for speech recognition hypotheses.
  • Generating probability distributions for speech recognition hypotheses for acoustic frames not paired with video frames.
  • Utilizing audio and video data to improve speech recognition accuracy.

Potential Applications: - Speech recognition systems - Audiovisual content analysis - Multimodal data processing

Problems Solved: - Enhancing speech recognition accuracy - Integrating audio and video data effectively - Improving overall performance of audiovisual systems

Benefits: - Increased accuracy in speech recognition - Enhanced understanding of audiovisual content - Improved performance of multimodal systems

Commercial Applications: Title: Enhanced Speech Recognition System with Audiovisual Integration This technology can be applied in various industries such as: - Security and surveillance - Virtual assistants - Video conferencing platforms

Questions about the technology: 1. How does the integration of audio and video data improve speech recognition accuracy? 2. What are the potential challenges in implementing this technology in real-world applications?


Original Abstract Submitted

a method includes receiving a sequence of acoustic frames and generating, by an audio encoder, at each of a plurality of output steps, an acoustic higher-order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. for each acoustic frame in the sequence of acoustic frames paired with a corresponding video frame, the method includes generating, by an audiovisual encoder, an audiovisual higher-order feature representation for the corresponding acoustic higher-order feature frame and the corresponding video frame; and generating, by a joint network, at an output step, a probability distribution over possible speech recognition hypotheses based on the audiovisual higher-order feature representation. the method, for each corresponding acoustic frame in the sequence of acoustic frames not paired with a corresponding video frame, includes generating, by the joint network, at an output step, a probability distribution over possible speech recognition hypotheses based on the acoustic higher-order feature representation.