18020851. ELECTRONIC DEVICE, SPEECH RECOGNITION METHOD THEREFOR, AND MEDIUM simplified abstract (HUAWEI TECHNOLOGIES CO., LTD.)

From WikiPatents
Jump to navigation Jump to search

ELECTRONIC DEVICE, SPEECH RECOGNITION METHOD THEREFOR, AND MEDIUM

Organization Name

HUAWEI TECHNOLOGIES CO., LTD.

Inventor(s)

Lei Qin of Shenzhen (CN)

Lele Zhang of Shenzhen (CN)

Hao Liu of Beijing (CN)

Yuewan Lu of Shenzhen (CN)

ELECTRONIC DEVICE, SPEECH RECOGNITION METHOD THEREFOR, AND MEDIUM - A simplified explanation of the abstract

This abstract first appeared for US patent application 18020851 titled 'ELECTRONIC DEVICE, SPEECH RECOGNITION METHOD THEREFOR, AND MEDIUM

Simplified Explanation

Embodiments of this application provide a method for speech recognition using a combination of facial depth imaging and audio analysis. The method involves obtaining a facial depth image and a voice recording from a user. The facial depth image is captured using a depth camera. The method then recognizes the shape of the user's mouth from the facial depth image and extracts voice features from the audio recording. These voice and mouth shape features are fused together to create an audio-video feature. Finally, the method uses this audio-video feature to recognize the voice utterance made by the user.

  • The method involves obtaining a facial depth image and a voice recording from a user.
  • The facial depth image is captured using a depth camera.
  • The method recognizes the shape of the user's mouth from the facial depth image.
  • Voice features are extracted from the audio recording.
  • The voice and mouth shape features are fused together to create an audio-video feature.
  • The audio-video feature is used to recognize the voice utterance made by the user.

Potential applications of this technology:

  • Speech recognition systems: This method can be used in speech recognition systems to improve accuracy by combining visual information from the user's mouth shape with audio analysis.
  • Human-computer interaction: The technology can be used to enable more natural and intuitive interactions between users and computers, such as voice-controlled interfaces that also take into account visual cues from the user's mouth movements.
  • Accessibility: The method can benefit individuals with speech impairments by providing a more accurate and robust speech recognition system that incorporates visual information.

Problems solved by this technology:

  • Improved accuracy: By combining visual information from the user's mouth shape with audio analysis, the method can improve the accuracy of speech recognition systems.
  • Robustness: The fusion of audio and visual features makes the speech recognition system more robust to variations in audio quality or background noise.
  • Natural interaction: By incorporating visual cues from the user's mouth movements, the technology enables more natural and intuitive interactions between users and computers.

Benefits of this technology:

  • Enhanced speech recognition accuracy: The fusion of audio and visual features improves the accuracy of speech recognition systems, leading to better user experiences.
  • Improved accessibility: Individuals with speech impairments can benefit from a more accurate and robust speech recognition system that incorporates visual information.
  • Natural and intuitive interaction: By considering visual cues from the user's mouth movements, the technology enables more natural and intuitive interactions between users and computers.


Original Abstract Submitted

Embodiments of this application provide a speech recognition method. The speech recognition method includes: obtaining a facial depth image and a to-be-recognized voice of a user, where the facial depth image is an image collected by using a depth camera; recognizing a mouth shape feature from the facial depth image, and recognizing a voice feature from a to-be-recognized audio; and fusing the voice feature and the mouth shape feature into an audio-video feature, and recognizing, based on the audio-video feature, a voice uttered by the user.