20240020977. SYSTEM AND METHOD FOR MULTIMODAL VIDEO SEGMENTATION IN MULTI-SPEAKER SCENARIO simplified abstract (PING AN TECHNOLOGY (SHENZHEN) CO., LTD.)

From WikiPatents
Jump to navigation Jump to search

SYSTEM AND METHOD FOR MULTIMODAL VIDEO SEGMENTATION IN MULTI-SPEAKER SCENARIO

Organization Name

PING AN TECHNOLOGY (SHENZHEN) CO., LTD.

Inventor(s)

Xinyi Wu of Palo Alto CA (US)

Tian Xia of Palo Alto CA (US)

Xinlu Yu of Palo Alto CA (US)

Ziyi Chen of Palo Alto CA (US)

Iek-Heng Chu of Palo Alto CA (US)

Sirui Xu of Palo Alto CA (US)

Mei Han of Palo Alto CA (US)

Jing Xiao of Palo Alto CA (US)

Peng Chang of Palo Alto CA (US)

SYSTEM AND METHOD FOR MULTIMODAL VIDEO SEGMENTATION IN MULTI-SPEAKER SCENARIO - A simplified explanation of the abstract

This abstract first appeared for US patent application 20240020977 titled 'SYSTEM AND METHOD FOR MULTIMODAL VIDEO SEGMENTATION IN MULTI-SPEAKER SCENARIO

Simplified Explanation

The abstract describes a system and method for segmenting a video with multiple speakers into sentences and detecting speaker changes. The video is then segmented into video clips based on the transcript and speaker change information.

  • The system and method segment a video with multiple speakers into sentences.
  • Speaker change information is detected between each pair of adjacent sentences based on audio or visual content.
  • The video is segmented into video clips using the transcript and speaker change information.

Potential Applications:

  • Video editing software could use this technology to automatically segment videos with multiple speakers into clips for easier editing.
  • Transcription services could benefit from this technology by automatically segmenting videos into sentences and identifying speaker changes.

Problems Solved:

  • Manual segmentation of videos with multiple speakers can be time-consuming and tedious. This technology automates the process, saving time and effort.
  • Identifying speaker changes in a video can be challenging, especially when there are multiple speakers. This technology detects speaker changes based on audio and visual content.

Benefits:

  • Increased efficiency in video editing and transcription processes.
  • Improved accuracy in segmenting videos and detecting speaker changes.
  • Reduces the need for manual intervention in video segmentation tasks.


Original Abstract Submitted

a system and method for multimodal video segmentation in a multi-speaker scenario are provided. a transcript of a video with a plurality of speakers is segmented into a plurality of sentences. speaker change information is detected between each two adjacent sentences of the plurality of sentences based on at least one of audio content or visual content of the video. the video is segmented into a plurality of video clips based on the transcript of the video and the speaker change information.