University of Rochester (20240305944). FIRST-PERSON AUDIO-VISUAL OBJECT LOCALIZATION SYSTEMS AND METHODS simplified abstract

From WikiPatents
Jump to navigation Jump to search

FIRST-PERSON AUDIO-VISUAL OBJECT LOCALIZATION SYSTEMS AND METHODS

Organization Name

University of Rochester

Inventor(s)

Chenliang Xu of Pittsford NY (US)

Chao Huang of Rochester NY (US)

Yapeng Tian of Plano TX (US)

FNU Anurag Kumar of Bothell WA (US)

FIRST-PERSON AUDIO-VISUAL OBJECT LOCALIZATION SYSTEMS AND METHODS - A simplified explanation of the abstract

This abstract first appeared for US patent application 20240305944 titled 'FIRST-PERSON AUDIO-VISUAL OBJECT LOCALIZATION SYSTEMS AND METHODS

    • Simplified Explanation:**

This patent application describes a localization system that uses images and synchronized audio from a video source to correlate audio elements with visual features and estimate geometric transformations between images.

    • Key Features and Innovation:**
  • Image input receives images from a video source.
  • Audio input receives synchronized audio from the video source.
  • Audio feature disentanglement network correlates audio elements with visual features.
  • Geometry-based feature aggregation module estimates geometric transformations between images and aggregates visual features.
    • Potential Applications:**

This technology can be used in augmented reality, virtual reality, video editing, and surveillance systems.

    • Problems Solved:**

The system addresses the challenge of accurately localizing audio elements with visual features in a video stream.

    • Benefits:**
  • Improved synchronization of audio and visual elements.
  • Enhanced accuracy in estimating geometric transformations.
  • Increased efficiency in processing audio-visual data.
    • Commercial Applications:**

Potential commercial uses include AR/VR applications, video production tools, and security systems. The market implications include improved user experiences and enhanced data analysis capabilities.

    • Questions about the Technology:**

1. How does the audio feature disentanglement network correlate audio elements with visual features? 2. What are the primary advantages of using a geometry-based feature aggregation module in this localization system?


Original Abstract Submitted

a localization system may include an image input that receives images from a video source and an audio input that receives, from the video source, audio synchronized with the images. the localization system may also include an audio feature disentanglement network that correlates distinct audio elements from the audio input with corresponding visual features from the image input. additionally, the localization system may include a geometry-based feature aggregation module that estimates a geometric transformation between two or more images from the video source and aggregates the visual features. various other devices, systems, and methods are also disclosed.