Google LLC (20240282294). Diffusion Models for Generation of Audio Data Based on Descriptive Textual Prompts simplified abstract

From WikiPatents
Jump to navigation Jump to search

Diffusion Models for Generation of Audio Data Based on Descriptive Textual Prompts

Organization Name

Google LLC

Inventor(s)

Qingqing Huang of Palo Alto CA (US)

Daniel Sung-Joon Park of Hoboken NJ (US)

Aren Jansen of Mountain View CA (US)

Timo Immanuel Denk of Zürich (CH)

Yue Li of Mountain View CA (US)

Ravi Ganti of Santa Clara CA (US)

Dan Ellis of New York NY (US)

Tao Wang of Sunnyvale CA (US)

Wei Han of Redwood City CA (US)

Joonseok Lee of Fremont CA (US)

Diffusion Models for Generation of Audio Data Based on Descriptive Textual Prompts - A simplified explanation of the abstract

This abstract first appeared for US patent application 20240282294 titled 'Diffusion Models for Generation of Audio Data Based on Descriptive Textual Prompts

Simplified Explanation:

This patent application describes a method where a machine-learned text generation model generates a corpus of textual data describing different types of audio. This textual data is then used to train a machine-learned audio classification model by processing audio recordings to obtain training data. The audio recordings are matched with sentences from the textual data in a joint audio-text embedding space. The sentences are then processed with a machine-learned generation model to create an intermediate representation, which is further processed with a machine-learned cascaded diffusion model to obtain audio data.

  • The patent application involves generating textual data describing audio types using a machine-learned text generation model.
  • It also includes training an audio classification model by processing audio recordings with the generated textual data.
  • The method uses a joint audio-text embedding space to match audio recordings with descriptive sentences.
  • An intermediate representation of the sentences is created using a machine-learned generation model.
  • A machine-learned cascaded diffusion model is then used to obtain audio data based on the difference between the audio data and the original recordings.

Potential Applications: This technology could be applied in various fields such as audio classification, content generation, and data processing.

Problems Solved: This technology addresses the challenges of efficiently processing and classifying audio data based on textual descriptions.

Benefits: The benefits of this technology include improved audio classification accuracy, enhanced data processing capabilities, and streamlined content generation processes.

Commercial Applications: Title: Advanced Audio Classification and Content Generation Technology This technology could be utilized in industries such as music streaming services, podcast platforms, and speech recognition systems to enhance audio processing and classification capabilities.

Prior Art: Researchers and developers can explore prior art related to machine-learned audio classification models, text generation models, and cascaded diffusion models to understand the existing technology landscape.

Frequently Updated Research: Researchers are constantly working on improving machine learning models for audio processing and text generation, which could lead to advancements in this technology.

Questions about the Technology: 1. How does this technology improve the efficiency of audio classification processes? 2. What are the potential limitations of using machine-learned models for audio data processing?


Original Abstract Submitted

a corpus of textual data is generated with a machine-learned text generation model. the corpus of textual data includes a plurality of sentences. each sentence is descriptive of a type of audio. for each of a plurality of audio recordings, the audio recording is processed with a machine-learned audio classification model to obtain training data including the audio recording and one or more sentences of the plurality of sentences closest to the audio recording within a joint audio-text embedding space of the machine-learned audio classification model. the sentence(s) are processed with a machine-learned generation model to obtain an intermediate representation of the one or more sentences. the intermediate representation is processed with a machine-learned cascaded diffusion model to obtain audio data. the machine-learned cascaded diffusion model is trained based on a difference between the audio data and the audio recording.