18676743. Unsupervised Learning of Disentangled Speech Content and Style Representation simplified abstract (Google LLC)

From WikiPatents
Jump to navigation Jump to search

Unsupervised Learning of Disentangled Speech Content and Style Representation

Organization Name

Google LLC

Inventor(s)

Ruoming Pang of New York NY (US)

Andros Tjandra of Mountain View CA (US)

Yu Zhang of Mountain View CA (US)

Shigeki Karita of Mountain View CA (US)

Unsupervised Learning of Disentangled Speech Content and Style Representation - A simplified explanation of the abstract

This abstract first appeared for US patent application 18676743 titled 'Unsupervised Learning of Disentangled Speech Content and Style Representation

The abstract describes a linguistic content and speaking style disentanglement model that includes a content encoder, a style encoder, and a decoder. The content encoder generates a latent representation of linguistic content for input speech, while the style encoder generates a latent representation of speaking style for the same input speech. The model is trained to disentangle speaking style information from linguistic content and vice versa.

  • Content encoder receives input speech and generates latent representation of linguistic content
  • Style encoder receives input speech and generates latent representation of speaking style
  • Decoder generates output speech based on latent representations of linguistic content and speaking style
  • Model is trained to disentangle speaking style information from linguistic content

Potential Applications: - Speech synthesis with different speaking styles - Voice conversion for personalized speech - Language translation with preserved speaking style

Problems Solved: - Separating linguistic content from speaking style - Enhancing speech synthesis and voice conversion accuracy

Benefits: - Improved naturalness and expressiveness in synthesized speech - Personalized voice conversion for various applications - Enhanced language translation with style preservation

Commercial Applications: Title: Advanced Speech Synthesis and Voice Conversion Technology Description: This technology can be used in virtual assistants, customer service bots, language learning apps, and entertainment industry for creating unique voices and enhancing user experience.

Questions about the technology: 1. How does this technology improve speech synthesis compared to traditional methods? 2. What are the potential challenges in implementing this technology in real-time applications?


Original Abstract Submitted

A linguistic content and speaking style disentanglement model includes a content encoder, a style encoder, and a decoder. The content encoder is configured to receive input speech as input and generate a latent representation of linguistic content for the input speech output. The content encoder is trained to disentangle speaking style information from the latent representation of linguistic content. The style encoder is configured to receive the input speech as input and generate a latent representation of speaking style for the input speech as output. The style encoder is trained to disentangle linguistic content information from the latent representation of speaking style. The decoder is configured to generate output speech based on the latent representation of linguistic content for the input speech and the latent representation of speaking style for the same or different input speech.