Google llc (20240312449). Unsupervised Learning of Disentangled Speech Content and Style Representation simplified abstract

From WikiPatents
Jump to navigation Jump to search

Unsupervised Learning of Disentangled Speech Content and Style Representation

Organization Name

google llc

Inventor(s)

Ruoming Pang of New York NY (US)

Andros Tjandra of Mountain View CA (US)

Yu Zhang of Mountain View CA (US)

Shigeki Karita of Mountain View CA (US)

Unsupervised Learning of Disentangled Speech Content and Style Representation - A simplified explanation of the abstract

This abstract first appeared for US patent application 20240312449 titled 'Unsupervised Learning of Disentangled Speech Content and Style Representation

The patent application describes a model that can disentangle linguistic content and speaking style in speech.

  • Content encoder receives input speech and generates a latent representation of linguistic content.
  • Style encoder receives input speech and generates a latent representation of speaking style.
  • Decoder generates output speech based on the latent representations of content and style.

Potential Applications: - Speech synthesis with different speaking styles - Voice conversion for personalized speech - Language translation with style preservation

Problems Solved: - Separating content and style in speech data - Enhancing naturalness and expressiveness in speech synthesis - Improving cross-lingual voice conversion

Benefits: - Customizable speech generation - Enhanced naturalness in synthesized speech - Improved accuracy in voice conversion tasks

Commercial Applications: Title: Advanced Speech Synthesis and Voice Conversion Technology This technology can be used in industries such as: - Entertainment (creating unique character voices) - Customer service (personalized automated responses) - Language learning (accent adaptation in language courses)

Questions about the technology: 1. How does this model improve upon existing speech synthesis techniques?

  - This model allows for the separation of content and style in speech, enabling more customizable and natural-sounding speech synthesis.

2. Can this technology be applied to real-time speech processing?

  - Yes, with further optimization, this technology could potentially be used for real-time applications such as voice assistants or live translation services.


Original Abstract Submitted

a linguistic content and speaking style disentanglement model includes a content encoder, a style encoder, and a decoder. the content encoder is configured to receive input speech as input and generate a latent representation of linguistic content for the input speech output. the content encoder is trained to disentangle speaking style information from the latent representation of linguistic content. the style encoder is configured to receive the input speech as input and generate a latent representation of speaking style for the input speech as output. the style encoder is trained to disentangle linguistic content information from the latent representation of speaking style. the decoder is configured to generate output speech based on the latent representation of linguistic content for the input speech and the latent representation of speaking style for the same or different input speech.