18525475. SYNTHESIS OF SPEECH FROM TEXT IN A VOICE OF A TARGET SPEAKER USING NEURAL NETWORKS simplified abstract (GOOGLE LLC)

From WikiPatents
Jump to navigation Jump to search

SYNTHESIS OF SPEECH FROM TEXT IN A VOICE OF A TARGET SPEAKER USING NEURAL NETWORKS

Organization Name

GOOGLE LLC

Inventor(s)

Ye Jia of Mountain View CA (US)

Zhifeng Chen of Mountain View CA (US)

Yonghui Wu of Fremont CA (US)

Jonathan Shen of Mountain View CA (US)

Ruoming Pang of Mountain View CA (US)

Ron J. Weiss of New York NY (US)

Ignacio Lopez Moreno of Brooklyn NY (US)

Fei Ren of Mountain View CA (US)

Yu Zhang of Mountain View CA (US)

Quan Wang of Hoboken NJ (US)

Patrick An Phu Nguyen of Palo Alto CA (US)

SYNTHESIS OF SPEECH FROM TEXT IN A VOICE OF A TARGET SPEAKER USING NEURAL NETWORKS - A simplified explanation of the abstract

This abstract first appeared for US patent application 18525475 titled 'SYNTHESIS OF SPEECH FROM TEXT IN A VOICE OF A TARGET SPEAKER USING NEURAL NETWORKS

Simplified Explanation

The patent application describes methods, systems, and apparatus for speech synthesis, including obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in the voice of the target speaker, generating a speaker vector, generating an audio representation of the input text spoken in the voice of the target speaker, and providing the audio representation for output.

  • Obtaining audio representation of speech of a target speaker
  • Obtaining input text for speech synthesis in the voice of the target speaker
  • Generating a speaker vector using a speaker encoder engine
  • Generating an audio representation of the input text in the voice of the target speaker using a spectrogram generation engine
  • Providing the audio representation of the synthesized speech for output

Potential Applications

This technology can be used in various applications such as:

  • Virtual assistants
  • Voice-controlled devices
  • Language translation services

Problems Solved

This technology helps in:

  • Creating more natural and personalized speech synthesis
  • Enhancing user experience in interacting with AI systems
  • Improving accessibility for individuals with speech impairments

Benefits

The benefits of this technology include:

  • Customizable speech synthesis in different voices
  • Improved accuracy and naturalness in synthesized speech
  • Enhanced user engagement and interaction with AI systems

Potential Commercial Applications

This technology can be applied in commercial settings such as:

  • Call centers for automated customer service
  • Entertainment industry for voiceovers and dubbing
  • Education sector for language learning applications

Possible Prior Art

One example of prior art in speech synthesis technology is the use of neural network models for generating speech from text inputs. Another example is the development of voice cloning techniques for replicating a specific speaker's voice.

Unanswered Questions

How does this technology handle different accents and dialects in speech synthesis?

The patent application does not specifically address how the system adapts to various accents and dialects when synthesizing speech in the voice of the target speaker. It would be interesting to know if the technology has the capability to mimic regional variations in speech patterns.

What is the computational complexity of the speech synthesis process described in the patent application?

The patent application does not provide information on the computational resources required for the speech synthesis process. Understanding the computational complexity of the system would be crucial for assessing its feasibility and scalability in real-world applications.


Original Abstract Submitted

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech synthesis. The methods, systems, and apparatus include actions of obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in a voice of the target speaker, generating a speaker vector by providing the audio representation to a speaker encoder engine that is trained to distinguish speakers from one another, generating an audio representation of the input text spoken in the voice of the target speaker by providing the input text and the speaker vector to a spectrogram generation engine that is trained using voices of reference speakers to generate audio representations, and providing the audio representation of the input text spoken in the voice of the target speaker for output.