Google llc (20240112667). SYNTHESIS OF SPEECH FROM TEXT IN A VOICE OF A TARGET SPEAKER USING NEURAL NETWORKS simplified abstract

From WikiPatents
Jump to navigation Jump to search

SYNTHESIS OF SPEECH FROM TEXT IN A VOICE OF A TARGET SPEAKER USING NEURAL NETWORKS

Organization Name

google llc

Inventor(s)

Ye Jia of Mountain View CA (US)

Zhifeng Chen of Mountain View CA (US)

Yonghui Wu of Fremont CA (US)

Jonathan Shen of Mountain View CA (US)

Ruoming Pang of Mountain View CA (US)

Ron J. Weiss of New York NY (US)

Ignacio Lopez Moreno of Brooklyn NY (US)

Fei Ren of Mountain View CA (US)

Yu Zhang of Mountain View CA (US)

Quan Wang of Hoboken NJ (US)

Patrick An Phu Nguyen of Palo Alto CA (US)

SYNTHESIS OF SPEECH FROM TEXT IN A VOICE OF A TARGET SPEAKER USING NEURAL NETWORKS - A simplified explanation of the abstract

This abstract first appeared for US patent application 20240112667 titled 'SYNTHESIS OF SPEECH FROM TEXT IN A VOICE OF A TARGET SPEAKER USING NEURAL NETWORKS

Simplified Explanation

The patent application describes methods, systems, and apparatus for speech synthesis, including obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in a voice of the target speaker, generating a speaker vector, generating an audio representation of the input text spoken in the voice of the target speaker, and providing the audio representation for output.

  • Obtaining an audio representation of speech of a target speaker
  • Obtaining input text for speech synthesis in the voice of the target speaker
  • Generating a speaker vector using a speaker encoder engine
  • Generating an audio representation of the input text spoken in the voice of the target speaker using a spectrogram generation engine
  • Providing the audio representation of the input text spoken in the voice of the target speaker for output

Potential Applications

This technology could be used in various applications such as:

  • Voice assistants
  • Voice cloning
  • Dubbing in movies or TV shows

Problems Solved

This technology helps in:

  • Creating personalized voice synthesis
  • Improving naturalness and quality of synthesized speech

Benefits

The benefits of this technology include:

  • Enhanced user experience with more natural and personalized voice synthesis
  • Efficient generation of speech in different voices

Potential Commercial Applications

The potential commercial applications of this technology could be:

  • Voice banking services
  • Customized voice messages for businesses

Possible Prior Art

One possible prior art in speech synthesis technology is the use of neural networks to generate speech from text inputs. This technology builds upon existing methods by incorporating speaker vectors to enhance voice synthesis.

Unanswered Questions

How does this technology handle different accents or languages in speech synthesis?

The patent application does not provide specific details on how the system adapts to different accents or languages in speech synthesis.

What is the computational complexity of the proposed method for speech synthesis?

The patent application does not discuss the computational complexity of the proposed method for speech synthesis.


Original Abstract Submitted

methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech synthesis. the methods, systems, and apparatus include actions of obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in a voice of the target speaker, generating a speaker vector by providing the audio representation to a speaker encoder engine that is trained to distinguish speakers from one another, generating an audio representation of the input text spoken in the voice of the target speaker by providing the input text and the speaker vector to a spectrogram generation engine that is trained using voices of reference speakers to generate audio representations, and providing the audio representation of the input text spoken in the voice of the target speaker for output.