18421116. Parallel Tacotron Non-Autoregressive and Controllable TTS simplified abstract (Google LLC)

From WikiPatents
Revision as of 08:06, 24 May 2024 by Wikipatents (talk | contribs) (Creating a new page)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Parallel Tacotron Non-Autoregressive and Controllable TTS

Organization Name

Google LLC

Inventor(s)

Isaac Elias of Mountain View CA (US)

Jonathan Shen of Mountain View CA (US)

Yu Zhang of Mountain View CA (US)

Ye Jia of Mountain View CA (US)

Ron J. Weiss of New York NY (US)

Yonghui Wu of Fremont CA (US)

Byungha Chun of Tokyo (JP)

Parallel Tacotron Non-Autoregressive and Controllable TTS - A simplified explanation of the abstract

This abstract first appeared for US patent application 18421116 titled 'Parallel Tacotron Non-Autoregressive and Controllable TTS

Simplified Explanation

The method for training a non-autoregressive TTS model involves encoding audio signals and text sequences, predicting phoneme durations, generating mel-frequency spectrogram sequences, and training the model based on the losses calculated.

  • Encoding reference audio signal into a variational embedding to disentangle style/prosody information.
  • Encoding input text sequence into an encoded text sequence.
  • Predicting phoneme durations for each phoneme in the input text sequence.
  • Determining phoneme duration loss based on predicted phoneme durations and reference phoneme duration.
  • Generating predicted mel-frequency spectrogram sequences for the input text sequence.
  • Determining final spectrogram loss based on predicted mel-frequency spectrogram sequences and reference mel-frequency spectrogram sequence.
  • Training the TTS model based on the final spectrogram loss and corresponding phoneme duration loss.

Potential Applications

This technology can be applied in speech synthesis, virtual assistants, audiobooks, language learning tools, and accessibility tools for visually impaired individuals.

Problems Solved

This technology solves the problem of generating natural-sounding speech from text input by disentangling style/prosody information from audio signals and accurately predicting phoneme durations.

Benefits

The benefits of this technology include improved speech synthesis quality, enhanced expressiveness in generated speech, and better alignment between text input and synthesized speech.

Potential Commercial Applications

Potential commercial applications of this technology include developing TTS software for various industries such as entertainment, education, customer service, and healthcare.

Possible Prior Art

Prior art in the field of TTS includes autoregressive models, attention mechanisms in sequence-to-sequence models, and variational autoencoders for speech synthesis.

What is the impact of this technology on speech synthesis quality?

This technology significantly improves speech synthesis quality by disentangling style/prosody information from audio signals, accurately predicting phoneme durations, and generating mel-frequency spectrogram sequences that closely match the reference spectrogram.

How does this technology compare to autoregressive TTS models in terms of efficiency?

This technology is more efficient than autoregressive TTS models as it does not rely on sequential generation of speech, leading to faster inference times and reduced computational complexity.


Original Abstract Submitted

A method for training a non-autoregressive TTS model includes receiving training data that includes a reference audio signal and a corresponding input text sequence. The method also includes encoding the reference audio signal into a variational embedding that disentangles the style/prosody information from the reference audio signal and encoding the input text sequence into an encoded text sequence. The method also includes predicting a phoneme duration for each phoneme in the input text sequence and determining a phoneme duration loss based on the predicted phoneme durations and a reference phoneme duration. The method also includes generating one or more predicted mel-frequency spectrogram sequences for the input text sequence and determining a final spectrogram loss based on the predicted mel-frequency spectrogram sequences and a reference mel-frequency spectrogram sequence. The method also includes training the TTS model based on the final spectrogram loss and the corresponding phoneme duration loss.