18487227. Attention-Based Clockwork Hierarchical Variational Encoder simplified abstract (GOOGLE LLC)

From WikiPatents
Jump to navigation Jump to search

Attention-Based Clockwork Hierarchical Variational Encoder

Organization Name

GOOGLE LLC

Inventor(s)

Robert Clark of Hertfordshire (GB)

Chun-an Chan of Mountain View CA (US)

Vincent Wan of London (GB)

Attention-Based Clockwork Hierarchical Variational Encoder - A simplified explanation of the abstract

This abstract first appeared for US patent application 18487227 titled 'Attention-Based Clockwork Hierarchical Variational Encoder

Simplified Explanation

The abstract describes a method for representing and synthesizing speech with a desired prosody. The method involves receiving a text utterance, selecting an utterance embedding to represent the intended prosody, and predicting the duration of each syllable in the text utterance. This is done by decoding a prosodic syllable embedding based on attention to linguistic features of each phoneme in the syllable. The predicted duration is used to generate a plurality of fixed-length predicted frames for the syllable.

  • The method represents and synthesizes speech with a desired prosody.
  • It uses an utterance embedding to capture the intended prosody.
  • The duration of each syllable is predicted using a prosodic syllable embedding.
  • Attention mechanism is used to focus on linguistic features of each phoneme.
  • Fixed-length predicted frames are generated based on the predicted duration.

Potential applications of this technology:

  • Text-to-speech systems can generate speech with specific prosodic patterns.
  • Voice assistants can mimic human-like intonation and emphasis.
  • Language learning tools can provide pronunciation guidance with correct prosody.

Problems solved by this technology:

  • Lack of naturalness in synthesized speech.
  • Difficulty in conveying intended prosody in text-based communication.
  • Inability to capture and reproduce complex prosodic patterns.

Benefits of this technology:

  • Improved naturalness and expressiveness in synthesized speech.
  • Enhanced communication and understanding in text-based interactions.
  • Better pronunciation guidance and language learning experience.


Original Abstract Submitted

A method for representing an intended prosody in synthesized speech includes receiving a text utterance having at least one word, and selecting an utterance embedding for the text utterance. Each word in the text utterance has at least one syllable and each syllable has at least one phoneme. The utterance embedding represents an intended prosody. For each syllable, using the selected utterance embedding, the method also includes: predicting a duration of the syllable by decoding a prosodic syllable embedding for the syllable based on attention by an attention mechanism to linguistic features of each phoneme of the syllable and generating a plurality of fixed-length predicted frames based on the predicted duration for the syllable.