20240013770. TEXT-TO-SPEECH (TTS) PROCESSING simplified abstract (Amazon Technologies, Inc.)
Contents
TEXT-TO-SPEECH (TTS) PROCESSING
Organization Name
Inventor(s)
Jaime Lorenzo Trueba of Cambridge (GB)
Thomas Renaud Drugman of Carnieres (BE)
Viacheslav Klimkov of Gdansk (PL)
Srikanth Ronanki of Cambridge (GB)
Thomas Edward Merritt of Cambridge (GB)
Andrew Paul Breen of Norwich (GB)
Roberto Barra-chicote of Cambridge (GB)
TEXT-TO-SPEECH (TTS) PROCESSING - A simplified explanation of the abstract
This abstract first appeared for US patent application 20240013770 titled 'TEXT-TO-SPEECH (TTS) PROCESSING
Simplified Explanation
During text-to-speech processing, a speech model generates audio data that corresponds to input text data. A spectrogram estimator is used to estimate the frequency spectrogram of the speech, which is then used to condition the speech model. Different segments of the input text data, such as phonemes, syllable-level features, and/or word-level features, can be separately encoded into context vectors. These separate context vectors are used by the spectrogram estimator to create the frequency spectrogram.
- The patent application describes a method for improving text-to-speech processing.
- A speech model is used to generate audio data based on input text data.
- A spectrogram estimator is employed to estimate the frequency spectrogram of the speech.
- The frequency spectrogram data is used to condition the speech model.
- Acoustic features corresponding to different segments of the input text data can be encoded into context vectors.
- The spectrogram estimator uses these context vectors to create the frequency spectrogram.
Potential Applications:
- Text-to-speech systems and applications
- Voice assistants and virtual agents
- Audiobook narration and production
- Accessibility tools for visually impaired individuals
Problems Solved:
- Enhances the quality and naturalness of synthesized speech
- Improves the accuracy of speech generation based on input text data
- Enables better representation of different segments of the input text data
Benefits:
- More realistic and natural-sounding synthesized speech
- Improved intelligibility and clarity of synthesized speech
- Enhanced user experience in text-to-speech applications
- Greater flexibility in encoding and representing different aspects of input text data
Original Abstract Submitted
during text-to-speech processing, a speech model creates output audio data, including speech, that corresponds to input text data that includes a representation of the speech. a spectrogram estimator estimates a frequency spectrogram of the speech; the corresponding frequency-spectrogram data is used to condition the speech model. a plurality of acoustic features corresponding to different segments of the input text data, such as phonemes, syllable-level features, and/or word-level features, may be separately encoded into context vectors; the spectrogram estimator uses these separate context vectors to create the frequency spectrogram.