Amazon technologies, inc. (20240296827). TEXT-TO-SPEECH (TTS) PROCESSING simplified abstract

From WikiPatents
Jump to navigation Jump to search

TEXT-TO-SPEECH (TTS) PROCESSING

Organization Name

amazon technologies, inc.

Inventor(s)

Jaime Lorenzo Trueba of Cambridge (GB)

Thomas Renaud Drugman of Carnieres (BE)

Viacheslav Klimkov of Gdansk (PL)

Srikanth Ronanki of Cambridge (GB)

Thomas Edward Merritt of Cambridge (GB)

Andrew Paul Breen of Norwich (GB)

Roberto Barra-chicote of Cambridge (GB)

TEXT-TO-SPEECH (TTS) PROCESSING - A simplified explanation of the abstract

This abstract first appeared for US patent application 20240296827 titled 'TEXT-TO-SPEECH (TTS) PROCESSING

The abstract of the patent application describes a process where a speech model generates audio data corresponding to input text data, with a spectrogram estimator creating a frequency spectrogram of the speech to condition the model. Acoustic features from different segments of the text data are separately encoded into context vectors for the spectrogram estimator to use in creating the frequency spectrogram.

  • Speech model generates audio data from input text data
  • Spectrogram estimator creates frequency spectrogram of speech
  • Acoustic features from text segments encoded into context vectors
  • Context vectors used to create frequency spectrogram
  • Conditioning of speech model based on frequency spectrogram

Potential Applications: - Text-to-speech systems - Speech recognition technology - Language translation tools

Problems Solved: - Enhancing the accuracy and naturalness of synthesized speech - Improving the performance of speech models - Enhancing the quality of text-to-speech systems

Benefits: - More realistic and natural-sounding synthesized speech - Increased accuracy in speech recognition - Enhanced user experience in language translation applications

Commercial Applications: Title: Advanced Text-to-Speech Technology for Enhanced User Experience This technology can be utilized in: - Virtual assistants - Customer service chatbots - Language learning applications

Questions about the technology: 1. How does the separate encoding of acoustic features into context vectors improve the performance of the speech model? 2. What are the potential limitations of using frequency spectrograms to condition speech models?


Original Abstract Submitted

during text-to-speech processing, a speech model creates output audio data, including speech, that corresponds to input text data that includes a representation of the speech. a spectrogram estimator estimates a frequency spectrogram of the speech; the corresponding frequency-spectrogram data is used to condition the speech model. a plurality of acoustic features corresponding to different segments of the input text data, such as phonemes, syllable-level features, and/or word-level features, may be separately encoded into context vectors; the spectrogram estimator uses these separate context vectors to create the frequency spectrogram.