20240013770. TEXT-TO-SPEECH (TTS) PROCESSING simplified abstract (Amazon Technologies, Inc.)

From WikiPatents
Jump to navigation Jump to search

TEXT-TO-SPEECH (TTS) PROCESSING

Organization Name

Amazon Technologies, Inc.

Inventor(s)

Jaime Lorenzo Trueba of Cambridge (GB)

Thomas Renaud Drugman of Carnieres (BE)

Viacheslav Klimkov of Gdansk (PL)

Srikanth Ronanki of Cambridge (GB)

Thomas Edward Merritt of Cambridge (GB)

Andrew Paul Breen of Norwich (GB)

Roberto Barra-chicote of Cambridge (GB)

TEXT-TO-SPEECH (TTS) PROCESSING - A simplified explanation of the abstract

This abstract first appeared for US patent application 20240013770 titled 'TEXT-TO-SPEECH (TTS) PROCESSING

Simplified Explanation

During text-to-speech processing, a speech model generates audio data that corresponds to input text data. A spectrogram estimator is used to estimate the frequency spectrogram of the speech, which is then used to condition the speech model. Different segments of the input text data, such as phonemes, syllable-level features, and/or word-level features, can be separately encoded into context vectors. These separate context vectors are used by the spectrogram estimator to create the frequency spectrogram.

  • The patent application describes a method for improving text-to-speech processing.
  • A speech model is used to generate audio data based on input text data.
  • A spectrogram estimator is employed to estimate the frequency spectrogram of the speech.
  • The frequency spectrogram data is used to condition the speech model.
  • Acoustic features corresponding to different segments of the input text data can be encoded into context vectors.
  • The spectrogram estimator uses these context vectors to create the frequency spectrogram.

Potential Applications:

  • Text-to-speech systems and applications
  • Voice assistants and virtual agents
  • Audiobook narration and production
  • Accessibility tools for visually impaired individuals

Problems Solved:

  • Enhances the quality and naturalness of synthesized speech
  • Improves the accuracy of speech generation based on input text data
  • Enables better representation of different segments of the input text data

Benefits:

  • More realistic and natural-sounding synthesized speech
  • Improved intelligibility and clarity of synthesized speech
  • Enhanced user experience in text-to-speech applications
  • Greater flexibility in encoding and representing different aspects of input text data


Original Abstract Submitted

during text-to-speech processing, a speech model creates output audio data, including speech, that corresponds to input text data that includes a representation of the speech. a spectrogram estimator estimates a frequency spectrogram of the speech; the corresponding frequency-spectrogram data is used to condition the speech model. a plurality of acoustic features corresponding to different segments of the input text data, such as phonemes, syllable-level features, and/or word-level features, may be separately encoded into context vectors; the spectrogram estimator uses these separate context vectors to create the frequency spectrogram.