US Patent Application 18213929. ELECTRONIC DEVICE AND METHOD OF GENERATING TEXT-TO-SPEECH MODEL FOR PROSODY CONTROL OF THE ELECTRONIC DEVICE simplified abstract

From WikiPatents
Jump to navigation Jump to search

ELECTRONIC DEVICE AND METHOD OF GENERATING TEXT-TO-SPEECH MODEL FOR PROSODY CONTROL OF THE ELECTRONIC DEVICE

Organization Name

Samsung Electronics Co., Ltd.


Inventor(s)

Junesig Sung of Gyeonggi-do (KR)


Taehoon Kim of Gyeonggi-do (KR)


Nikos Ellinas of Athens (GR)


Pirros Tsiakoulis of Athens (GR)


Hyoungmin Park of Gyeonggi-do (KR)


ELECTRONIC DEVICE AND METHOD OF GENERATING TEXT-TO-SPEECH MODEL FOR PROSODY CONTROL OF THE ELECTRONIC DEVICE - A simplified explanation of the abstract

  • This abstract for appeared for US patent application number 18213929 Titled 'ELECTRONIC DEVICE AND METHOD OF GENERATING TEXT-TO-SPEECH MODEL FOR PROSODY CONTROL OF THE ELECTRONIC DEVICE'

Simplified Explanation

This abstract describes an electronic device that can generate a text-to-speech (TTS) model. The device receives training data that includes different speech sounds called phenomes. It then determines the prosody value (the rhythm and intonation) for each phenome and groups them into clusters based on these values. The device also extracts the sequence of phenomes corresponding to a given text and selects the appropriate prosody cluster for each phenome based on the speech patterns in the text. Finally, it generates a TTS model using both the phenome sequence and the prosody cluster index sequence.


Original Abstract Submitted

According to certain embodiments, an electronic device, comprises: a memory storing therein instructions; and a processor electrically connected to the memory and configured to execute the instructions, wherein, when the instructions are executed by the processor, the processor receives training data comprising a plurality of phenomes; determines a prosody value for each one of the plurality of phenomes in the training data; clusters the plurality of phenomes based on the prosody value for each one of the plurality of phenomes in the training data, thereby resulting in a plurality of prosody clusters; extracts a phoneme sequence corresponding to a text in the training data; extracts a prosody cluster index sequence corresponding to an utterance of the text by selecting one of the plurality of clusters based on prosody values of the utterance of the text; and generates a text-to-speech (TTS) model based on the phoneme sequence and the prosody cluster index sequence.