POSITION-BASED TEXT-TO-SPEECH MODEL

Organization Name

ADOBE INC.

Inventor(s)

Puneet Mathur of Sunnyvale CA US

Franck Dernoncourt of Spokane WA US

Quan Hung Tran of San Jose CA US

Jiuxiang Gu of Baltimore MD US

Ani Nenkova of Philadelphia PA US

Vlad Ion Morariu of Potomac MD US

Rajiv Bhawanji Jain of Falls Church VA US

Dinesh Manocha of Bethesda MD US

POSITION-BASED TEXT-TO-SPEECH MODEL

This abstract first appeared for US patent application 20250095631 titled 'POSITION-BASED TEXT-TO-SPEECH MODEL

Original Abstract Submitted

position-based text-to-speech model and training techniques are described. a digital document, for instance, is received by an audio synthesis service. a text-to-speech model is utilized by the audio synthesis service to generate digital audio from text included in the digital document. the text-to-speech model, for instance, is configured to generate a text encoding and a document positional encoding from an initial text sequence of the digital document. the document positional encoding is based on a location of the text encoding within the digital document. digital audio is then generated by the text-to-speech model that includes a spectrogram having a reordered text sequence, which is different from the initial text sequence, by decoding the text encoding and the document positional encoding.