Two-Level Text-To-Speech Systems Using Synthetic Training Data

Organization Name

Google LLC

Inventor(s)

Lev Finkelstein of Mountain View CA (US)

Chun-an Chan of Mountain View CA (US)

Byungha Chun of Tokyo (JP)

Norman Casagrande of London (GB)

Yu Zhang of Mountain View CA (US)

Robert Andrew James Clark of Hertfordshire (GB)

Vincent Wan of London (GB)

Two-Level Text-To-Speech Systems Using Synthetic Training Data

This abstract first appeared for US patent application 20250078808 titled 'Two-Level Text-To-Speech Systems Using Synthetic Training Data

Original Abstract Submitted

a method includes obtaining training data including a plurality of training audio signals and corresponding transcripts. each training audio signal is spoken by a target speaker in a first accent/dialect. for each training audio signal of the training data, the method includes generating a training synthesized speech representation spoken by the target speaker in a second accent/dialect different than the first accent/dialect and training a text-to-speech (tts) system based on the corresponding transcript and the training synthesized speech representation. the method also includes receiving an input text utterance to be synthesized into speech in the second accent/dialect. the method also includes obtaining conditioning inputs that include a speaker embedding and an accent/dialect identifier that identifies the second accent/dialect. the method also includes generating an output audio waveform corresponding to a synthesized speech representation of the input text sequence that clones the voice of the target speaker in the second accent/dialect.