SYSTEMS AND METHODS FOR TEXT-TO-SPEECH SYNTHESIS

Organization Name

Inventor(s)

SYSTEMS AND METHODS FOR TEXT-TO-SPEECH SYNTHESIS - A simplified explanation of the abstract

This abstract first appeared for US patent application 18598996 titled 'SYSTEMS AND METHODS FOR TEXT-TO-SPEECH SYNTHESIS

The abstract describes a system and method for text to speech synthesis, where the system receives input text, a reference spectrogram, and either an emotion ID or speaker ID. The system then generates vector representations of the input text and reference spectrogram, modifies them based on a combined representation including the emotion or speaker ID, and generates an audio waveform using a decoder.

System and method for text to speech synthesis
Receives input text, reference spectrogram, and emotion or speaker ID
Generates vector representations of input text and reference spectrogram
Modifies representations based on combined representation with emotion or speaker ID
Generates audio waveform using a decoder

Potential Applications: - Voice assistants - Audio books - Language translation services

Problems Solved: - Enhancing naturalness and expressiveness of synthesized speech - Personalizing speech synthesis based on emotion or speaker ID

Benefits: - Improved user experience - Enhanced communication of emotions through synthesized speech

Commercial Applications: Title: Advanced Text to Speech Synthesis System for Enhanced User Experience This technology can be used in various industries such as: - Customer service - Entertainment - Education

Questions about Text to Speech Synthesis: 1. How does the system generate vector representations of input text and reference spectrogram? The system uses encoders to generate vector representations of the input text and reference spectrogram.

2. What is the role of the variance adaptor in modifying the vector representations? The variance adaptor modifies the vector representations based on a combined representation including the emotion or speaker ID.

Original Abstract Submitted

Embodiments described herein provide systems and methods for text to speech synthesis. A system receives, via a data interface, an input text, a reference spectrogram, and at least one of an emotion ID or speaker ID. The system generates, via a first encoder, a vector representation of the input text. The system generates, via a second encoder, a vector representation of the reference spectrogram. The system generates, via a variance adaptor, a modified vector representation based on a combined representation including a combination of the vector representation of the input text, the vector representation of the reference spectrogram, and at least one of an embedding of the emotion ID or an embedding of the speaker ID. The system generates, via a decoder, an audio waveform based on the modified vector representation. The generated audio waveform may be played via a speaker.