18598996. SYSTEMS AND METHODS FOR TEXT-TO-SPEECH SYNTHESIS simplified abstract (Datum Point Labs Inc.)
SYSTEMS AND METHODS FOR TEXT-TO-SPEECH SYNTHESIS
Organization Name
Inventor(s)
SYSTEMS AND METHODS FOR TEXT-TO-SPEECH SYNTHESIS - A simplified explanation of the abstract
This abstract first appeared for US patent application 18598996 titled 'SYSTEMS AND METHODS FOR TEXT-TO-SPEECH SYNTHESIS
The abstract describes a system and method for text to speech synthesis, where the system receives input text, a reference spectrogram, and either an emotion ID or speaker ID. The system then generates vector representations of the input text and reference spectrogram, modifies them based on a combined representation including the emotion or speaker ID, and generates an audio waveform using a decoder.
- System and method for text to speech synthesis
- Receives input text, reference spectrogram, and emotion or speaker ID
- Generates vector representations of input text and reference spectrogram
- Modifies representations based on combined representation with emotion or speaker ID
- Generates audio waveform using a decoder
Potential Applications: - Voice assistants - Audio books - Language translation services
Problems Solved: - Enhancing naturalness and expressiveness of synthesized speech - Personalizing speech synthesis based on emotion or speaker ID
Benefits: - Improved user experience - Enhanced communication of emotions through synthesized speech
Commercial Applications: Title: Advanced Text to Speech Synthesis System for Enhanced User Experience This technology can be used in various industries such as: - Customer service - Entertainment - Education
Questions about Text to Speech Synthesis: 1. How does the system generate vector representations of input text and reference spectrogram? The system uses encoders to generate vector representations of the input text and reference spectrogram.
2. What is the role of the variance adaptor in modifying the vector representations? The variance adaptor modifies the vector representations based on a combined representation including the emotion or speaker ID.
Original Abstract Submitted
Embodiments described herein provide systems and methods for text to speech synthesis. A system receives, via a data interface, an input text, a reference spectrogram, and at least one of an emotion ID or speaker ID. The system generates, via a first encoder, a vector representation of the input text. The system generates, via a second encoder, a vector representation of the reference spectrogram. The system generates, via a variance adaptor, a modified vector representation based on a combined representation including a combination of the vector representation of the input text, the vector representation of the reference spectrogram, and at least one of an embedding of the emotion ID or an embedding of the speaker ID. The system generates, via a decoder, an audio waveform based on the modified vector representation. The generated audio waveform may be played via a speaker.