SYSTEMS AND METHODS FOR TEXT-TO-SPEECH SYNTHESIS

Organization Name

Inventor(s)

SYSTEMS AND METHODS FOR TEXT-TO-SPEECH SYNTHESIS - A simplified explanation of the abstract

This abstract first appeared for US patent application 18604278 titled 'SYSTEMS AND METHODS FOR TEXT-TO-SPEECH SYNTHESIS

The patent application describes a system and method for text-to-speech synthesis using reference spectrograms and style vectors.

System receives input text, first reference spectrogram, and second reference spectrogram via a data interface.
Encoders generate vector representations of each input.
Combined representation is generated based on the vector representations of the reference spectrograms.
Cross attention is performed between the combined representation and the vector representation of the input text to generate a style vector.
Decoder generates an audio waveform based on the modified vector representation conditioned by the style vector.
Style vector conditions speech generation via conditional layer normalization.
Generated audio waveform can be played via a speaker and used in communication by a digital avatar interface.

1. 1. Potential Applications:

- Text-to-speech applications - Digital avatars for communication - Speech synthesis for various industries such as entertainment, education, and customer service

1. 1. Problems Solved:

- Enhances the naturalness and expressiveness of synthesized speech - Allows for personalized speech generation based on style vectors - Improves the quality and accuracy of text-to-speech systems

1. 1. Benefits:

- Improved user experience in interacting with digital avatars - Enhanced communication through synthesized speech - Customizable speech generation for different contexts and styles

1. 1. Commercial Applications:
    1. Title: Advanced Text-to-Speech Synthesis Technology

This technology can be utilized in: - Virtual assistants - Interactive voice response systems - E-learning platforms - Entertainment industry for voiceovers and dubbing

1. 1. Questions about Text-to-Speech Synthesis:
    1. 1. How does the system generate style vectors for speech synthesis?

The system performs cross attention between the combined representation and the input text to generate a style vector, which conditions the speech generation process.

1. 1. 1. 2. What are the potential applications of this text-to-speech synthesis technology?

This technology can be used in various industries such as entertainment, education, and customer service for enhancing communication through synthesized speech.

Original Abstract Submitted

Embodiments described herein provide systems and methods for text to speech synthesis. A system receives, via a data interface, an input text, a first reference spectrogram, and a second reference spectrogram. The system generates, via encoders, vector representations of each of the inputs. The system generates a combined representation based on the vector representation of the first reference spectrogram and the vector representation of the second reference spectrogram. The system performs cross attention between the combined representation and the vector representation of the input text to generate a style vector. The system may generate, via a decoder, an audio waveform based on the modified vector representation and conditioned by the style vector where the style vector conditions the speech generation via conditional layer normalization. The generated audio waveform may be played via a speaker. The generated audio may be used in communication by a digital avatar interface.