Generating Expressive Speech Audio From Text Data

Organization Name

Electronic Arts Inc.

Inventor(s)

Siddharth Gururani of Santa Clara CA (US)

Kilol Gupta of Redwood City CA (US)

Dhaval Shah of Redwood City CA (US)

Zahra Shakeri of Newark CA (US)

Jervis Pinto of Toronto (CA)

Mohsen Sardari of Burlingame CA (US)

Navid Aghdaie of San Jose CA (US)

Kazi Zaman of Foster City CA (US)

Generating Expressive Speech Audio From Text Data - A simplified explanation of the abstract

This abstract first appeared for US patent application 20240290316 titled 'Generating Expressive Speech Audio From Text Data

The abstract describes a system for generating expressive speech audio in video game development using a machine-learned synthesizer.

User interface receives text input and speech style selection.
Machine-learned synthesizer includes text encoder, speech style encoder, and decoder.
Text encoder generates text encodings from user input.
Speech style encoder processes speech style features to generate speech style encoding.
Combined encodings are decoded to generate predicted acoustic features.
Modules process predicted acoustic features, including a machine-learned vocoder to generate expressive speech audio.

Potential Applications: - Enhancing user experience in video games with realistic and expressive speech audio. - Improving character interactions and dialogue in video game development.

Problems Solved: - Providing a system for developers to easily generate expressive speech audio. - Enhancing the immersion and engagement of players in video games.

Benefits: - Streamlining the process of creating speech audio in video game development. - Enhancing the overall quality and realism of speech audio in video games.

Commercial Applications: - This technology can be used by video game developers to enhance the audio experience in their games, potentially leading to increased player engagement and satisfaction.

Prior Art: - Researchers and developers in the field of speech synthesis and audio generation may have explored similar techniques for creating expressive speech audio in various applications.

Frequently Updated Research: - Stay updated on advancements in machine learning techniques for speech synthesis and audio generation to further improve the system's capabilities.

Questions about the Technology: 1. How does the machine-learned synthesizer differentiate between various speech styles? 2. What are the potential limitations of using this system in video game development?

Original Abstract Submitted

a system for use in video game development to generate expressive speech audio comprises a user interface configured to receive user-input text data and a user selection of a speech style. the system includes a machine-learned synthesizer comprising a text encoder, a speech style encoder and a decoder. the machine-learned synthesizer is configured to generate one or more text encodings derived from the user-input text data, using the text encoder of the machine-learned synthesizer; generate a speech style encoding by processing a set of speech style features associated with the selected speech style using the speech style encoder of the machine-learned synthesizer; combine the one or more text encodings and the speech style encoding to generate one or more combined encodings; and decode the one or more combined encodings with the decoder of the machine-learned synthesizer to generate predicted acoustic features. the system includes one or more modules configured to process the predicted acoustic features, the one or more modules comprising a machine-learned vocoder configured to generate a waveform of the expressive speech audio.