20230197057. Speech Recognition Using Unspoken Text and Speech Synthesis simplified abstract (Google LLC)

From WikiPatents
Jump to navigation Jump to search

Speech Recognition Using Unspoken Text and Speech Synthesis

Organization Name

Google LLC

Inventor(s)

Zhehuai Chen of Jersey City NJ (US)

Andrew M. Rosenberg of Brooklyn NY (US)

Bhuvana Ramabhadran of Mt. Kisco NY (US)

Pedro J. Moreno Mengibar of Jersey City NJ (US)

Speech Recognition Using Unspoken Text and Speech Synthesis - A simplified explanation of the abstract

This abstract first appeared for US patent application 20230197057 titled 'Speech Recognition Using Unspoken Text and Speech Synthesis

Simplified Explanation

The abstract describes a method for training a text-to-speech (TTS) model and a speech recognition model simultaneously using a generative adversarial network (GAN). Here are the key points:

  • The method involves obtaining a set of training text utterances.
  • For each training text utterance, the method generates a synthetic speech representation using the GAN-based TTS model.
  • An adversarial discriminator of the GAN is used to determine an adversarial loss term, which measures the acoustic noise disparity between the synthetic speech representation and a non-synthetic speech representation from the set of spoken training utterances.
  • The parameters of the GAN-based TTS model are updated based on the adversarial loss term at each output step for each training text utterance.

Potential applications of this technology:

  • Text-to-speech systems: The method can be used to improve the quality and naturalness of synthetic speech generated by TTS models.
  • Speech recognition systems: The method can enhance the accuracy and robustness of speech recognition models by training them alongside the TTS model.
  • Assistive technologies: This technology can benefit individuals with speech impairments or disabilities by providing more realistic and intelligible synthetic speech.

Problems solved by this technology:

  • Acoustic noise disparity: The method addresses the challenge of minimizing the difference in acoustic noise between synthetic speech and real speech, improving the quality and realism of synthetic speech.
  • Training efficiency: By training the TTS and speech recognition models simultaneously, the method optimizes the learning process and potentially reduces the training time.

Benefits of this technology:

  • Improved speech quality: The method aims to generate synthetic speech that closely resembles natural speech, enhancing the user experience and making it more pleasant to listen to.
  • Enhanced speech recognition accuracy: By training the speech recognition model alongside the TTS model, the method can improve the recognition of synthetic speech, leading to more accurate transcription and understanding of spoken language.
  • Versatile applications: The technology can be applied to various domains, including voice assistants, audiobooks, language learning, and accessibility tools, providing benefits to a wide range of users.


Original Abstract Submitted

a method for training a generative adversarial network (gan)-based text-to-speech (tts) model and a speech recognition model in unison includes obtaining a plurality of training text utterances. at each of a plurality of output steps for each training text utterance, the method also includes generating, for output by the gan-based tts model, a synthetic speech representation of the corresponding training text utterance, and determining, using an adversarial discriminator of the gan, an adversarial loss term indicative of an amount of acoustic noise disparity in one of the non-synthetic speech representations selected from the set of spoken training utterances relative to the corresponding synthetic speech representation of the corresponding training text utterance. the method also includes updating parameters of the gan-based tts model based on the adversarial loss term determined at each of the plurality of output steps for each training text utterance of the plurality of training text utterances.