US Patent Application 17661832. Speaker Embeddings for Improved Automatic Speech Recognition simplified abstract

From WikiPatents
Jump to navigation Jump to search

Speaker Embeddings for Improved Automatic Speech Recognition

Organization Name

Google LLC


Inventor(s)

Fadi Biadsy of Mountain View CA (US)

Dirk Ryan Padfield of Seattle WA (US)

Victoria Zayats of Seattle WA (US)

Speaker Embeddings for Improved Automatic Speech Recognition - A simplified explanation of the abstract

This abstract first appeared for US patent application 17661832 titled 'Speaker Embeddings for Improved Automatic Speech Recognition

Simplified Explanation

The patent application describes a method for converting atypical speech into a more typical representation using a speaker embedding network.

  • The method involves receiving a reference audio signal of a target speaker with atypical speech.
  • A speaker embedding network generates a speaker embedding that captures the speaker characteristics of the target speaker.
  • The method also involves receiving a speech conversion request with input audio data of the target speaker's utterance.
  • The speaker embedding is used to bias a speech conversion model to convert the input audio data into a more typical representation of the target speaker's utterance.


Original Abstract Submitted

A method includes receiving a reference audio signal corresponding to reference speech spoken by a target speaker with atypical speech, and generating, by a speaker embedding network configured to receive the reference audio signal as input, a speaker embedding for the target speaker. The speaker embedding conveys speaker characteristics of the target speaker. The method also includes receiving a speech conversion request that includes input audio data corresponding to an utterance spoken by the target speaker associated with the atypical speech. The method also includes biasing, using the speaker embedding generated for the target speaker by the speaker embedding network, a speech conversion model to convert the input audio data corresponding to the utterance spoken by the target speaker associated with atypical speech into an output canonical representation of the utterance spoken by the target speaker.