US Patent Application 17826987. TECHNIQUES FOR IMPROVED ZERO-SHOT VOICE CONVERSION WITH A CONDITIONAL DISENTANGLED SEQUENTIAL VARIATIONAL AUTO-ENCODER simplified abstract

From WikiPatents
Jump to navigation Jump to search

TECHNIQUES FOR IMPROVED ZERO-SHOT VOICE CONVERSION WITH A CONDITIONAL DISENTANGLED SEQUENTIAL VARIATIONAL AUTO-ENCODER

Organization Name

Tencent America LLC

Inventor(s)

Chunlei Zhang of Bellevue WA (US)

Jiachen Lian of Palo Alto CA (US)

Dong Yu of Palo Alto CA (US)

TECHNIQUES FOR IMPROVED ZERO-SHOT VOICE CONVERSION WITH A CONDITIONAL DISENTANGLED SEQUENTIAL VARIATIONAL AUTO-ENCODER - A simplified explanation of the abstract

This abstract first appeared for US patent application 17826987 titled 'TECHNIQUES FOR IMPROVED ZERO-SHOT VOICE CONVERSION WITH A CONDITIONAL DISENTANGLED SEQUENTIAL VARIATIONAL AUTO-ENCODER

Simplified Explanation

- The patent application describes a method for voice conversion using a conditional disentangled sequential variational auto-encoder (C-DSVAE). - The method involves receiving input speech segments and encoding them using a shared encoder to generate a speaker embedding and a content embedding. - The speaker embedding and content embedding are further encoded using separate encoders to obtain encoded results. - A content bias is enabled, and the content embedding is reshaped using the content bias. - Finally, a reconstructed speech output is generated based on the encoded results and the reshaped content embedding.


Original Abstract Submitted

A method, system, apparatus, and computer-readable medium for voice conversion using a conditional disentangled sequential variational auto-encoder (C-DSVAE) is provided. The method, performed by at least one processor, includes receiving input speech segments, encoding the input speech segments via a shared encoder to generate a speaker embedding and a content embedding, and encoding a posterior distribution of the speaker embedding via a speaker encoder and encoding a posterior distribution of the content embedding via a content encoder to obtain encoded results. The method further includes enabling a content bias, reshaping the content embedding using the content bias, and generating a reconstructed speech output based on the encoded results and the reshaped content embedding.