US Patent Application 17826987. TECHNIQUES FOR IMPROVED ZERO-SHOT VOICE CONVERSION WITH A CONDITIONAL DISENTANGLED SEQUENTIAL VARIATIONAL AUTO-ENCODER simplified abstract
TECHNIQUES FOR IMPROVED ZERO-SHOT VOICE CONVERSION WITH A CONDITIONAL DISENTANGLED SEQUENTIAL VARIATIONAL AUTO-ENCODER
Organization Name
Inventor(s)
Chunlei Zhang of Bellevue WA (US)
Jiachen Lian of Palo Alto CA (US)
TECHNIQUES FOR IMPROVED ZERO-SHOT VOICE CONVERSION WITH A CONDITIONAL DISENTANGLED SEQUENTIAL VARIATIONAL AUTO-ENCODER - A simplified explanation of the abstract
This abstract first appeared for US patent application 17826987 titled 'TECHNIQUES FOR IMPROVED ZERO-SHOT VOICE CONVERSION WITH A CONDITIONAL DISENTANGLED SEQUENTIAL VARIATIONAL AUTO-ENCODER
Simplified Explanation
- The patent application describes a method for voice conversion using a conditional disentangled sequential variational auto-encoder (C-DSVAE). - The method involves receiving input speech segments and encoding them using a shared encoder to generate a speaker embedding and a content embedding. - The speaker embedding and content embedding are further encoded using separate encoders to obtain encoded results. - A content bias is enabled, and the content embedding is reshaped using the content bias. - Finally, a reconstructed speech output is generated based on the encoded results and the reshaped content embedding.
Original Abstract Submitted
A method, system, apparatus, and computer-readable medium for voice conversion using a conditional disentangled sequential variational auto-encoder (C-DSVAE) is provided. The method, performed by at least one processor, includes receiving input speech segments, encoding the input speech segments via a shared encoder to generate a speaker embedding and a content embedding, and encoding a posterior distribution of the speaker embedding via a speaker encoder and encoding a posterior distribution of the content embedding via a content encoder to obtain encoded results. The method further includes enabling a content bias, reshaping the content embedding using the content bias, and generating a reconstructed speech output based on the encoded results and the reshaped content embedding.