ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Towards Improved Zero-shot Voice Conversion with Conditional DSVAE

Jiachen Lian, Chunlei Zhang, Gopala Krishna Anumanchipalli, Dong Yu

Disentangling content and speaking style information is essential for zero-shot non-parallel voice conversion (VC). Our previous study investigated a novel framework with disentangled sequential variational autoencoder (DSVAE) as the backbone for information decomposition. We have demonstrated that simultaneously disentangle content embedding and speaker embedding from one utterance is feasible for downstream tasks such as speaker verification and zero-shot VC. In this study, we continue the direction by rising one concern remained in the prior distribution of content branch of the DSVAE baseline. We find the random initialed prior distribution will force the content embedding to reduce the phonetic-structure information during the learning process, which is not a desired property. Here, we seek to achieve a better content embedding with more phonetic information preserved. We propose conditional DSVAE, a new model that enables content bias as a condition to the prior modeling and reshapes the content embedding sampled from the posterior distribution. In our experiment on the VCTK dataset, we demonstrate that content embeddings derived from the conditional DSVAE overcome the randomness and achieve a better phoneme classification accuracy compared with the DSVAE baseline. In the meanwhile, the change results in an improved performance of zero-shot VC.