ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice Conversion

Ziqian Ning, Shuai Wang, Pengcheng Zhu, Zhichao Wang, Jixun Yao, Lei Xie, Mengxiao Bi

Streaming voice conversion has gained popularity for its applicability in real-time applications. The recently proposed DualVC 2 has successfully achieved robust and high-quality streaming voice conversion in approximately 180ms. However, DualVC 2 is based on the recognition-synthesis framework, with multi-level cascade models that cannot be jointly optimized, and faces severe performance drops with small chunks caused by the ASR encoder. To address these issues, we propose an end-to-end model DualVC 3. It incorporates K-means clustered SSL features to guide the training of the content encoder and adopts an optional language model for pseudo-content generation to improve the conversion quality. Experimental results demonstrate that DualVC 3 achieves comparable performance to DualVC 2 in both subjective and objective metrics, with a latency of only 50 ms. We have made our audio samples publicly available.