ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Speaker Normalization and Content Restoration for Zero-Shot Voice Conversion with Attention-Enhanced Discriminator

Desheng Hu, Yang Xiang, Jian Lu, Xinhui Hu, Xinkang Xu

Zero-shot voice conversion can be achieved by extracting the source linguistic content and the unseen target speaker information, then reconstructing mel-spectrograms from these representations. In this paper, we propose a novel zero-shot VC method. First, we disentangle content and speaker information by training the content encoder from scratch, integrating a supervised phoneme classification network with speaker normalization and content restoration modules. Second, we enhance the speaker encoder by applying consistency loss, ensuring the extraction of accurate and robust speaker representations. Finally, we introduce an attention-enhanced discriminator for adversarial training to generate high-fidelity mel-spectrograms. Experimental results demonstrate that our proposed method demonstrates outstanding VC performance in terms of both speech quality and speaker similarity for unseen speakers.