ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

ControlVC: Zero-Shot Voice Conversion with Time-Varying Controls on Pitch and Speed

Meiying Chen, Zhiyao Duan

Recent advancements in neural speech synthesis have renewed interest in voice conversion (VC) to go beyond timbre transfer. Achieving controllability of para-linguistic parameters like pitch and speed is crucial in various applications. However, existing studies either lack interpretability or only provide global control at the utterance level. This paper introduces ControlVC, the first neural voice conversion system to enable time-varying controls on pitch and speed. ControlVC uses pre-trained encoders to generate pitch and linguistic embeddings, combined and converted to speech using a vocoder. Speed control is achieved by TD-PSOLA pre-processing, while pitch control is achieved by manipulating the pitch contour before feeding it into the encoder. Systematic subjective and objective evaluations show that this work significantly outperforms self-constructed baselines on speech quality and controllability for non-parallel zero-shot conversion while achieving time-varying control.