ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Utilizing Adaptive Global Response Normalization and Cluster-Based Pseudo Labels for Zero-Shot Voice Conversion

Ji Sub Um, Hoirin Kim

Recently, there has been an increase in research on zero-shot voice conversion. Many conventional studies use dynamic layers to conduct conversion for unseen speakers. Our aim is to extend dynamic methods to transmit content information as well. To achieve this, we propose AGRN-VC, which utilizes ConvNeXt V2 modules with adaptive global response normalization (AGRN) layers to convey content information. When conveying this information, it is crucial to ensure that the source speaker's information is not transmitted. So we adopt auxiliary learning with cluster-based pseudo labels. It helps the content encoder to focus on content information while excluding speaker information by performing a pseudo label classification task using its output. We conduct comparative experiments between various baseline models and the proposed model using subjective and objective metrics. Our proposed approach achieves better converted speech quality in terms of speaker similarity and naturalness.