ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Robust Feature Decoupling in Voice Conversion by Using Locality-Based Instance Normalization

Yewei Gu, Xianfeng Zhao, Xiaowei Yi

Extensive style transfer methods have shown that instance normalization (IN) is a simple yet effective way to remove style information. However, few studies have focused on whether these channel-wise feature statistics, such as mean and standard deviation (std) are consistent locally and globally, which ultimately leads to insufficient feature decoupling. In this paper, we first propose locality-based instance normalization (LoIN) to impose statistical feature consistency constraints on latent feature maps. LoIN performs normalization using local feature statistics which are calculated on randomly selected frames rather than the entire set of frames used in the training phase. In particular, LoIN is lightweight, less computationally intensive, and transferable to any IN-driven VC method. Experimental results show the superiority of LoIN in disentanglement and transfer performance and show improvement in both speaker similarity and content consistency.