ISCA Archive SSW 2023
ISCA Archive SSW 2023

PRVAE-VC: Non-Parallel Many-to-Many Voice Conversion with Perturbation-Resistant Variational Autoencoder

Kou Tanaka, Hirokazu Kameoka, Takuhiro Kaneko

This paper describes a novel approach to non-parallel many-to-many voice conversion (VC) that utilizes a variant of the conditional variational autoencoder (VAE) called a perturbation-resistant VAE (PRVAE). In VAE-based VC, it is commonly assumed that the encoder extracts content from the input speechwhile removing source speaker information. Following this extraction, the decoder generates output from the extracted content and target speaker information. However, in practice,the encoded features may still retain source speaker information, which can lead to a degradation of speech quality duringspeaker conversion tasks. To address this issue, we proposea perturbation-resistant encoder trained to match the encodedfeatures of the input speech with those of a pseudo-speech generated through a content-preserving transformation of the inputspeech’s fundamental frequency and spectral envelope using acombination of pure signal processing techniques. Our experimental results demonstrate that this straightforward constraintsignificantly enhances the performance in non-parallel many-to-many speaker conversion tasks. Audio samples can be accessedat http://www.kecl.ntt.co.jp/people/tanaka.ko/projects/prvaevc/.