ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

TVQVC: Transformer Based Vector Quantized Variational Autoencoder with CTC Loss for Voice Conversion

Ziyi Chen, Pengyuan Zhang

Techniques of voice conversion (VC) aim to modify the speaker identity and style of an utterance while preserving the linguistic content. Although there are lots of VC methods, the state of the art of VC is still cascading automatic speech recognition (ASR) and text-to-speech (TTS). This paper presents a new structure of vector-quantized autoencoder based on transformer with CTC loss for non-parallel VC, which inspired by cascading ASR and TTS VC method. Our proposed method combines CTC loss and vector quantization to get high-level linguistic information without speaker information. Objective and subjective evaluations on the mandarin datasets show that the converted speech of our proposed model is better than baselines on naturalness, rhythm and speaker similarity.