ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Accent Conversion using Pre-trained Model and Synthesized Data from Voice Conversion

Tuan Nam Nguyen, Ngoc-Quan Pham, Alexander Waibel

Accent conversion (AC) aims to generate synthetic audios by changing the pronunciation pattern and prosody of source speakers (in source audios) while preserving voice quality and linguistic content. There has not been a parallel corpus that contains pairs of audios having the same contents yet coming from the same speakers in different accents, the authors hence work on a solution to synthesize one as training input. The training pipeline is conducted via two steps. First, a voice conversion (VC) model is constructed to synthesize a training data set, containing pairs of audios in the same voice but two different accents. Second, an AC model is trained with the synthesized data to convert a source accented speech to a target accented speech. Given the recognized success of self-supervised learning speech representation (wav2vec 2.0) on certain speech problems such as VC, speech recognition, speech translation, and speech-to-speech translation, we adopt this architecture with some customization to train the AC model in the second step. With just 9-hour synthesized training data, the encoder initialized by the weight of the pre-trained wav2vec 2.0 model outperforms the LSTM-based encoder.