ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Decoupling Segmental and Prosodic Cues of Non-native Speech through Vector Quantization

Waris Quamer, Anurag Das, Ricardo Gutierrez-Osuna

Accent conversion (AC) seeks to transform utterances from a non-native speaker to appear native-like. Compared to voice conversion, which generally treats accent and voice quality as one, AC provides a finer-grained decomposition of speech. This paper presents an AC system that further decomposes an accent into its segmental and prosodic characteristics, and provides independent control of both channels. The system uses conventional modules (acoustic model, speaker/prosody encoders, seq2seq model) to generate accent conversions that combine (1) the segmental characteristics from a source utterance, (2) the voice characteristics from a target utterance, and (3) the prosody of a reference utterance. However, naive application of this idea prevents the system from learning and transferring prosody. We show that vector quantization and removal of repeated codewords allows the system to transfer prosody and improve voice similarity, as verified by objective and perceptual measures.