ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Hierarchical Timbre-Cadence Speaker Encoder for Zero-shot Speech Synthesis

Joun Yeop Lee, Jae-Sung Bae, Seongkyu Mun, Jihwan Lee, Ji-Hyun Lee, Hoon-Young Cho, Chanwoo Kim

Although recent zero-shot text-to-speech (zs-TTS) models have shown high performance in terms of speech quality, speaker similarity is not up to par. Speaker similarity can be expressed in two different components: intra-speaker consistent component (timbre) and inter-utterance variate component (cadence). In this paper, we propose a timbre-cadence speaker encoder for zs-TTS that improves speaker similarity by modeling these components. To disentangle timbre and cadence more efficiently, we employ a hierarchical structure. The cadence embedding is first encoded with VICReg which enlarges the inter-utterance embedding within a batch. Next, timbre embedding is extracted after subtracting cadence embedding and using a loss between timbre embedding and speaker ID-based speaker embedding. Additionally, we propose an effective data augmentation called speaker mixing augmentation, where two short utterances from different speakers are concatenated for a more robust zs-TTS model.