ISCA Archive SSW 2023
ISCA Archive SSW 2023

Cross-lingual transfer using phonological features for resource-scarce text-to-speech

Johannes Abraham Louw

In this work, we explore the use of phonological features incross-lingual transfer within resource-scarce settings. We modify the architecture of VITS to accept a phonological featurevector as input, instead of phonemes or characters. Subsequently, we train multispeaker base models using data from LibriTTS and then fine-tune them on single-speaker Afrikaans andisiXhosa datasets of varying sizes, representing the resource-scarce setting. We evaluate the synthetic speech both objectively and subjectively and compare it to models trained withthe same data using the standard VITS architecture. In our experiments, the proposed system utilizing phonological featuresas input converges significantly faster and requires less data thanthe base system. We demonstrate that the model employingphonological features is capable of producing sounds in the target language that were unseen in the source language, even inlanguages with significant linguistic differences, and with only5 minutes of data in the target language.