ISCA Archive SpeechProsody 2024
ISCA Archive SpeechProsody 2024

Prosodic characteristics of English-accented Swedish neural TTS

Christina Tånnander, Jim O'Regan, David House, Jens Edlund, Jonas Beskow

Neural text-to-speech synthesis (TTS) captures prosodic features strikingly well, notwithstanding the lack of prosodic labels in training or synthesis. We trained a voice on a single Swedish speaker reading in Swedish and English. The resulting TTS allows us to control the degree of English-accentedness in Swedish sentences. English-accented Swedish commonly exhibits well-known prosodic characteristics such as erroneous tonal accents and understated or missed durational differences. TTS quality was verified in three ways. Automatic speech recognition resulted in low errors, verifying intelligibility. Automatic language classification had Swedish as the majority choice, while the likelihood of English increased with our targeted degree of English-accentedness. Finally, a rank of perceived English-accentedness acquired through pairwise comparisons by 20 human listeners demonstrated a strong correlation with the targeted English-accentedness. We report on phonetic and prosodic analyses of the accented TTS. In addition to the anticipated segmental differences, the analyses revealed temporal and prominence-related variations coherent with Swedish spoken by English-speakers, such as missing Swedish stress patterns and overly reduced unstressed syllables. With this work, we aim to glean insights into speech prosody from the latent prosodic features of neural TTS models. In addition, it will help implement speech phenomena such as code switching in TTS.