ISCA Archive SSW 2023
ISCA Archive SSW 2023

Advocating for text input in multi-speaker text-to-speech systems

Gérard Bailly, Martin Lenglet, Olivier Perrotin, Esther Klabbers

Nowadays text-to-speech synthesis (TTS) systems are mostcommonly trained using phonetic input. This is mostly due tothe poor performance of the letter-to-sound (L2S) mapping (inparticular with languages with opaque orthography) performedby end-to-end TTS: the empirical distribution of the words sampled in the sole training corpus cannot compete with pronunciation dictionaries. Taylor and Richmond [1] actually reportedletter-to-sound errors – implicitly performed by end-to-end systems from raw text input – close to 10%.This paper nevertheless shows that speakers produce lawful phonological variations and that end-to-end TTS systemstrained to accept text input – once trained adequately – can capture these variations of pronunciation that are strong markersof sociolinguistic features. We illustrate such variations on liaisons and schwas in French and r-linking in British English.We therefore advocate for restoring text input for TTS, so thatthe many aspects of style variations (produced by speakers aswell as stylistic variations) encoded by suprasegmental featurescan also be reflected in actual variations of pronunciation.