ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Low-dimensional Style Token Control for Hyperarticulated Speech Synthesis

Miku Nishihara, Dan Wells, Korin Richmond, Aidan Pine

Global style tokens (GSTs) allow for rich modelling of the variation in a speech corpus and subsequent control of text-tospeech synthesis (TTS). However, certain styles of speech may be marked by variation along multiple dimensions, complicating the interpretation and control of learned style tokens. One example is hyperarticulated or ‘clear’ speech, for example as directed toward listeners with hearing impairments or language learners in the classroom, which in English is characterised by reduced speaking rate, increased F0, more careful articulation of vowels and plosive consonants, and other factors. We present a method for simplifying control of style tokens by applying principal components analysis (PCA) to GST weights from a TTS system trained on both plain and clear speech. We identify the axes of variation in PCA space with the acoustic correlates of clear speech in English and show that we can synthesise either style by moving along a single dimension in that space.