ISCA Archive SpeechProsody 2024
ISCA Archive SpeechProsody 2024

Is there an uncanny valley for speech? Investigating listeners’ evaluations of realistic TTS voices

Alice Ross, Martin Corley, Catherine Lai

The exploration of uncanny valley effects (UVE) -a distaste for entities that appear almost, but not quite, human -has been a productive topic of research in human-robot interaction. Meanwhile, realistic text-to-speech (TTS) voices are increasingly encountered in various settings. In this work, we aim to describe the relationship between the perceived human-likeness and pleasantness of TTS voices and seek evidence of auditory UVE in listeners’ evaluations. In an online between-subjects experiment, listeners rated an array of manipulated TTS voices, trained using a single speaker’s data. The evidence obtained is compatible with a slight plateau in a generally positive correlation between realism and approval. All the TTS voices used received ratings of below 50% on average for ‘human-likeness’, and therefore conclusions about UVE, i.e. negative reactions to voices perceived as very human-like, cannot be drawn from these data. Our results suggest that, although a correlation exists, high realism may not be necessary for relatively high approval; on average, voices with decreased pitch variation were rated about twice as highly for being ‘pleasant’ and ‘friendly’ as they were ‘like a human’. The relationship between pitch variation and perceived realism is examined and identified as a direction for further research.