ISCA Archive Interspeech 2006
ISCA Archive Interspeech 2006

A model of the regularities underlying speaker variation: evidence from hybrid synthesis

Susan R. Hertz

This paper presents the framework of a speech model, tentatively called the "hybrid model," which offers an explanation of how listeners can identify phonemes in an incoming speech signal despite the vast amount of cross-speaker and contextual variation. Fundamental to the model are two basic speech units into which listeners process the incoming speech stream: acoustic consonant clusters and acoustic nuclei. Acoustic nuclei are responsible for speaker identity, but acoustic consonant clusters are more generic and can even be substituted across speakers with negligible impact on speech quality. The paper focuses on acoustic consonant clusters, showing that much of the variability in them is perceptually irrelevant, and how the hybrid model accounts for listenersÂ’ ability to parse them into phonemes. The paper supports the model as applied to English by drawing on experiments in hybrid synthesis, a technique in which speech is produced by splicing together segments from different speakers [1].