ISCA Archive SpeechProsody 2006
ISCA Archive SpeechProsody 2006

Speech recognition only with supra-segmental features - hearing speech as music -

Nobuaki Minematsu, Tazuko Nishimura, Takao Murakami, Keikichi Hirose

This paper proposes a novel paradigm of speech recognition where only the supra-segmental features are utilized. Absolute properties of speech events such as formants and spectrums are completely discarded and only the relative and differential properties of the events are extracted as phonic contrasts. The phonic contrasts are considered as supra-segmental features and they are mathematically shown not to carry non-linguistic features such as speaker, age, gender, etc. This fact leads us to expect that speaker-independent speech recognition should be possible with the reference models built only with a single speaker¡¯s speech. Experiments of isolated vowel sequence recognition show that this expectation is correct and that the performance of the new paradigm is better than that of the conventional one using more than four thousand speakers, even in the case of noisy speech. Hearing sounds through capturing only their contrasts and their structure is often done when hearing musical sounds, indicating that the proposed paradigm hears speech as music.