ISCA Archive SLaTE 2007
ISCA Archive SLaTE 2007

Are learners myna birds to the averaged distributions of native speakers? - a note ofwarning from a serious speech engineer -

Nobuaki Minematsu

The current speech recognition technology consists of clearly separate modules of acoustic models, language models, a pronunciation dictionary, and a decoder. CALL systems often use the acoustic matching module to compare a learner’s utterance to the templates stored in the systems. The acoustic template of a phrase is usually calculated by collecting utterances of that phrase spoken by native speakers and estimating their averaged distribution. If phoneme-based comparison is required, phoneme-based templates should be prepared and Hidden Markov Models are often adopted for training the templates. In this framework, a learner’s utterance is acoustically and directly compared to the averaged distributions. And then, the notorious mismatch problem more or less inevitably happens. I wonder whether this framework is pedagogically-sound enough. No children acquire language through imitating their parents’ voices acoustically. Male learners don’t have to produce female voices even when a female teacher asks them to repeat her. What in a learner’s utterance should be acoustically matched with what in a teacher’s utterance? I consider that the current speech technology does not have any good answers and this paper proposes a good candidate answer by regarding speech as music.