Automatic estimation of pronunciation proficiency has its specific difficulty. Adequacy in controlling the vocal organs is often estimated from spectral envelopes of input utterances but the envelope patterns are also affected by alternating speakers. To develop a good and stable method for automatic estimation, the envelope changes caused by linguistic factors and those by extra-linguistic factors should be properly separated. In our previous study [1], to this end, we proposed a mathematically guaranteed and linguistically-valid speaker-invariant representation of pronunciation, called speech structure. After the proposal, we have tested that representation also for ASR [2, 3, 4] and, through these works, we have learned better how to apply speech structures for various tasks. In this paper, we focus on a proficiency estimation experiment done in [1] and, using the recently developed techniques for the structures, we carry out that experiment again but under different conditions. Here, we use a smaller unit of structural analysis, speaker-invariant substructures, and relative structural distances between a learner and a teacher. Results show higher correlation between human and machine rating and also show extremely higher robustness to speaker differences compared to widely used GOP scores.
s N. Minematsu, “Pronunciation assessment based upon the phonological distortions observed in language learners’ utterances,” Proc. INTERSPEECH, pp.1669–1672, 2004. Y. Qiao et al., “Random discriminant structure analysis for continous Japanese vowel recognition,” Proc. ASRU, pp.576–581, 2007. S. Asakawa et al., “Multi-stream parameterization for structural speech recognition,” Proc. ICASSP, pp.4097–4100, 2008. N. Minematsu et al., “Implementation of robust speech recognition by simulating infants’ speech perception based on the invariant sound shape embedded in utterances,” Proc. SPECOM, 2009.