ISCA Archive AVSP 2010
ISCA Archive AVSP 2010

Detection of specific mispronunciations using audiovisual features

Sébastien Picard, G. Ananthakrishnan, Preben Wik, Olov Engwall, Sherif Abdou

This paper introduces a general approach for binary classification of audiovisual data. The intended application is mispronunciation detection for specific phonemic errors, using very sparse training data. The system uses a Support Vector Machine (SVM) classifier with features obtained from a Time Varying Discrete Cosine Transform (TV-DCT) on the audio log-spectrum as well as on the image sequences. The concatenated feature vectors from both the modalities were reduced to a very small subset using a combination of feature selection methods. We achieved 95-100% correct classification for each pair-wise classifier on a database of Swedish vowels with an average of 58 instances per vowel for training. The performance was largely unaffected when tested on data from a speaker who was not included in the training.

Index Terms: Time Varying-DCT, Genetic Algorithms, MRMR, CAPT