ISCA Archive ICSLP 2002
ISCA Archive ICSLP 2002

Optimal selection of speech data for automatic speech recognition systems

Arkadiusz Nagórski, Lou Boves, Herman Steeneken

This paper presents a method designed to select a limited set of maximally information rich speech data from a database for optimal training and diagnostic testing of Automatic Speech Recognition (ASR) systems. The method uses Principal Component Analysis (PCA) to map the variance of the speech material in a database into a low-dimensional space, followed by clustering and a selection technique. It appears that a very straightforward implementation of this procedure automatically detects at least two criteria for a classifi- cation of speakers of standard Dutch, viz. gender and the way in which the /r/ is produced. To verify the power of the technique to improve ASR, data sets of equal size selected with this method and obtained randomly were used to train a recognition system on Dutch connected digits. The results show an improvement in the recognition performance when optimal data sets were used, especially for the conditions where the sub-corpora used for training were relatively small.