ISCA Archive Interspeech 2012
ISCA Archive Interspeech 2012

Feature extraction based on hearing system signal processing for robust large vocabulary speech recognition

Qi Peter Li, Xie Sun

A new auditory-based feature extraction algorithm for robust speech recognition is developed from modeling the signal processing functions in the hearing system. Usually, the performance of acoustic models trained in clean speech drops significantly when tested on noisy speech; thus recognition systems cannot work robustly in the field even when they have good performance in labs. To address the problem, we have developed features based on a set of modules to simulate the signal processing functions in the cochlea, such as auditory transform, hair cells, and equal-loudness functions. The features are then applied to the Wall Street Journal task. To simulate the performance in the field, the training data is near clean speech while the testing data are with added white and babble noise. As shown in our experiments, without added noise, the proposed features have a similar performance as MFCC, RASTA-PLP, and PLP features. When we added noise and tested at different SNR levels, the performance of the proposed auditory features is significantly better than others. For example, at 10 dB SNR level which is often encountered in real applications, the performance of the proposed auditory features is 65.53% while the best from others is 36.33% from the RASTA-PLP. The proposed features provide an absolute gain on recognition accuracy of 29.20%. Overall, our experiments show that the proposed auditory features have strong robustness in the mismatched and noisy situations in speech recognition.

Index Terms: Speech feature extraction, auditory-based feature, robust speech recognition, cochlea, auditory transform