ISCA Archive Eurospeech 2003
ISCA Archive Eurospeech 2003

Integrated pitch and MFCC extraction for speech reconstruction and speech recognition applications

Xu Shao, Ben P. Milner, Stephen J. Cox

This paper proposes an integrated speech front-end for both speech recognition and speech reconstruction applications. Speech is first decomposed into a set of frequency bands by an auditory model. The output of this is then used to extract both robust pitch estimates and MFCC vectors. Initial tests used a 128 channel auditory model, but results show that this can be reduced significantly to between 23 and 32 channels. A detailed analysis of the pitch classification accuracy and the RMS pitch error shows the system to be more robust than both comb function and LPC-based pitch extraction. Speech recognition results show that the auditory-based cepstral coefficients give very similar performance to conventional MFCCs. Spectrograms and informal listening tests also reveal that speech reconstructed from the auditory-based cepstral coefficients and pitch has similar quality to that reconstructed from conventional MFCCs and pitch.