Introduction. The main motivation for this project was to study performance of non-linear speech analysis methods in automatic speech recognition. Specifically, we selected wavelet transform as a promising non-linear tool for signal analysis that has been already successfully applied in many tasks, such as in image recognition and compression leading to standards such as JPEG2000. The plan was to perform a comparative analysis between the standard mel-cepstral and wavelet based set of features and to evaluate the baseline speech recognition rates of two aforementioned parameterization methods.
We start with a brief description of the Fourier and wavelet transforms from the perspective of joint time frequency analysis where we focus on localization issues of the two transforms. Ability of the transformation to properly capture short time events is defined with the localization capabilities of its basic functions and is one of the prerequisites for a successful application in speech processing. The Fourier transform offers constant timefrequency resolution where the wavelet transform enables better frequency resolution at low frequencies and better time localization of the transient phenomena in the time domain. This very much resembles to the first stage of human auditory perception and to basilar membrane excitation where the wavelet transform introduces roughly logarithmic frequency sensitivity. We carried out comparative within and cross-language experiments on the Slovenian and English SpeechDat2 databases using the standard melcepstral and the wavelet based set of features. The tool used in automatic speech recognition was the reference recogniser that is built around the HTK toolkit. This enabled us to conduct controlled experiments on six different subsets of SpeechDat2 vocabularies (yes/no sentences, citinames, phonetically rich word, digits, etc).