Speech recognition systems are often designed under controlled working conditions (usually known as "laboratory" conditions) that do not correspond in many cases to real recognition environments. In practice, recognition rates in laboratory conditions are severely damaged by unexpected factors, that can be classified from two points of view: a) speech extrinsic factors, like environmental noise, reverberations, channel distortions (due to acquisition, processing and/or transmission devices), bandwith reduction, etc., and b) speech intrinsic factors (non expected speech signal entering our recognition system), like spontaneous speech, unadaptation of the user to syntactical/grammatical requirements of the system, stress and others. From now on, we will call "noise", in a general manner, to all those extrinsic or intrinsic perturbations that will be added to our clean speech, producing noisy speech as a result. In this paper, both single-channel and multi-channel approaches to speech enhancement [1,2] for a word spotting system are presented. Single-channel approach is intended when no references of the noisy source are available -in this cases, speech enhancement could be accomplished with techniques as spectral subtraction or classical filtering. Multi-channel approach, needs at least one correlated reference of the noisy source and, in this other case, techniques as adaptive filtering are usually used [3]. We shall compare this two different approaches in a word spotting system, where several Spanish keywords are embedded in some spontaneous utterances.
Keywords: Speech Recognition, Word Spotting, Spectral Subtraction, Adaptive Filtering, Noise Cancelling, Speech Prediction