Researches on human speech perception indicate that temporal envelopes of speech signal are the main carrier of linguistic information. In automatic speech recognition (ASR), the long-term temporal envelopes of subband signals are replaced with short-time spectral envelopes to characterize the linguistic information in speech signal. Past studies have repeatedly shown that temporal fluctuation of spectral trajectory beyond the range of [1, 12]Hz can be harmful to speech recognition. This study investigates the significance of temporal modulation for phoneme identification in machine system. Both long-term temporal envelopes and short-term spectral envelopes are used as the front-end features. Results indicate that temporal modulations above 16 Hz have significant contribution to phoneme identification in clean and noisy conditions, in long-term analysis case. Whereas in short-term analysis case, modulations above 16 Hz are not robust.
Index Terms: multistream, temporal modulations, phone recognition