ISCA Archive Eurospeech 1995
ISCA Archive Eurospeech 1995

Transition-based feature extraction within frame-based recognition

Zhihong Hu, Etienne Barnard, Ronald A. Cole

Current frame-based speech recognition systems sample speech at a fixed set of locations relative to each frame. Modeling the temporal dynamic behavior of speech is thereby complicated. This work shows that by explicitly using transitional information when extracting features, one can better model the acoustic phonetic structure, resulting in higher word level recognition performance. In this proposed approach, features representing local transitional information are used (a constant number of features are selected at each time frame, but the features are sampled near areas of greatest spectrum change within a relatively long window.) By explicitly modeling transitions in this way, we can also model local contextual information. Using this technique, the word level error rate decreased up to 30% on the databases we tested.