ISCA Archive RSR 1997
ISCA Archive RSR 1997

Robust features and environmental compensation: a few comments

Nelson Morgan

This is a brief note to comment on a few points related to two excellent keynote papers by Greenberg [3] and by Stern et al [5]. In a sense, Stern's paper describes the current technology; in particular, approaches to adjusting ASR systems based on phone or sub-phone-based HMMs in order to improve performance in the presence of noise and linear channel effects. On the other hand, Greenberg's paper gives a direction for the future, focusing on aspects of spoken language that he does not believe our current systems incorporate. At first glance, the papers might seem almost unrelated. Greenberg's paper focuses on characteristics of conversational speech that indicate limitations of current ASR technology. He suggests a wide-ranging multi-tiered strategy as the fundamental solution to the poor performance that is observed for unexpected testing conditions with machine recognizers. Stern's paper is descriptive of the approaches to noise and channel robustness developed at CMU and elsewhere over the last decade, and as such is a good review of what can be done with the techniques that Greenberg criticizes. The papers are not really contradictory; faced with the requirement of improving recognition performance a good engineer will both consider new directions and also maximally exploit the existing ones. The CMU group has placed considerable emphasis on exploiting a range of solutions to linear disturbances, including both model-based and feature-based compensations. When information about the nature of the disturbance (or about the "clean" signal) is available, methods pioneered by the CMU group show the extent to which the problem can be reduced. Other methods show how iterative approaches (EM) can be used to improve the probability estimates despite interfering signals or convolutional error. We do not yet know what engineering techniques will be required in order to implement a system incorporating all the levels that Greenberg suggests, but when we do it is likely that a real implementation will be statistical, and as such will still require mathematical characterizations such as the ones Stern presents (though perhaps not these same ones).