Current-generation speech recognition systems seek to identify words via analysis of their underlying phonological constituents. Although this stratagem works well for carefully enunciated speech emanating from a pristine acoustic environment, it has fared less well for recognizing speech spoken under more realistic conditions, such as (1) moderate to high levels of background noise (2) moderately reverberant acoustic environments (3) spontaneous, informal conversation Under such "real-world" conditions the acoustic properties of speech make it difficult to partition the acoustic stream into readily definable phonological units, thus rendering the process of word recognition highly vulnerable to departures from "canonical" patterns. Analysis of informal, spontaneous speech indicates that the stability of linguistic representation is more likely to reside on the syllabic and phrasal levels than on the phonological. In consequence, attempts to represent words merely as sequences of phones, and to derive meaning from simple chains of lexical entities, are unlikely to yield high levels of recognition performance under such real-world conditions.
A multi-tiered representation of speech is proposed, one in which only partial information from each of many levels of linguistic abstraction is required for sufficient identification of lexical and phrasal elements. Such tiers of linguistic abstraction are unified through a hierarchically organized process of temporal binding and are, in principle, highly tolerant of the sorts of "distortions" imposed on speech in the real world.