We report our recent development of a feature-based general statistical framework for automatic speech recognition. The design of the feature-based atomic units of speech is aimed at a parsimonious scheme to share the inter-word and inter-phone speech data and at a unified way to account for the context-dependent behaviors in speech. We provide detailed descriptions of the design considerations for the recognizer and of key aspects of the design process. This process, which we call lexicon "compilation", consists of three elements: 1) establishing a feature-specification system; 2) constructing a probabilistic and fractional temporal overlapping pattern across the features; and 3) mapping from the feature-overlap pattern to a state-transition graph. A standard phonetic classification task from the TIMIT database is used as a testbed to evaluate the performance of the recognizer. The experimental results show error-rate reductions ranging from 15% to 27% compared with a conventional context-independent phonetic classifier.
Keywords: speech recognition, features, non-linear phonology, hidden Markov model, articulatory gestures