We present methods of detector design in the Automatic Speech Attribute Transcription project. This paper details the results of a student-led, cross-site collaboration between Georgia Institute of Technology, The Ohio State University and Rutgers University. The work reported in this paper describes and evaluates the detection-based ASR paradigm and discusses phonetic attribute classes, methods of detecting framewise phonetic attributes and methods of combining attribute detectors for ASR.
We use Multi-Layer Perceptrons, Hidden Markov Models and Support Vector Machines to compute confidence scores for several prescribed sets of phonetic attribute classes. We use Conditional Random Fields (CRFs) and knowledge-based rescoring of phone lattices to combine framewise detection scores for continuous phone recognition on the TIMIT database. With CRFs, we achieve a phone accuracy of 70.63%, outperforming the baseline and enhanced HMM systems, by incorporating all of the attribute detectors discussed in the paper.