ISCA Archive Eurospeech 1993
ISCA Archive Eurospeech 1993

Recognition confidence measures for spontaneous spoken dialog

Sheryl R. Young, Wayne Ward

This paper reports on a new technique for evaluating confidence in word strings produced by a speech recognition system used for processing limited domain spontaneous dialogs. The technique can also be used to produce confidence metrics for non-spontaneous speech. Spontaneous speech is especially difficult because unknown words, verbal noise and speech repairs and edits arc common phenomena that complicate the basic speaker-independent continuous speech recognition process. Our goal is to produce a confidence metric for spontaneously generated word strings that combines acoustic and higher-level knowledge sources through the use of Bayesian Updating. This confidence measure takes into account knowledge source reliability and ability to differentially discriminate misrecognitions. This wor£ is part of our larger project on automatically detecting and acquiring the meaning or out-of-vocabulary words. In estimating acoustic confidence, we first normalize the word score produced by the recognizer. This is done by subtracting the log-probability score for an all-phone recognition from the log-probability word score and normalizing for length. The all-phone score is generated by running the speecn recognizer on the utterance allowing any triphone to follow any other enphone with a trigram probability for tnphone sequences. A triphone is a context dependent phone model. Trigrams of the triphone sequences are computed from a large corpus of English language text. We use Bayesian Updating to turn the normalized word score into a confidence measure. For this, words are grouped into classes using alternate grouping methods. For each word class we estimate when a word in the class is seen with a particular score, what is the percentage of time that the word was correctly recognized. This estimate is made by running the recognition system on a training set of data. This gives us a airect measure of the confidence with which we can reject or accept a word based on acoustic measures.