A common cause of errors in spoken language systems is the presence of out-of-vocabulary (OOV) words in the input. Named entities (people, places, organizations, etc.) are a particularly important class of OOVs. In this paper we focus on detecting OOV named entities (NEs) for twoway English/Iraqi speech-to-speech translation. Our approach builds on Maximum Entropy (MaxEnt) classifier trained on a suite of contextual features. These features include: n-gram context, part-of-speech tags (both supervised an unsupervised), and a novel word posterior feature computed from the trajectory of the word posteriors within the utterance. Our experimental results show that fusion (both early and late) of these novel word posterior features with rest of the contextual features significantly improves detection accuracy for OOV NEs. However, we also observe that the same features that perform well on OOV NEs can hurt in detecting in-vocabulary NEs. Therefore, the choice of the features should be based on expected occurrence of OOV NEs.
Index Terms: named entity detection, ASR confidence estimation, conversational speech, speech to speech translation