ISCA Archive SLAM 2014
ISCA Archive SLAM 2014

Proper name retrieval from diachronic documents for automatic speech transcription using lexical and temporal context

Irina Illina, Dominique Fohr, Georges Linarès

Proper names are usually key to understanding the information contained in a document. Our work focuses on increasing the vocabulary coverage of a speech transcription system by automatically retrieving new proper names from contemporary diachronic text documents. The idea is to use in-vocabulary proper names as an anchor to collect new linked proper names from the diachronic corpus. Our assumption is that time is an important feature for capturing name-to-context dependencies, that was confirmed by temporal mismatch experiments. We studied a method based on Mutual Information and proposed a new method based on cosine-similarity measure that dynamically augment the automatic speech recognition system vocabulary. Recognition results show a significant reduction of the word error rate using augmented vocabulary for broadcast news transcription.

Index Terms: speech recognition, out-of-vocabulary words, proper names, vocabulary augmentation