ISCA Archive Interspeech 2005
ISCA Archive Interspeech 2005

Diachronic vocabulary adaptation for broadcast news transcription

Alexandre Allauzen, Jean-Luc Gauvain

This article investigates the use of Internet news sources to automatically adapt the vocabulary of a French and an English broadcast news transcription system. A specific method is developed to gather training, development and test corpora from selected websites, normalizing them for further use. A vectorial vocabulary adaptation algorithm is described which interpolates word frequencies estimated on adaptation corpora to directly maximize lexical coverage on a development corpus. To test the generality of this approach, experiments were carried out simultaneously in French and in English (UK) on a daily basis for the month May 2004. In both languages, the OOV rate is reduced by more than a half.