ISCA Archive Eurospeech 2003
ISCA Archive Eurospeech 2003

Training data optimization for language model adaptation

Xiaoshan Fang, Jianfeng Gao, Jianfeng Li, Huanye Sheng

Language model (LM) adaptation is a necessary step when the LM is applied to speech recognition. The task of LM adaptation is to use out-domain data to improve in-domain model's performance since the available in-domain (task-specific) data set is usually not large enough for LM training. LM adaptation faces two problems. One is the poor quality of the out-domain training data. The other is the mismatch between the n-gram distribution in out-domain data set and that in in-domain data set. This paper presents two methods, filtering and distribution adaptation, to solve them respectively. First, a bootstrapping method is presented to filter suitable portion from two large variable quality out-domain data sets for our task. Then a new algorithm is proposed to adjust the n-gram distribution of the two data sets to that of a task-specific but small data set. We consider preventing over-fitting problem in adaptation. All resulting models are evaluated on the realistic application of email dictation. Experiments show that each method achieves better performance, and the combined method achieves a perplexity reduction of 24% to 80%.