ISCA Archive ISCSLP 2000
ISCA Archive ISCSLP 2000

Lexicon Optimization for Chinese Language Modeling

Jun Zhao, Jianfeng Gao, Eric Chang, Mingjing Li

In this paper, we present an approach to lexicon optimization for Chinese language modeling. The method is an iterative procedure consisting of two phases, namely lexicon generation and lexicon pruning. In the first phase, we extract appropriate new words from a very large training corpus using statistical approaches. In the second phase, we prune the lexicon to a preset memory limitation using a perplexity minimization criterion. Experimental results show up to a 6% character perplexity reduction compared to the baseline lexicon.