ISCA Archive ISCSLP 2002
ISCA Archive ISCSLP 2002

Investigation and analysis on designing Chinese balance corpus

Rile Hu, Chengqing Zong, Juha Iso-Sipila, Bo Xu

Recently, the statistical methods have become the main methods in the research of computational linguistics and natural language processing. The corpus is the basis of the statistical method. How to keep the balance in corpus collection is an important issue. In this paper, we report the results of our investigation and analysis on some real corpus, and propose a scheme to keep the balance in corpus design. Suggestions for the composition in corpus design are also presented in this paper.