ISCA Archive ISCSLP 2006
ISCA Archive ISCSLP 2006

Chinese Character-based Segmentation \& POS-tagging and Named Entity Identification with a CRF Chunker

Xinhui Hu, Hideki Kashioka

In this paper, we propose a character-based conditional random field (CRF) chunker to identify Chinese named entity words in the text files. The input for it is from a character-based tagger in which the segmentation and partof-speech (POS) tagging are conducted simultanueously. The character-based tagger is trained by using a corpus in which each character is tagged with both its position (POC) in a word and POS tag of the word. The chunker is trained by an IOB2 tagged corpus, in which each character is labelled with POC, POS and chunk tags (one of the B, I, O). 4 kinds of named entities, including personal names, location names, organization names, and other proper nouns, are assumed to be identification targets. In experiments using the People’s Daily corpus, we found the CRF chunker can obtain better results than the maximum entropy model and support vector machine model in the case of using similar features. We also confirmed that the bigram features for the CRF chunker is superior to the unigram features, and nearly 1% improvement in identification is obtained with the addition of POS information. Keywords: Chinese Segmentation & POS tagging, Named Entity Identification, Character-based Model, ME, CRF