ISCA Archive Eurospeech 1991
ISCA Archive Eurospeech 1991

A technique to automatically assign parts-of-speech to words taking into account word-ending information through a probabilistic model

Giulio Maltese, Federico Mancini

A system to automatically tag arbitrary text with the part-of-speech of each word is described. The system is based on a probabilistic model where we assume that words in a given sequence are the output symbols of a Hidden Markov Model, the states of which are represented by pairs of parts-of-speech. Using a 17 tag set the rate of correctly tagged words ranged from 96. 2% to 97. 2% on various texts. The system proved to be quite effective even using a small set of initial statistics. As to words never occurred in training data, we employed a statistical technique based on word-endings frequencies. This technique resulted in a 22% decrease in tagging error rate using a 260,000-word reference vocabulary and in a 49% decrease making use of a 20,000-word vocabulary.