The topic of this paper is to introduce a certain type of structure into a bigram language model by using the concept of word equivalence classes. We train these classes automatically, using an iterative clustering algorithm which finds a (local) optimum of some clustering criterion. We show that the conventional maximum-likelihood criterion performs well, but has the disadvantage that one has to specify the number of word classes in advance. We therefore modify this criterion using a special form of cross-validation, the leaving-one-out technique. The resulting algorithm is able to find both the unknown classification and the unknown number of classes at the same time. Clustering experiments were carried out on an English and a German text corpus comprising 1.1 million and 100,000 words, respectively. Compared to a word bigram model we could reduce the perplexity by more than 10% using a class model with automatically clustered classes. Combinations with the word model and with linguistically defined parts of speech lead to a further improvement of up to 37%.
Keywords: Stochastic Language Modelling, Statistical Clustering, Leaving-One-Out Method