ISCA Archive Eurospeech 1995
ISCA Archive Eurospeech 1995

Algorithms for bigram and trigram word clustering

Sven Martin, Jörg Liermann, Hermann Ney

This paper presents and analyzes improved algorithms for clustering bigram and trigram word equivalence classes, and their respective results: 1) We give a detailed time complexity analysis of bigram clustering algorithms. 2) We present an improved implementation of bigram clustering so that large corpora (38 million words and more) can be clustered within a small number of days or even hours. 3) We extend the clustering approach from bigrams to trigrams. 4) We present experimental results on a 38 million word training corpus.