ISCA Archive ICSLP 1996
ISCA Archive ICSLP 1996

Word clustering with parallel spoken language corpora

Ye-Yi Wang, John Lafferty, Alex Waibel

In this paper we introduce a word clustering algorithm which uses a bilingual, parallel corpus to group together words in the source and target language. Our method generalizes previous mutual information clustering algorithms for monolingual data by incorporating a statistical translation model. Preliminary experiments have shown that the algorithm can effectively employ the constraints implicit in bilingual data to extract classes which are well-suited to machine translation tasks.