In this work we compare different methods for clustering words into equivalence classes within a bigram language model, for a specific-domain recognition task (train timetable enquiry). Though the perplexity values obtained by the various methods differ, the word error rates eventually achieved are very similar. We examine this behavior in the light of the word usage peculiarities present in these types of tasks.