Stochastic language models based on word n-grams require huge amount of training material and of storage especially for large vocabulary systems. Using n-grams based on classes much less training material is necessary and higher coverage can be achieved. Building classes on basis of linguistic characteristics has the advantage that new words can be mapped easily. To generate linguistic oriented language models training corpora are necessary where to each word its linguistic class has to be assigned. For this task we use commercially available linguistic knowledge bases of high coverage: a german lexicon and a grammar of a machine translation system. We first generate an initial language model using information derived from grammatical parse of training material. As next step linguistic structure represented statistically via the initial language model is extrapolated into any lexically tagged text. The initial language model in this way performs the basis of a bootstrapping process. Using the described technique we are presenting in this paper a tool which assigns to each word of lexically tagged text its linguistic oriented class respecting the sentence context. First evaluations show that 91.3% of classes are assigned correctly.
Keywords: stochastic language models, large vocabulary speech recognition, tagging, n-grams, linguistic oriented classes.