One of the challenges for Large Vocabulary Continuous Speech Recognition (LVCSR) of German is its complex morphology and high level of compounding. It leads to high Out-of-vocabulary (OOV) rates, and poor Language Model (LM) probabilities. In such cases, building LMs on morpheme level can be considered a better choice. Thereby, higher lexical coverage and lower LM perplexities are achieved. On the other side, a successful approach to improve the LM probability estimation is to incorporate features of words using feature-based LMs. In this paper, we use features derived for morphemes as well as words. Thus, we combine the benefits of both morpheme level and feature rich modeling. We compare the performance of stream-based, class-based and factored LMs (FLMs). Relative reductions of around 1.5% in Word Error Rate (WER) are achieved compared to the best previous results obtained using FLMs.
Index Terms: language model, morpheme, streambased, class-based, factored