ISCA Archive Interspeech 2014
ISCA Archive Interspeech 2014

Word-phrase-entity language models: getting more mileage out of n-grams

Michael Levit, Sarangarajan Parthasarathy, Shuangyu Chang, Andreas Stolcke, Benoît Dumoulin

We present a modification of the traditional n-gram language modeling approach that departs from the word-level data representation and seeks to re-express the training text in terms of tokens that could be either words, common phrases or instances of one or several classes. Our iterative optimization algorithm considers alternative parses of the corpus in terms of these tokens, re-estimates token n-gram probabilities and also updates within-class distributions. In this paper, we focus on the cold start approach that only assumes availability of the word-level training corpus, as well as a number of generic class definitions. Applied to the calendar scenario in the personal assistant domain, our approach reduces word error rates by more than 13% relative to the word-only n-gram language models. Only a small fraction of these improvements can be ascribed to a larger vocabulary.

doi: 10.21437/Interspeech.2014-168

Cite as: Levit, M., Parthasarathy, S., Chang, S., Stolcke, A., Dumoulin, B. (2014) Word-phrase-entity language models: getting more mileage out of n-grams. Proc. Interspeech 2014, 666-670, doi: 10.21437/Interspeech.2014-168

  author={Michael Levit and Sarangarajan Parthasarathy and Shuangyu Chang and Andreas Stolcke and Benoît Dumoulin},
  title={{Word-phrase-entity language models: getting more mileage out of n-grams}},
  booktitle={Proc. Interspeech 2014},