ISCA Archive Interspeech 2005
ISCA Archive Interspeech 2005

Statistical language models for large vocabulary spontaneous speech recognition in dutch

Jacques Duchateau, Dong Hoon Van Uytsel, Hugo Van Hamme, Patrick Wambacq

In state-of-the-art large vocabulary automatic recognition systems, a large statistical language model is used, typically an N-gram. However in order to estimate this model, a large database of sentences or texts in the same style as the recognition task is needed. For spontaneous speech one doesn't dispose of such database since it should consist of accurate thus expensive orthographic transcriptions of spoken audio.

This paper investigates how readily available large news paper corpora can be used to improve language models for spontaneous speech recognition although both language styles differ considerably. A technique is proposed that does a perplexity based automatic selection of appropriate news paper articles and that subsequently uses these texts in the language model estimation. Recognition experiments on spontaneous broadcast speech in Dutch showed significant improvements using this technique.


doi: 10.21437/Interspeech.2005-22

Cite as: Duchateau, J., Uytsel, D.H.V., Hamme, H.V., Wambacq, P. (2005) Statistical language models for large vocabulary spontaneous speech recognition in dutch. Proc. Interspeech 2005, 1301-1304, doi: 10.21437/Interspeech.2005-22

@inproceedings{duchateau05_interspeech,
  author={Jacques Duchateau and Dong Hoon Van Uytsel and Hugo Van Hamme and Patrick Wambacq},
  title={{Statistical language models for large vocabulary spontaneous speech recognition in dutch}},
  year=2005,
  booktitle={Proc. Interspeech 2005},
  pages={1301--1304},
  doi={10.21437/Interspeech.2005-22},
  issn={2958-1796}
}