ISCA Archive Interspeech 2005
ISCA Archive Interspeech 2005

Building topic specific language models from webdata using competitive models

Abhinav Sethy, Panayiotis G. Georgiou, Shrikanth Narayanan

The ability to build topic specific language models, rapidly and with minimal human effort, is a critical need for fast deployment and portability of ASR across different domains. The World Wide Web (WWW) promises to be an excellent textual data resource for creating topic specific language models. In this paper we describe an iterative web crawling approach which uses a competitive set of adaptive models comprised of a generic topic independent background language model, a noise model representing spurious text encountered in web based data (Webdata), and a topic specific model to generate query strings using a relative entropy based approach for WWW search engines and to weight the downloaded Webdata appropriately for building topic specific language models. We demonstrate how this system can be used to rapidly build language models for a specific domain given just an initial set of example utterances and how it can address the various issues attached with Webdata. In our experiments we were able to achieve a 20% reduction in perplexity for our target medical domain. The gains in perplexity translated to a 4% improvement in ASR word error rate (absolute) corresponding to a relative gain of 14%.