We present a semi-supervised language modeling technique to improve search performance on terms without training data. Probabilities estimated from automatic transcripts of a large corpus of in-domain audio are added to an existing LM. Requiring no development data or external resources, our method achieves 70% of the possible gain for manual transcription of the same audio. This is in sharp contrast to the modest gains of previous semisupervised LM experiments. We compare the value of additional resources (labor or data) to semi-supervised learning. If human effort is available, we describe a transcription regime to efficiently close the remaining performance gap.
Index Terms: KWS, language modeling, CTS