ISCA Archive Interspeech 2015
ISCA Archive Interspeech 2015

Enhancing low resource keyword spotting with automatically retrieved web documents

Le Zhang, Damianos Karakos, William Hartmann, Roger Hsiao, Richard Schwartz, Stavros Tsakalidis

Keyword Spotting (KWS) systems developed for low resource languages with very little transcribed audio suffer due to a small vocabulary (high out-of-vocabulary (OOV) rate) and a weak language model. In this paper, we propose to augment such systems using automatically retrieved web documents. Our procedure can find large volumes of web documents similar to a small pool of training transcriptions within a few hours, by querying a search engine with automatically generated query terms. We then use simple language identification to extract high-confidence text for lexicon expansion and language modeling. Experiments using six very limited language packs (VLLP) from the IARPA-Babel program show web documents can cut the OOV rate by half on the development set, and on average improve keyword spotting performance by 2.8 points absolute measured by the Actual Term Weighted Value (ATWV). In particular, we find most of the gains (8.7 points on average) are from keywords that were OOV in the baseline system, and are converted into in-vocabulary (IV) through lexicon expansion. These gains are obtained even after using subword units (unsupervised syllable-like units and sequences of phones), which are known to greatly enhance OOV keyword search performance.