State-of-the-art spoken term detection (STD) systems are built on top of large vocabulary speech recognition engines, which generate lattices that encode candidate occurrences of each in-vocabulary query. These lattices specifiy start and stop times of hypothesized term occurrences, providing a clear opportunity to return to the acoustics to incorporate novel confidence measures for verification. In this paper, we introduce a novel exemplar distance metric to the recently proposed discriminative point process modeling (DPPM) framework and use the resulting whole word models to generate STD confidence scores. In doing so, we introduce STD to a completely distinct acoustic modeling pipeline, trading Gaussian mixture models (GMM) for multi-layer perceptrons and replacing dictionary-derived hidden Markov models (HMM) with exemplarbased point process models. We find that whole word DPPM scores both perform comparably and are complementary to lattice posterior scores produced by a state-of-the-art speech recognition engine.
Index Terms: spoken term detection, point process model, discriminative training, whole word model