State of the art vocabulary-independent spoken term detection methods
are typically based on variants of the dynamic time warping (DTW) algorithm
since DTW, being based on acoustic sequence matching, allows robust
retrieval in settings with scarcity of linguistic resources. However,
the DTW comes with a high computational cost which limits its practicality
in a deployed server. To this end, we investigate the efficacy of subsampling
and propose a neural network architecture to reduce the computational
load of DTW-based keyword search. We use a time-subsampled RNN to reduce
the frame rate of the document as well as the dimensionality of representation
while training it to maintain the cost incurred along the DTW alignment
path, thus allowing us to reduce the computational complexity (both
space and time) of the search algorithm.
Experiments on the
Turkish and Zulu limited language packs of the IARPA Babel program
show that the proposed methods allow considerable reduction in CPU
time (88 times) and memory usage (18 times) without significant loss
in search accuracy (0.0270 ATWV). Moreover, even at very high compression
levels with lower search precision, high recall rates are maintained,
allowing the potential of multi-resolution search.