We present a new speech indexing and search scheme called Randomized Acoustic Indexing and Logarithmic-time Search (RAILS) that enables scalable query-by-example spoken term detection in the zero resource regime. RAILS is derived from our recent investigation into the application of randomized hashing and approximate nearest neighbor search algorithms to raw acoustic features. Our approach permits an approximate search through hundreds of hours of speech audio in a matter of seconds, and may be applied to any language without the need of a training corpus, acoustic model, or pronunciation lexicon. The fidelity of the approximation is controlled through a small number of easily interpretable parameters that allow a trade-off between search accuracy and speed.
Index Terms: speech indexing, zero resource, query-byexample, spoken term detection, locality sensitive hashing