We study spoken term detection:the task of determining whether and where a given word or phrase appears in a given segment of speech:in the setting of limited training data. This setting is becoming increasingly important as interest grows in porting spoken term detection to multiple lowresource languages and acoustic environments. We propose a discriminative algorithm that aims at maximizing the area under the receiver operating characteristic curve, often used to evaluate the performance of spoken term detection systems. We implement the approach using a set of feature functions based on multilayer perceptron classifiers of phones and articulatory features, and experiment on data drawn from the Switchboard database of conversational telephone speech. Our approach outperforms a baseline HMM-based system by a large margin across a number of training set sizes.
Index Terms: spoken term detection, discriminative training, AUC, structural SVM