ISCA Archive Interspeech 2012
ISCA Archive Interspeech 2012

Acoustic and data-driven features for robust speech activity detection

Samuel Thomas, Sri Harish Mallidi, Thomas Janu, Hynek Hermansky, Nima Mesgarani, Xinhui Zhou, Shihab Shamma, Tim Ng, Bing Zhang, Long Nguyen, Spyros Matsoukas

In this paper we evaluate different features for speech activity detection (SAD). Several signal processing techniques are used to derive acoustic features that capture attributes of speech useful in differentiating speech segments in noise. The acoustic features include shortterm spectral features, long-term modulation features both derived using Frequency Domain Linear Prediction (FDLP), and joint spectrotemporal features extracted using 2D filters on a cortical representation of speech. Posteriors of speech and non-speech from a trained multi-layer perceptron are also used as data-driven features for this task. These feature extraction techniques form part of an elaborate feature extraction front-end where information spanning several hundreds of milliseconds of the signal are used along with heteroscedastic linear discriminant analysis for dimensionality reduction. Processed feature outputs from the proposed front-end are used to train SAD systems based on Gaussian mixture models for processing of speech from multiple languages transmitted over noisy radio communication channels under the ongoing DARPA Robust Automatic Transcription of Speech (RATS)program. The proposed front-end performs significantly better than standard acoustic feature extraction techniques in these noisy conditions.

Index Terms: Speech Activity Detection, Features for SAD