Automatic detection of speech in audio streams has become an important preprocessing step for speech recognition, speaker recognition, and audio data mining. In many applications, the speech activity detection has to be performed on highly degraded audio streams. We present here our work to address the challenge of speech activity detection for highly degraded channel conditions. We present two two-pass modified cumulative sum (CUSUM) approaches based on maximum a posteriori (MAP) adaptation and regularized feature-based maximum likelihood linear regression (RFMLLR) adaption. In this paper, we compare the two approaches to a single-pass modified CUSUM baseline system with Gaussian mixture models (GMM) of speech and non-speech classes. The systems are evaluated on two test sets. Each consists of data from eight highly degraded channels. Our two-pass MAP adaptation system reduces the total error by 27%-54% relative compared to the single-pass baseline system. We present also experiments showing additional gains of 3%-25% relative by using channel-specific GMM models for speech and non-speech instead of a single channel-indpendent GMM model for each.
Index Terms: speech activity detection, adaptation