We present an algorithm for identifying the location of sibilant phones in noisy speech. Our algorithm does not attempt to identify sibilant onsets and offsets directly but instead detects a sustained increase in power over the entire duration of a sibilant phone. The normalized estimate of the sibilant power in each of 14 frequency bands forms the input to two Gaussian mixture models that are trained on sibilant and non-sibilant frames respectively. The likelihood ratio of the two models is then used to classify each frame. We evaluate the performance of our algorithm on the TIMIT database and demonstrate that the classification accuracy is over 80% at 0 dB signal to noise ratio for additive white noise.
Index Terms: sibilant speech, spectrographic mask estimation, speech classification, speech segregation