ISCA Archive ICSLP 2002
ISCA Archive ICSLP 2002

Speech, music and songs discrimination in the context of handsets variability

Hassan Ezzaidi, Jean Rouat

The problem of speech, music and music with songs discrimination in telephony with handsets variability is addressed in this paper. Two systems are proposed. The first system uses three Gaussian Mixture Models (GMM) for speech, music and songs respectively. Each GMM comprises 8 Gaussians trained on very short sessions. Twenty six speakers (13 females, 13 males) have been randomly chosen from the SPIDRE corpus. The music were obtained from a large set of data and comprises various styles. For 138 minutes of testing time, a speech discrimination score of 97.9% is obtained when no channel normalization is used. These performance are obtained for a relatively short analysis frame (32ms sliding window, buffering of 100 ms). When using channel normalization, an important score reduction (on the order of 10 to 20%) is observed. The second system has been designed for applications requiring shorter processing times along with shorter training sessions. It is based on an empirical transformation of the . MFCC that enhances the dynamical evolution of tonality. It yields in average an acceptable discrimination rate of 90% (speech-/music) and 84% (speech, music and songs with music).