The paper considers the task of recognizing environmental sounds, which plays a critical role in human's perception of an auditory context in audiovisual materials. A variety of features have been proposed for audio recognition, either frame-based or segmental. Here, we propose a two-stage framework to combine modeling in these two levels. First, the Gaussian Mixture Models(GMMs) are built based on short-term features and pre-classification are performed. Then, in the event that the GMMs are not certain about the result, the system engages Support Vector Machines (SVMs) to refine the output hypothesis. In the next stage, the features are combined by taking posterior estimates of GMMs along with segmental features as SVMs' input features. Experiments on the sound dataset show that the proposed framework makes an improvement over the traditional methods.
Index Terms: environmental sound classification, model combination, GMMs, SVMs