In recent years, designing the coding and pooling structures in layered networks has been shown to be a useful method for learning highlevel feature representations for visual data. Yet, such learning structures have not been extensively studied for audio signals. In this paper, we investigate the different pooling strategies based on the sparse coding scheme and propose a temporal pyramid pooling method to extract discriminative and shift-invariant feature representations. We demonstrate the superiority of our new feature representation over traditional features on the acoustic event classification task.
Index Terms: sparse coding, pooling, acoustic event classification