ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Deep Segment Model for Acoustic Scene Classification

Yajian Wang, Jun Du, Hang Chen, Qing Wang, Chin-Hui Lee

In most state-of-the-art acoustic scene classification (ASC) techniques, convolutional neural networks (CNNs) are adopted due to their extraordinary ability in learning local deep features. However, the CNN-based approach is unable to effectively describe the structure of sound events in an audio clip, which is a key element in distinguishing acoustic scenes with similar characteristics, whereas the acoustic segment model (ASM) based approach shows its superiority. To take full advantage of these two types of approaches, we proposed a novel deep segment model (DSM) for ASC. DSM employs a fully convolutional neural network (FCNN) as a deep feature extractor and then guides the ASM to better capture semantic information among sound events. Specifically, the FCNN-based encoder is trained with the multi-task of classifying both three coarse-grained acoustic scenes and ten fine-grained acoustic scenes to extract multi-level acoustic features. Moreover, an entropy-based decision fusion strategy is designed to further utilize the complementarity of FCNN-based and DSM-based systems. The final system achieves an accuracy of 80.4\% in the DCASE2021 Task1b audio dataset, yielding a relative error rate reduction of about 15\% over the FCNN-based system.