ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Temporal Self Attention-Based Residual Network for Environmental Sound Classification

Achyut Tripathi, Konark Paul

Recent years have witnessed a remarkable performance of attention mechanisms for learning representative and prototypical features for tasks such as the classification of distinct sounds and images. Classification of environmental sounds is also an equally challenging task to the classification of speech and music. The presence of semantically irrelevant and silent frames are two major issues that persist in environmental sound classification (ESC). This paper presents a linear self-attention (LSA) mechanism with a learnable memory unit that encodes temporal and spectral characteristics of the spectrogram used while training the deep ESC model. The memory unit can be easily designed using two linear layers followed by a normalization layer. Unlike traditional self-attention mechanisms, the proposed LA mechanism has a linear computational cost. The efficacy of the proposed method is evaluated on two benchmark ESC datasets, viz. ESC-10 and DCASE-2019 Task-1A datasets. The experiments and results show that the model trained with the proposed attention mechanism efficiently learns temporal and spectral information from spectrogram of a signal. The performance of the proposed deep ESC model is comparable or superior to state-of-the-art attention-based deep ESC models.