ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Enhanced Feature Learning with Normalized Knowledge Distillation for Audio Tagging

Yuwu Tang, Ziang Ma, Haitao Zhang

Pre-trained transformer-based models have been the mainstream of audio tagging. Transformer-based models bring high performance at the cost of huge model size and slow inference speed, while pre-training methods heavily rely on large-scale data and vast computing resources. We argue that a more lightweight CNN-based backbone with customized feature learning can achieve the comparable performance as transformers. Thus an efficient audio tagging framework is proposed to capture more abundant feature information with several enhanced feature learning blocks. We further employ the method of knowledge distillation (KD) and propose a normalized KD loss calculation with adaptive temperature coefficients according to the samples distribution. Extensive experiments demonstrate that our method obtains the state-of-the-art results with a lightweight CNN-based model.