ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

iCNN-Transformer: An improved CNN-Transformer with Channel-spatial Attention and Keyword Prediction for Automated Audio Captioning

Kun Chen, Jun Wang, Feng Deng, Xiaorui Wang

Automated audio captioning (AAC) is a task to generates text description of an audio recording. Sound events, acoustic scene and the relationship between events are described in audio captions. Currently, most AAC systems are based on encoder-decoder architecture, in which the decoder predicts the caption completely according to the features extracted from the audio clip. As a result, learning more efficient audio features allows the decoder to generate more appropriate descriptions. This paper proposes an approach to guide the generation of captioning by multi-level information extracted from audio clip. Specifically, we use two modules to obtain acoustic information for semantic expression. (1) A module that combines channel attention and spatial attention to pay more attention to important features. (2) A trained keyword prediction module to generate word-level guidance information. We apply our modules to the CNN-Transformer architecture and experiment in Clotho. The results show that the proposed approach can significantly improve the scores of various evaluation metrics and achieve the state-of-the-art performance in the Cross-entropy training stage.