ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Joint Time and Frequency Transformer for Chinese Opera Classification

Qiang Li, Beibei Hu

Transformer has recently gained more attention and is widely used in audio tasks. Most tasks compute attention directly over the entire time-frequency space or only in the temporal. This paper presents a joint time and frequency model for Chinese opera classification. A shallow convolutional block is used to get localized low-level semantic features and reduce the feature map size. Moreover, the criss-cross attention and the factorised self-attention are employed in the model to extract the time and frequency space representation. The experiment results demonstrate that the proposed model achieves state-of-the-art performance on a large Chinese opera dataset with fewer model parameters.