Transformer has recently gained more attention and is widely used in audio tasks. Most tasks compute attention directly over the entire time-frequency space or only in the temporal. This paper presents a joint time and frequency model for Chinese opera classification. A shallow convolutional block is used to get localized low-level semantic features and reduce the feature map size. Moreover, the criss-cross attention and the factorised self-attention are employed in the model to extract the time and frequency space representation. The experiment results demonstrate that the proposed model achieves state-of-the-art performance on a large Chinese opera dataset with fewer model parameters.