ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

TridentSE: Guiding Speech Enhancement with 32 Global Tokens

Dacheng Yin, Zhiyuan Zhao, Chuanxin Tang, Zhiwei Xiong, Chong Luo

This paper presents TridentSE, a new and innovative architecture for speech enhancement that efficiently combines local details and global information. The architecture uses time-frequency bin level representation for capturing detailed information and a small number of global tokens for processing global information. It employs cross attention modules to transfer information between the local and global representation, and separates the global tokens into two groups to process inter- and intra-frame information. A metric discriminator is utilized to increase perceptual quality and achieve improved performance compared to previous speech enhancement methods. With lower computational cost, TridentSE achieved a PESQ of 3.47 on the VoiceBank+DEMAND dataset and a PESQ of 3.44 on the DNS no-reverb test set, outperforming most previous methods. Visualization shows that the global tokens demonstrate diverse and interpretable global patterns.