ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Speaker- and Phone-aware Convolutional Transformer Network for Acoustic Echo Cancellation

Chang Han, Weiping Tu, Yuhong Yang, Jingyi Li, Xinhong Li

Recent studies indicate the effectiveness of deep learning (DL) based methods for acoustic echo cancellation (AEC) in background noise and nonlinear distortion scenarios. However, content and speaker variations degrade the performance of such DL-based AEC models. In this study, we propose a AEC model that takes phonetic and speaker identities features as auxiliary inputs, and present a complex dual-path convolutional transformer network (DPCTNet). Given an input signal, the phonetic and speaker identities features extracted by the contrastive predictive coding network that is a self-supervised pretraining model, and the complex spectrum generated by short time Fourier transform are treated as the spectrum pattern inputs for DPCTNet. In addition, the DPCTNet applies an encoder-decoder architecture improved by inserting a dual-path transformer to effectively model the extracted inputs in a single frame and the dependence between consecutive frames. Comparative experimental results showed that the performance of AEC can be improved by explicitly considering phonetic and speaker identities features.