ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Gated Convolutional Fusion for Time-Domain Target Speaker Extraction Network

Wenjing Liu, Chuan Xie

Target speaker extraction aims to extract the target speaker's voice from mixed utterances based on auxillary reference speech of the target speaker. A speaker embedding is usually extracted from the reference speech and fused with the learned acoustic representation. The majority of existing works perform simple operation-based fusion of concatenation. However, potential cross-modal correlation may not be effectively explored by this naive approach that directly fuse the speaker embedding into the acoustic representation. In this work, we propose a gated convolutional fusion approach by exploring global conditional modeling and trainable gating mechanism for learning sophisticated interaction between speaker embedding and acoustic representation. Experiments on WSJ0-2mix-extr dataset proves the efficacy of the proposed fusion approach, which performs favorably against other fusion methods with considerable improvement in terms of SDRi and SI-SDRi. Moreover, our method can be flexibly incorporated into similar time-domain speaker extraction networks to attain better performance.