ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

CFTNet: Complex-valued Frequency Transformation Network for Speech Enhancement

Nursadul Mamun, John H. L. Hansen

It is widely known that the presence of multi-speaker babble noise greatly degrades speech intelligibility. However, suppressing noise without creating artifacts in human speech is challenging in environments with a low signal-to-noise ratio (SNR), and even more so if noise is speechlike such as babble noise. Deep learning-based systems either enhance the magnitude response and reuse distorted phases or enhance the complex spectrogram. Frequency transformation block (FTB) has emerged as a useful architecture to implicitly capture harmonic correlation which is especially important for people with hearing loss (hearing aid/ cochlear implant users). This study proposes a complex-valued frequency transformation network (CFTNet) for speech enhancement, which leverages both a complex-valued U-Net and FTB to capture sufficient low-level contextual information. The proposed system learns a complex transformation matrix to accurately recover speech in the time-frequency domain from a noisy spectrogram. Experimental results demonstrate that the proposed system can achieve significant improvements in both seen and unseen noise over state-of-art networks. Furthermore, the proposed CFTNet can suppress highly nonstationary noise without creating musical artifacts commonly observed in conventional enhancement methods.