ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Improve Speech Enhancement using Perception-High-Related Time-Frequency Loss

Ding Zhao, Zhan Zhang, Bin Yu, Yuehai Wang

Commonly used speech enhancement (SE) training losses like mean absolute error (MAE) loss and short-time Fourier transformation (STFT) loss suffer from the problem of mismatch with the speech quality, which leads to suboptimal training results. To tackle this problem, we propose a new loss named perception-high-related time-frequency (PHRTF) loss. The proposed loss modifies STFT loss by adding a trainable module named perceptual spectrum mask predictor (PSMP). This module can predict the perceptual spectrum mask (PSM) from the magnitude spectrum of enhanced and clean speech. Further, PHRTF loss multiplies the amplitude error spectrum (AES) with PSM to emphasize perception-relevant loss components to correlate highly with the speech quality. We conduct experiments on the VoiceBank-DEMAND dataset, and the results show that PHRTF loss has a significantly higher correlation with the speech quality than other losses. Meanwhile, PHRTF loss outperforms other losses and improves PESQ by 0.32 over MAE loss and 0.19 over STFT loss on the training of Wave-U-Net. We also apply PHRTF loss to a more advanced SE model, and the training result outperforms other competitive baselines.