ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Vector-quantized Variational Autoencoder for Phase-aware Speech Enhancement

Tuan Vu Ho, Quoc Huy Nguyen, Masato Akagi, Masashi Unoki

Recent speech enhancement methods based on the complex ideal ratio mask (cIRM) have achieved promising results. These methods often deploy a deep neural network to jointly estimate the real and imaginary components of the cIRM defined in the complex domain. However, the unbounded property of cIRM poses difficulties when it comes to effectively training a neural network. To alleviate this problem, this paper proposes a phase-aware speech enhancement method by estimating the magnitude and phase of a complex adaptive Wiener filter. In this method, a noise-robust vector-quantized variational autoencoder is utilized for estimating the magnitude Wiener filter by using the Itakura-Saito divergence on time-frequency domain, while the phase of the Wiener filter is estimated by a convolutional recurrent network using the scale-invariant signal-to-noise ratio constraint in the time domain. The proposed method was evaluated on the open Voice Bank+DEMAND dataset to provide a direct comparison with other speech enhancement studies and achieved the PESQ score of 2.85 and STOI score of 0.94, which is better than the state-of-art method based on cIRM estimation in the 2020 Deep Noise Challenge.