ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Densely-connected Convolutional Recurrent Network for Fundamental Frequency Estimation in Noisy Speech

Yixuan Zhang, Heming Wang, DeLiang Wang

Estimating fundamental frequency (F0) from an audio signal is a necessary step in many tasks such as speech synthesis and speech analysis. Although high estimation accuracy has been achieved for clean speech, it is still challenging for F0 estimation to handle noisy speech, mainly because of the corruption of harmonic structure caused by noise. In this paper, we view F0 estimation as a multi-class classification problem and train a frequency-domain densely-connected convolutional neural network (DC-CRN) to estimate F0 from noisy speech. The proposed model significantly outperforms baseline methods in terms of detection rate. We find that using complex short-time Fourier transform (STFT) as input produces better performance compared to using magnitude STFT as input. Furthermore, we explore improving F0 estimation with speech enhancement. Although the F0 estimation model trained on clean speech performs well on enhanced speech, the distortion introduced by the speech enhancement model limits the estimation performance. We propose a cascade model which consists of two modules that optimize enhanced speech and estimated F0 in turn. Experimental results show that the cascade model brings further improvements to the DC-CRN model, especially in low signal-to-noise ratio (SNR) conditions.