ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

FusedF0: Improving DNN-based F0 Estimation by Fusion of Summary-Correlograms and Raw Waveform Representations of Speech Signals

Eray Eren, Lee Ngee Tan, Abeer Alwan

DSP-based F0 estimation algorithms, such as multi-band summary-correlogram (MBSC), are robust to noisy speech. Recent studies show that mapping from raw waveform segments into F0 estimates by DNNs can outperform DSP-based methods in F0 estimation. However, generalization and noise robustness of DNNs have not been fully addressed previously. We propose a hybrid DSP and DNN based approach to F0 estimation. Key contributions include: (a) a modified version of MBSC that is substantially faster than the original algorithm while maintaining the accuracy of F0 estimates; (b) a method for fusing DSP features with raw waveform representations using a DNN architecture to obtain noise-robust F0 estimation; (c) demonstrating that auxiliary DSP features improve generalization with a relatively small number of DNN parameters. On the PTDB-TUG database, the proposed algorithm outperforms the MBSC and CREPE DNN baselines (including optimized versions) for clean and noisy speech at 20, 10, and 0 dB SNR.