ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

YOLOPitch: A Time-Frequency Dual-Branch YOLO Model for Pitch Estimation

Xuefei Li, Hao Huang, Ying Hu, Liang He, Jiabao Zhang, Yuyi Wang

Pitch estimation is of fundamental importance in audio processing and music information retrieval. YOLO is a well developed model designed for image target detection. Here we introduce YOLOv7 into pitch estimation task and improve by proposing time-frequency (TF) dual-branch into the model according to pitch perception of human auditory. An additional advantage of the model over the state-of-the-art (SOTA) models is that it only needs to add an unvoiced class without additional unvoiced/voiced detection to achieve joint pitch estimation and voiced determination. Experiments show for both music and speech, the proposed TF dual-branch can boost pitch estimation accuracy over the back-bone. Our model exhibits superior pitch estimation performance over the SOTA and shows minimal performance degradation in noisy condition. The overall accuracy on the MDB-stem-synth dataset peaks at 99.4%, and voicing determination F-score reaches 99.9%.