ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

WhisperMSS: A Two-Stage Framework for Mandarin Singing Transcription and Segmentation Using Pretrained Models

Ruoxuan Liang, Xiangjian Zeng, Zhen Liu, Qingqiang Wu, RuiChen Zhang, Le Ren

This paper addresses the challenges of Mandarin singing transcription and segmentation by proposing a two-stage framework based on Whisper. Our research makes three key contributions: First,enhancing transcription accuracy via WhisperMLT, which incorporates a Chinese-specific text embedding layer, a CTC branch atop the encoder, and a Transformer-based contextual network; Second, optimizing CTC posterior probabilities through syllable-aligned pseudo-labeling, which generates one-hot frame-level labels from timestamp-annotated datasets;Finally, achieving precise segmentation with CTC-Vseg, which implements silence label insertion, constrained state transitions, and dynamic programming-based path optimization.Experiments demonstrate superior performance in Mandarin singing segmentation, offering novel solutions for audio processing tasks.