This paper addresses the challenges of Mandarin singing transcription and segmentation by proposing a two-stage framework based on Whisper. Our research makes three key contributions: First,enhancing transcription accuracy via WhisperMLT, which incorporates a Chinese-specific text embedding layer, a CTC branch atop the encoder, and a Transformer-based contextual network; Second, optimizing CTC posterior probabilities through syllable-aligned pseudo-labeling, which generates one-hot frame-level labels from timestamp-annotated datasets;Finally, achieving precise segmentation with CTC-Vseg, which implements silence label insertion, constrained state transitions, and dynamic programming-based path optimization.Experiments demonstrate superior performance in Mandarin singing segmentation, offering novel solutions for audio processing tasks.