ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Pre-training Neural Transducer-based Streaming Voice Conversion for Faster Convergence and Alignment-free Training

Hiroki Kanagawa, Takafumi Moriya, Yusuke Ijima

Seq2seq-based voice conversion (seq2seq VC) can model without pre-aligning speech lengths, but non-monotonic attention matrices may cause unnatural VC. The VC-T, a streaming VC based on neural transducer, ensures monotonic alignments, outperforming seq2seq VC. However, VC-T demands guiding alignments and time-consuming training due to tensor computations. The guiding alignments are generated from manually annotated phoneme labels with labor-intensive efforts, and they are essential for eliminating improbable paths, thus stabilizing VC-T training. This work proposes a two-stage VC-T training pipeline for fast convergence: 1) VC-T pre-training to learn probable paths, which form a matrix, optimized by L1 loss, 2) Fine-tuning refines the pre-trained VC-T, outputting a probable tensor from the start. This enables VC-T training without the guiding alignment. Experiments show our pipeline achieves superior streaming VC while significantly reducing training time compared to conventional VC-T.