ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Improving WaveRNN with Heuristic Dynamic Blending for Fast and High-Quality GPU Vocoding

Muyang Du, Chuan Liu, Jiaxing Qi, Junjie Lai

Auto-regressive vocoders are typically less efficient at inference due to their serial nature, making it difficult to fully utilize graphics processing units (GPUs). In this context, batched inference with upsampled feature folding can be used to speed up vocoding. However, speech quality degradation caused by blending folded waveform segments making it hard to be applied to production. To address this issue, we propose a novel blending approach called heuristic dynamic blending (HDB), which effectively addresses the voice trembling and echo artifacts of conventional static blending. We also propose a parallel algorithm of HDB running on GPUs, which significantly reduces the additional time overhead introduced by the naive HDB algorithm. Experimental results demonstrate that using a multi-band WaveRNN with HDB can effectively improve parallelism for real-time GPU vocoding while maintaining high speech quality comparable to non-folding inference.