ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Joint Modeling of Multi-Sample and Subband Signals for Fast Neural Vocoding on CPU

Hiroki Kanagawa, Yusuke Ijima, Hiroyuki Toda

In this work, we propose a fast and high quality neural vocoder for CPU implementation. The main approaches to realize fast inference via an autoregressive model are 1) a subband-based vocoder and 2) multiple samples prediction. Our previous work demonstrated that the combination worked well up to two samples simultaneous generation without quality degradation. To further increase the number of simultaneous samples while maintaining quality, we focus on the existence of an association between subband signals and multiple samples. Our proposed vocoder jointly models these associations with a multivariate Gaussian. Experimentals show that our proposed four-sample vocoder is 1.47 times faster than the conventional two-sample equivalent. For both the acoustic features extracted from natural speech and those predicted by TTS, the proposed method realizes generation with up to four samples without any significant degradation in naturalness. This vocoder also matched the naturalness comparable of the two-sample conventional method.