ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Fast Grad-TTS: Towards Efficient Diffusion-Based Speech Generation on CPU

Ivan Vovk, Tasnima Sadekova, Vladimir Gogoryan, Vadim Popov, Mikhail Kudinov, Jiansheng Wei

Recently, score-based diffusion probabilistic modeling has shown encouraging results in various tasks outperforming other popular generative modeling frameworks in terms of quality. However, to unlock its potential and make diffusion models feasible from the practical point of view, special efforts should be made to enable more efficient iterative sampling procedure on CPU devices. In this paper, we focus on applying the most promising techniques from recent literature on diffusion modeling to Grad-TTS, a diffusion-based text-to-speech system, in order to accelerate it. We compare various reverse diffusion sampling schemes, the technique of progressive distillation, GAN-based diffusion modeling and score-based generative modeling in latent space. Experimental results demonstrate that it is possible to speed Grad-TTS up to 4.5 times compared to vanilla Grad-TTS and achieve real time factor 0.15 on CPU while keeping synthesis quality competitive with that of conventional text-to-speech baselines.