ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Diffusion Generative Vocoder for Fullband Speech Synthesis Based on Weak Third-order SDE Solver

Hideyuki Tachibana, Muneyoshi Inahara, Mocho Go, Yotaro Katayama, Yotaro Watanabe

Diffusion generative models, which generate data by the time-reverse dynamics of diffusion processes, have attracted much attention recently, and have already been applied in the speech domain such as speech waveform synthesis. Diffusion generative models initially had the disadvantage of slow synthesis, but many fast samplers have been proposed and this disadvantage is being overcome. The authors have also proposed an efficient sampler based on a second-order approximation derived from the Itô-Taylor series, and have achieved some success. This study further examines the possibility of incorporating third-order terms and experimentally verifies that a vocoder using this method can synthesize high-fidelity fullband (48 kHz) speech signals faster than in real time. It is also shown that the method is applicable to the extension of speech bandwidth from wideband (16 kHz) to fullband (48 kHz).