ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

LightVoc: An Upsampling-Free GAN Vocoder Based On Conformer And Inverse Short-time Fourier Transform

Dinh Son Dang, Tung Lam Nguyen, Bao Thang Ta, Tien Thanh Nguyen, Thi Ngoc Anh Nguyen, Dang Linh Le, Nhat Minh Le, Van Hai Do

Most neural vocoders based on generative adversarial networks (GANs) rely on iterative upsampling to generate audio sequences from mel-spectrograms as well as dilated convolution to expand their receptive fields. Nevertheless, iterative upsampling increases the network's complexity and thus decreases the inference speed. Moreover, convolution neural networks are geared towards extracting fine-grained local information and still struggle to capture long-term dependencies. In this work, we propose LightVoc, an efficient and high-quality GAN-based neural vocoder that replaces all upsampling blocks with a stack of Conformer blocks and uses a novel combination of discriminators to generate high-resolution waveforms over the full-band. From our experiments on LJSpeech dataset, LightVoc produces comparable audio quality while being 52.5 times faster in terms of CPU-based inference speed in comparison to HiFi-GAN V1.