ISCA Archive SIGUL 2023
ISCA Archive SIGUL 2023

Nepali Text-to-Speech Synthesis using Tacotron2 for Melspectrogram Generation

Supriya Khadka, Ranju G.C., Prabin Paudel, Rahul Shah, Basanta Joshi

The paper proposes a method for generating high-quality synthesized Nepali speech from the text using the Tacotron2 model for melspectrogram generation. The speech synthesis process involves two phases: melspectrogram generation and vocoder output. The Nepali text is preprocessed and tokenized before being fed into a Tacotron2 model for generating melspec- trograms. The Tacotron2 model is trained on a publicly available OpenSLR dataset for the Nepali language and finetuned on a new dataset created by the authors. Through fine-tuning, the model is refined to improve its performance and adapt it to language-specific characteristics. Further, incremental learning is employed to continually update the model with new data, ensuring its ability to generalize and adapt to evolving contexts. The melspectrograms are then sent to HiFiGAN and WaveGlow vocoders, which produce the synthesized speech. Finally, post- processing techniques are applied to further refine the generated output, enhancing its naturalness. The synthesized speech was qualitatively evaluated to obtain a Mean Opinion Score of 4.03 for naturalness, which stands as the highest among all previous Nepali Text to Speech tasks conducted to date.