ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

EdenTTS: A Simple and Efficient Parallel Text-to-speech Architecture with Collaborative Duration-alignment Learning

Youneng Ma, Junyi He, Meimei Wu, Guangyue Hu, Haojun Fei

In pursuit of high inference speed, many non-autoregressive neural text-to-speech (TTS) models have been proposed for parallel speech synthesis recently. A critical challenge of parallel speech generation lies in the learning of text-speech alignment. Existing methods usually require an external aligner for guidance or involve complex training process. In this work, we propose Eden-TTS, a simple and efficient parallel TTS architecture which jointly learns duration prediction, text-speech alignment and speech generation in a single fully-differentiable model. The alignment is learned implicitly in our architecture. A novel energy-modulated attention mechanism is proposed for alignment guidance which leads to fast and stable convergence of our model. Our model can be easily implemented and trained. Experiments demonstrate that our method can generate speech of high quality with high training efficiency.