The AHOLAB Text-to-Speech system for Blizzard Challenge 2021

Victor García Romillo, Inma Hernaéz Rioja, Eva Navas

In this paper we present the Text-to-Speech synthesis system proposed for the 2021 Blizzard Challenge by Aholab Signal Processing Group. The goal of this challenge is to build a synthetic voice from a provided speech corpus recorded in European Spanish. The challenge comprises two tasks: synthesising text containing only Spanish words and synthesising Spanish texts containing a small number of English words. Our system uses Tacotron-2 to compute mel-spectrograms from the input sequence, followed by WaveGlow as neural vocoder to obtain the audio signals from the spectrograms. A Spanish linguistic front-end module was used to transform grapheme sequences into phoneme sequences. In order to improve the robustness of the system and make the learning of the alignments in the acoustic model easier, a prior knowledge based loss was added to it. Evaluation shows that our systems had a good performance on both tasks.

doi: 10.21437/Blizzard.2021-11

Cite as: García Romillo, V., Hernaéz Rioja, I., Navas, E. (2021) The AHOLAB Text-to-Speech system for Blizzard Challenge 2021. Proc. The Blizzard Challenge 2021, 64-69, doi: 10.21437/Blizzard.2021-11

