The TAL Speech Synthesis System for Blizzard Challenge 2021

Shaotong Guo, Shuaiting Chen, Dao Zhou, Gang He, Changbin Chen

This paper introduces the TAL speech synthesis system for Blizzard Challenge 2021 which aims to synthesize voice as similar as the provided target speaker. We built a Spanish speech synthesis system based on the pre-trained BERT model, GST and HiFi-GAN for task 2021-SH1. First, we use a modified open source Spanish front-end to generate Spanish phoneme sequences from the input Spanish text. Then, we constructed a modified GST model which condition the encoder on linguistic features. The acoustic model is trained on two speakers, and then fine-tune on the target speaker from provided corpus. To speed up the synthesis process and maintain the speech quality, we use HiFi-GAN, an efficient and high fidelity GAN-based vocoder, to synthesize mel-spectrogram into speech waveform. The evaluation results shows that our system performs well especially in the word error rates evaluation.

