This paper presents the Sogou speech synthesis system for Blizzard Challenge 2018. The corpus released to the participants this year is a 6.5-hour children’s audio book in British English, which is the same as for the 2017 data release. We build a parametric system for this task. Firstly, a multi-speaker DNN-BLSTM model is applied for mel spectrograms modeling. Then, a modified WaveNet model conditioned on the predicted mel features is used to generate 16-bit speech waveforms at 32 kHz, instead of the conventional vocoder.
This is the first time for Sogou to join the Blizzard Challenge, we have developed speech synthesis for years. The identifier for our system is J, the results show that our submitted system performed good on all the criterion.