ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

Transformer-Based Acoustic Modeling for Streaming Speech Synthesis

Chunyang Wu, Zhiping Xiu, Yangyang Shi, Ozlem Kalinli, Christian Fuegen, Thilo Koehler, Qing He

Transformer models have shown promising results in neural speech synthesis due to their superior ability to model long-term dependencies compared to recurrent networks. The computation complexity of transformers increases quadratically with sequence length, making it impractical for many real-time applications. To address the complexity issue in speech synthesis domain, this paper proposes an efficient transformer-based acoustic model that is constant-speed regardless of input sequence length, making it ideal for streaming speech synthesis applications. The proposed model uses a transformer network that predicts the prosody features at phone rate and then an Emformer network to predict the frame-rate spectral features in a streaming manner. Both the transformer and Emformer in the proposed architecture use a self-attention mechanism that involves explicit long-term information, thus providing improved speech naturalness for long utterances. In our experiments, we use a WaveRNN neural vocoder that takes in the predicted spectral features and generates the final audio. The overall architecture achieves human-like speech quality both on short and long utterances while maintaining a low latency and low real-time factor. Our mean opinion score (MOS) evaluation shows that for short utterances, the proposed model achieves a MOS of 4.213 compared to ground-truth with MOS of 4.307; and for long utterances, it also produces high-quality speech with a MOS of 4.201 compared to ground-truth with MOS of 4.360.