ISCA Archive Blizzard 2021
ISCA Archive Blizzard 2021

Nana-HDR: A Non-attentive Non-autoregressive Hybrid Model for TTS

Shilun Lin, Wenchao Su, Li Meng, Fenglong Xie, Xinhui Li, Li Lu

This paper presents Nana-HDR, a new non-attentive non-autoregressive model with hybrid Transformer-based Dense-fuse encoder and RNN-based decoder for TTS. It mainly consists of three parts: Firstly, a novel Dense-fuse encoder with dense connections between basic Transformer blocks for coarse feature fusion and a multi-head attention layer for fine feature fusion. Secondly, a single-layer non-autoregressive RNN-based decoder. Thirdly, a duration predictor instead of an attention model that connects the above hybrid encoder and decoder. Experiments indicate that Nana-HDR gives full play to the advantages of each component, such as strong text encoding ability of Transformer-based encoder, stateful decoding without being bothered by exposure bias and local information preference, and stable alignment provided by duration predictor. Due to these advantages, Nana-HDR achieves competitive performance in naturalness and robustness on two Mandarin corpora and shows potential on a small Spanish corpus of Blizzard Challenge 2021.

doi: 10.21437/Blizzard.2021-4

Cite as: Lin, S., Su, W., Meng, L., Xie, F., Li, X., Lu, L. (2021) Nana-HDR: A Non-attentive Non-autoregressive Hybrid Model for TTS. Proc. The Blizzard Challenge 2021, 25-30, doi: 10.21437/Blizzard.2021-4

  author={Shilun Lin and Wenchao Su and Li Meng and Fenglong Xie and Xinhui Li and Li Lu},
  title={{Nana-HDR: A Non-attentive Non-autoregressive Hybrid Model for TTS}},
  booktitle={Proc. The Blizzard Challenge 2021},