ISCA Archive Interspeech 2014
ISCA Archive Interspeech 2014

TTS synthesis with bidirectional LSTM based recurrent neural networks

Yuchen Fan, Yao Qian, Feng-Long Xie, Frank K. Soong

Feed-forward, Deep neural networks (DNN)-based text-to-speech (TTS) systems have been recently shown to outperform decision-tree clustered context-dependent HMM TTS systems. However, the long time span contextual effect in a speech utterance is still not easy to accommodate, due to the intrinsic, feed-forward nature in DNN-based modeling. Also, to synthesize a smooth speech trajectory, the dynamic features are commonly used to constrain speech parameter trajectory generation in HMM-based TTS [2]. In this paper, Recurrent Neural Networks (RNNs) with Bidirectional Long Short Term Memory (BLSTM) cells are adopted to capture the correlation or co-occurrence information between any two instants in a speech utterance for parametric TTS synthesis. Experimental results show that a hybrid system of DNN and BLSTM-RNN, i.e., lower hidden layers with a feed-forward structure which is cascaded with upper hidden layers with a bidirectional RNN structure of LSTM, can outperform either the conventional, decision tree-based HMM, or a DNN TTS system, both objectively and subjectively. The speech trajectory generated by the BLSTM-RNN TTS is fairly smooth and no dynamic constraints are needed.


doi: 10.21437/Interspeech.2014-443

Cite as: Fan, Y., Qian, Y., Xie, F.-L., Soong, F.K. (2014) TTS synthesis with bidirectional LSTM based recurrent neural networks. Proc. Interspeech 2014, 1964-1968, doi: 10.21437/Interspeech.2014-443

@inproceedings{fan14_interspeech,
  author={Yuchen Fan and Yao Qian and Feng-Long Xie and Frank K. Soong},
  title={{TTS synthesis with bidirectional LSTM based recurrent neural networks}},
  year=2014,
  booktitle={Proc. Interspeech 2014},
  pages={1964--1968},
  doi={10.21437/Interspeech.2014-443},
  issn={2308-457X}
}