ISCA Archive Interspeech 2015
ISCA Archive Interspeech 2015

An investigation of recurrent neural network architectures for statistical parametric speech synthesis

Sivanand Achanta, Tejas Godambe, Suryakanth V. Gangashetty

In this paper, we investigate two different recurrent neural network (RNN) architectures: Elman RNN and recently proposed clockwork RNN [1] for statistical parametric speech synthesis (SPSS). Of late, deep neural networks are being used for SPSS which involve predicting every frame independent of the previous predictions, and hence requires post-processing for ensuring smooth evolution of speech parameters. RNNs, on the other hand, are intuitively better suited for the task as they inherently model temporal dependencies, but were restricted in use because of the difficulty in training. Lately, techniques such as sparse initialization, Nesterov's accelerated gradient, gradient clipping and leaky integration (LI) have been shown to overcome this difficulty. We study the utility of these techniques for SPSS task. In addition, we show that clockwork RNN is equivalent to an Elman RNN with a particular form of LI. This perspective enables us to understand the reason why a simple Elman RNN with LI units performs well on sequential tasks.