ISCA Archive Interspeech 2012
ISCA Archive Interspeech 2012

Speech restoration based on deep learning autoencoder with layer-wised pretraining

Xugang Lu, Shigeki Matsuda, Chiori Hori, Hideki Kashioka

Neural network can be used to “remember” speech patterns by encoding speech statistical regularity in network parameters. Clean speech can be “recalled” when noisy speech is input to the network. Adding more hidden layers can increase network capacity. But when the hidden layer size increases (deep network), the network is easily to be trapped to a local solution when traditional training strategy is used. Therefore, the performance of using a deep network sometimes is even worse than using a shallow network. In this study, we explore the greedy layer-wised pretraining strategy to train a deep autoencoder (DAE) for speech restoration, and apply the restored speech for noisy robust speech recognition. The DAE is first pretrained using quasi-Newton optimization algorithm layer by layer in which each layer is regarded as a shallow autoencoder. And the output of the preceding layer is served as the input to the next layer. The pretrained layers are stacked and “unrolled” to be a DAE. The pretrained parameters are served as initial parameters of the DAE which are used to refine training. The trained DAE is used as a filter for speech restoration when noisy speech is given. Noisy robust speech recognition experiments were done to examine the performance of the trained deep network. Experimental results show that the DAE trained with pretraining process significantly improved the performance of speech restoration from noisy input.

Index Terms: Deep learning, autoencoder, noise reduction, speech recognition