ISCA Archive Interspeech 2012
ISCA Archive Interspeech 2012

Pipelined back-propagation for context-dependent deep neural networks

Xie Chen, Adam Eversole, Gang Li, Dong Yu, Frank Seide

The Context-Dependent Deep-Neural-Network HMM, or CDDNN-HMM, is a recently proposed acoustic-modeling technique for HMM-based speech recognition that can greatly outperform conventional Gaussian-mixture based HMMs. For example, a CD-DNN-HMM trained on the 2000h Fisher corpus achieves 14.4% word error rate on the Hub5'00-FSH speakerindependent phone-call transcription task, compared to 19.6% obtained by a state-of-the-art, conventional discriminatively trained GMM-based HMM. That CD-DNN-HMM, however, took 59 days to train on a modern GPGPU — the immense computational cost of the minibatch based back-propagation (BP) training is a major roadblock. Unlike the familiar Baum-Welch training for conventional HMMs, BP cannot be efficiently parallelized across data. In this paper we show that the pipelined approximation to BP, which parallelizes computation with respect to layers, is an efficient way of utilizing multiple GPGPU cards in a single server. Using 2 and 4 GPGPUs, we achieve a 1.9 and 3.3 times end-to-end speed-up, at parallelization efficiency of 0.95 and 0.82, respectively, at no loss of recognition accuracy.

Index Terms: speech recognition, deep neural networks, parallelization, GPGPU