Feed-forward neural network language models (NNLMs) are known to improve both perplexity and word error rate performance for speech recognition compared with conventional n-gram language models. We present experimental results showing how much the WER can be improved by increasing the scale of the NNLM, in terms of model size and training data. However, training time can become very long. We implemented a hierarchical NNLM approximation to speed up the training, through splitting up events and parallelizing training as well as reducing the output vocabulary size of each sub-network. The training time was reduced by about 20 times, e.g. from 50 days to 2 days, with no degradation in WER. Using English Broadcast News data (350M words), we obtained significant improvements over the baseline n-gram language model, competitive with recently published recurrent neural network language model (RNNLM) results.
Index Terms: neural network language models