In this paper we describe a method to perform sequence-discriminative
training of neural network acoustic models without the need for frame-level
cross-entropy pre-training. We use the lattice-free version of the
maximum mutual information (MMI) criterion: LF-MMI. To make its computation
feasible we use a phone n-gram language model, in place of the word
language model. To further reduce its space and time complexity we
compute the objective function using neural network outputs at one
third the standard frame rate. These changes enable us to perform the
computation for the forward-backward algorithm on GPUs. Further the
reduced output frame-rate also provides a significant speed-up during
decoding.
We present results on 5 different LVCSR tasks with training data
ranging from 100 to 2100 hours. Models trained with LF-MMI provide
a relative word error rate reduction of ~11.5%, over those trained
with cross-entropy objective function, and ~8%, over those trained
with cross-entropy and sMBR objective functions. A further reduction
of ~2.5%, relative, can be obtained by fine tuning these models
with the word-lattice based sMBR objective function.