ISCA Archive Interspeech 2012
ISCA Archive Interspeech 2012

Discriminative feature-space transforms using deep neural networks

George Saon, Brian Kingsbury

We present a deep neural network (DNN) architecture which learns time-dependent offsets to acoustic feature vectors according to a discriminative objective function such as maximum mutual information (MMI) between the reference words and the transformed acoustic observation sequence. A key ingredient in this technique is a greedy layer-wise pretraining of the network based on minimum squared error between the DNN outputs and the offsets provided by a linear feature-space MMI (FMMI) transform. Next, the weights of the pretrained network are updated with stochastic gradient ascent by backpropagating the MMI gradient through the DNN layers. Experiments on a 50 hour English broadcast news transcription task show a 4% relative improvement using a 6-layer DNN transform over a state-of-the-art speakeradapted system with FMMI and model-space discriminative training.

Index Terms: speech recognition, deep neural networks

doi: 10.21437/Interspeech.2012-4

Cite as: Saon, G., Kingsbury, B. (2012) Discriminative feature-space transforms using deep neural networks. Proc. Interspeech 2012, 14-17, doi: 10.21437/Interspeech.2012-4

  author={George Saon and Brian Kingsbury},
  title={{Discriminative feature-space transforms using deep neural networks}},
  booktitle={Proc. Interspeech 2012},