ISCA Archive Interspeech 2014
ISCA Archive Interspeech 2014

Dynamic stream weight estimation in coupled-HMM-based audio-visual speech recognition using multilayer perceptrons

Ahmed Hussen Abdelaziz, Dorothea Kolossa

Jointly using audio and video features can increase the robustness of automatic speech recognition systems in noisy environments. A systematic and reliable performance gain, however, is only achieved if the contributions of the audio and video stream to the decoding decision are dynamically optimized, for example via so-called stream weights. In this paper, we address the problem of dynamic stream weight estimation for coupled-HMM-based audio-visual speech recognition. We investigate the multilayer perceptron (MLP) for mapping reliability measure features to stream weights. As an input for the multilayer perceptron, we use a feature vector containing different model-based and signal-based reliability measures. Training of the multilayer perceptron has been achieved using dynamic oracle stream weights as target outputs, which are found using a recently proposed expectation maximization algorithm. This new approach of MLP-based stream-weight estimation has been evaluated using the Grid audio-visual corpus and has outperformed the best baseline performance, yielding a 23.72% average relative error rate reduction.


doi: 10.21437/Interspeech.2014-292

Cite as: Abdelaziz, A.H., Kolossa, D. (2014) Dynamic stream weight estimation in coupled-HMM-based audio-visual speech recognition using multilayer perceptrons. Proc. Interspeech 2014, 1144-1148, doi: 10.21437/Interspeech.2014-292

@inproceedings{abdelaziz14_interspeech,
  author={Ahmed Hussen Abdelaziz and Dorothea Kolossa},
  title={{Dynamic stream weight estimation in coupled-HMM-based audio-visual speech recognition using multilayer perceptrons}},
  year=2014,
  booktitle={Proc. Interspeech 2014},
  pages={1144--1148},
  doi={10.21437/Interspeech.2014-292},
  issn={2308-457X}
}