ISCA Archive ICSLP 1996
ISCA Archive ICSLP 1996

Using the visual component in automatic speech recognition

N. M. Brooke

The movements of talkersÂ’ faces are known to convey visual cues that can improve speech intelligibility, especially where there is noise or hearing-impairment. This suggests that visible facial gestures could be exploited to enhance speech intelligibility in automatic systems. Handling the volume of data represented by images of talkersÂ’ faces implies some form of data compression. Rather than using conventional feature extraction approaches, image coding and compression can be achieved using data-driven, statistically-oriented techniques such as artificial neural-networks (ANNs) or principal component analysis (PCA). A major issue is the combination of the audio and visual data so that the best use can be made of the two modalities together. Perceptual experiments may offer guidance on suitable machine architectures, many of which currently use hidden Markov models (HMMs).